Skip to main navigation Skip to search Skip to main content

Artificial Intelligence Algorithms for Polygenic Genotype-Phenotype Predictions and Diagnosis of Coronary Artery Disease

  • Shaikha Ahmed Alsuwaidi

Student thesis: Master's Thesis

Abstract

The leveraging of genetic data is a core objective in the field of precision medicine, particularly with complex diseases such as Atherosclerotic Coronary Artery Disease (CAD). Most methods of identifying CAD are invasive and by the time it is identified, CAD is already mature and can lead to further complications. Hence with the UK Biobank, a large amount of clinical, environmental, and genotype data can be exploited with machine learning classification models to identify patients with CAD. This paper explores the bridge between the genetic Single Nucleotide Polymorphism (SNP) data provided in the UK Biobank and the CAD phenotype specified, and it introduces a novel score-based genetic feature engineering method that leverages Gene Ontology annotation for machine learning models. A systematic approach was taken to select the samples, clean the data, evaluate the Polygenic Risk Scores (PRSs) and the new score-based genetic features. The best performance with the PRS score and the baseline features (i.e. age, gender, Systolic blood pressure, Total Cholesterol, HDL-c, smoking status, medication use, history of diabetes, and the 4 genotype Principal Components) was the Light Gradient Boosting Machine (LGBM) with Auto Ldpred2 model (shrinkage parameter set to 0.5). The novel score-based approach included five phases, and the best approach was based on the set parent-children nodes threshold in phase three. It achieves the highest mean AUC scores on the validation and test sets (mean AUC of 0.789 and 0.805 respectively), and its feature ranking along with the LGBM feature ranking underscored a set of genetic score-based features that were the most relevant to the CAD phenotype. It highlighted the ATP9A gene score, MAP3K7 (TAK1) gene score, and the GO:0060452 score defined as ‘positive regulation of cardiac muscle contraction’. It is shown that there is more benefit to raising the granularity on PRSs rather than totally compressing it, especially in a machine learning setting, where it alleviates some of the mystery of artificial learning and can explain the contrasting performance of the CAD PRS in different publications.
Date of Award22 Apr 2025
Original languageAmerican English
SupervisorAndreas Henschel (Supervisor)

Keywords

  • Phenotype classification model
  • Genotype to phenotype prediction
  • Coronary Artery Disease
  • Annotation
  • Polygenic risk scores

Cite this

'