Artificial Intelligence Algorithms for Polygenic Genotype-Phenotype Predictions

  • Muhammad Muneeb

Student thesis: Master's Thesis


Muhammad Muneeb, 'Artificial Intelligence Algorithms for Polygenic Genotype- Phenotype Predictions', Master of Science in Computer Science, Department of Electrical Engineering and Computer Science, Khalifa University of Science and Technology, United Arab Emirates, May 2021. This dissertation comprises seven chapters. Chapter 1 is the general introduction to the problem statement. Genotype-Phenotype predictions are indispensable in genetics. These predictions assist in finding genetic mutations causing variations in human beings. There are several approaches for finding the association that can be broadly categorized into two classes: statistical techniques and machine learning methods. Chapter 2 explores the existing literature and approaches like GWAS, PRS for genotypephenotype prediction. We will see that the scope of the problem is not only limited to humans, but it also applies to other living organisms like animals, crops, and plants. Chapter 3 elaborate the machine learning algorithms in detail. We examined the eyecolor phenotype for which public dataset was available. We used about nine classifiers because each has its way of mapping the relationship between genotype-phenotype. Stacked ensembles of LSTM outperformed other algorithms for 1560 SNPs with an overall accuracy of 0.96, AUC = 0.98 for brown eyes, and AUC = 0.97 for Blue-Green eyes. Chapter 4 proposed a pipeline that comprises three steps to identify genes associated with different phenotypes. We examined the hair color phenotype for which GWAS catalog data is available. Chapter 5 compares the machine learning and Polygenic risk score for genotype-phenotype prediction. PRS is a more informative quantity as compared to classification in machine learning. PRS predicts the tendency that a particular person will have a specific disease, whereas machine learning only classifies it into a phenotype. We analyzed the depression phenotype. With machine learning, the accuracy we got was 0.56, and with PRS, it was about 0.024. Chapter 6 provides insight into the transfer learning technique for genotype data. There are some populations for which the dataset is minimal for analysis. Still, we can use the dataset for some other populations to learn about the disease-causing SNPs and use that knowledge for genotype-phenotype prediction of small populations. Similarly, for the analysis of endangered species and personalized medicine, we can use this pipeline. Chapter 7 is a synopsis of the thesis and the future direction like Multi-model machine learning algorithms by including information from genotype data, mRNA, amino acid, and proteins.
Date of AwardMay 2021
Original languageAmerican English


  • machine learning
  • genotype-phenotype prediction
  • gene identification
  • transfer learning
  • genetics.

Cite this