TY - JOUR
T1 - Improving T2D machine learning-based prediction accuracy with SNPs and younger age
AU - Hageh, Cynthia AL
AU - Henschel, Andreas
AU - Zhou, Hao
AU - Zubelli, Jorge
AU - Nader, Moni
AU - Chacar, Stephanie
AU - Iakovidou, Nantia
AU - Hatzikirou, Haralampos
AU - Abchee, Antoine
AU - O'Sullivan, Siobhan
AU - Zalloua, Pierre A.
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/1
Y1 - 2025/1
N2 - Background: This study aimed to evaluate whether integrating clinical and genomic data improves the performance of machine learning (ML) models for predicting Type 2 Diabetes (T2D) risk. Methods: Six models—Random Forest, Support Vector Machine, Linear Discriminant Analysis, Logistic Regression, Gradient Boosting Machine, and Decision Tree—were trained and tested on a discovery dataset (N=3,546) and validated in the UK Biobank (N=31,620). Model performance was assessed using clinical data alone, combined clinical and genomic data, and in age-specific groups (>55 and ≤55 years). Results: The inclusion of genomic data modestly improved model performance across all algorithms in the discovery dataset. Clinical features such as family history of T2D and hypertension consistently ranked as top features. When SNPs were added, T2D-associated variants, including rs2943641 (IRS1), rs7903146 (TCF7L2), and rs7756992 (CDKAL1), emerged among the most important features, particularly in younger individuals. These findings demonstrate the translational potential of incorporating genomics for early risk identification. In the UK Biobank, all models achieved AUCs exceeding 91 % with combined clinical and genomic data. Performance was notably better among younger individuals (≤55 years), emphasizing the models’ potential for early detection. Integration of a polygenic risk score (PRS) further supported risk prediction, particularly in younger individuals, though incremental gains were modest. Conclusions: While traditional clinical factors remained the strongest predictors of T2D risk, integration of genomic data produced a modest improvement in model performance, especially among younger adults. Validation across independent datasets confirmed the generalizability of these findings, underscoring the value of multi-dimensional risk-prediction models to refine T2D risk assessment.
AB - Background: This study aimed to evaluate whether integrating clinical and genomic data improves the performance of machine learning (ML) models for predicting Type 2 Diabetes (T2D) risk. Methods: Six models—Random Forest, Support Vector Machine, Linear Discriminant Analysis, Logistic Regression, Gradient Boosting Machine, and Decision Tree—were trained and tested on a discovery dataset (N=3,546) and validated in the UK Biobank (N=31,620). Model performance was assessed using clinical data alone, combined clinical and genomic data, and in age-specific groups (>55 and ≤55 years). Results: The inclusion of genomic data modestly improved model performance across all algorithms in the discovery dataset. Clinical features such as family history of T2D and hypertension consistently ranked as top features. When SNPs were added, T2D-associated variants, including rs2943641 (IRS1), rs7903146 (TCF7L2), and rs7756992 (CDKAL1), emerged among the most important features, particularly in younger individuals. These findings demonstrate the translational potential of incorporating genomics for early risk identification. In the UK Biobank, all models achieved AUCs exceeding 91 % with combined clinical and genomic data. Performance was notably better among younger individuals (≤55 years), emphasizing the models’ potential for early detection. Integration of a polygenic risk score (PRS) further supported risk prediction, particularly in younger individuals, though incremental gains were modest. Conclusions: While traditional clinical factors remained the strongest predictors of T2D risk, integration of genomic data produced a modest improvement in model performance, especially among younger adults. Validation across independent datasets confirmed the generalizability of these findings, underscoring the value of multi-dimensional risk-prediction models to refine T2D risk assessment.
KW - AI
KW - Machine Learning
KW - Predictive models
KW - T2D
UR - https://www.scopus.com/pages/publications/105009308509
U2 - 10.1016/j.csbj.2025.06.038
DO - 10.1016/j.csbj.2025.06.038
M3 - Article
AN - SCOPUS:105009308509
SN - 2001-0370
VL - 27
SP - 2772
EP - 2781
JO - Computational and Structural Biotechnology Journal
JF - Computational and Structural Biotechnology Journal
ER -