Comparative Study of Early Diabetes Risk Stratification Based on Machine Learning Algorithms
Main Article Content
Keywords
diabetes prediction, machine learning, feature importance, random forest, model comparison
Abstract
Objective: This study aims to systematically compare the performance of multiple machine learning models in diabetes risk prediction and identify key risk factors, thereby providing data-driven decision support for early diabetes screening. Methods:Using the UCI Pima Indians Diabetes dataset, five models-logistic regression, K-nearest neighbors, support vector machine, decision tree, and random forest-were trained and evaluated. Model performance was comprehensively assessed via metrics including AUC-ROC, precision, and recall, with feature importance analysis employed to elucidate core diabetes risk factors.Results: The random forest model demonstrated superior performance across multiple metrics (AUC = 0.8167). Plasma glucose was consistently identified as the strongest predictor, with body mass index (BMI) and age also emerging as significant contributors. Conclusion: The random forest model exhibits robust performance and effective capture of feature interactions, making it well-suited for early diabetes prediction with considerable potential for clinical application.
References
- [1] Sun, H., Saeedi, P., Karuranga, S., Pinkepank, M., Ogurtsova, K., Duncan, B. B., Stein, C., Basit, A., Chan, J. C. N., Mbanya, J. C., et al. IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Research and Clinical Practice. 2022, 183, p. 109119. https://doi.org/https://doi.org/10.1016/j.diabres.2021.109119.
- [2] World Health Organization. Diabetes fact sheet. Geneva: WHO, 2022.
- [3] Care, D. Medical care in diabetes 2020. Diabetes Care. 2020, 43(Suppl. 1), pp. S135-S151.
- [4] Lindström, J., Louheranta, A., Mannelin, M., Rastas, M., Salminen, V., Eriksson, J., Uusitupa, M., Tuomilehto, J. and for the Finnish Diabetes Prevention Study, G. The finnish diabetes prevention study (DPS): Lifestyle intervention and 3-year results on diet and physical activity. Diabetes Care. 2003, 26(12), pp. 3230-3236. https://doi.org/10.2337/diacare.26.12.3230.
- [5] Smith, J. M. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care, 1988, Los Alamitos, CA, 1988; pp. 261-265.
- [6] Mendoza, A. Logistic Regression From Scratch With PyTorch. Available from: https://www.axelmendoza.com/posts/logistic-regression-from-scratch-pytorch/ (accessed 8 January 2026).
