An Investigation of Molecular Property Prediction and Classification Based on Machine Learning Algorithms

Main Article Content

Yanzi Chen

Keywords

model prediction, classification, random forest

Abstract

With the rapid advancement of science and technology, chemical and physical research has entered an era of complexity and high dimensionality, where traditional research paradigms struggle to optimize vast chemical parameter searches. The Machine Chemist platform was developed in this context, leveraging big data and intelligent models to automate chemical synthesis, characterization, and testing processes. This study aims to predict the y1, y2 and y3 properties and classes of 2,580 molecules based on their physicochemical properties and improve model accuracy through data analysis and modeling. Data preprocessing involved removing missing values, duplicates, and outliers using the quartile method, resulting in an analyzable dataset. A scatter plot of  y2 and id suggested a univariate polynomial function relationship, leading to the construction of a univariate nonlinear regression model. The model achieved a high prediction accuracy, with a low root mean square error and a high coefficient of determination. Additionally, the relationship between class and  y1~y3 , x1~x100 indicators was examined, revealing mostly nonlinear relationships. A random forest model was established to classify 2,580 molecules into 1 to 4 classes based on the properties of 200,000 chemical molecules. The model's performance was evaluated using decision trees, confusion matrices, precision, and recall metrics. Finally, the SHAP method assessed the impact of feature indicators on classification outcomes, contributing to the development of a reliable molecular category prediction model.

Abstract 35 | PDF Downloads 8

References

Wang, Lei, Mingyue Chu, Xiaohua Wang, Honglu Guan, Peng Chen & Gao Guanlong. Research on the method of monitoring the operation status of primary side equipment of intelligent substation based on random forest. Electrical Measurement and Instrumentation (07), 184-190. doi:10.19753/j.issn1001-1390.2024.07.026.