An Investigation of Molecular Property Prediction and Classification Based on Machine Learning Algorithms

Authors

  • Yanzi Chen Hangzhou Normal University, Hangzhou 311121, China Author

DOI:

https://doi.org/10.70267/a5ghvb15

Keywords:

model prediction, classification, random forest

Abstract

With the rapid advancement of science and technology, chemical and physical research has entered an era of complexity and high dimensionality, where traditional research paradigms struggle to optimize vast chemical parameter searches. The Machine Chemist platform was developed in this context, leveraging big data and intelligent models to automate chemical synthesis, characterization, and testing processes. This study aims to predict the y1, y2 and y3 properties and classes of 2,580 molecules based on their physicochemical properties and improve model accuracy through data analysis and modeling. Data preprocessing involved removing missing values, duplicates, and outliers using the quartile method, resulting in an analyzable dataset. A scatter plot of  y2 and id suggested a univariate polynomial function relationship, leading to the construction of a univariate nonlinear regression model. The model achieved a high prediction accuracy, with a low root mean square error and a high coefficient of determination. Additionally, the relationship between class and  y1~y3 , x1~x100 indicators was examined, revealing mostly nonlinear relationships. A random forest model was established to classify 2,580 molecules into 1 to 4 classes based on the properties of 200,000 chemical molecules. The model's performance was evaluated using decision trees, confusion matrices, precision, and recall metrics. Finally, the SHAP method assessed the impact of feature indicators on classification outcomes, contributing to the development of a reliable molecular category prediction model.

References

Wang, Lei, Mingyue Chu, Xiaohua Wang, Honglu Guan, Peng Chen & Gao Guanlong. Research on the method of monitoring the operation status of primary side equipment of intelligent substation based on random forest. Electrical Measurement and Instrumentation (07), 184-190. doi:10.19753/j.issn1001-1390.2024.07.026.

Downloads

Published

2024-07-09

Issue

Section

Research Articles

How to Cite

Chen , Y. (2024). An Investigation of Molecular Property Prediction and Classification Based on Machine Learning Algorithms. Computers and Artificial Intelligence, 1(1), 40-44. https://doi.org/10.70267/a5ghvb15