Accurate prediction of chemical product yield is crucial for optimizing reaction conditions and improving efficiency in both research and industrial settings. This study presents a robust, data-driven pipeline for yield prediction using advanced machine learning techniques, specifically CatBoost and XGBoost regressors. The methodology integrates comprehensive data preprocessing—including outlier removal, feature engineering, normalization, dimensionality reduction with PCA, and feature selection—to enhance model performance and interpretability. Experimental data were split into training and testing sets, with hyperparameter tuning performed via grid search to identify optimal model configurations. Model evaluation revealed that CatBoost outperformed XGBoost, achieving a mean squared error (MSE) of 68.12, root mean squared error (RMSE) of 8.25, mean absolute error (MAE) of 5.67, and an R² score of 0.92, indicating high predictive accuracy and the ability to explain 92% of the variance in yield. XGBoost also performed strongly, with an R² of 0.90. Explainable AI techniques using SHAP analysis identified the most influential features driving model predictions and provided transparency into feature interactions and their impact on yield outcomes. Visualizations of actual versus predicted yields and feature importance further validated the models’ effectiveness. The proposed pipeline demonstrates a systematic and reproducible approach to chemical yield prediction, offering valuable insights for process optimization and experimental planning in chemical research and industry.
Key words: Keywords: Chemical yield prediction, Machine Learning, CatBoost, XGBoost, Explainable AI(SHAP)
|