Air pollution remains a critical global challenge, necessitating accurate and interpretable predictive models for effective environmental monitoring. This study develops and evaluates machine learning models to predict the U.S. EPA Air Quality Index (AQI) using meteorological and pollutant data. We compare standalone algorithms—including XGBoost, TabNet, MLP, SVM, and Random Forest—and propose a stacking ensemble that integrates XGBoost, MLP, RF, and SVM via logistic regression meta-learning. Our results demonstrate that XGBoost achieves the highest individual performance (98.78% accuracy), while the stacking ensemble further improves predictive robustness (99.10% accuracy), particularly at AQI class boundaries. Feature importance analysis identifies PM2.5, PM10, and CO as the most influential predictors, with spatial visualization revealing urban-industrial hotspots. The framework balances accuracy, computational efficiency, and interpretability, recommending XGBoost for resource-constrained deployments and the ensemble for high-stakes applications. This work contributes a scalable solution for real-time air quality alerts and policy support, with implications for public health and environmental management.
Key words: Air Quality Index (AQI), machine learning, stacking ensemble, XGBoost, interpretability, environmental monitoring.
|