Abstract:
In this study, five different machine learning models were used for the PM
2.5 concentration prediction problem in Nanjing:multiple linear regression, random forest, K Nearest Neighbor Model(KNN), BP neural network, and eXtreme Gradient Boosting XGBoost.The study was based on the air quality and meteorological data of Nanjing for the years of 2021 and 2022, and the models were trained and tested by data preprocessing and feature scaling. The evaluation metrics included correlation coefficient, mean squared error RMSE, mean absolute error MAE and mean absolute percentage error MAPE. The results showed that the five models had good prediction performance in general, with the Random Forest model having the highest prediction accuracy and the minimum error. The analysis of the prediction accuracy in different seasons showed that the prediction accuracy of multiple linear regression and BP neural network was higher in spring and winter than in summer and fall. While the random forest, K Nearest Neighbor Model(KNN) and eXtreme Gradient Boosting XGBoost models had the highest prediction accuracy in winter. In terms of model running efficiency, the BP neural network had the longest training time and the most memory usage, while the K Nearest Neighbor Model(KNN) model had the least running time and memory usage. Considering the prediction accuracy and running efficiency, the random forest model performed best in predicting PM
2.5 concentration in Nanjing. The methods and models in this study could also provide references for air quality prediction in other regions.