Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Comparison of models for predicting PM2.5 concentration in Wuhan, China
(USC Thesis Other)
Comparison of models for predicting PM2.5 concentration in Wuhan, China
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
University of Southern California Department or Preventive Medicine COMPARISON OF MODELS FOR PREDICTING PM2.5 CONCENTRATION IN WUHAN, CHINA A Thesis in BIOSTATISTICS by XIAOHE CHEN Submitted in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE August 2018 ii ABSTRACT Background: Air pollution levels are very high in urban areas of China, posing a serious threat to public health. It is of critical importance to be able to accurately predict air pollution episodes, particularly to help protect vulnerable populations including children and the elderly. Methods: Air pollution and meteorological data collected from the Huaqiao monitoring station in Wuhan, China were used for predictive modeling of fine particulate matter (PM 2.5) concentrations. Initial data processing steps included predictive mean matching, which was applied to impute missing values. Principal component analysis was then used to reduce data dimensionality and multicollinearity between predictor variables. Support vector machines, random forests, gradient boosted decision tree and generalized additive models (GAM) were fitted and compared by cross validation. Results: A significant delay phenomenon was found during one-step time series prediction. To eliminate this phenomenon and evaluate the true model performance, first-order differenced PM 2.5 concentrations were used in our comparative modeling methods. Of the prediction methods, random forests and PCA-random forest had the worst performance with correlation coefficient (R) between predicted values and observed values lower than 0.68 and root mean square error larger than 7.6 in both training and test datasets. PCA-SVM had the best performance on the training set (R = 0.83, RMSE = 6.32) but lower performance on the test set (R = 0.70, RMSE = 7.55). The GAM model showed the best performance on test set (R = 0.78, RMSE = 6.55). iii TABLE OF CONTENTS Chapter 1 Introduction ........................................................................................................... 1 Chapter 2 Data and Methods .................................................................................................. 3 2.1 Data .......................................................................................................................... 3 2.2 Data pre-processing .................................................................................................. 4 2.2.1 Missing Value Imputation ................................................................................ 6 2.2.2 Feature Engineering ......................................................................................... 7 2.2.3 Detrending: Baseline Model ............................................................................. 8 2.3 Exploratory Analysis using Time Series Models ...................................................... 8 2.4 Prediction Methods .................................................................................................. 9 2.4.1 Principal Component Analysis ......................................................................... 9 2.4.2 Support Vector Machine .................................................................................. 10 2.4.3 Random Forest ................................................................................................. 11 2.4.4 Gradient Boosted Decision Trees ..................................................................... 11 2.4.5 Generalized Additive Models ........................................................................... 12 Chapter 3 Results and Discussions ......................................................................................... 13 1 Chapter 1 Introduction Over the past several decades, air pollution has been an increasingly severe problem in China. Only in recent years has the air quality issue overpassed economic development concerns to be at the forefront of significant importance. China has undergone rapid growth and industrialization since 1990, and through this transition period, ambient air pollution has risen as a risk factor to be ranked fourth in terms of age standardized disability-adjusted life-years [1]. The air quality in China has become so problematic that some industrial cities in central and northern areas suffer from frequent haze attacks with concentrations far exceeding World Health Organization Ambient Air Quality Guidelines [2]. Wuhan, the capital of Hubei Province, is a metropolitan area in central China consisting of three main cities, Hankou, Wuchang and Hanyang. Air in Wuhan is heavily polluted due to heavy industry and constant construction. Combustion for energy production and other industrial activities such as building and mining are primary sources or air pollution in the region. Particulate matter air pollution with mean aerodynamic diameter of 2.5 µ m or less (PM 2.5) has been studied extensively due to its association with adverse health effects. Their small size and light weight enable them to stay and travel longer in the air than heavier particles, and their chemistry, which, depending on the source, can include transition metals and other constituents, all contribute to the toxicity of PM 2.5. Researchers have shown that long-term exposure to combustion-related PM 2.5 increased the risk of cardiopulmonary and lung cancer mortality [3]. In China, it has been estimated that there were 1.37 million premature mortalities attributed to PM 2.5 in 2013 [4]. Acute (short-term) exposure to PM 2.5 has been associated with increases in 2 cardiopulmonary mortality [5][6] and hospital admissions [7]. In central China, seasonal variations in PM 2.5 were significantly associated with increases in cardiovascular mortality [8]. With these significant literatures on the detrimental health effects associated with PM 2.5, there is clearly a need for accurate air quality forecasting methods to assist in protecting human health and welfare. However, PM 2.5 concentration series are intermittent and unstable, which make its forecasting a very difficult task, so hybrid models with transformation and decomposition of the original time series are usually used to improve the one-step forecasting ability [9]. In this study we compare methods for predicting future PM 2.5 concentrations given other historical air quality and meteorological data collected from a monitoring station in Wuhan, China. 3 Chapter 2 Data and Methods 2.1 Data It has been suggested that meteorological conditions are associated with PM 2.5 concentrations both in the United States [10] and other countries such as Japan [11], so in order to predict PM 2.5 concentration using statistical models, meteorological covariates were considered in our model development. Meteorological and air pollution data were collected from a monitoring station located in Huaqiao, Hankou (latitude: 30º 33’29’’, longitude: 114º 15’10’’). There are no industrial air pollution sources within a 15 km radius of the monitor station, and at the closest two power plants (distance of 15 km and 27 km respectively) desulphurization and denitration are conducted so that exhaust gas complies with the Chinese national ambient air quality standard (GB 3095-2012). As a result, the air pollution concentrations at the Huaqiao monitoring site were considered to be relatively stable and have less abrupt changes compared to other locations with monitoring data. Hourly data from 2/1/2016 to 2/1/2018 were collected, including gaseous and particulate matter air pollution concentrations (sulfur dioxide SO 2, oxides of nitrogen NO 2, NO, NOx, carbon monoxide CO, ozone O 3, coarse particulate matter PM 10 and fine particulate matter PM 2.5), and meteorological data (wind speed, wind direction, atmospheric pressure, humidity and temperature). Data from 1/1/2018 were used as the test dataset, while data before 2018 were used 4 as training dataset. The training set had 16,801 observations and the test set had 744 observations. Figure 2-1. Distribution of hourly PM 2.5 concentrations at Huaqiao monitoring station 2.2 Data pre-processing Due to the working condition of the air pollution monitoring instruments, some hourly missing values were found in the dataset. Many abnormal observations were also found in the dataset, including extremely large values not following the trend (much higher than previous and later values), which should be expected because a single monitor station would be easily influenced by the local air condition. For instance, when large trucks delivering construction materials passed by or local traffic jams occurred, the air pollution data would rise significantly. All these large and abnormal values were encoded as missing values in the dataset, and thus many missing values resulted, that could heavily influence the performance of the prediction models. Table 2-1. Summary statistics on missing values in the training set Variable Missing value (N) Missing Rate (%) PM 2.5 118 0.70% 5 PM 10 237 1.41% SO 2 357 2.12% NO 2 378 2.25% NO 321 1.91% NOx 339 2.02% CO 604 3.59% O 3 547 3.25% Wind direction and speed 44 0.26% Atmospheric pressure 51 0.30% Humidity 50 0.30% Temperature 50 0.30% Table 2-2. Summary statistics on missing values in the test set Variable Missing value (N) Missing Rate (%) PM 2.5 14 1.88% PM 10 15 2.02% SO 2 13 1.75% NO 2 12 1.61% NO 12 1.61% NOx 12 1.61% CO 9 1.21% O 3 9 1.21% Wind direction and speed 9 1.21% Atmospheric pressure 9 1.21% 6 Humidity 9 1.21% Temperature 9 1.21% Similar rates of missing data were present in the test and training sets. We note that the gaseous pollutants had the highest rates of missing values in the training set including CO (3.59%) followed by O 3 (3.25%) and NO 2 (2.25%), but in the test set PM 10 had the most missing values (2.02%). 2.2.1 Missing Value Imputation The data in this study represented an hourly time series over a two-year period, and for machine learning models to perform correctly in predicting hourly PM 2.5 concentration, complete data for all input variables were necessary. If we simply removed all missing observations, predictive information would have been lost, and the results would not be accurate. The continuity of time series was therefore necessary for our model development. To solve this problem, multiple imputation by predictive mean matching (PMM), a technique for imputing continuous data, was conducted. The following steps of PMM were conducted: 1. Estimate a linear regression of a variable with other variables for cases with no missing data, producing a set of coefficients B. 2. Randomly draw from the posterior distribution of B and produce a set of new coefficients B*. 7 3. Use B* to predict each missing value and a set of observed values mostly close to the predicted values are generated. The missing value is then replaced by a value randomly drawn from them. 4. Repeat steps 1 through 3 to complete the dataset. PMM has been shown to preserve the original distribution of empirical data more accurately than parametric multiple imputation methods when distributional assumptions fail [12], and it generally performs better when the sample size is large [12]. PMM imputation of the training set was implemented with the R package ‘MICE’. As the missing rates in the training set were not high, the number of multiple imputations was set to 5. Test set was also imputed due to the same reason. 2.2.2 Feature Engineering Feature engineering is a term commonly used in machine learning for the data processing step that creates features (variables) in the form necessary for modeling. For this study, hourly lagged features were created for both the response (PM 2.5 concentration) and input variables (pollutants other than PM 2.5 and meteorology). Assuming the value of a variable at time t is X(t), lags up to 24 hours, namely X(t-1), X(t-2),…,X(t-24), were created for PM 2.5 . Lags up to 5 hours, X(t-1),…,X(t-5), were created for weather covariates and other air pollutants for the prediction modeling. Additional time features created for the models included day of the year, day of the month, day of the week, time of the day and seasons (Spring, Summer, Autumn, Winter). As variables with higher variance will have higher weights in later methods such as PCA, which aims to extract maximum variance from data, all continuous and ordinal variables were standardized to zero mean and unit variance. 8 2.2.3 Detrending: Baseline Model One-step prediction for PM 2.5 concentration was the primary focus of this study. Sometimes one-step prediction follows the trend of a time series very well, resulting in both high prediction accuracy and small root mean squared error (RMSE) indicating excellent performance; however, this is a common trap rarely discussed. Prediction of the time series is usually one step delayed, and the predicted value is very similar to the previous value. Consider a random walk procedure, which can be written as , where in our case Y(t+1) is the PM 2.5 concentration at time t+1, and Y(t) is the PM 2.5 concentration one hour before time t+1. If the variance of the error term is extremely small compared to Y, then Y(t+1) will always be similar to Y(t), which means the one-step predicted PM 2.5 concentration will always be one-step lagged. Although the error term of the model may not be independent or follow a normal distribution, we can fit a linear regression model between Y(t) and Y(t+1) as the baseline. Calculating the coefficient of determination and RMSE can tell us the proportion of variance in the PM 2.5 concentrations that can be predicted from the PM 2.5 concentrations in the previous hour. With this approach, models should have better performance than this baseline. However, using the first- order difference of PM 2.5 concentration (the error term) as the response variable for prediction should be a better choice. If the model performance is high (i.e. the difference between Y(t) and Y(t+1) can be predicted), then the delay phenomenon of one-step prediction can be eliminated. 2.3 Exploratory Analysis using Time Series Models The time series of PM 2.5 concentrations was plotted to help visualize stationarity of the time series, and whether there was a seasonal trend. The Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test [13] was used to check the stationarity of the time series, where a small p-value ( 1) ( ) ( ) Y t Y t t 9 suggested the null hypothesis of stationarity should be rejected. If the time series was not stationary, differencing was applied until it followed a stationary process. Autoregressive integrated moving average (ARIMA) models, commonly used to model time series were used. In general, if no seasonality is detected, non-seasonal ARIMA model can be used; otherwise seasonal ARIMA models are preferred. As the PM 2.5 data tended to have a long seasonal period of about a year, and hourly data made the seasonal period too long for an ARIMA model, a non- seasonal ARIMA model was applied for on the training set. Non-seasonal ARIMA(p,d,q) model can be written as: , where y’ t is the d th -order differenced time series at time t, p is the order of the autoregressive part and q is the order of the moving average part. We initialized a large search range for p, q, d and fit ARIMA models with all combinations of (p,q,d), choosing the model with lowest AIC or BIC as the best model. This procedure, also called grid search, was implemented with the R package ‘forecast’. Model residuals were checked to ensure they were uncorrelated white noise. The cross-correlation function (CCF) measures the correlation between two time series as a function of displacement of one relative to the other, and is useful for identifying the time delay phenomenon of time series prediction. CCF was checked to ensure that the fitted values were not the observed value at the previous step. 2.4 Prediction Methods 2.4.1 Principal Component Analysis Hourly lagged environmental variables and additional time features created during feature engineering were expected to have high multicollinearity. Pearson’s correlation 1 1 1 1 t t p t p t q t q t y c y y 10 coefficient was checked to assess the pairwise correlation between variables. Principal component analysis (PCA) is often used to handle multicollinearity. PCA uses orthogonal transformation to convert observations of correlated variables into values of linearly uncorrelated variables [14]. Data must be normalized before PCA. The first K principal components accounting for the most variance (99%) were chosen and extracted as new input variables for the predictive models to reduce the dimension of the data. New training and test sets were built based on the PCA results. 2.4.2 Support Vector Machine Support vector machines are a supervised machine learning technique commonly used for classification and regression [15]. Support vector regression finds the coefficients that minimize: , under the condition that , and , where the alphas are the Lagrange multipliers for each observation i=1,…n, C is a constant, and epsilon is a hyperparameter. Non-linear support vector regression was used to model PM 2.5 concentrations (with original features) and their first-order differences (both with and without principal components). The radial basis function: , was used as the kernel to project observations into high dimensional feature space. Epsilon in the insensitive loss function was set to 0.1 and constant of the regularization term in the Lagrange formulation was set to 1. Grid search and cross-validation were used to find parameters giving the best model performance. 1 1 1 1 1 ( ( *)( *) ( , ) ( *) ( * ) 2 n n n n i i j j i j i i i i i i j i i L K x x y 1 ( *) 0 n ii i 0 i C 0* i C 2 2 2 , ( ) 2 ij i j i j xx K x x exp exp x x 11 2.4.3 Random Forest The random forests model is an ensemble machine learning method based on decision trees and bootstrap aggregating (bagging) [16]. It overcomes the disadvantage of overfitting and generally boosts the model performance at the expense of slightly higher bias. The following steps were implemented in the random forest algorithm: 1. Let N be the sample size of the training set, and let M be the number of features. 2. Sample N times from N observations with replacement to build a new training set, and observations out-of-bag were used for testing. 3. For every node in a decision tree, randomly sample m (m<M) features to decide the split criterion. 4. No decision tree was pruned. In this study, random forests were used to model both PM 2.5 concentration (with original features) and first-order differences (both with and without principal components). The number of trees was set to 500, and m was set to M/3. Variable importance was obtained to identify the main factors affecting the response variable. 2.4.4 Gradient Boosted Decision Trees Gradient boosting is an ensembled machine learning method based on weak prediction models such as decision trees (GBDT) [17]. In boosting iterations, every updated model is based on gradient descent of loss function of the previous model. For GBDT, the number of trees determines number of iterations. In this study, GBDT was used to model both PM 2.5 concentrations (with original features) and first-order difference (both with and without principal 12 components). The learning rate was set to 0.001, sample fraction was set to 0.5 and number of trees was set to 80000. 2.4.5 Generalized Additive Models The generalized additive model can be written in the following form: , where an exponential family distribution with a link function g is specified for Y, and f is smooth (non-linear) functions of the predictor variables represented by a set of basis functions [18]. In this study, predictors with relatively high importance in the random forest model were selected for the GAM. An identity link function under the Gaussian framework was used, and the predictors included a mixture of linear and smooth functions. Linear functions of predictors from the random forest model were included, and smooth functions of time, decomposed into monthly, hourly, and day of the week trends were included using a cyclic cubic regression spline for month and a regular cubic regression spline for hour and day of the week. Both PM 2.5 concentration and first-order difference were modeled as the response variable, but no principal components were applied in GAM. 1 1 2 2 (( ( ))) ( ) ( ) ( ) mm g E Y f X f X f X 13 Chapter 3 Results and Discussions 3.1 Results Time series plots showed an annual seasonal trend in PM 2.5 concentrations over 2016 to 2017 (Figure 3-1). Due to multiple seasonal components in PM 2.5 concentrations series, non-seasonal ARIMA models were used for time series prediction. The KPSS test indicated that PM 2.5 concentrations were not stationary (KPSS level = 1.9077, p-value < 0.01), but first-order differenced PM 2.5 concentrations were stationary (KPSS level = 0.0084, p-value > 0.1), so ARIMA (p, 1, q) was chosen (Fig. 3-2). The model with lowest AIC (1.2x10 5 ) was ARIMA (0,1,1). However, residual diagnosis showed residual trends (Fig. 3-3), indicating that the model did not fully explain the temporal variability in PM 2.5 (i.e. the residuals did not follow a white noise process). 14 Figure 3-1. PM 2.5 concentrations from Feb 2016. Figure 3-2. First-order differenced PM 2.5 concentrations from Feb 2016. 15 Figure 3-3. Autocorrelation function of residuals from ARMIA model. Comparing the fitted and observed values from the ARIMA (0,1,1) model (Figure 3-4) showed a high correlation (R = 0.9643, RMSE = 9.60), but this is purely due to the delay phenomenon mentioned in Chapter 2. Cross-correlation between the observed and fitted time series shows that one-hour lag PM 2.5 concentrations were even more strongly correlated with the fitted values (R = 0.9999, RMSE=2.7 x10 -5 ), and they were actually the same (Figure 3-6). Thus, using the ARIMA (0,1,1) model for prediction was not useful. 16 Figure 3-4. Scatterplot of observed vs fitted PM 2.5 concentrations Figure 3-5. Cross correlation (CCF) between observed and ARIMA fitted PM 2.5 concentrations (lag in hours) showing stronger correlation at lag1 than lag 0. 17 Figure 3-6. Scatterplot of 1h delayed observed vs fitted PM 2.5 concentration The random forests and SVM models were fitted using all features discussed in Section 2.2.2 and compared with the baseline model. Correlation coefficients (R) between fitted PM 2.5 concentration and observed PM 2.5 concentration as well as RMSE were calculated (Table 3-1). We found that the baseline model using only previous hour PM 2.5 concentration for prediction already had great performance. Both random forests and SVM had similar performance to the baseline model. The significant delay phenomenon was observed in the plot of random forest prediction results (Figure 3-7), and the cross-correlation function proved this result (Figure 3-8). Based on the test set, the observed delay in the predicted values was evident by a shift of one hour from the observed values. Table 3-1. Comparison of model performance using all features created in Chapter 2.2.2. Models R (training set) R (test set) RMSE (training set) RMSE (test set) Baseline 0.9667 0.9699 9.54 10.24 18 Random Forest 0.9784 0.9786 7.34 8.64 SVM 0.9893 0.9695 5.41 10.24 Figure 3-7. One-hour prediction of PM 2.5 concentrations from the random forest model on the test set observed (red), predicted (green) 19 Figure 3-8. Cross correlation between observed and random forest fitted PM 2.5 concentrations showing stronger correlation at lag 1 than lag 0 The delay phenomenon was not as severe in the SVM prediction result according to both prediction plot (Figure 3-9) and cross-correlation function (Figure 3-10), suggesting SVM performed better on this dataset. Nevertheless, it was difficult to examine the true predictive performance of the models due to the primary influence of PM 2.5 from the previous hour in all these models. Figure 3-9. One-hour prediction (SVM) of PM 2.5 concentration on the test set observed (red) and predicted (green) 20 Figure 3-10. Cross correlation between observed and SVM fitted PM 2.5 concentrations showing stronger correlation at lag 0 To assess the true performance of these models, first-order differenced PM 2.5 concentration was used as the response variable for random forest and SVM model. Increases in node purity for each predictor in the random forest model were calculated to the rank variable importance. Variables with low importance (increase in node purity lower than 1.5x10 4 ) were not used in the GAM, including PM 2.5 concentrations lagged at 5 hours and higher. To reduce data dimension and multicollinearity, predictors in both training set and test set were transformed into principal components. There were 53 components accounting for 99% variance in total used for random forest (RF) and SVM models. GBDT was also fitted using principal components. Correlation coefficients and RMSE between predicted and observed first- order differenced PM 2.5 concentration in both the training set and test set were recorded (Table 3- 2). Cross-correlation function of each model proved the delay phenomenon was eliminated (Figure 3-11). Of the prediction methods, random forests and PCA-random forest had the poorest performance with correlation coefficient (R) between predicted values and observed values in 21 both training and test dataset lower than 0.68 and root mean square error larger than 7.6. PCA- SVM had the best performance on the training set (R = 0.83, RMSE = 6.32) but lower performance on the test set (R = 0.70, RMSE = 7.55). The GAM model showed the best performance on test set (R = 0.78, RMSE = 6.55). Table 3-2. Comparison of model performance using first-order difference as the response variable Models R (training set) R (test set) RMSE (training set) RMSE (test set) SVM 0.7740 0.7025 6.32 7.55 RF 0.6246 0.6776 7.68 7.88 PCA-SVM 0.8273 0.6975 5.65 7.67 PCA-RF 0.6392 0.6569 7.73 8.24 PCA-GBDT 0.7336 0.7234 6.72 7.44 GAM 0.7582 0.7780 6.29 6.55 22 Figure 3-11. Cross correlation between observed and fitted first-order differenced PM 2.5 concentration showing stronger correlation at lag 0 3.1 Discussion One-step prediction of PM 2.5 concentration performed well in the study as shown in Figure 3-7, Figure 3-9 and Table 3-1, if PM 2.5 concentration itself was used as the response variable. However, after eliminating the delay phenomenon by modeling first-order differenced PM 2.5 concentration, performance of these methods dropped drastically, meaning that the difference of PM 2.5 concentration between the next hour and present hour could not be accurately predicted using these methods. This indicated that model performance should not be used as the sole standard in time series prediction, and the delay phenomenon in time series prediction should be addressed as it might incorrectly boost confidence in model performance to a rather high level. 23 However, hyperparameters of these models were only tuned roughly, which might not be giving the best prediction performance. Grid search could be applied in SVM, random forest and GBDT to find parameters giving better model performance at the expense of more time. More potential relevant variables such as hourly traffic and industrial emission could be included to improve the model performance. Also, more complicated hybrid models could be used to improve prediction accuracy. The intermittent and unstable nature of PM 2.5 concentration time series leads to multiple frequency components. Decomposing them requires more complicated methods. As has been proposed, wavelet transformation (WT), variational mode decomposition (VMD) and differential evolution (DE) algorithm combined with back propagational neural network can significantly improve prediction accuracy compared to back propagational neural network [9]. Decomposing PM 2.5 concentration series with WT and VMD and then forecast each variational mode using SVM optimized by DE algorithm and aggregate them could be tried to boost prediction performance. 24 References [1] Yang, G., Wang, Y., Zeng, Y., Gao, G.F., Liang, X., Zhou, M., et al. (2013) Rapid health transition in China, 1990–2010: findings from the Global Burden of Disease Study 2010. The Lancet, 381, 1987–2015. [2] Lu, M., Tang, X., Wang, Z., Gbaguidi, A., Liang, S., Hu, K., et al. (2017) Source tagging modeling study of heavy haze episodes under complex regional transport processes over Wuhan megacity, Central China. Environmental Pollution, 231, 612–621. [3] Iii, C.A.P. (2002) Lung Cancer, Cardiopulmonary Mortality, and Long-term Exposure to Fine Particulate Air Pollution. Journal of the American Medical Association, 287, 1132. [4] Liu, J., Han, Y., Tang, X., Zhu, J. and Zhu, T. (2016) Estimating adult mortality attributable to PM 2.5 exposure in China with assimilated PM 2.5 concentrations based on a ground monitoring network. Science of The Total Environment, 568, 1253–1262. [5] Franklin, M., Koutrakis, P. and Schwartz, J. (2008) The Role of Particle Composition on the Association Between PM2.5 and Mortality. Epidemiology, 19, 680–689. [6] Franklin, M., Zeka, A. and Schwartz, J. (2006) Association between PM2.5 and all-cause and specific-cause mortality in 27 US communities. Journal of Exposure Science & Environmental Epidemiology, 17, 279–287. [7] Zanobetti, A., Franklin, M., Koutrakis, P. and Schwartz, J. (2009) Fine particulate air pollution and its components in association with cause-specific emergency admissions. Environmental Health, 8. [8] Huang, W., Cao, J., Tao, Y., Dai, L., Lu, S.-E., Hou, B., et al. (2012) Seasonal Variation of Chemical Species Associated With Short-Term Mortality Effects of PM2.5 in Xi’an, a Central City in China. American Journal of Epidemiology, 175, 556–566. [9] Wang, D., Liu, Y., Luo, H., Yue, C. and Cheng, S. (2017) Day-Ahead PM2.5 Concentration Forecasting Using WT-VMD Based Decomposition Method and Back Propagation Neural Network Improved by Differential Evolution. International Journal of Environmental Research and Public Health, 14, 764. [10] Tai, A.P., Mickley, L.J. and Jacob, D.J. (2010) Correlations between fine particulate matter (PM2.5) and meteorological variables in the United States: Implications for the sensitivity of PM2.5 to climate change. Atmospheric Environment, 44, 3976–3984. [11] Wang, J. and Ogawa, S. (2015) Effects of Meteorological Conditions on PM2.5 Concentrations in Nagasaki, Japan. International Journal of Environmental Research and Public Health, 12, 9089–9101. [12] Kleinke, K. (2017) Multiple Imputation Under Violated Distributional Assumptions: A Systematic Evaluation of the Assumed Robustness of Predictive Mean Matching. Journal of Educational and Behavioral Statistics, 42, 371–404. [13] Kwiatkowski, D., Phillips, P.C., Schmidt, P. and Shin, Y. (1992) Testing the null hypothesis of stationarity against the alternative of a unit root. Journal of Econometrics, 54, 159–178. [14] Abdi. H., & Williams, L.J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2 (4), 433–459. [15] Cortes, Corinna; Vapnik, Vladimir N. (1995). Support-vector networks. Machine Learning, 20 (3), 273–297. [16] Tyralis, H. and Papacharalampous, G. (2017) Variable Selection in Time Series Forecasting Using Random Forests. Algorithms, 10, 114. [17] Hastie, T., Friedman, J. and Tibshirani, R. (2001) Boosting and Additive Trees. The Elements of Statistical Learning Springer Series in Statistics, 299–345. [18] Hastie, T. J.; Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall/CRC.
Abstract (if available)
Abstract
Background: Air pollution levels are very high in urban areas of China, posing a serious threat to public health. It is of critical importance to be able to accurately predict air pollution episodes, particularly to help protect vulnerable populations including children and the elderly. ❧ Methods: Air pollution and meteorological data collected from the Huaqiao monitoring station in Wuhan, China were used for predictive modeling of fine particulate matter (PM2.5) concentrations. Initial data processing steps included predictive mean matching, which was applied to impute missing values. Principal component analysis was then used to reduce data dimensionality and multicollinearity between predictor variables. Support vector machines, random forests, gradient boosted decision tree and generalized additive models (GAM) were fitted and compared by cross validation. ❧ Results: A significant delay phenomenon was found during one-step time series prediction. To eliminate this phenomenon and evaluate the true model performance, first-order differenced PM2.5 concentrations were used in our comparative modeling methods. Of the prediction methods, random forests and PCA-random forest had the worst performance with correlation coefficient (R) between predicted values and observed values lower than 0.68 and root mean square error larger than 7.6 in both training and test datasets. PCA-SVM had the best performance on the training set (R = 0.83, RMSE = 6.32) but lower performance on the test set (R = 0.70, RMSE = 7.55). The GAM model showed the best performance on test set (R = 0.78, RMSE = 6.55).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
Using multi-angle imaging spectroradiometer aerosol mixture properties and meteorology for PM₂.₅ assessment in Iran
PDF
Forecasting traffic volume using machine learning and kriging methods
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Assessment of the mortality burden associated with ambient air pollution in rural and urban areas of India
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Downscaling satellite observations of dust with deep learning
PDF
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
PDF
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
Predicting hospital length of stay (LOS) using the National Inpatient Sample
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Statistical downscaling with artificial neural network
PDF
Assessment of land cover change in Southern California from 2003 to 2011 using Landsat Thematic Mapper
PDF
Spatial analysis of PM₂.₅ air pollution in association with hospital admissions in California
PDF
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
Machine learning-based breast cancer survival prediction
PDF
Associations of ambient air pollution exposures with perceived stress in the MADRES cohort
Asset Metadata
Creator
Chen, Xiaohe
(author)
Core Title
Comparison of models for predicting PM2.5 concentration in Wuhan, China
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
07/11/2018
Defense Date
07/10/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
feature engineering,machine learning,model comparison,OAI-PMH Harvest,PM 2.5,time series
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Franklin, Meredith (
committee chair
), Berhane, Kiros (
committee member
), Lewinger, Juan (
committee member
)
Creator Email
basten65@163.com,xiaohech@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-14716
Unique identifier
UC11671554
Identifier
etd-ChenXiaohe-6387.pdf (filename),usctheses-c89-14716 (legacy record id)
Legacy Identifier
etd-ChenXiaohe-6387.pdf
Dmrecord
14716
Document Type
Thesis
Format
application/pdf (imt)
Rights
Chen, Xiaohe
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
feature engineering
machine learning
model comparison
PM 2.5
time series