4 Methodology
A summarised methodology is outlined below. The full description of methodology can be found in the following publication:
Borchers-Arriagada, N., Morgan, G.G., Buskirk, J.V., Gopi, K., Yuen, C., Johnston, F.H., Guo, Y., Cope, M. and Hanigan, I.C. (2024) ‘Daily PM2.5 and Seasonal-Trend Decomposition to Identify Extreme Air Pollution Events from 2001 to 2020 for Continental Australia Using a Random Forest Model’, Atmosphere, 15(1341). Available at: https://doi.org/10.3390/atmos15111341.
4.1 PM2.5 estimation
A random forest model for the estimation of total PM2.5 concentration was developed on a range of satellite-derived, land use, land cover, road and traffic, and weather and climate predictors. The response variable PM2.5 was drawn from state and territory government monitor observations, aggregated from hourly to daily level.
Predictors were first filtered to exclude highly correlated variables (>80%). A grid search of hyperparameters was performed and, for the optimal set of hyperparameters, the variables of importance were selected (variables with importance higher than the mean). The final random forest model was then trained on these selected variables with default hyperparameter settings and used to estimate PM2.5 for a 5km resolution grid of Australia (mainland and Tasmania).
4.2 Seasonal Decomposition model summary
Decomposition of the predicted PM2.5 was performed using Seasonal and Trend Decomposition using LOESS (STL), splitting the total PM2.5 into trend, seasonal and remainder components.
The STL was calculated considering each grid point as a time series starting in 2001, imputing missing values with imputeTS::na_interpolation
. A grid search of the seasonal window parameter of the STL was performed (values = 15, 25, 35, 45) and partial autocorrelation calculated of the STL using stats::pacf
.
Very slight differences were observed in the partial autocorrelation when comparing seasonal windows. A seasonal window of 45 was selected on the basis of having the lowest maximum partial autocorrelation. It was also noted the higher value for seasonal window produced a more regular variation in seasonal component.
Due to the extraordinary bushfire events and resulting PM2.5 levels in January 2020, the STL decomposition did not include data from 2020 year. Decomposition of PM2.5 for 2020 was instead infilled by assuming identical daily seasonal and trend components to 2019 (for the 2001-2020 original dataset) or 2021 (for the 2020-2023 update). The remainder component for 2020 was then calculated as the difference between the estimated total PM2.5 and assumed trend + seasonal
component.
4.3 Flags
For each pixel, a series of binary flags to aid identification of PM2.5 source were extracted from a variety of satellite data and weather grids. Additionally two statistical threshold flags were calculated from the estimated PM2.5 and the STL remainder
component as an indication of unusually high levels of PM2.5.
Dust data was drawn from two sources - CAMS satellite aerosol optical depth (AOD) and MERRA-2 PM2.5 dust - and pixel flagged if it exceeded a specific percentile threshold calculated over the entire time period (2001-2020) and all pixels. In a similar manner, the temperature flags were set to indicate where the pixel had a mean daily temperature below a specific temperature.
Pixels were flagged for active fires (from satellite-derived data) for a given buffer size (ranging from 10km to 500km) if an active fire was present within the buffer of a pixel.
A statistical threshold was calculated from the total PM2.5 concentration by determining the 95th percentile of PM2.5 over the study time period for each pixel. The pixel was then flagged for each day where PM2.5 exceeded this threshold.
A second statistical threshold was derived by trimming the remainder
component to the 99th percentile then calculating the standard deviation (SD) for each pixel. The pixel was then flagged if the remainder
exceeded 2 \(\times\) SD of the trimmed remainder.