5 Caveats
5.1 Missing data
Due to the unavailability of solar exposure data, there are several missing days of prediction prior to 2010. Solar exposure data and predicted PM2.5 is complete from 2010 onwards.
Dust flags derived from the CAMS are not available pre-2003.
Due to unavailability of landscape fire data before November 2020 (the basis of active_fire
predictors), PM2.5 could not be estimated for years pre-2001.
5.2 Negative predicted PM2.5 values
PM2.5 predictions are occasionally negative. This could be a consequence of the training data, which allowed negative values in the response variable. It may also be a result of the application of the model to environments dissimilar to the locations in the training data. Most regulatory monitors are located in urban areas and/or close to human infrastructure, and therefore the model may be less accurate in rural or unpopulated regions.
The prediction_out_range
variable indicates where the predicted PM2.5 fall outside the range of the response variable values provided to train the model. There are a variety of methods to handle negative values:
- consider using
prediction_out_range
or a conditional on the PM2.5 value to identify improbable values - consider removing negative or improbable values (interpolating to fill missing values if needed) or truncating to some chosen threshold
- threshold may be absolute (specific value) or relative (e.g. percentile)
5.3 Infilled data
Raster cells along water bodies were not well-covered by some predictors and consequently the model could not estimate a PM2.5 concentration at those points. This was a result of the predictor extraction process, where data was extracted by point extraction (using the prediction grid cell centroid) from the source predictor raster. For some predictors, the source raster was missing over water bodies and thus would return NA
if the prediction grid cell centroid lay slightly offshore (though the cell might otherwise include significant land area).
As much of Australia’s population is concentrated along coastal regions, the dataset was extrapolated to ensure all missing coastal pixels were filled. Extrapolation was performed for missing coastal pixels by:
- calculating a focal window (size 3, queen-type contiguity) maximum on the binary flags (i.e. if any adjacent cell is flagged, the target cell is also flagged); and
- calculating a focal window (size 3) mean on the PM2.5 prediction and decomposition components (i.e. mean of all non-NA adjacent cells)
Additionally, flags were originally extracted for the superseded V1.2 grid and were remapped to the V1.3 grid, rather than performing a full extraction for the V1.3 grid. As the V1.2 grid did not entirely cover the V1.3 grid, flags were extrapolated by taking a 15km buffer on unmapped grid points and setting the flag to 1 if any V1.2 grid cell within the buffer was flagged 1.
5.4 Artefacts
PM2.5 was predicted on a grid of 5km resolution, however the predictions may not vary smoothly across the entirety of the extent due to the limited resolution of input predictors. This is sometimes noticeable in visualisations of the surface:
- grid pattern of roughly 55km width and height, matching the MERRA-2 satellite raster grid resolution
- circular regions (particular in northern Australia) corresponding to the 100km radius buffers of the
em_fire_dens_fireskm2_100000_daily
predictor - seemingly random small ‘patches’ in fixed positions across daily predictions, corresponding to the predictor
pop_dens_10000