4 Code guide
The structure and syntax of an R targets pipeline may be unfamiliar to you depending on your level of coding experience. Depending on your intended usage, some or all of the following may guide your understanding of the workflow. Links to further useful examples and documentation are provided.
4.1 R targets package
The R targets package, a set of pipeline implementation and management tools, forms the basis of the Air health Scientific Workflow System. Using targets aids the reproducibility of analyses, tracking input data, parameters, code and dependencies to determine which steps need to be rerun when a change is detected.
A targets pipeline is structured as a list of targets, each of which has a name and associated code block. The result of each target is saved and can be used in other targets by referring to it by name.
On running the pipeline, each target is checked for changes to format, metadata and data, and is rerun if a change is detected. If there has been no change, the target is skipped (code is not run). Note that if a target changes, all dependent downstream targets will be run as the dependencies are recorded in target metadata.
More complex pipelines can be set up with targets branching functionality and tarchetypes package. See the targets manual and documentation for further information.
4.1.1 Function-oriented programming
Targets is designed to be function-oriented, writing and calling functions. This is in contrast to the style of programming that runs step-by-step, top to bottom.
An example of the latter:
In targets:
do_multiply <- function(a, b){
a*b
}
list(
tar_target(x, 2),
tar_target(y, 3),
tar_target(z, do_multiply(x, y))
)
While for this example, there seems to be little benefit from creating a function (indeed one could simply write tar_target(z, x*y)
), it aids code clarity and efficiency for more complex workflows. Clearly named functions can self-describe their intended purpose, with defined inputs (arguments) and output(s) (return value). Carefully defined generalised functions may be reused as needed within or across projects.
4.2 Directory and File Structure
The key files and folders of the Air Health SWS targets pipeline are as follows:
├── main.R
├── _targets.R
├── R/
├──── func_analysis/
├──── func_data/
├──── func_helpers/
├──── func_tidy/
├──── func_viz/
├── report.Rmd
The main.R script is where you should start. It contains a few lines of code for restoring the packages, visualising the targets pipeline, running the pipeline and viewing target results. It does not make any changes to the workflow. Further exploratory analysis, outside of the pipeline, can be added here.
The _targets.R script forms the essential core of the targets pipeline, This is where the targets are defined, along with sourcing of required functions and specification of required libraries. See Section B.3 for more information.
All custom functions are stored in the R/ folder. These are sourced near the top of the _targets.R script (and in main.R where needed). For clarity, the functions are arranged in a series of folders:
- func_analysis/: functions to analyse input data, acting on tidied and combined data
- func_data/: functions to combine and derive tidied input data to prepare for analysis
- func_helpers/: miscellaneous helper functions that are not directly concerned with the processing or analysis of data
- func_tidy/: functions to clean the initial input data
- func_viz/: functions to visualise or output data
4.3 _targets.R script
As is standard with any _targets.R script, the structure is as follows:
- Load targets library (and tarchetypes if using)
- Source custom functions
- Set global variables
- Set target options (set what R packages are to be available to the pipeline)
- Define targets
Custom functions used in the pipeline are loaded from the R/
folder. Though one can define the functions within _targets.R
, keeping functions separately helps to keep code tidy. Additionally, the function files can be sourced or copied as needed for other projects.
Global variables, here state
and year
, are a tidier way to specify parameters that are repeatedly used. The state
and year
variables are used within the tidy...
targets to subset the input data and do not have to be passed to the do_tidy...
functions.
Packages specified in tar_option_set
are globally available to subsequent targets. If a target fails because a function cannot be found, double-check the required package is listed here.
Finally, targets are defined in a list. For legibility, targets are divided into five lists, assigned to variables input
, tidy
, data
, analysis
and viz
. These are then combined in the final list that defines the pipeline. Each of the target divisions has a corresponding function folder.
tidy ........ R/func_tidy
data ........ R/func_data
analysis .... R/func_analysis
viz ......... R/func_viz
It is, of course, not necessary to use this organisation of targets and function files. All targets can be directly defined within the pipeline list and all function files can be stored under R/
, or some alternative arrangement is possible.
4.3.1 Input and tidy targets
Each target in the input
list corresponds to the files of a specific input dataset. This is followed by a tidy
list which reads and tidies the input data with a function do_tidy...
customised to the format of each dataset (i.e. the target for meshblock input files has a corresponding tidy
target to clean and reformat the input meshblock data as needed).
Datasets are not combined or dependent on one another at this stage. Data is manipulated into a tidy format (e.g. standardised names, converting column data types, converting to long format), and may be subsetted to relevant columns if the dataset is large.
Further processing of data, such as calculation, combination and extraction takes place in data
list.
4.3.2 data targets
The separate tidied datasets are combined and processed to a stage where the data is ready to be passed to an analysis function. In this pipeline, this covers the extraction of environmental exposure (from source rasters), calculation of counterfactual exposure, calculation of mortality rate in the impact population, application of mortality rate to the study population and aggregation of population-weighted exposure to study population regions.
The end result of this section of targets is a data table of the study population with expected deaths (from application of mortality data) and exposure in baseline and counterfactual cases (from exposure rasters and defined counterfactual value).
It is not necessary to gather all data into a single table, but avoids redundancy if you have several analysis targets working on the same data - you will not have to combine the data at the start of every analysis target.
As a sidenote, it may be observed that some datasets are split over multiple files, for instance, the ABS meshblock files are by state. Depending on the dataset, tidying and processing may be best done file by file. This is possible through dynamic branching, where target B iterates over the output of other target(s) as defined by the pattern
argument. Such dynamically-branched targets appear as a square with tar_glimpse
or tar_visnetwork
. (Read more at the targets manual)
Consider the raster extraction to polygon step, one of the more time-consuming steps, where the inputs are exposure rasters (one file = one year) and ABS meshblock polygons (one file = one state). All the exposure rasters share the same projection, extent and resolution and thus can be stacked into a multi-layer raster - extracting a multi-layer raster does not cost much more than extracting from a single-layer raster. On the other hand, the cost of extraction is heavily dependent on the number and complexity of polygons. If an extra state were to be added to the pipeline, it would minimise cost to process and extract each set of state meshblocks separately via dynamic branching. Consequently, dynamic branching is used only on the ABS meshblocks dataset and not the exposure rasters, and branches recombined only after the extraction process is complete.
4.3.3 analysis targets
After data processing, analysis is performed through two targets - one defining the health impact response and the other applying the response to the processed data to calculate the attributable number of deaths. Further targets can be added to determine other health measures such as years of life lost.
4.3.4 viz targets
Naturally the results of analysis should be visualised in some form, at the very least for common-sense check. A target can be defined to produce a plot, or, if using tarchetypes
, render an R Markdown report which can draw on output of other targets in its content.
In the targets viz_an
and leaflet_an
, the code block no longer consists of a single function call. Instead, it is a series of data manipulation statements, aggregating and merging with spatial data, before calling the custom plot function. This structure keeps the plot function generalisable rather than specific to the data inputs. The data manipulation code can also be split off into a separation function or even a separate target if the results will be used in multiple targets.
The R Markdown report is rendered in a target by specifying the appropriate .Rmd
file in the tarchetypes::tar_render
function. To minimise computation time and take advantage of the targets pipeline features, minimal processing should occur in R Markdown code. Target outputs are retrieved with tar_read
or tar_load
, Read more in the R targets manual - Literate Programming.
4.4 Metaprogramming
The R targets package includes metaprogramming tools (based on the rlang
package), that is, tools using code to generate code. One of the simpler cases is combining branches after static branching. Static branching may be used to generate a number of branched targets over a series of varying parameters, e.g. using different methods of analysis or modelling, or changing an input parameter to the modelling function.