Data Science Tips
CARDAT data
CARDAT stores
a wide array of population, health and environmental datasets. There are three types of data:
- Open data - data that has been published on both Cloud CARDAT and in the public domain. These data don't require any special access requirements.
- General data - data that is published to Cloud CARDAT and is available to all CARDAT users.
- Restricted data - data that is published to Cloud CARDAT, but require permission from the Data Owner before access can be granted.
A catalogue of these datasets, including metadata in Ecological Metadata Language (EML) format is kept in the
CAR data inventory. These records are published
here.
Access to CARDAT data is
administered by the CAR data management team
(car.data@sydney.edu.au).
Data cleaning
It is inevitable that the data we work has imperfections. As part of storing data in CARDAT or EHI we often take steps to clean the data. When this is done, we always:
- save a copy of the original data in the data_provided folder,
- save the code that is used to transform the data in the code folder,
- save the resulting cleaned data in the data_derived folder, and
- include a timestamp as part of the derived data file name,
This ensures that the transformation process is transparent and reproducible.
Data cleaning steps
There are many issues with data that we see again and again when dealing with data. When cleaning data there are a number of things that may be required, including:
- fixing character conversion errors for extended character sets,
- ensuring data is saved without embedded metadata such as leading rows with introductory text, summary text at the end or multiple header rows within the data,
- removing commas from numbers and casting them to numeric format,
- converting variables to relevant data formatting standards (for example the International Standards from ISO),
- identification of miscellaneous errors such as data inconsistencies and duplicate variable names,
- reformatting numeric or character strings where appropriate,
- identification of any out-of-range values (based on the specified units), or questionable data in general,
- rename all files and variables using lower_case_with_underscores naming convention,
- tabulating frequencies and variable distributions, noting any outliers for review,
- identification any opportunities to make wide data longer, or many files that can be merged,
- if there are multiple linked tables, each table should include a key column that allows those tables to be linked unambiguously (such as the site_ID variables),
- checking that values in linked files correspond up to values in related files (e.g. a site_ID in one file that is missing from the spatial data file),
- writing as CSV with quote encapsulated strings (for archival purposes),
- coding missing data as NA, or identify if these were actually censored,
- coercing dates to ISO 8601 YYYY-MM-DD format,
- casting nominal variables that use integer codes as character,
- checking that all value labels in enumerated lists are described in the metadata (i.e. codes for “1” = “low”, “2” = “mid” and “3” = “high”)
- attempt to identify and split any combined variables (for example, season AND year like “winter-97”),
- identification of any characters in numeric or date variables and consider replacing with NA, (add to a comments variable if possible or state conversion in metadata),
- identify any values that Excel may try to convert to date type (for e.g. Site code “1-5” will appear as 5-Jan and should be rewritten as “site_1-5”),
- using a GIS to confirm the distribution of spatial coordinates, and transform geographical coordinates in decimal degrees (GDA94) if supplied in metres UTM or AMG, and
- renaming files to be consistent within the framework of the data storage system.
References
White, E., Baldridge, E., Brym, Z., Locey, K., McGlinn, D., & Supp, S. 2013. Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution, 6(2), 1–10. doi:10.4033/iee.2013.6b.6.f
Wickham, H., 2014. Tidy Data. Journal of Statistical Software, VV (Ii). Available at: http://vita.had.co.nz/papers/tidy-data.pdf
Leek, J. 2014. https://github.com/jtleek/datasharing
Borer, E., Sea bloom, E., Jones, M., and Schildhauer, M. 2009. Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America 90:205–214. http://dx.doi.org/10.1890/0012-9623-90.2.205
Campbell, J. L., Rustad, L. E., Porter, J. H., Taylor, J. R., Dereszynski, E. W., Shanley, J. B., Gries, C., Henshaw, D. L., Martin, M. E., Sheldon, W. M., and Boose, E. R. 2013. Quantity is Nothing without Quality: Automated QA/QC for Streaming Environmental Sensor Data. BioScience, 63, 574-585. http://dx.doi.org/10.1525/bio.2013.63.7.10
Working with Github Branches
- create a branch using: git checkout -b working_joe
- work on stuff
- use Github to make a pull request then merge and delete
- on your local version delete the branch: git branch -D working_joe
- and delete the remote link branch: git fetch origin --prune
Setup Cloud.car-dat.org WebDav using Rclone
author: “Lucas Hertzog” & “Anh Han”
date: “2023-04-12”
output: html_document
Cloud.car-dat.org WebDav is a service provided by NextCloud that allows you to access and manage your files stored in the NextCloud platform using the WebDav protocol. Rclone is a command-line tool that allows you to interact with Clould.car-dat.org WebDav and perform various operations such as syncing, copying, and moving files.
In this guide, we will show you how to set up NextCloud WebDav using Rclone on Windows, Mac, and Linux.
Windows
- Download and install Windows File System Proxy (WinFsp) from https://winfsp.org/.
- Download and install Rclone from https://rclone.org/downloads/.
- Configure Rclone by following these steps:
- Open a command prompt and run the command “rclone config”.
- Follow the on-screen instructions to configure Rclone.
- When prompted for the WebDav URL, enter “https://cloud.car-dat.org/” and press Enter.
- When prompted for the username and password, enter your NextCloud Files login credentials.
Optionally, you can create a .bat file named “mountrclone.bat” with the following content to automatically mount the drive:
@echo off
start c:\rclone\rclone mount –vfs-cache-mode full Cloud.car-dat.org:/ z:
This command will mount the Cloud.car-dat.org drive to the Z: drive letter.
MacOS
- Install Rclone by running the command “brew install rclone” in Terminal.
- Configure Rclone by running the command “rclone config” in Terminal.
- Follow the on-screen instructions to configure Rclone.
- When prompted for the WebDav URL, enter “https://cloud.car-dat.org/” and press Enter.
- When prompted for the username and password, enter your NextCloud Files login credentials.
Linux
- Install Rclone by running the command “sudo apt-get install rclone” in Terminal.
- Configure Rclone by running the command “rclone config” in Terminal.
- Follow the on-screen instructions to configure Rclone.
- When prompted for the WebDav URL, enter “https://cloud.car-dat.org/” and press Enter.
- When prompted for the username and password, enter your NextCloud Files login credentials.
More information and instructions to set up automatic syncing is https://linuxnewbieguide.org/rclone/
Setup local Cloud.car-dat.org Desktop Synchronization
author: “Anh Han”
date: “2023-12-15”
- Download NextCloud Desktop Client for Windows, MacOS and Linux from https://nextcloud.com/install/.
- Install the downloaded setup application.
- Reboost your operating system for the NextCloud configuration changes to take effect.
- Launch NextCloud desktop.
- When prompted for the Nextcloud account, select “Log in” option.
- When prompted for the Server address, enter “https://cloud.car-dat.org/” and press Enter.
- NextCloud automatically switches to Cloud Cardat Server web interface for account login. When prompted for the username and password, enter your Cloud Cardat login credentials and press “Grant access”.
- Configure the Sync Folder by accessing “Settings” and pressing “Add Folder Sync Connection” in NextCloud desktop.
- Change directory of NextCloud synchronised local folder on your computer.