
Quick start guide
Dax Kellie
2025-03-24
Source:vignettes/quick_start_guide.Rmd
quick_start_guide.Rmd
corella is a tool for standardising data in R to use the Darwin Core Standard. Darwin Core Standard is the primary data standard for species occurrence data—records of organisms observed in a location and time—in the Atlas of Living Australia (ALA), other Living Atlases and the Global Biodiversity Information Facility (GBIF). The standard allows the ability to compile data from a variety of sources, improving the ease to share, use and reuse data.
The main tasks to standardise data with Darwin Core Standard are:
- Ensure columns use valid Darwin Core terms as column names
- Include all required information (e.g. scientific name, unique observation ID, valid date)
- Ensure columns contain valid data
This process can be daunting. corella is designed to reduce confusion of how to get started, and help determine which Darwin Core terms might match your column names.
Install
To install from CRAN:
install.packages("corella")
To install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("AtlasOfLivingAustralia/corella")
To load the package:
Rename, add or edit columns
Here is a minimal example dataset of cockatoo observations. In our
dataframe df
there are columns that contain information
that we would like to standardise using Darwin Core.
library(tibble)
library(lubridate)
df <- tibble(
latitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes
longitude = c(149.125, 149.133),
date = c("14-01-2023", "15-01-2023"),
time = c("10:23:00", "11:25:00"),
month = c("January", "February"),
day = c(100, 101),
species = c("Callocephalon fimbriatum", "Eolophus roseicapilla"),
n = c(2, 3),
crs = c("WGS84", "WGS8d"),
country = c("Australia", "Denmark"),
continent = c("Oceania", "Europe")
)
df
#> # A tibble: 2 × 11
#> latitude longitude date time month day species n crs country
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
#> 1 -35.31 149. 14-01-2023 10:23:00 Janu… 100 Calloc… 2 WGS84 Austra…
#> 2 -35.273 149. 15-01-2023 11:25:00 Febr… 101 Eoloph… 3 WGS8d Denmark
#> # ℹ 1 more variable: continent <chr>
We can standardise our data with set_
functions. The
set_
functions possess a suffix name to identify what type
of data they are used to standardise (e.g. set_coordinates
,
set_datetime
), and arguments in set_
functions
are valid Darwin Core terms (ie column names). By grouping
Darwin Core terms based on their data type, corella makes it easier for
users to find relevant Darwin Core terms to use as column names (one of
the most onerous parts of Darwin Core for new users).
Let’s specify that the scientific name (i.e. genus + species name) in
our data is in the species
column by using
set_scientific_name()
. You’ll notice 2 things happen:
- The
species
column in our dataframe is renamed toscientificName
-
set_scientific_name()
runs a check on ourspecies
column to make sure it is formatted correctly
df |>
set_scientific_name(scientificName = species)
#> ⠙ Checking 1 column: scientificName
#> ✔ Checking 1 column: scientificName [335ms]
#>
#> # A tibble: 2 × 11
#> latitude longitude date time month day n crs country continent
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#> 1 -35.31 149. 14-01-2023 10:23… Janu… 100 2 WGS84 Austra… Oceania
#> 2 -35.273 149. 15-01-2023 11:25… Febr… 101 3 WGS8d Denmark Europe
#> # ℹ 1 more variable: scientificName <chr>
What happens when we add a column with an error in it? The
latitude
column in df
is a class
character
column, instead of a numeric
column
as it should be. When we try to update the column name using
set_coordinates()
, an error tells us the class is
wrong.
df |>
set_scientific_name(scientificName = species) |>
set_coordinates(decimalLongitude = longitude,
decimalLatitude = latitude)
#> ⠙ Checking 1 column: scientificName
#> ✔ Checking 1 column: scientificName [315ms]
#>
#> ⠙ Checking 2 columns: decimalLatitude and decimalLongitude
#> ✔ Checking 2 columns: decimalLatitude and decimalLongitude [620ms]
#>
#> Error in `check_decimalLatitude()`:
#> ! decimalLatitude must be a numeric vector, not character.
Fix or update columns
To change, edit or fix a column, users can edit the column within the
set_
function.
Each set_
function is essentially a specialised dplyr::mutate()
,
meaning users can edit columns using the same processes they would when
using dplyr::mutate()
. We can fix the latitude
column so that it is class numeric
within the
set_coordinates()
function.
df_darwincore <- df |>
set_scientific_name(scientificName = species) |>
set_coordinates(decimalLongitude = longitude,
decimalLatitude = as.numeric(latitude))
#> ⠙ Checking 1 column: scientificName
#> ⠹ Checking 1 column: scientificName
#> ✔ Checking 1 column: scientificName [316ms]
#>
#> ⠙ Checking 2 columns: decimalLatitude and decimalLongitude
#> ✔ Checking 2 columns: decimalLatitude and decimalLongitude [620ms]
#>
df_darwincore
#> # A tibble: 2 × 11
#> date time month day n crs country continent scientificName
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 14-01-2023 10:23:00 January 100 2 WGS84 Austra… Oceania Callocephalon…
#> 2 15-01-2023 11:25:00 Februa… 101 3 WGS8d Denmark Europe Eolophus rose…
#> # ℹ 2 more variables: decimalLatitude <dbl>, decimalLongitude <dbl>
Auto-detect columns
corella is also able to detect when a column exists in a data frame
that already has a valid Darwin Core term as a column name. For example,
df
contains columns with locality information. We can add
set_locality()
to our pipe to identify these columns, but
because several columns already have valid Darwin Core terms as column
names (country
and continent
),
set_locality()
will detect these valid Darwin Core columns
in df
and check them automatically.
df |>
set_scientific_name(scientificName = species) |>
set_coordinates(decimalLongitude = longitude,
decimalLatitude = as.numeric(latitude)) |>
set_locality()
#> ⠙ Checking 1 column: scientificName
#> ✔ Checking 1 column: scientificName [311ms]
#>
#> ⠙ Checking 2 columns: decimalLatitude and decimalLongitude
#> ✔ Checking 2 columns: decimalLatitude and decimalLongitude [620ms]
#>
#> ⠙ Checking 2 columns: country and continent
#> ✔ Checking 2 columns: country and continent [618ms]
#>
#> # A tibble: 2 × 11
#> date time month day n crs country continent scientificName
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 14-01-2023 10:23:00 January 100 2 WGS84 Austra… Oceania Callocephalon…
#> 2 15-01-2023 11:25:00 Februa… 101 3 WGS8d Denmark Europe Eolophus rose…
#> # ℹ 2 more variables: decimalLatitude <dbl>, decimalLongitude <dbl>
df_darwincore
#> # A tibble: 2 × 11
#> date time month day n crs country continent scientificName
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 14-01-2023 10:23:00 January 100 2 WGS84 Austra… Oceania Callocephalon…
#> 2 15-01-2023 11:25:00 Februa… 101 3 WGS8d Denmark Europe Eolophus rose…
#> # ℹ 2 more variables: decimalLatitude <dbl>, decimalLongitude <dbl>
corella’s auto-detection prevents users from needing to specify every single column, reducing the amount of typing for users when they have already have valid Darwin Core column names!
Suggest a workflow
Unsure where to start? Confused about the minimum requirements to
share your data? Using suggest_workflow()
is the easiest
way to get started in corella.
suggest_workflow()
provides a high level summary
designed to show:
- Which column names match valid Darwin Core terms
- The minimum requirements for data in a Darwin Core Archive (i.e. a completed data resource in Darwin Core standard).
- A suggested workflow to help you add the minimum required columns
- Additional functions that could be added to a piped workflow (based the provided dataset’s matching Darwin Core column names)
The intention of suggest_workflow()
is to provide a
general help function whenever users feel uncertain about what to do
next. Let’s see what the output says about our original dataframe
df
.
df |>
suggest_workflow()
#>
#> ── Matching Darwin Core terms ──────────────────────────────────────────────────
#> Matched 4 of 11 column names to DwC terms:
#> ✔ Matched: continent, country, day, month
#> ✖ Unmatched: crs, date, latitude, longitude, n, species, time
#>
#> ── Minimum required Darwin Core terms ──────────────────────────────────────────
#>
#> Type Matched term(s) Missing term(s)
#> ✖ Identifier (at least one) - occurrenceID, catalogNumber, recordNumber
#> ✖ Record type - basisOfRecord
#> ✖ Scientific name - scientificName
#> ✖ Location - decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
#> ✖ Date/Time - eventDate
#>
#> ── Suggested workflow ──────────────────────────────────────────────────────────
#>
#> To make your data Darwin Core compliant, use the following workflow:
#>
#> df |>
#> set_occurrences() |>
#> set_datetime() |>
#> set_coordinates() |>
#> set_scientific_name()
#>
#> ── Additional functions
#> Based on your matched terms, you can also add to your pipe:
#> • `set_datetime()` `set_locality()`
#> ℹ See all `set_` functions at
#> http://corella.ala.org.au/reference/index.html#add-rename-or-edit-columns-to-match-darwin-core-terms
suggest_workflow()
will update the suggested function
pipe to only suggest functions that are necessary to standardise your
data correctly.
For example, after using one of the suggested functions
set_occurrences()
, if we run
suggest_workflow()
again, the output message no longer
suggests set_occurrences()
.
df_edited <- df |>
set_occurrences(
occurrenceID = seq_len(nrow(df)),
basisOfRecord = "humanObservation"
)
df_edited |>
suggest_workflow()
#>
#> ── Matching Darwin Core terms ──────────────────────────────────────────────────
#> Matched 6 of 13 column names to DwC terms:
#> ✔ Matched: basisOfRecord, continent, country, day, month, occurrenceID
#> ✖ Unmatched: crs, date, latitude, longitude, n, species, time
#>
#> ── Minimum required Darwin Core terms ──────────────────────────────────────────
#>
#> Type Matched term(s) Missing term(s)
#> ✔ Identifier (at least one) occurrenceID -
#> ✔ Record type basisOfRecord -
#> ✖ Scientific name - scientificName
#> ✖ Location - decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
#> ✖ Date/Time - eventDate
#>
#> ── Suggested workflow ──────────────────────────────────────────────────────────
#>
#> To make your data Darwin Core compliant, use the following workflow:
#>
#> df |>
#> set_datetime() |>
#> set_coordinates() |>
#> set_scientific_name()
#>
#> ── Additional functions
#> Based on your matched terms, you can also add to your pipe:
#> • `set_datetime()` `set_locality()`
#> ℹ See all `set_` functions at
#> http://corella.ala.org.au/reference/index.html#add-rename-or-edit-columns-to-match-darwin-core-terms
Test your data
If your dataset already uses valid Darwin Core terms as column names,
instead of working through each set_
function, you might
wish to run tests on your entire dataset. To run checks on your data
like a test suite, use check_dataset()
. Much like
devtools::test()
or devtools::check()
,
check_dataset()
runs the relevant check on each matching
Darwin Core column and returns a summary of the results, along with any
error messages returned by those checks.
df <- tibble(
latitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes
longitude = c(149.125, 149.133),
date = c("14-01-2023", "15-01-2023"),
individualCount = c(0, 2),
species = c("Callocephalon fimbriatum", "Eolophus roseicapilla"),
country = c("AU", "AU"),
occurrenceStatus = c("present", "present")
)
df |>
check_dataset()
#> ℹ Testing data
#> ✔ | E P | Column
#> ⠙ | 0 individualCount
#> ✔ | 1 ✖ | individualCount [29ms]
#>
#> ⠙ | 0 country
#> ✔ | 1 ✖ | country [24ms]
#>
#> ══ Results ═════════════════════════════════════════════════════════════════════
#>
#> [ Errors: 2 | Pass: 0 ]
#> ℹ Checking Darwin Core compliance
#> ✖ Data does not meet minimum Darwin Core column requirements
#> ℹ Use `suggest_workflow()` to see more information.
#> ── Error in term ───────────────────────────────────────────────────────────────
#>
#> individualCount values do not match occurrenceStatus.
#> ✖ Found 1 row where individualCount = 0 but occurrenceStatus = "present".
#> Unexpected value in country.
#> ✖ Invalid value: "AU"
The goal of check_dataset()
is to make running many
checks more efficient, and to cater to users who prefer a
test-suite-like workflow.