Checking your dataset

A Darwin Core Archive consists of several pieces to ensure a dataset is structured correctly (and can be restructured correctly in the future). These pieces include the dataset, a metadata statement, and an xml file detailing how the columns in the data relate to each other.

corella is designed to check whether a dataset conforms to Darwin Core standard. This involves two main steps: * Ensuring that a dataset uses valid Darwin Core terms as column names * Checking that the data in each column is the correct type for the specified Darwin Core term

This vignette gives additional information about the second step of checking each column’s data.

Checking individual terms

corella consists of many internal check_ functions. Each one runs basic validation checks on the specified column to ensure the data conforms to the Darwin Core term’s expected data type.

For example, here is a very small dataset with two observations of galahs (Eolophus roseicapilla) (class character), their latitude and longitude coordinates (class numeric), and a location description in the column place (class character).

library(corella)
library(tibble)

df <- tibble::tibble(
  name = c("Eolophus roseicapilla", "Eolophus roseicapilla"),
  latitude = c(-35.310, -35.273),
  longitude = c(149.125, 149.133),
  place = c("a big tree", "an open field")
)

df
#> # A tibble: 2 × 4
#>   name                  latitude longitude place        
#>   <chr>                    <dbl>     <dbl> <chr>        
#> 1 Eolophus roseicapilla    -35.3      149. a big tree   
#> 2 Eolophus roseicapilla    -35.3      149. an open field

I can use the function set_coordinates() to specify which of my columns refer to the valid Darwin Core terms decimalLatitude and decimalLongitude. I have intentionally added the wrong column place as decimalLatitude. corella will return an error because decimalLatitude and decimalLatitude fields must be numeric in Darwin Core standard. This error comes from a small internal checking function called check_decimalLatitude().

df |>
  set_coordinates(decimalLatitude = place, # wrong column
                  decimalLongitude = longitude)
#> ⠙ Checking 2 columns: decimalLatitude and decimalLongitude
#> ✔ Checking 2 columns: decimalLatitude and decimalLongitude [644ms]
#> 
#> Error in `check_decimalLatitude()`:
#> ! decimalLatitude must be a numeric vector, not character.

Supported terms

corella contains internal check_ functions for all individual Darwin Core terms that are supported. These are as follows:

Term	check function	set function
Supported Darwin Core terms
and their associated functions
basisOfRecord	check_basisOfRecord()	set_occurrences()
occurrenceID	check_occurrenceID()	set_occurrences()
scientificName	check_scientificName()	set_scientific_name()
decimalLatitude	check_decimalLatitude()	set_coordinates()
decimalLongitude	check_decimalLongitude()	set_coordinates()
geodeticDatum	check_geodeticDatum()	set_coordinates()
coordinateUncertaintyInMeters	check_coordinateUncertaintyInMeters()	set_coordinates()
eventDate	check_eventDate()	set_datetime()
kingdom	check_kingdom()	set_taxonomy()
phylum	check_phylum()	set_taxonomy()
class	check_class()	set_taxonomy()
order	check_order()	set_taxonomy()
family	check_family()	set_taxonomy()
genus	check_genus()	set_taxonomy()
specificEpithet	check_specificEpithet()	set_taxonomy()
vernacularName	check_vernacularName()	set_taxonomy()
taxonRank	check_taxonRank()	set_scientific_name()
scientificNameAuthorship	check_scientificNameAuthorship()	set_scientific_name()
recordedBy	check_recordedBy()	set_observer()
recordedByID	check_recordedByID()	set_observer()
measurementID	check_measurementID()	set_measurements()
measurementType	check_measurementType()	set_measurements()
measurementValue	check_measurementValue()	set_measurements()
measurementUnit	check_measurementUnit()	set_measurements()
continent	check_continent()	set_locality()
country	check_country()	set_locality()
countryCode	check_countryCode()	set_locality()
stateProvince	check_stateProvince()	set_locality()
locality	check_locality()	set_locality()
license	check_license()	set_license()
rightsHolder	check_rightsHolder()	set_license()
accessRights	check_accessRights()	set_license()
sex	check_sex()	set_individual_traits()
lifeStage	check_lifeStage()	set_individual_traits()
reproductiveCondition	check_reproductiveCondition()	set_individual_traits()
vitality	check_vitality()	set_individual_traits()
individualID	check_individualID()	set_individual_traits()
eventID	check_eventID()	set_events()
parentEventID	check_parentEventID()	set_events()
eventType	check_eventType()	set_events()
eventTime	check_eventTime()	set_datetime()
year	check_year()	set_datetime()
month	check_month()	set_datetime()
day	check_day()	set_datetime()
coordinatePrecision	check_coordinatePrecision()	set_coordinates()
datasetID	check_datasetID()	set_collection()
datasetName	check_datasetName()	set_collection()
catalogNumber	check_catalogNumber()	set_collection()
individualCount	check_individualCount()	set_abundance()
organismQuantity	check_organismQuantity()	set_abundance()
organismQuantity	check_organismQuantity()	set_abundance()
organismQuantityType	check_organismQuantityType()	set_abundance()

When a user specifies a column to a matching Darwin Core term (or the column/term is detected by corella automatically) in a set_ function, the set_ function triggers that matching term’s check_ function. This process ensures that the data is correctly formatted prior to being saved in a Darwin Core Archive.

It’s useful to know that these internal, individual check_ functions exist because they are the building blocks of a full suite of checks, which users can run with check_dataset().

Checking a full dataset

For users who are familiar with Darwin Core standards, or who have datasets that already conform to Darwin Core standards (or are very close), it might be more convenient to run many checks at one time.

Users can use the check_dataset() function to run a “test suite” on their dataset. check_dataset() detects all columns that match valid Darwin Core terms, and runs all matching check_ functions all at once, interactively, much like devtools::test() or devtools::check().

The output of check_dataset() returns: * A summary table of whether each matching column’s check passed or failed * The number of errors and passed columns * Whether the data meets minimum Darwin Core requirements * The first 5 error messages returned by checks

df <- tibble::tibble(
  decimalLatitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes
  decimalLongitude = c(149.125, 149.133),
  date = c("14-01-2023", "15-01-2023"),
  individualCount = c(0, 2),
  scientificName = c("Callocephalon fimbriatum", "Eolophus roseicapilla"),
  country = c("AU", "AU"),
  occurrenceStatus = c("present", "present")
  )

df |>
  check_dataset()
#> ℹ Testing data
#> ✔ | E P | Column
#> ⠙ | 0  decimalLatitude
#> ✔ | 3 ✖ | decimalLatitude   [52ms]
#> 
#> ⠙ | 0  decimalLongitude
#> ✔ | 0 ✔ | decimalLongitude  [9ms]
#> 
#> ⠙ | 0  individualCount
#> ✔ | 1 ✖ | individualCount   [29ms]
#> 
#> ⠙ | 0  scientificName
#> ✔ | 0 ✔ | scientificName    [9ms]
#> 
#> ⠙ | 0  country
#> ✔ | 1 ✖ | country           [28ms]
#> 
#> ══ Results ═════════════════════════════════════════════════════════════════════
#> 
#> [ Errors: 5 | Pass: 2 ]
#> ℹ Checking Darwin Core compliance
#> ✖ Data does not meet minimum Darwin Core column requirements
#> ℹ Use `suggest_workflow()` to see more information.
#> ── Error in term ───────────────────────────────────────────────────────────────
#> 
#> decimalLatitude must be a numeric vector, not character.
#> decimalLatitude must be a numeric vector, not character.
#> Value is outside of expected range in decimalLatitude.
#> ℹ Column contains values outside of -90 <= x <= 90.
#> individualCount values do not match occurrenceStatus.
#> ✖ Found 1 row where individualCount = 0 but occurrenceStatus = "present".
#> Unexpected value in country.
#> ✖ Invalid value: "AU"

Note that check_dataset() currently only accepts occurrence-level datasets. Datasets with hierarchical events data (eg multiple or repeated Surveys, Site Locations) are not currently supported.

Users have options

corella offers two options for checking a dataset, which we have detailed above: Running individual checks through set_ functions, or running a “test suite” with check_dataset(). We hope that these alternative options provide users with different options for their workflow, allowing them to choose their favourite method or switch between methods as they standardise their data.

Dax Kellie

2025-03-24

Checking individual terms

Supported terms

Checking a full dataset

Users have options