Checking your dataset
Dax Kellie
2024-12-20
checking-your-dataset.Rmd
A Darwin Core Archive consists of several pieces to ensure a dataset is structured correctly (and can be restructured correctly in the future). These pieces include the dataset, a metadata statement, and an xml file detailing how the columns in the data relate to each other.
corella is designed to check whether a dataset conforms to Darwin Core standard. This involves two main steps: * Ensuring that a dataset uses valid Darwin Core terms as column names * Checking that the data in each column is the correct type for the specified Darwin Core term
This vignette gives additional information about the second step of checking each column’s data.
Checking individual terms
corella consists of many internal check_
functions. Each
one runs basic validation checks on the specified column to ensure the
data conforms to the Darwin Core term’s expected data type.
For example, here is a very small dataset with two observations of
galahs (Eolophus roseicapilla) (class character
),
their latitude and longitude coordinates (class numeric
),
and a location description in the column place
(class
character
).
## Error in get(paste0(generic, ".", class), envir = get_method_env()) :
## object 'type_sum.accel' not found
library(tibble)
df <- tibble::tibble(
name = c("Eolophus roseicapilla", "Eolophus roseicapilla"),
latitude = c(-35.310, -35.273),
longitude = c(149.125, 149.133),
place = c("a big tree", "an open field")
)
df
## # A tibble: 2 × 4
## name latitude longitude place
## <chr> <dbl> <dbl> <chr>
## 1 Eolophus roseicapilla -35.3 149. a big tree
## 2 Eolophus roseicapilla -35.3 149. an open field
I can use the function use_coordinates()
to specify
which of my columns refer to the valid Darwin Core terms
decimalLatitude
and decimalLongitude
. I have
intentionally added the wrong column place
as
decimalLatitude
. corella will return an error because
decimalLatitude
and decimalLatitude
fields
must be numeric in Darwin Core standard. This error comes from a small
internal checking function called
check_decimalLatitude()
.
df |>
use_coordinates(decimalLatitude = place, # wrong column
decimalLongitude = longitude)
## ⠙ Checking 2 columns: decimalLatitude and decimalLongitude
## ✔ Checking 2 columns: decimalLatitude and decimalLongitude [644ms]
##
## Error in `check_decimalLatitude()`:
## ! decimalLatitude must be a numeric vector, not character.
Supported terms
corella contains internal check_
functions for all
individual Darwin Core terms that are supported. These are as
follows:
Supported Darwin Core terms | ||
and their associated functions | ||
Term | check function | use function |
---|---|---|
basisOfRecord | check_basisOfRecord() | use_occurrences() |
occurrenceID | check_occurrenceID() | use_occurrences() |
scientificName | check_scientificName() | use_scientific_name() |
decimalLatitude | check_decimalLatitude() | use_coordinates() |
decimalLongitude | check_decimalLongitude() | use_coordinates() |
geodeticDatum | check_geodeticDatum() | use_coordinates() |
coordinateUncertaintyInMeters | check_coordinateUncertaintyInMeters() | use_coordinates() |
eventDate | check_eventDate() | use_datetime() |
continent | check_continent() | use_locality() |
country | check_country() | use_locality() |
countryCode | check_countryCode() | use_locality() |
stateProvince | check_stateProvince() | use_locality() |
locality | check_locality() | use_locality() |
kingdom | check_kingdom() | use_taxonomy() |
phylum | check_phylum() | use_taxonomy() |
class | check_class() | use_taxonomy() |
order | check_order() | use_taxonomy() |
family | check_family() | use_taxonomy() |
genus | check_genus() | use_taxonomy() |
specificEpithet | check_specificEpithet() | use_taxonomy() |
vernacularName | check_vernacularName() | use_taxonomy() |
individualCount | check_individualCount() | use_abundance() |
organismQuantity | check_organismQuantity() | use_abundance() |
organismQuantityType | check_organismQuantityType() | use_abundance() |
organismQuantity | check_organismQuantity() | use_abundance() |
datasetID | check_datasetID() | use_collection() |
datasetName | check_datasetName() | use_collection() |
catalogNumber | check_catalogNumber() | use_collection() |
coordinatePrecision | check_coordinatePrecision() | use_coordinates() |
taxonRank | check_taxonRank() | use_scientific_name() |
scientificNameAuthorship | check_scientificNameAuthorship() | use_scientific_name() |
year | check_year() | use_datetime() |
month | check_month() | use_datetime() |
day | check_day() | use_datetime() |
eventTime | check_eventTime() | use_datetime() |
individualID | check_individualID() | use_individual_traits() |
lifeStage | check_lifeStage() | use_individual_traits() |
sex | check_sex() | use_individual_traits() |
vitality | check_vitality() | use_individual_traits() |
reproductiveCondition | check_reproductiveCondition() | use_individual_traits() |
recordedBy | check_recordedBy() | use_observer() |
recordedByID | check_recordedByID() | use_observer() |
eventID | check_eventID() | use_events() |
eventType | check_eventType() | use_events() |
parentEventID | check_parentEventID() | use_events() |
When a user specifies a column to a matching Darwin Core term (or the
column/term is detected by corella automatically) in a use_
function, the use_
function triggers that matching term’s
check_
function. This process ensures that the data is
correctly formatted prior to being saved in a Darwin Core Archive.
It’s useful to know that these internal, individual
check_
functions exist because they are the building blocks
of a full suite of checks, which users can run with
check_dataset()
.
Checking a full dataset
For users who are familiar with Darwin Core standards, or who have datasets that already conform to Darwin Core standards (or are very close), it might be more convenient to run many checks at one time.
Users can use the check_dataset()
function to run a
“test suite” on their dataset. check_dataset()
detects all
columns that match valid Darwin Core terms, and runs all matching
check_
functions all at once, interactively, much like
devtools::test()
or devtools::check()
.
The output of check_dataset()
returns: * A summary table
of whether each matching column’s check passed or failed * The number of
errors and passed columns * Whether the data meets minimum Darwin Core
requirements * The first 5 error messages returned by checks
df <- tibble::tibble(
decimalLatitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes
decimalLongitude = c(149.125, 149.133),
date = c("14-01-2023", "15-01-2023"),
individualCount = c(0, 2),
scientificName = c("Callocephalon fimbriatum", "Eolophus roseicapilla"),
country = c("AU", "AU"),
occurrenceStatus = c("present", "present")
)
df |>
check_dataset()
## ℹ Testing data
## ✔ | E P | Column
## ⠙ | 0 decimalLatitude
## ✔ | 3 ✖ | decimalLatitude [51ms]
##
## ⠙ | 0 decimalLongitude
## ✔ | 0 ✔ | decimalLongitude [9ms]
##
## ⠙ | 0 individualCount
## ⠹ | 1 ✖ | individualCount
## ✔ | 1 ✖ | individualCount [33ms]
##
## ⠙ | 0 scientificName
## ✔ | 0 ✔ | scientificName [9ms]
##
## ⠙ | 0 country
## ✔ | 1 ✖ | country [25ms]
##
## ══ Results ═════════════════════════════════════════════════════════════════════
##
## [ Errors: 5 | Pass: 2 ]
##
## ✖ Data does not meet minimum Darwin Core requirements
## ℹ Use `suggest_workflow()` to see more information.
## ── Error in decimalLatitude ────────────────────────────────────────────────────
##
## decimalLatitude must be a numeric vector, not character.
## decimalLatitude must be a numeric vector, not character.
## Value is outside of expected range in decimalLatitude.
## ℹ Column contains values outside of -90 <= x <= 90.
## ── Error in individualCount ────────────────────────────────────────────────────
##
## individualCount values do not match occurrenceStatus.
## ✖ Found 1 row where individualCount = 0 but occurrenceStatus = "present".
## ── Error in country ────────────────────────────────────────────────────────────
##
## Unexpected value in country.
## ✖ Invalid value: "AU"
Note that check_dataset()
currently only accepts
occurrence-level datasets. Datasets with hierarchical events data (eg
multiple or repeated Surveys, Site Locations) are not currently
supported.
Users have options
corella offers two options for checking a dataset, which we have
detailed above: Running individual checks through use_
functions, or running a “test suite” with check_dataset()
.
We hope that these alternative options provide users with different
options for their workflow, allowing them to choose their favourite
method or switch between methods as they standardise their data.