Checking your dataset#

Dax Kellie, Amanda Buyan

A Darwin Core Archive consists of several pieces to ensure a dataset is structured correctly (and can be restructured correctly in the future). These pieces include the dataset, a metadata statement, and an xml file detailing how the columns in the data relate to each other.

corella is designed to check whether a dataset conforms to Darwin Core standard. This involves two main steps:

  • Ensuring that a dataset uses valid Darwin Core terms as column names

  • Checking that the data in each column is the correct type for the specified Darwin Core term

This vignette gives additional information about the second step of checking each column’s data.

Checking individual terms#

corella consists of many internal check_ functions. Each one runs basic validation checks on the specified column to ensure the data conforms to the Darwin Core term’s expected data type.

For example, here is a very small dataset with two observations of galahs (Eolophus roseicapilla) (class str), their latitude and longitude coordinates (class numeric), and a location description in the column place (class str).

>>> import corella
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['Eolophus roseicapilla', 'Eolophus roseicapilla'],
...     'latitude': [-35.310, -35.273],
...     'longitude': [149.125, 149.133],
...     'place': ['a big tree', 'an open field'],
... })
>>> df
                    name  latitude  longitude          place
0  Eolophus roseicapilla   -35.310    149.125     a big tree
1  Eolophus roseicapilla   -35.273    149.133  an open field

I can use the function set_coordinates() to specify which of my columns refer to the valid Darwin Core terms decimalLatitude and decimalLongitude. I have intentionally added the wrong column place as decimalLatitude. corella will return an error because decimalLatitude and decimalLongitude fields must be numeric in Darwin Core standard. This error comes from a small internal checking function called check_decimalLatitude().

>>> df = corella.set_coordinates(dataframe=df,
...                              decimalLatitude = 'place', # wrong column
...                              decimalLongitude = 'longitude')
Checking 2 column(s): decimalLongitude, decimalLatitude
Traceback (most recent call last):
  File "/Users/buy003/Documents/GitHub/corella-python/docs/source/corella_user_guide/checking_dataset_code.py", line 19, in <module>
    df = corella.set_coordinates(dataframe=df,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/buy003/anaconda3/envs/galaxias-dev/lib/python3.11/site-packages/corella/set_coordinates.py", line 81, in set_coordinates
    raise ValueError("There are some errors in your data.  They are as follows:\n\n{}".format('\n'.join(errors)))
ValueError: There are some errors in your data.  They are as follows:

the decimalLatitude column must be numeric.

Supported terms#

corella contains internal check_ functions for all individual Darwin Core terms that are supported. These are as follows:

Supported Darwin Core Terms and Their Associated Functions

set_function

dwc_term

set_occurrences()

basisOfRecord

occurrenceID

set_scientific_name()

scientificName

taxonRank

scientificNameAuthorship

set_coordinates()

decimalLatitude

decimalLongitude

geodeticDatum

coordinateUncertaintyInMeters

coordinatePrecision

set_datetime()

eventDate

year

month

day

eventTime

set_locality()

continent

country

countryCode

stateProvince

locality

set_taxonomy()

kingdom

phylum

class

order

family

genus

specificEpithet

vernacularName

set_abundance()

individualCount

organismQuantity

organismQuantityType

organismQuantity

set_collection()

datasetID

datasetName

catalogNumber

set_individual_traits()

individualID

)

lifeStage

sex

vitality

reproductiveCondition

set_observer()

recordedBy

recordedByID

set_events()

eventID

eventType

parentEventID

set_license()

license

rightsHolder

accessRights

When a user specifies a column to a matching Darwin Core term (or the column/term is detected by corella automatically) in a set_ function, the set_ function triggers that matching term’s check_ function. This process ensures that the data is correctly formatted prior to being saved in a Darwin Core Archive.

It’s useful to know that these internal, individual check_ functions exist because they are the building blocks of a full suite of checks, which users can run with check_dataset().

Checking a full dataset#

For users who are familiar with Darwin Core standards, or who have datasets that already conform to Darwin Core standards (or are very close), it might be more convenient to run many checks at one time.

Users can use the check_dataset() function to run a “test suite” on their dataset. check_dataset() detects all columns that match valid Darwin Core terms, and runs all matching check_ functions all at once, interactively.

The output of check_dataset() returns:

  • A summary table of whether each matching column’s check passed or failed

  • The number of errors and passed columns

  • Whether the data meets minimum Darwin Core requirements

  • The first 5 error messages returned by checks

>>> df = pd.DataFrame({
...     'decimalLatitude': [-35.310, "-35.273"], # deliberate error for demonstration purposes
...     'decimalLongitude': [149.125, 149.133],
...     'date': ["14-01-2023", "15-01-2023"],
...     'individualCount': [0, 2],
...     'scientificName': ["Callocephalon fimbriatum", "Eolophus roseicapilla"],
...     'country': ["AU", "AU"],
...     'occurrenceStatus': ["present", "present"],
... })
>>> corella.check_dataset(occurrences=df)
Checking 1 column(s): individualCount
Checking 1 column(s): occurrenceStatus
Checking 2 column(s): decimalLatitude, decimalLongitude
Checking 1 column(s): country
Checking 0 column(s): 
Checking 1 column(s): occurrenceStatus
Checking 1 column(s): scientificName

  Number of Errors  Pass/Fail    Column name
------------------  -----------  ----------------
                 1  ✗            decimalLatitude
                 0  ✓            decimalLongitude
                 0  ✓            individualCount
                 0  ✓            scientificName
                 1  ✗            country
                 1  ✗            occurrenceStatus


══ Results ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════


Errors: 3 | Passes: 3

✗ Data does not meet minimum Darwin core requirements
Use corella.suggest_workflow()

── Error in decimalLatitude ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

the decimalLatitude column must be numeric.

── Error in country ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Some of your country are incorrect.  Accepted values are found on Wikipedia: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

── Error in occurrenceStatus ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Some of your individual counts are 0, yet the occurrence status is set to present.  Please change occurrenceStatus to ABSENT

Users have options#

corella offers two options for checking a dataset, which we have detailed above: Running individual checks through set_ functions, or running a “test suite” with check_dataset(). We hope that these alternative options provide users with different options for their workflow, allowing them to choose their favourite method or switch between methods as they standardise their data.