Quick start guide#
Dax Kellie, Amanda Buyan
corella is a tool for standardising data in Python to use the Darwin Core Standard.
Darwin Core Standard is the primary data standard for species occurrence data—records of
organisms observed in a location and time—in the Atlas of Living Australia (ALA), other
Living Atlases and the Global Biodiversity Information Facility (GBIF). The standard allows
the ability to compile data from a variety of sources, improving the ease to share, use
and reuse data.
The main tasks to standardise data with Darwin Core Standard are:
Ensure columns use valid Darwin Core terms as column names
Include all required information (e.g. scientific name, unique observation ID, valid date)
Ensure columns contain valid data
This process can be daunting. corella is designed to reduce confusion of how to get started, and help determine which Darwin Core terms might match your column names.
Install#
See <Installation Instructions.
To load the package:
>>> import corella
Rename, add or edit columns#
Here is a minimal example dataset of cockatoo observations. In our dataframe,
df, there are columns that contain information that we would like to standardise
using Darwin Core.
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'latitude': [-35.310, "-35.273"], # deliberate error for demonstration purposes
... 'longitude': [149.125, 149.133],
... 'date': ["14-01-2023", "15-01-2023"],
... 'time': ["10:23:00", "11:25:00"],
... 'month': ["January", "February"],
... 'day': [100, 101],
... 'species': ["Callocephalon fimbriatum", "Eolophus roseicapilla"],
... 'n': [2, 3],
... 'crs': ["WGS84", "WGS8d"],
... 'country': ["Australia", "Denmark"],
... 'continent': ["Oceania", "Europe"]
... })
>>> df
latitude longitude date time ... n crs country continent
0 -35.31 149.125 14-01-2023 10:23:00 ... 2 WGS84 Australia Oceania
1 -35.273 149.133 15-01-2023 11:25:00 ... 3 WGS8d Denmark Europe
[2 rows x 11 columns]
We can standardise our data with set_ functions. The set_ functions possess a suffix name
to identify what type of data they are used to standardise (e.g. set_coordinates, set_datetime),
and arguments in set_ functions are valid Darwin Core terms (ie column names). By grouping Darwin
Core terms based on their data type, corella makes it easier for users to find relevant Darwin Core
terms to use as column names (one of the most onerous parts of Darwin Core for new users).
Let’s specify that the scientific name (i.e. genus + species name) in our data is in the species
column by using set_scientific_name(). You’ll notice 2 things happen:
The species column in our dataframe is renamed to
scientificNameset_scientific_name()runs a check on our species column to make sure it is formatted correctly
>>> df_dwc = corella.set_scientific_name(dataframe=df,scientificName='species')
>>> df_dwc
Checking 1 column(s): scientificName
latitude longitude date time ... n crs country continent
0 -35.31 149.125 14-01-2023 10:23:00 ... 2 WGS84 Australia Oceania
1 -35.273 149.133 15-01-2023 11:25:00 ... 3 WGS8d Denmark Europe
[2 rows x 11 columns]
What happens when we add a column with an error in it? The latitude column in df is a class
string column, instead of a numeric column as it should be. When we try to update the column
name using set_coordinates(), an error tells us the class is wrong.
>>> df_dwc = corella.set_coordinates(dataframe=df_dwc,
... decimalLongitude = 'longitude',
... decimalLatitude = 'latitude')
Checking 1 column(s): scientificName
Checking 2 column(s): decimalLongitude, decimalLatitude
Traceback (most recent call last):
File "/Users/buy003/Documents/GitHub/corella-python/docs/source/getting_started/quick_start_code.py", line 31, in <module>
df_dwc = corella.set_coordinates(dataframe=df_dwc,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/buy003/anaconda3/envs/galaxias-dev/lib/python3.11/site-packages/corella/set_coordinates.py", line 81, in set_coordinates
raise ValueError("There are some errors in your data. They are as follows:\n\n{}".format('\n'.join(errors)))
ValueError: There are some errors in your data. They are as follows:
the decimalLatitude column must be numeric.
Fix or update columns#
To change, edit or fix a column, users can edit the column within the set_ function.
Each set_ function is essentially a specialised pandas rename function, meaning
users can edit columns using the same processes they would when using pandas.rename. We
can fix the latitude column so that it is class numeric within the set_coordinates() function.
>>> df_dwc['latitude'] = pd.to_numeric(df_dwc['latitude'])
>>> df_dwc = corella.set_coordinates(dataframe=df_dwc,
... decimalLongitude = 'longitude',
... decimalLatitude = 'latitude')
>>> df_dwc
Checking 1 column(s): scientificName
Checking 2 column(s): decimalLongitude, decimalLatitude
decimalLatitude decimalLongitude date ... crs country continent
0 -35.310 149.125 14-01-2023 ... WGS84 Australia Oceania
1 -35.273 149.133 15-01-2023 ... WGS8d Denmark Europe
[2 rows x 11 columns]
Auto-detect columns#
corella is also able to detect when a column exists in a data frame that already has
a valid Darwin Core term as a column name. For example, df contains columns with
locality information. We can use set_locality() to identify these columns, but because
several columns already have valid Darwin Core terms as column names (country and continent),
set_locality() will detect these valid Darwin Core columns in df and check them automatically.
>>> df_dwc = corella.set_locality(dataframe=df)
Checking 1 column(s): scientificName
Checking 2 column(s): decimalLongitude, decimalLatitude
decimalLatitude decimalLongitude date ... crs country continent
0 -35.310 149.125 14-01-2023 ... WGS84 Australia Oceania
1 -35.273 149.133 15-01-2023 ... WGS8d Denmark Europe
[2 rows x 11 columns]
corella’s auto-detection prevents users from needing to specify every single column, reducing
the amount of typing for users when they have already have valid Darwin Core column names!
Suggest a workflow#
Unsure where to start? Confused about the minimum requirements to share your data? Using
suggest_workflow() is the easiest way to get started in corella.
suggest_workflow() provides a high level summary designed to show:
Which column names match valid Darwin Core terms
The minimum requirements for data in a Darwin Core Archive (i.e. a completed data resource in Darwin Core standard).
A suggested workflow to help you add the minimum required columns
Additional functions that could be added to a piped workflow (based the provided dataset’s matching Darwin Core column names)
The intention of suggest_workflow() is to provide a general help function whenever users feel uncertain about what to do next. Let’s see what the output says about our original dataframe df.
>>> corella.suggest_workflow(occurrences=df)
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── All DwC terms ──
Matched 4 of 11 column names to DwC terms:
✓ Matched: month, day, country, continent
✗ Unmatched: date, n, species, longitude, latitude, time, crs
── Minimum required DwC terms occurrences ──
Type Matched term(s) Missing term(s)
------------------------- ----------------- -------------------------------------------------------------------------------
Identifier (at least one) - occurrenceID OR catalogNumber OR recordNumber
Record type - basisOfRecord
Scientific name - scientificName
Location - decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
Date/Time - eventDate
── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Occurrences ──
To make your occurrences Darwin Core compliant, use the following workflow:
corella.set_occurrences()
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()
Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()
suggest_workflow() will update the suggested function pipe to only suggest functions
that are necessary to standardise your data correctly.
For example, after using one of the suggested functions set_occurrences(), if we run
suggest_workflow() again, the output message no longer suggests set_occurrences().
>>> df_edited = corella.set_occurrences(occurrences=df,
... occurrenceID='sequential',
... basisOfRecord='HumanObservation')
>>> corella.suggest_workflow(occurrences=df_edited)
Checking 1 column(s): basisOfRecord
Checking 1 column(s): occurrenceID
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── All DwC terms ──
Matched 6 of 13 column names to DwC terms:
✓ Matched: occurrenceID, month, day, country, continent, basisOfRecord
✗ Unmatched: crs, longitude, latitude, date, time, species, n
── Minimum required DwC terms occurrences ──
Type Matched term(s) Missing term(s)
------------------------- ----------------- -------------------------------------------------------------------------------
Identifier (at least one) occurrenceID -
Record type basisOfRecord -
Scientific name - scientificName
Location - decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
Date/Time - eventDate
── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Occurrences ──
To make your occurrences Darwin Core compliant, use the following workflow:
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()
Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()
Test your data#
If your dataset already uses valid Darwin Core terms as column names, instead of
working through each set_ function, you might wish to run tests on your entire dataset.
To run checks on your data like a test suite, use check_dataset(). check_dataset()
runs the relevant check on each matching Darwin Core column and returns a summary of the
results, along with any error messages returned by those checks.
>>> df = pd.DataFrame({
... 'latitude': [-35.310, "-35.273"], # deliberate error for demonstration purposes
... 'longitude': [149.125, 149.133],
... 'date': ["14-01-2023", "15-01-2023"],
... 'individualCount': [0, 2],
... 'species': ["Callocephalon fimbriatum", "Eolophus roseicapilla"],
... 'country': ["AU", "AU"],
... 'occurrenceStatus': ["present", "present"],
... })
>>> corella.check_dataset(occurrences=df)
Checking 1 column(s): individualCount
Checking 1 column(s): occurrenceStatus
Checking 1 column(s): country
Checking 0 column(s):
Checking 1 column(s): occurrenceStatus
Number of Errors Pass/Fail Column name
------------------ ----------- ----------------
0 ✓ individualCount
1 ✗ country
1 ✗ occurrenceStatus
══ Results ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Errors: 2 | Passes: 1
✗ Data does not meet minimum Darwin core requirements
Use corella.suggest_workflow()
── Error in country ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Some of your country are incorrect. Accepted values are found on Wikipedia: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
── Error in occurrenceStatus ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Some of your individual counts are 0, yet the occurrence status is set to present. Please change occurrenceStatus to ABSENT
The goal of check_dataset() is to make running many checks more efficient, and
to cater to users who prefer a test-suite-like workflow.