What is Data Provenance

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What should I be recording while I do my analyses?

Objectives
  • Train yourself to record important metadata needed for

Where are we?

In the previous lesson we touched on capturing provenance and metadata while you clean your own dataset. But now we are talking about what you can do during the analysis phase of a data life cycle to implement FAIR practices.

datalifecycle-analysis

What is Provenance?

Definition from Miriam-Webster Dictionary:

  • Origin, Source.
  • The history of ownership of a valued object or work of art or literature

Provenance was originally a term applied to works of fine art as a record of their history. It came to encompass not just chain of custody, but a full record of what happened to it and where it traveled. The term is now applied to data as well, and the meaning is the same. Where did a dataset come from? Who created it? What processing was done to it? Where was it stored?

Monet painting
Water Lilies, Monet (1919), photo by Szilas in the Metropolitan Museum of Art (2008). Image source: Wikipedia

The parallels between “artwork” and “data” still hold up today.

From “artworkarchive’s “Provenance 101”:

  • “An ideal provenance captures the ownership history of an artwork all the way back to the creation of the artist. But, many times there are gaps in the object’s ownership record which can affect the work’s value. “
  • _ “Don’t make your artworks worthless by losing important information. “_

Have you ever come back to plots or data you created and have no idea how you make them? At that point the provenance is gone. You can’t go back in time and collect some kinds of metadata. You have to keep good notes and records while you work with your data.

What to keep in mind when writing a provenance record

Example notes:

2022-06-22: Downloaded a subset of dataset “Niskin bottle samples” which spans 2004 to 2008 to folder “BATS_niskin/orig/bcodmo_dataset_3782_2004_to_2008.csv”

data source citation: Johnson, R. (2019) Niskin bottle water samples and CTD measurements at water sample depths collected at Bermuda Atlantic Time-Series sites in the Sargasso Sea ongoing from 1955-01-29 (BATS project). Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2019-05-29 [Subset 2004 to 2008]. http://lod.bco-dmo.org/id/dataset/3782 [Accessed on 2022-06-22]

Data were binned data by hour, ordered table by station, cast, pressure and saved to “BATS_niskin_2004_to_2008/hourly/BATS_niskin_hourly.xlsx”

  • exported Sheet 1 to “BATS_niskin_2004_to_2008/hourly/BATS_niskin_hourly.csv.”
  • exported plot in Sheet 2 to “BATS_niskin_2004_to_2008/hourly/BATS_profiles.png”

Having clear records about how a plot was produced with the table you used to produce it is very important for reports and journal publications. It allows your results to be reproducible, transparent, and facilitate peer review. It also makes writing your publication easier since you already have the figure captions written!

Key Points

  • You can’t go back in time and collect some kinds of metadata. You have to keep good notes and records.