PODS, a convention for paleoclimate data in spreadsheets


The LiPD container based on the Linked Data JSON (JSON-LD) format is a practical solution to the problem of organizing and storing hierarchical paleoclimate data in a generalizable schema. This is an important step forward towards standardizing the representation and linkage of diverse paleoclimate datasets.

In this IPython notebook, I introduce a method to interact with the LiPD container using ordinary spreadsheets. The motivation to create this method is guided by the fact that the paleoclimate community uses mainly spreadsheets to edit and store the data and the metadata of their measurements, and not JSON-based formats. What is missing is a way to convert such spreadsheet-based data to LiPD format and vice versa.

Working directly with LiPD has two other disadvantages:

  1. JSON notation will never be as easy to edit and modify than a spreadsheet document;
  2. The LiPD refer for data to a headerless CSV file which requires the user to continually refer to the LiPD file and navigate its nested attributes in order to figure out what the columns in the CSV file refer to.

Therefore, I propose to stick with the use of spreadsheets but standardize them into a convention where the data and the metadata are stored in two separate worksheets of the same spreadsheet document.The metadata is stored with a dot notation to represent the hierarchical nature of the attributes. This convention could be called PODS (Paleo Open Document Spreadsheet).

With the PODS convention, users can directly edit their data in an ordinary spreadsheet program like Excel or OpenOffice and later convert them to LiPD, which is a good container for storing data in a document database like mongoDB (since it uses JSON).

The following procedure describes how to organize a spreadsheet at the PODS convention and transform it into LiPD, and vice versa.


The PODS convention

  • 2 worksheets: Data and Metadata.
  • The Data worksheet presents all parameters as columns with a column name as in a CSV file.
  • The Metadata worksheet has 2 columns corresponding to Attribute and Value of all the parameters described in the Data worksheet. There are no headers. Each line describes an attribute with its hierarchy with a dot notation and its value. If there is no corresponding parameter in the Data worksheet, then it is assumed that the attribute describes a global attribute.
  • missing data are described as empty cells.
  • no comments are allowed in the Data and Metadata worksheets but other worksheets can be added to do so.

The Data worksheet:
The Metadata worksheet:


Notes:

  1. Today, pandas does not allow yet the reading and the writing of ODS (Open Document Spreadsheet) but there are a lot of requests for this feature and it should be feasible soon (pandas/issue 2311).

  2. The LiPD container refers to a CSV file to store the data. This headerless CSV file where each column refers to a column number is very poorly described and could lead to confusion. I think it would be safer and clearer to use one line of header to name the columns by the parameter names.

  3. The LiPD container can contain a list object for values. It would be simpler to disable this possibility and have only unique values.

  4. Make sure spreadsheet cells are formatted correctly, i.e. numbers cells are specified as number and not text. Same for dates.

  5. A compliance checker needs to be built to check that input files conform to the PODS convention, something like the netCDF CF-checker.


Reading a spreadsheet at the PODS convention to pandas dataframe

In [1]:
file = 'spreadsheet_at_PODS_convention_01.xls'
In [2]:
import pandas as pd
dfD = pd.read_excel(file, sheetname='Data')
dfD
Out[2]:


depth
year
temperature
0
0201514.34
1
100201512.70
2
200201510.30
3
400201510.20
4
4502015NaN
5
500201510.10
6
1000201510.10

In [3]:
dfM = pd.read_excel(file, sheetname='Metadata', header=None, names=['Attribute', 'Value'])
dfM
Out[3]:


Attribute
Value
0
depth.parameterTypemeasured
1
depth.unitscm
2
depth.notesdepth refers to top of sample
3
depth.datatypecsvw:NumericFormat
4
depth.descriptiondepth below ocean floor
5
year.parameterTypeinferred
6
year.unitsAD
7
year.methodlinear interpolation
8
year.datatypecsvw:NumericFormat
9
year.descriptioncalendar year AD
10
temperature.parameterTypeinferred
11
temperature.descriptionsea-surface temperature inferred from Mg/Ca ra...
12
temperature.climateInterpretation.seasonalityMJJ
13
temperature.climateInterpretation.parameterT
14
temperature.climateInterpretation.parameterDetailseaSurface
15
temperature.climateInterpretation.interpDirectionpositive
16
temperature.climateInterpretation.basisMg/Ca calibration to SST
17
temperature.calibration.equationBAR2005: Mg/Ca=0.794*exp(0.10*SST)
18
temperature.calibration.uncertainty1.3
19
temperature.calibration.referenceBarker et al., (2005), Thornalley et al., (2009)
20
temperature.datatypecsvw:NumericFormat
21
temperature.materialforamifera carbonate
22
temperature.proxyMg/Ca
23
temperature.unitsdeg C
24
paleoDataTableNamedata
25
filenameatlantic0220Thornalley2009.csv


Note that columns key has been changed and column numbers changed to parameter names.


Convert the pandas dataframes to a LiPD container

In [4]:
import PODS
reload(PODS)
a_LiPD = PODS.df_to_LiPD(dfM, dfD.columns, verbose=True)

import json
print json.dumps(a_LiPD, sort_keys=False, indent=4, separators=(',', ': '))

{
    "depth": {
        "parameterType": "measured",
        "units": "cm",
        "notes": "depth refers to top of sample",
        "datatype": "csvw:NumericFormat",
        "description": "depth below ocean floor"
    },
    "year": {
        "parameterType": "inferred",
        "units": "AD",
        "datatype": "csvw:NumericFormat",
        "method": "linear interpolation",
        "description": "calendar year AD"
    },
    "temperature": {
        "parameterType": "inferred",
        "description": "sea-surface temperature inferred from Mg/Ca ratios",
        "climateInterpretation": {
            "basis": "Mg/Ca calibration to SST",
            "parameter": "T",
            "parameterDetail": "seaSurface",
            "interpDirection": "positive",
            "seasonality": "MJJ"
        },
        "calibration": {
            "equation": "BAR2005: Mg/Ca=0.794*exp(0.10*SST)",
            "uncertainty": 1.3,
            "reference": "Barker et al., (2005), Thornalley et al., (2009)"
        },
        "datatype": "csvw:NumericFormat",
        "material": "foramifera carbonate",
        "proxy": "Mg/Ca",
        "units": "deg C"
    },
    "paleoDataTableName": "data",
    "filename": "atlantic0220Thornalley2009.csv"
}


Convert a LiPD container to a pandas dataframe

In [5]:
dfD, dfM = PODS.LiPD_to_df(a_LiPD)
In [6]:
dfD
Out[6]:


depth
year
temperature
0
0201514.34
1
100201512.70
2
200201510.30
3
400201510.20
4
4502015NaN
5
500201510.10
6
1000201510.10

In [7]:
dfM
Out[7]:


Attribute
Value
0
depth.parameterTypemeasured
1
depth.unitscm
2
depth.notesdepth refers to top of sample
3
depth.datatypecsvw:NumericFormat
4
depth.descriptiondepth below ocean floor
5
year.parameterTypeinferred
6
year.unitsAD
7
year.datatypecsvw:NumericFormat
8
year.methodlinear interpolation
9
year.descriptioncalendar year AD
10
temperature.parameterTypeinferred
11
temperature.descriptionsea-surface temperature inferred from Mg/Ca ra...
12
temperature.climateInterpretation.basisMg/Ca calibration to SST
13
temperature.climateInterpretation.parameterT
14
temperature.climateInterpretation.parameterDetailseaSurface
15
temperature.climateInterpretation.interpDirectionpositive
16
temperature.climateInterpretation.seasonalityMJJ
17
temperature.calibration.equationBAR2005: Mg/Ca=0.794*exp(0.10*SST)
18
temperature.calibration.uncertainty1.3
19
temperature.calibration.referenceBarker et al., (2005), Thornalley et al., (2009)
20
temperature.datatypecsvw:NumericFormat
21
temperature.materialforamifera carbonate
22
temperature.proxyMg/Ca
23
temperature.unitsdeg C
24
paleoDataTableNamedata
25
filenameatlantic0220Thornalley2009.csv


Export pandas dataframes to a spreadsheet

In [8]:
writer = pd.ExcelWriter('spreadsheet_at_PODS_convention_02.xls')
dfD.to_excel(writer, sheet_name="Data", index=False)                       # one day will be to_ods
dfM.to_excel(writer, sheet_name="Metadata", index=False, header=False)
writer.save()

spreadsheet_at_PODS_convention_01.xls is identical to spreadsheet_at_PODS_convention_02.xls proof that convertions are correct.

PODS ===> LiPD ===> PODS

In [ ]:


Below are some tests to deal with values entered as lists


A conversion test from a BibJSON that describes publication metadata

In [9]:
a_LiPD = [{
    "author": [
    {"name" : "Thornalley, D.J.R"},
    {"name" : "Elderfield, H."},
    {"name" : "McCave, N"}
    ],
    "type" : "article",
    "identifier" : 
        {"type": "doi",
        "id": "10.1038/nature07717",
        "url": "http://dx.doi.org/10.1038/nature07717"}
        ,
    "pubYear": 2009
}]

dfD, dfM = PODS.LiPD_to_df(a_LiPD)
dfM
Out[9]:


Attribute
Value
0
pubYear2009
1
identifier.urlhttp://dx.doi.org/10.1038/nature07717
2
identifier.typedoi
3
identifier.id10.1038/nature07717
4
typearticle
5
author.0.nameThornalley, D.J.R
6
author.1.nameElderfield, H.
7
author.2.nameMcCave, N


A conversion test with a GeoJSON

In [10]:
a_LiPD = { "type": "FeatureCollection",
    "features": [
      { "type": "Feature",
        "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
        "properties": {"prop0": "value0"}
        },
      { "type": "Feature",
        "geometry": {
          "type": "LineString",
          "coordinates": [
            [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
            ]
          },
        "properties": {
          "prop0": "value0",
          "prop1": 0.0
          }
        },
      { "type": "Feature",
         "geometry": {
           "type": "Polygon",
           "coordinates": [
             [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
               [100.0, 1.0], [100.0, 0.0] ]
             ]
         },
         "properties": {
           "prop0": "value0",
           "prop1": {"this": "that"}
           }
         }
       ]
     }

dfD, dfM = PODS.LiPD_to_df(a_LiPD)
dfM
Out[10]:


Attribute
Value
0
typeFeatureCollection
1
features.0.geometry.typePoint
2
features.0.geometry.coordinates[102.0, 0.5]
3
features.0.typeFeature
4
features.0.properties.prop0value0
5
features.1.geometry.typeLineString
6
features.1.geometry.coordinates[[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [10...
7
features.1.typeFeature
8
features.1.properties.prop0value0
9
features.1.properties.prop10
10
features.2.geometry.typePolygon
11
features.2.geometry.coordinates[[[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [1...
12
features.2.typeFeature
13
features.2.properties.prop0value0
14
features.2.properties.prop1.thisthat


Those last 2 convertions are not bijective since LiPD container contains lists that are transformed with numbers for each items when PODS.LiPD_to_df is used:

  • author[] to author.1, author.2, author.3
  • features[] to features.1, features.2

Should list be disallowed from LiPD ? Otherwise need more work to recreate a list from numbers when use PODS.df_to_LiPD

  • author.1, author.2, author.3 to author[]
  • features.1, features.2 to features[]