The LiPD container based on the Linked Data JSON (JSON-LD) format is a practical solution to the problem of organizing and storing hierarchical paleoclimate data in a generalizable schema. This is an important step forward towards standardizing the representation and linkage of diverse paleoclimate datasets.
In this IPython notebook, I introduce a method to interact with the LiPD container using ordinary spreadsheets. The motivation to create this method is guided by the fact that the paleoclimate community uses mainly spreadsheets to edit and store the data and the metadata of their measurements, and not JSON-based formats. What is missing is a way to convert such spreadsheet-based data to LiPD format and vice versa.
Working directly with LiPD has two other disadvantages:
Therefore, I propose to stick with the use of spreadsheets but standardize them into a convention where the data and the metadata are stored in two separate worksheets of the same spreadsheet document.The metadata is stored with a dot notation to represent the hierarchical nature of the attributes. This convention could be called PODS (Paleo Open Document Spreadsheet).
With the PODS convention, users can directly edit their data in an ordinary spreadsheet program like Excel or OpenOffice and later convert them to LiPD, which is a good container for storing data in a document database like mongoDB (since it uses JSON).
The following procedure describes how to organize a spreadsheet at the PODS convention and transform it into LiPD, and vice versa.
Notes:
Today, pandas does not allow yet the reading and the writing of ODS (Open Document Spreadsheet) but there are a lot of requests for this feature and it should be feasible soon (pandas/issue 2311).
The LiPD container refers to a CSV file to store the data. This headerless CSV file where each column refers to a column number is very poorly described and could lead to confusion. I think it would be safer and clearer to use one line of header to name the columns by the parameter names.
The LiPD container can contain a list object for values. It would be simpler to disable this possibility and have only unique values.
Make sure spreadsheet cells are formatted correctly, i.e. numbers cells are specified as number and not text. Same for dates.
A compliance checker needs to be built to check that input files conform to the PODS convention, something like the netCDF CF-checker.
file = 'spreadsheet_at_PODS_convention_01.xls'
import pandas as pd dfD = pd.read_excel(file, sheetname='Data') dfD
depth | year | temperature | |
---|---|---|---|
0 | 0 | 2015 | 14.34 |
1 | 100 | 2015 | 12.70 |
2 | 200 | 2015 | 10.30 |
3 | 400 | 2015 | 10.20 |
4 | 450 | 2015 | NaN |
5 | 500 | 2015 | 10.10 |
6 | 1000 | 2015 | 10.10 |
dfM = pd.read_excel(file, sheetname='Metadata', header=None, names=['Attribute', 'Value']) dfM
Attribute | Value | |
---|---|---|
0 | depth.parameterType | measured |
1 | depth.units | cm |
2 | depth.notes | depth refers to top of sample |
3 | depth.datatype | csvw:NumericFormat |
4 | depth.description | depth below ocean floor |
5 | year.parameterType | inferred |
6 | year.units | AD |
7 | year.method | linear interpolation |
8 | year.datatype | csvw:NumericFormat |
9 | year.description | calendar year AD |
10 | temperature.parameterType | inferred |
11 | temperature.description | sea-surface temperature inferred from Mg/Ca ra... |
12 | temperature.climateInterpretation.seasonality | MJJ |
13 | temperature.climateInterpretation.parameter | T |
14 | temperature.climateInterpretation.parameterDetail | seaSurface |
15 | temperature.climateInterpretation.interpDirection | positive |
16 | temperature.climateInterpretation.basis | Mg/Ca calibration to SST |
17 | temperature.calibration.equation | BAR2005: Mg/Ca=0.794*exp(0.10*SST) |
18 | temperature.calibration.uncertainty | 1.3 |
19 | temperature.calibration.reference | Barker et al., (2005), Thornalley et al., (2009) |
20 | temperature.datatype | csvw:NumericFormat |
21 | temperature.material | foramifera carbonate |
22 | temperature.proxy | Mg/Ca |
23 | temperature.units | deg C |
24 | paleoDataTableName | data |
25 | filename | atlantic0220Thornalley2009.csv |
Note that columns key has been changed and column numbers changed to parameter names.
import PODS reload(PODS) a_LiPD = PODS.df_to_LiPD(dfM, dfD.columns, verbose=True) import json print json.dumps(a_LiPD, sort_keys=False, indent=4, separators=(',', ': '))
{ "depth": { "parameterType": "measured", "units": "cm", "notes": "depth refers to top of sample", "datatype": "csvw:NumericFormat", "description": "depth below ocean floor" }, "year": { "parameterType": "inferred", "units": "AD", "datatype": "csvw:NumericFormat", "method": "linear interpolation", "description": "calendar year AD" }, "temperature": { "parameterType": "inferred", "description": "sea-surface temperature inferred from Mg/Ca ratios", "climateInterpretation": { "basis": "Mg/Ca calibration to SST", "parameter": "T", "parameterDetail": "seaSurface", "interpDirection": "positive", "seasonality": "MJJ" }, "calibration": { "equation": "BAR2005: Mg/Ca=0.794*exp(0.10*SST)", "uncertainty": 1.3, "reference": "Barker et al., (2005), Thornalley et al., (2009)" }, "datatype": "csvw:NumericFormat", "material": "foramifera carbonate", "proxy": "Mg/Ca", "units": "deg C" }, "paleoDataTableName": "data", "filename": "atlantic0220Thornalley2009.csv" }
dfD, dfM = PODS.LiPD_to_df(a_LiPD)
dfD
depth | year | temperature | |
---|---|---|---|
0 | 0 | 2015 | 14.34 |
1 | 100 | 2015 | 12.70 |
2 | 200 | 2015 | 10.30 |
3 | 400 | 2015 | 10.20 |
4 | 450 | 2015 | NaN |
5 | 500 | 2015 | 10.10 |
6 | 1000 | 2015 | 10.10 |
dfM
Attribute | Value | |
---|---|---|
0 | depth.parameterType | measured |
1 | depth.units | cm |
2 | depth.notes | depth refers to top of sample |
3 | depth.datatype | csvw:NumericFormat |
4 | depth.description | depth below ocean floor |
5 | year.parameterType | inferred |
6 | year.units | AD |
7 | year.datatype | csvw:NumericFormat |
8 | year.method | linear interpolation |
9 | year.description | calendar year AD |
10 | temperature.parameterType | inferred |
11 | temperature.description | sea-surface temperature inferred from Mg/Ca ra... |
12 | temperature.climateInterpretation.basis | Mg/Ca calibration to SST |
13 | temperature.climateInterpretation.parameter | T |
14 | temperature.climateInterpretation.parameterDetail | seaSurface |
15 | temperature.climateInterpretation.interpDirection | positive |
16 | temperature.climateInterpretation.seasonality | MJJ |
17 | temperature.calibration.equation | BAR2005: Mg/Ca=0.794*exp(0.10*SST) |
18 | temperature.calibration.uncertainty | 1.3 |
19 | temperature.calibration.reference | Barker et al., (2005), Thornalley et al., (2009) |
20 | temperature.datatype | csvw:NumericFormat |
21 | temperature.material | foramifera carbonate |
22 | temperature.proxy | Mg/Ca |
23 | temperature.units | deg C |
24 | paleoDataTableName | data |
25 | filename | atlantic0220Thornalley2009.csv |
writer = pd.ExcelWriter('spreadsheet_at_PODS_convention_02.xls') dfD.to_excel(writer, sheet_name="Data", index=False) # one day will be to_ods dfM.to_excel(writer, sheet_name="Metadata", index=False, header=False) writer.save()
spreadsheet_at_PODS_convention_01.xls is identical to spreadsheet_at_PODS_convention_02.xls proof that convertions are correct.
PODS ===> LiPD ===> PODS
a_LiPD = [{ "author": [ {"name" : "Thornalley, D.J.R"}, {"name" : "Elderfield, H."}, {"name" : "McCave, N"} ], "type" : "article", "identifier" : {"type": "doi", "id": "10.1038/nature07717", "url": "http://dx.doi.org/10.1038/nature07717"} , "pubYear": 2009 }] dfD, dfM = PODS.LiPD_to_df(a_LiPD) dfM
Attribute | Value | |
---|---|---|
0 | pubYear | 2009 |
1 | identifier.url | http://dx.doi.org/10.1038/nature07717 |
2 | identifier.type | doi |
3 | identifier.id | 10.1038/nature07717 |
4 | type | article |
5 | author.0.name | Thornalley, D.J.R |
6 | author.1.name | Elderfield, H. |
7 | author.2.name | McCave, N |
a_LiPD = { "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": {"type": "Point", "coordinates": [102.0, 0.5]}, "properties": {"prop0": "value0"} }, { "type": "Feature", "geometry": { "type": "LineString", "coordinates": [ [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0] ] }, "properties": { "prop0": "value0", "prop1": 0.0 } }, { "type": "Feature", "geometry": { "type": "Polygon", "coordinates": [ [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ] ] }, "properties": { "prop0": "value0", "prop1": {"this": "that"} } } ] } dfD, dfM = PODS.LiPD_to_df(a_LiPD) dfM
Attribute | Value | |
---|---|---|
0 | type | FeatureCollection |
1 | features.0.geometry.type | Point |
2 | features.0.geometry.coordinates | [102.0, 0.5] |
3 | features.0.type | Feature |
4 | features.0.properties.prop0 | value0 |
5 | features.1.geometry.type | LineString |
6 | features.1.geometry.coordinates | [[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [10... |
7 | features.1.type | Feature |
8 | features.1.properties.prop0 | value0 |
9 | features.1.properties.prop1 | 0 |
10 | features.2.geometry.type | Polygon |
11 | features.2.geometry.coordinates | [[[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [1... |
12 | features.2.type | Feature |
13 | features.2.properties.prop0 | value0 |
14 | features.2.properties.prop1.this | that |
Those last 2 convertions are not bijective since LiPD container contains lists that are transformed with numbers for each items when PODS.LiPD_to_df is used:
Should list be disallowed from LiPD ? Otherwise need more work to recreate a list from numbers when use PODS.df_to_LiPD