Making (Tabular) Data Packages More Tabular


#1

In working with data creators who aren’t familiar with programming, I found a resistance to creating metadata in JSON. So, I came up with a way to store all of the information in a datapackage.json file in a tabular format. The result is that you can build a data package with only CSV, and put the whole package in a single Excel file.

Here is the general idea: http://metatab.org

The tabular data format is not specific to any particular structure, but it can be tailored, so with the right input file the standard JSON converted will output a datapackage.json file:

http://metatab.org/2016/10/25/metatab-interface-to-datapackage-json/

I had a good call with Dan Fowler about the idea, and it seems that the problem that Metatab addresses – data creators who are unfamiliar with JSON – is fairly common. I’ve proposed that this format could be used to create a homologue to the datapackage.json file, datapackage.csv, which would be much easier for a lot of users to create, read and use, but have the same information, and can be converted to a datapackage.json file.

The Metatab project has produced libraries for Python and Javascript, and a Google Spreadsheet plugin that can publish a spreadsheet directly to CKAN. The specs, however, have not been reviewed by anyone else, so comments are very much appreciated. There are two, one for the base tabular data format, and one that tailors the format to metadata packages.

thanks,

eric.


Country-specific data package register
#2

Hey Eric, thanks for posting.

The point you make is a very good one about trying to make it easy to add metadata in an easy to use tool like a spreadsheet.

This point has been thought about quite a bit. For example, we had an early version of data packages in which you could store metadata in a table (like you do). It has also been used in some other (slightly different) efforts e.g. HXL - http://hxlstandard.org/ (where you add a metadata row to your data).

As you suggest the most natural approach would be to create a datapackage.csv that is a mapping of the datapackage.json – this was the original idea in the earlier version of tabular data package called “Simple Data Format” (SDF).

This is definitely worth exploring.

However, I would note the complications which led us to drop this originally:

  • The mapping is non-trivial and quite complex. E.g. a data package has many resources. Representing this in tabular structure leads either to complex structure in your single metadata sheet or multiple metadata sheets. Either way things start to get complex and much of the benefits of the “simple editor” start to fall away - sure you can use a spreadsheet to edit the metadata but it starts to require expertise to do or a separate tool (in which case why not use json)
  • If you have this as well as JSON you have another format to support (and one where the mapping to and from your data structure is somewhat complex – unlike, say, YAML)

For the present, we have chosen JSON as a good half-way house:

  • Humans: pretty readable and editable by experts (coders)
  • Machines: very readable and editable (practically natively - very little mapping needed as most languages have native support). In additions, things like npm already use a json-based packager so many of the ideas are tried and tested.

However, we definitely remain open to improvements and it would be great to see what could be done here.


#3

Rufus,

Thanks for the history – I wasn’t aware of HXL, and I figured that all of this had been thought about before, but I didn’t know about it. thanks.

HXL is a somewhat different thing. Metatab isn’t row oriented in the same way. Metatab is not really a CSV data file, it’s a grid format for structured data, sometimes stored in CSV. Since it’s not strictly row-oriented, the format can store general structured data, including data package metadata that has multiple resources.

The Metatab and Datapackage.json formats are homologous; one can be converted to the other. For instance, here is a Metatab input that creates the example datapackage.json for the GDP package, from the data package documentation:

“a data package has many resources.” The number of resources hasn’t been a problem so far; I’m also working on a Metatab version of the metadata for the US census, which has about 100 resources, 1000 tables and 9000 fields.

"If you have this as well as JSON you have another format to support " Not entirely; the two formats are homologous and the conversion between datapackage.csv and datapackage.json is completely programatic. If you serialize a JSON file to metatab, and then convert back to JSON, you get the input file ( with canonicalization), for any possible JSON file. So, Metatab can use the same tool chain as datapackage.json, by converting to JSON first. If the data package tool chain included the Metatab python parser, it could work with a datapackage.csv file in exactly the same was as a datapackage.json file.

" or multiple metadata sheets." For our use case, in which the metadata is stored in Excel, this is actually an advantage. Our users will be submitting Excel files into a workflow, and the Excel file will have the data and metadata in separate tabs. The largest section of the metadata, the schema, is also in a separate tab. This separation make it easier to use, but, of course, it’s more tailored to Excel, not to CSVs in a zip file.

“Representing this in tabular structure leads either to complex structure in your single metadata sheet” I don’t think the Metatab structure is more complex than JSON, and is actually much easier to read than JSON, especially for non tech user. ( Particularly for spreadsheets, where we can add color and styling to separate sections. )

“Humans: pretty readable and editable by experts (coders)” That is the crux of our problem; our metadata creators aren’t coders, so if we don’t give them a more familiar way to create metadata, we don’t get any metadata.

Also, Metatab makes it possible to entirely manage a CKAN data package from a Google Spreadsheet.

Since we’re committed to the Metatab format for a pilot project in California, it would be really valuable to be able to learn from your experience and thoughts from the prior work you’d done on a tabular metadata format. Could you refer me to prior work or documents?

Perhaps, as a next step, you could propose a test case to validate the format against? I’d be happy to put together a demo.


#4

@ericbusboom i have been thinking more and more about this and I think it would be really great to explore this further and try and get something really working here in terms of a pure tabular metadata representation.

The best way to proceed right now i think is a short chat.

To set that up, I would invite you to jump on the frictionless data chat channel and ping me there: https://gitter.im/frictionlessdata/chat


#5

UPDATE: Eric and I had a call last Thursday.

Looks very promising to have metatab have full support for data package by default.

We are also talking about other tool and spec collaboration.


#6

Hi All. I’ve been working on getting our basic tool process running with Metatab, and it’s now useful. Here is the github repo, along with a simple tutorial in the README:

https://github.com/CivicKnowledge/metatab-py

The tutorial demonstrates:

  • Creating a new Metatab package
  • Adding many data resources from a single ZIP file, or all datafiles linked on a web page.
  • Automatically creating schemas
  • Generating a Tabular Data Package compatible ZIP file archive.

From the Tabular Data Package perspective, the most interesting part of this is probably that, when creating a ZIP or S3 package, the system also creates a datapackage.json file, so data creators have a non-JSON way to create the package, but the package can also be use by TDP tooling.


#7

Also, here are some of the interesting Metatab configuration files:

https://github.com/CivicKnowledge/metatab/tree/master/declarations

In particular, datapackage-0.1.csv file configures Metatab to use all of the Tabular Data Package Terms, and to ( mostly ) output JSON in the structure of datapackage.json.

The Metatab python library has a function for creating datapackage.json files from the standard Metatab terms, but we can also configure metatab to almost exactly harmonize with Tabular Data Packages. Column I in the datapackage-0.1.csv file links the TDP terms to their Metatab equivalents. Rows 24-70 are the metadata terms, and rows 74 to 108 allow for using singular form terms in Metatab to produce the plural form used in the JSON file.


#8

Hi @ericbusboom very nice. I think I understand a bit better. Thanks for merging my edits to the README!

Also thanks for the clarification re: the two potential pathways for creating Data Packages with metatab. I wonder which is more ideal to promote. I’ve left a few other clarifying questions as issues on the repo.

I think we can combine the README and this post for a nice Labs blog post: http://okfnlabs.org/blog/


#9

@danfowler & c., I’ve been working on harmonizing Metatab and the Tabular Data Package spec, resulting in these two Metatab configuration files.

Metatab declarations: https://github.com/CivicKnowledge/metatab/blob/master/declarations/metatab-0.2.csv
Datapackage Declarations: https://github.com/CivicKnowledge/metatab/blob/master/declarations/datapackage-0.1.csv

These are Metatab-formated files that the Metatab parser uses to validate Metatab files and to produce JSON output. In the Datapackage declarations, the Synonyms allow users to write a metatab doc with singular names like ‘Resource.Title’ and the name will be translated to the plural name that the Tabular Data Package format defines.

Using the Declaration document involves setting the first term in a Metatab doc to “Declare datapackage-latest”. After that, the default Metatab JSON output will also be valid TDP json. ( Although it is probably somewhat broken right now … )

The harmonization process involves adding two columns to the files. In the Metatab decl is a “DataPackageTerm” property ( line 27) which lists the term in the Datapackage Decl that is associated with the Metatab term. In the Datapackage decl there are properties (line 23 ) for DataPackageTerm for the original JSON key name in the TDP spec, and a Metatab term for the name of the closest Metatab term.

As you can see from the coverage of these properties, Metatab has a superset of the TDP terms. Metatab is missing only one of the TDP terms, and that is implemented elsewhere. The terms for Columns/Fields and Table/Resources are virtually identical.

I’ve probably gone as far as I can on harmonizing, so it would be valuable to work more closely with you to figure out how to deal with the Metatab terms that don’t have TDP equivalents. I think all of them can be extensions in TDP, but there are a few you might want to consider formally adding to TDP. In particular, I’d like to recommend a more comprehensive set of terms for Contacts and a broader range of Resource types.


#10

Hi @ericbusboom this is brilliant. “decl” == Declaration file?

For those things that you believe should be formally added to TDP, the best first step would be to create an issue here: https://github.com/frictionlessdata/specs/issues

Did you have something in particular in mind when you mentioned “extensions” to TDP?


#11

I meant the normal Data Package extensibility and customisation , adding in properties that are not part of the defined profile.

“create an issue here”: Ok, will do.