Emerging patterns / workflows for Data Packages (2014)


#1

Moving material from this github issue in specs repo here: https://github.com/frictionlessdata/specs/issues/113. These notes were started in 2014.

This a “hack” issue for recording thoughts and examples about how datapackage.json fits into people’s data workflows and how it might be adapted / extended to support them better.

Scraping and “Sources”

A frequent experience in putting together a data package (and dataset generally) is that you need to scrape / pull data from some original source and clean that up.

A very common requirement of this is to store the location of the original data file and use that in the processing / scraping. (Sometimes you even scrape the list of urls to scrape - e.g. here and here).

You could see this as a kind of “config” info for your data processing workflow.

Ultimately, I’ve personally ended up just coding the config as a variable into the relevant script (see e.g. this recent example). Occasionally I’ve put in a quasi-config json file or similar.

This is both unsystematic, and I often end up duping that info into other places such as the README and, now, the datapackage.json (in the sources field).

Particularly, as started to add it to the datapackage.json I’ve been wondering whether one should regularly be using the datapackage.json as the place to store and source this kind of info.

Specifically for the case of source data urls the sources field seems a natural, and extendable, location.

Pattern Proposal: document the use of the sources attribute for storing this kind of data source information and the use in scraping scripts.

TODO: document potential suggested additions to this field such as “path” which would indicate a cache path where that file has been cached locally.