Emerging patterns / workflows for Data Packages (2014)

rufuspollock · August 9, 2016, 10:39am

Moving material from this github issue in specs repo here: [discuss] Emerging patterns / workflows · Issue #113 · frictionlessdata/specs · GitHub. These notes were started in 2014.

This a “hack” issue for recording thoughts and examples about how datapackage.json fits into people’s data workflows and how it might be adapted / extended to support them better.

Scraping and “Sources”

A frequent experience in putting together a data package (and dataset generally) is that you need to scrape / pull data from some original source and clean that up.

A very common requirement of this is to store the location of the original data file and use that in the processing / scraping. (Sometimes you even scrape the list of urls to scrape - e.g. here and here).

You could see this as a kind of “config” info for your data processing workflow.

Ultimately, I’ve personally ended up just coding the config as a variable into the relevant script (see e.g. this recent example). Occasionally I’ve put in a quasi-config json file or similar.

This is both unsystematic, and I often end up duping that info into other places such as the README and, now, the datapackage.json (in the sources field).

Particularly, as started to add it to the datapackage.json I’ve been wondering whether one should regularly be using the datapackage.json as the place to store and source this kind of info.

Specifically for the case of source data urls the sources field seems a natural, and extendable, location.

Pattern Proposal: document the use of the sources attribute for storing this kind of data source information and the use in scraping scripts.

TODO: document potential suggested additions to this field such as “path” which would indicate a cache path where that file has been cached locally.

Topic		Replies	Views
Foreign keys across data packages Frictionless Data	1	1495	April 7, 2018
Tutorial for handcrafting a Table Schema Frictionless Data	6	1374	June 15, 2017
Integrating with a website Frictionless Data	5	1091	February 25, 2016
Geo Data Package Frictionless Data	42	5331	March 1, 2018
Data Package Users and Usages Frictionless Data	0	1050	December 29, 2014

Emerging patterns / workflows for Data Packages (2014)

Scraping and “Sources”

Related topics