Source vs Compiled for data (packages) [Musings]

rufuspollock · August 11, 2016, 1:40pm

Moved here from [idea/discussion] Source vs Compiled for data (packages) · Issue #120 · frictionlessdata/specs · GitHub (May 17 2014)

Source / complied distinction is common in code. I think there is a good and relevant analogy with data. Best explained by some concrete examples:

Normalized to de-normalized. Source data is normalized (for efficiency and structure) but compiled might be one denormalized file. Good example here the CRA (~UK budget) dataset in OpenSpending - GitHub - os-data/gb-country-regional-analysis: UK Country Regional Analysis dataset (UK Government Budget). This is pretty simple tabular dataset. Normalizing leads to a 4-5x reduction in file size but when loading into OpenSpending you need to denormalize to a single file.

One can also think of things like:

Spreadsheet => PDF
geodata => tiles

I think for our purposes these are less relevant examples.

This is relevant because I think data package creators and managers (curators) want to work with source but consumers will often want compiled. Good analogy here with code and specifically Debian: in Debian the packager will often create pre-compiled versions of the software which debian users then install via apt-get with these pre-compiled versions stored in the primary ftp storage area.

So overall I can imagine an architecture like this:

Source data packages stored in git or ckan or …
Defined compilation pattern / step (e.g. a datapackage has a scripts/make or scripts/compile)
Pre-Compiled versions cached into file storage (s3 etc)

Topic		Replies	Views
Emerging patterns / workflows for Data Packages (2014) Frictionless Data	0	882	August 9, 2016
Geo Data Package Frictionless Data	42	5330	March 1, 2018
Readme.md practice for data packages Frictionless Data	10	1915	August 5, 2017
Save as Data Package Excel plug-in Frictionless Data	1	1180	January 6, 2016
Pilot: Frictionless Data in Archaeology Open Archaeology	7	2097	September 22, 2016

Source vs Compiled for data (packages) [Musings]

Related topics