Moved here from [idea/discussion] Source vs Compiled for data (packages) · Issue #120 · frictionlessdata/specs · GitHub (May 17 2014)
Source / complied distinction is common in code. I think there is a good and relevant analogy with data. Best explained by some concrete examples:
- Normalized to de-normalized. Source data is normalized (for efficiency and structure) but compiled might be one denormalized file. Good example here the CRA (~UK budget) dataset in OpenSpending - GitHub - os-data/gb-country-regional-analysis: UK Country Regional Analysis dataset (UK Government Budget). This is pretty simple tabular dataset. Normalizing leads to a 4-5x reduction in file size but when loading into OpenSpending you need to denormalize to a single file.
One can also think of things like:
- Spreadsheet => PDF
- geodata => tiles
I think for our purposes these are less relevant examples.
This is relevant because I think data package creators and managers (curators) want to work with source but consumers will often want compiled. Good analogy here with code and specifically Debian: in Debian the packager will often create pre-compiled versions of the software which debian users then install via apt-get with these pre-compiled versions stored in the primary ftp storage area.
So overall I can imagine an architecture like this:
- Source data packages stored in git or ckan or …
- Defined compilation pattern / step (e.g. a datapackage has a scripts/make or scripts/compile)
- Pre-Compiled versions cached into file storage (s3 etc)