Source vs Compiled for data (packages) [Musings]


Moved here from (May 17 2014)

Source / complied distinction is common in code. I think there is a good and relevant analogy with data. Best explained by some concrete examples:

  • Normalized to de-normalized. Source data is normalized (for efficiency and structure) but compiled might be one denormalized file. Good example here the CRA (~UK budget) dataset in OpenSpending - This is pretty simple tabular dataset. Normalizing leads to a 4-5x reduction in file size but when loading into OpenSpending you need to denormalize to a single file.

One can also think of things like:

  • Spreadsheet => PDF
  • geodata => tiles

I think for our purposes these are less relevant examples.

This is relevant because I think data package creators and managers (curators) want to work with source but consumers will often want compiled. Good analogy here with code and specifically Debian: in Debian the packager will often create pre-compiled versions of the software which debian users then install via apt-get with these pre-compiled versions stored in the primary ftp storage area.

So overall I can imagine an architecture like this:

  • Source data packages stored in git or ckan or …
  • Defined compilation pattern / step (e.g. a datapackage has a scripts/make or scripts/compile)
  • Pre-Compiled versions cached into file storage (s3 etc)