Source vs Compiled for data (packages) [Musings]

Moved here from [idea/discussion] Source vs Compiled for data (packages) · Issue #120 · frictionlessdata/specs · GitHub (May 17 2014)

Source / complied distinction is common in code. I think there is a good and relevant analogy with data. Best explained by some concrete examples:

One can also think of things like:

  • Spreadsheet => PDF
  • geodata => tiles

I think for our purposes these are less relevant examples.

This is relevant because I think data package creators and managers (curators) want to work with source but consumers will often want compiled. Good analogy here with code and specifically Debian: in Debian the packager will often create pre-compiled versions of the software which debian users then install via apt-get with these pre-compiled versions stored in the primary ftp storage area.

So overall I can imagine an architecture like this:

  • Source data packages stored in git or ckan or …
  • Defined compilation pattern / step (e.g. a datapackage has a scripts/make or scripts/compile)
  • Pre-Compiled versions cached into file storage (s3 etc)