What compression format (if any) should we use for OpenSpending data


#1

Motivation: we may want to compress some data files in OpenSpending DataStore (S3) to a) save money b) make it easier to upload and download info

Options:

  • zip
  • gzip
  • bzip2
  • xz

Comments:

  • Both bzip2 and xz are reasonably obscure (and newer). xz is better than bzip2 so i think we can discard bzip2

Criteria (in order)

  • Widespread support (we can imagine others than core OS team wanting to grab these)
  • Support streaming uncompress (don’t need whole file to decompress)
    • In case we want to pull data into another service
  • Size over the wire
  • (probably not important) - speed of compression/decompression

#2

Since we’re starting to support Budget Data Packages which rely on data packages I think this should be something supported by data packages which can trickle down into OpenSpending instead of we deciding our own stuff and then having something different from everyone else when (and if) data packages decide on something.


#3

Maybe, but I think it is better to address this specific issue here:

a) Its specific (rather than the general case) - which may make it easier to address
b) It was more about immediate convenience for (power) uploaders and downloaders (specifically Holger (working on farmsubsidy) brought this up)

So: I think it is definitely worth discussing here :slight_smile: (even if we can eventually upstream)


#4

Doesn’t compression rob us of really amazing opportunities to do streaming stuff on CSV? I can’t imagine S3 costing so much this actually becomes a need so I would be a tentative -1 on compression as a whole :slight_smile:


#5

That was my instinct too …

However, Holger made compelling argument in terms of upload and download that for some data (at least near-term) we may want compression because of size (e.g. farmsubsidy stuff).

I’m still not sure.

I also wonder if some compression stuff allows compression / decompression on the fly (I thought gzip supported this) in which case this might not be so bad.

PS: on cost front I started a cost analysis spreadsheet here but I should probably open a separate thread for that …


#6

Here’s a pull request for adding compression support for data packages generally: https://github.com/dataprotocols/dataprotocols/pull/198