How to handle big data (OpenSpending in Ukraine)


#1

In Ukraine we have terabytes of spending data available.
It’s near real-time government spending data on per bank transaction level.
How to handle such a massive array of information?
How much would it cost?
How to make it useful?


#2

As a first try, I’ve updated an extract of data for 5 days, to this next.openspending platform.

Here’s my experience with OpenSpending:

  1. There are several fields for which I had to choose “Unknown”:
    a) transaction description. Unfortunately, this field is unstructured and contains raw description with specifying classification codes, contract identifiers, budgetary codes, court orders, etc in a single raw string. Can OpenSpending in any way help with parsing those raw description fields and extracting valuable information?
    b) recipient bank (name and code). I couldn’t find whether I can specify which bank the money goes
    c) payer bank (name and code). Similarly, apparently there’s no field dedicated to payer’s bank account

  2. When I visualize recipient, I see its identifier, but not its display name.

  3. When I edit the dataset, it looks like that date format is not saved, it’s always “%Y-%m-%d”.

  4. I couldn’t find a way to enrich my dataset with some external resources. I found that there is some “external hooks” feature. What is it for? There’s also “Data Mine” feature. Is it for enrichment?

Overall, uploading experience is great. But I don’t know how well does it scale.


#3

Hey @vanuan!

First things first - I’m opening proper Github issues for all the problems you mentioned. Adding more data types is super easy, and we’ll investigate regarding the rest.

Generally speaking we’ve always wanted to work with spending data, but we didn’t have a proper use case.
If you’re up to it, I’d love to have a short call with you to show you around and answer all your questions - especially around our framework for automatically uploading data to openspending (see the datapackage-pipeline framework and where we use it). With that you can do data enrichment, custom processing and much more.

Hoping to hear from you!


#4

I’m wondering if it may be possible to attach Babbage to Amazon RedShift. It wouldn’t be fast (i.e. multi-second responses) but it would certainly scale very well.

Where does the Ukrainian treasury publish this data, can you share a link?


#5

@pudo It’s http://spending.gov.ua/

See also Entry for Government Spending / Ukraine