Open Spending Data Structure: Ideas and Suggestions

Whats the problem?

Let’s assume the following use cases:

  • As a developer, I want to generate aggregates to drive my visualisations that facet over dimensions in the data.
  • As a developer, I want to generate a search index for the data which is storage efficient (using nested objects).
  • As an analyst, I want to align budget classifications in my data with those used by an international standard and I need to make those semantics explicit in my data.
  • As a data journalist, I want to generate graph representations of entities (suppliers, authorities) in the spending data to check for signs of corruption (cf. http://www.homolova.sk/dh/info.html)
  • As a linked data academic, I want to model the data into RDF using the DataCube ontology so that I can provide a SPARQL thingie (cf. OpenBudgets.eu)

From what I can tell, the data package format does very little in the way of making possible what any of these users are trying to do. Specifically, it would give them information CSV column data types - all the other stuff is trivially inferred from source data. It’s lot of overhead to have this full metadata spec just to get info on types.

Concrete proposal

Here’s how I think a more versatile and precise version of the BDP stuff could look like. This is based on the assumption that it is desirable for the data model to

  • not rely on naming conventions excessively (“Explicit is better than implicit.”)
  • instead, use annotation to express the semantics of the dataset
  • align for budget comparison outside of the actual source dataset
  • keep it simple, don’t consider hierarchies (cofog1… cofog3) for now

So here’s a guided tour:

I want to emphasise that the additional structure is not just valuable for BI/OLAP use cases, but also needed e.g. to generate a meaningful ElasticSearch mapping, or to generate a transactional network graph.

Why column-based metadata will not work for budget alignment

The issue with classification alignment using OSDP will be that it doesn’t have the notion of any non-standard dimensions, such as a German budget’s “Hauptfunktion”. Such dimensions I could annotate to say “map this up with COFOG”. Instead, OSDP will see some columns (let’s say hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc) and not understand that they form a common thing, so I would have to annotate any or all of them with the spine mapping info. In either case, it ends up being ambiguous.

The OSDP solution to this is naming conventions: I rename my columns from hauptfunktionID to functionalID etc. and by convention this gets picked up. The problem I have with this is that it constitutes a loss of information (i.e. the term “hauptfunktion” has an actual legal meaning beyond functional classification), and it also makes it impossible to represent both the source and aligned classification in the same dataset. As an aside, it also doesn’t seem to support hierarchies (i.e. hauptfunktion, oberfunktion, funktion would have to be reduced to one column set).

The alternative is to define an explicit mapping in which I say that hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc all form different attributes of the same dimension. Then I can say that this dimension should be mapped out to COFOG. That’s what the OLAPpians call a logical model, which I keep hammering on about. If you want to see a data standard focussed around such modelling, I would point you at Google’s DSPL (DSPL Tutorial  |  Dataset Publishing Language  |  Google Developers).

If you include this in OSDP, then the information in the datapackage.json would actually be sufficient to construct meaningful OLAP cubes.

1 Like