Hi all,
I’m sorry if this is a dumb question, but I’m trying to get started converting some data to the new OS Data Package format. Yes, of course the format is still being defined but I thought it might be a useful concrete step.
Anyway, I can’t see either in the current OSDP spec or the previous Budget Data Package format an explicit description of how to implement links between various levels of aggregation.
I can think of a couple of ways:
A single CSV with the levels of hierarchy as different columns: “Department”, “Programme”, “Project” for instance. Downsides: lots of repeated values, and no easy way to see the aggregated numbers. (Perhaps not so important since the analytics platform will do that…?)
As above, but with additional CSV files for the higher levels: department,csv, programme.csv, project.csv.
But is there then a way to make this hierarchy explicit in the datapackage.json?
Maybe it’s too early to be getting into these specifics?
That’s a very good point. I’ve been managing hierarchies outside of the datastore in the (former) OS satellite site I operate (like this). I’m now thinking about how to pull them into the main domain model of SpenDB, an OpenSpending fork which I develop. The most common method of doing this seems to be to model hierarchies as a specified set of attributes inside a logical dimension - often stored within the same dimension table, even if different levels have very different cardinalities…
To me, the options you highlight are equivalent; I feel they are just a product of the fact that OSDP still doesn’t cleanly separate between the concerns of the physical structure of the data (i.e. the sets of files or database tables used to supply a given dataset), and the semantics of the contained columns (i.e. how do these columns relate to the process of government budgeting/expenditure).
This has become a bit more explicit with the introduction of the mapping structure (which essentially appear to establish “well-known field types”), but the spec still confuses the life out of me. Still, in my opinion, support for OLAP features like hierarchies would become much more feasible if OS fully adopted that model and made dimensions/measures (and not CSV columns) the first-class entities in it’s data model.