Open Spending Data Structure: Ideas and Suggestions

I’m starting this as a general thread for ideas, suggestions, arguments and general chit chat on the data structure in the next iteration of Open Spending.

Context

As we start to flesh out a roadmap for Open Spending, based on the new proposed architecture, we need a solid basis for how meta data will be defined as part of the overall data model.

At the base of the new architecture is a flat file data store. More specifically, all data will be stored in CSV files, with meta data in JSON.

The packaging specification we are using is Data Package, and we are designing a Data Package Profile called Open Spending Data Package to meet the specific data and meta data needs of Open Spending.

The Open Spending Data Package (OSDP) spec is currently in draft. The draft can be read here: http://labs.openspending.org/osep/osep-04.html and specific comments on the draft can be made here: [OSEP-04] Open Spending Data Package · Issue #6 · openspending/osep · GitHub

OSDP short-term goals

  • Obviously, to provide a way to describe the data in a flat file format
  • In the first iteration, to at least capture the meta data requirements that allow us to port existing Open Spending data into the new format
  • Iterate and be open to change
    • Based on compelling use cases for financial data that we want to support
    • Particularly, based on challenges we encounter while bootstrapping the new architecture over the next ~ 6 months

Ideas

Please add to this thread with what you’d like to see in OSDP, and why.

Whats the problem?

Let’s assume the following use cases:

  • As a developer, I want to generate aggregates to drive my visualisations that facet over dimensions in the data.
  • As a developer, I want to generate a search index for the data which is storage efficient (using nested objects).
  • As an analyst, I want to align budget classifications in my data with those used by an international standard and I need to make those semantics explicit in my data.
  • As a data journalist, I want to generate graph representations of entities (suppliers, authorities) in the spending data to check for signs of corruption (cf. http://www.homolova.sk/dh/info.html)
  • As a linked data academic, I want to model the data into RDF using the DataCube ontology so that I can provide a SPARQL thingie (cf. OpenBudgets.eu)

From what I can tell, the data package format does very little in the way of making possible what any of these users are trying to do. Specifically, it would give them information CSV column data types - all the other stuff is trivially inferred from source data. It’s lot of overhead to have this full metadata spec just to get info on types.

Concrete proposal

Here’s how I think a more versatile and precise version of the BDP stuff could look like. This is based on the assumption that it is desirable for the data model to

  • not rely on naming conventions excessively (“Explicit is better than implicit.”)
  • instead, use annotation to express the semantics of the dataset
  • align for budget comparison outside of the actual source dataset
  • keep it simple, don’t consider hierarchies (cofog1… cofog3) for now

So here’s a guided tour:

I want to emphasise that the additional structure is not just valuable for BI/OLAP use cases, but also needed e.g. to generate a meaningful ElasticSearch mapping, or to generate a transactional network graph.

Why column-based metadata will not work for budget alignment

The issue with classification alignment using OSDP will be that it doesn’t have the notion of any non-standard dimensions, such as a German budget’s “Hauptfunktion”. Such dimensions I could annotate to say “map this up with COFOG”. Instead, OSDP will see some columns (let’s say hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc) and not understand that they form a common thing, so I would have to annotate any or all of them with the spine mapping info. In either case, it ends up being ambiguous.

The OSDP solution to this is naming conventions: I rename my columns from hauptfunktionID to functionalID etc. and by convention this gets picked up. The problem I have with this is that it constitutes a loss of information (i.e. the term “hauptfunktion” has an actual legal meaning beyond functional classification), and it also makes it impossible to represent both the source and aligned classification in the same dataset. As an aside, it also doesn’t seem to support hierarchies (i.e. hauptfunktion, oberfunktion, funktion would have to be reduced to one column set).

The alternative is to define an explicit mapping in which I say that hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc all form different attributes of the same dimension. Then I can say that this dimension should be mapped out to COFOG. That’s what the OLAPpians call a logical model, which I keep hammering on about. If you want to see a data standard focussed around such modelling, I would point you at Google’s DSPL (DSPL Tutorial  |  Dataset Publishing Language  |  Google Developers).

If you include this in OSDP, then the information in the datapackage.json would actually be sufficient to construct meaningful OLAP cubes.

1 Like

Hi @pudo

Thanks for taking the time to engage with this. There are some important points/ideas here, so let’s talk through it:

schema_proposal in YAML

As described here: A proposed metadata structure for OpenSpending raw data. · GitHub

(I’m just going to go over it all, even if some is obvious)

meta_data

We have conventions for this in Data Package (name, title, etc.), these properties sit directly on the top-level object of a Data Package Descriptor - so we are basically in alignment here.

model

This is great. It ties in with some points @rufuspollock talked about with me this week about the data model being centred around 4 different classes of thing:

  • Entities
  • Transactions
  • Projects
  • Taxonomies

Let’s break it down a bit:

If we just forget about the different dimension scheme for a minute, Data Package has a robust system for declaring meta data on a dimension, and the basic properties of each attribute in a dimension.

So, even if we are talking about dimensions being split over multiple resources (ie: different CSV files in a package), or, in a single resource (one file has the necessary data for different dimensions), we can provide that information on each Data Package Resource (Data Packages - Data Protocols - Open Knowledge Foundation), and more specifically for types/format, on the schema of each resource (JSON Table Schema - Data Protocols - Open Knowledge Foundation).

To demonstrate, your dimension called project could be described on a resource like this:

resources: [
    {
         "name": "project-data",
         "title": "Project",
         "description": "Project under which funds were released",
         "schema": {
             "fields": [
                 {
                     "name": "project_name",
                     "title": "Project name",
                     "type": "string",
                     "format": "default"
                },
                ... and so on
            ]
         }
    }
]

So, all the core attribute information, and also, the meta data type stuff on dimensions like name and title (or label, from the YAML example) can easily be represented in Data Package - this is prime use case case for data package. It ties this data logically to the resource(s) the data describes.

But what are we missing?

I think ignoring smaller differences (like having currency as part of a unit object to describe value - which is really great), the main thing missing in OSEP-04, which you are addressing here, has two forms:

  1. The explicit mapping of Resources to dimensions (some thing that says “this resource is definitely all the Entity data”)
  2. A way to align internal taxonomies with external ones, like here, and ensuring that this alignment is rich (not one single field to another single field)

Let’s address the second point first:

OSEP-04 does not yet deal with such alignment yet, although it is clearly a goal. Budget Data Package does expect a COFOG mapping, but it exhibits the problems you have described (no explicit way to get that extra info on the mapping).

So, as far as I see, if we are going to consider adding this to OSEP-04 in the near future, we should discuss a way to do so that would be Data Package friendly.

For the first point:

The openspending.mapping object provides mapping of fields (and, a limited set of them), and, thinking out loud, could probably be used to provide a way to map dimensions <> resources using a similar pattern? I also like the fact that all that attribute info is with the resources (normal Data Package stuff), and this is just responsible for the mapping aspects of dimensions.

So, the points I’m trying to make here (and I hope I’ve understood your proposal well enough):

  • Good work in making dimensions explicit
  • I like the treatment of aligning taxonomies
  • Current spec does already store most of the raw meta data you are proposing
  • Current spec could absorb most of these ideas, likely by expanding the openspending.mapping object, or rethinking it

It’s a hack. I’m not sure how I can convince you not to make BDP that hacky…

What you are proposing is a very logical extension of the column naming layout approach, in which the physical layout of the source data files are used as a carrier for metadata.

To be adopted, that would require all source data to be perfectly normalised by the data producer, i.e. people would give you a flat-file star schema: one fact table file, one file each dimension in the source data. For the German budget, that makes 12 different files. I’m not even clear on how the linkages between these files are performed - I assume via magically named ID columns?

You can claim that this is somehow simple (because it’s all flat files), but that’s just because you’ve hidden all the complexity in implicit rules about file normalisation. People will trip over this if you enforce it, or you will get a weird semi-standard if you don’t enforce it (and allow dimensional data on the fact table).

I would strongly vote for introducing some level of separation between the physical layout of the source data files, and the logical model of the data itself. It’ll be explicit, and much easier to produce (instead of having to write scripts to separate out all of the dimensions).

Finally, even if you keep the resources hack: please think about naming. Everybody in the world calls this dimensions: Google DSPL, SDMX (IMF, EU, WB), the RDF DataCube ontology, the entire commercial data warehousing community.

What value are you adding by breaking with that well-established naming and introducing naming friction between your system and the entire rest of the world?

:).

Let’s say I didn’t override openspending.mapping above, but instead called it dimensions. This is an approximation of what I was suggesting (rest of the data already sits on the resource or resource.schema object, so I see absolutely no compelling argument for why it should be here in a dimensions declaration). Again, I’m just thinking out loud here:

{
    ... an OSDP
    dimensions: {
        project: {
            resource: "the-resource-name",
            attributes: {
                name: "the_column_name",
                id: "the_column_id"
            }
        }
    }
}

And alternative, that does not tie the model to the file structure (note: we use this pattern already on openspending.mapping):

{
    ... an OSDP
    dimensions: {
        project: {
            attributes: {
                name: "the-resource-name/the_column_name",
                id: "the-other-resource-name/the_column_id"
            }
        }
    }
}

Cool, I like the second notation. We still need some notion of how the JOIN is to be performed, I guess – or would that be part of the mapping?

One thing worth thinking about is dates. The easiest way to model them, I think, is to actually have different parts of the date split into different attributes - which requires some sort of notation to do that.

I’d love to see a demo file that shows all the relevant components and how they tie together - my intuition is that spreading metadata across the resource and mapping sections will be a bit confusing, but I don’t have a good understanding of the overall picture (and apps using this, like cubepress and spendb, could probably transform it into a useful structure).

@rgrp
@trickvi

what do you think about the treatment of dimensions? There are some other good points above, but I think the main thing to consider first is this formalising of the mapping as dimensions.

As you know I’m +1 on the whole approach here - it is, in essence, what the mapping stuff is on the OpenSpending Data Package (we could merge these two convos … - or deprecate one)

I think the detail that @pudo has here is good.

Also if it was not clear already: we (openspending) are not using the BDP model of requiring the naming of the source files to conform to the proposed ideal model (and likely BDP will be updated to be more like OSDP in the nearish future - i hope!)

I think we can distinguish three things to resolve here:

  • What is the approximate structure of the ideal “model” we map to e.g. key measures plus “objects/dimensions” e.g. projects, entities, …
    • I think what we have as a base is now excellent (e.g. entities, projects, classifications)
  • What is the minimum we require of people and what is the minimum we suggest
  • I think this is what we have or close to:
    • required: i.e. id, amount, date (?)
    • recommended: “to” (recipient / supplier)
  • remember this should be extensible and we can say: here’s all this other stuff you can do too
  • how we actually implement this in the datapackage.json. I would say the approach we have of object types / measures is ok though i would like some super simple option where people do not need to grok the whole conceptual model. I would also advocate the resource-name/field-name model etc

I think we are very close now to having something good enough to run ahead on …

@pudo

JOIN: could you explain further please.

Dates: I agree on the modeling point, as part of a fact table or some such as part of service that would provide an API. But, that seems to me like an implementation detail that belongs there, and not in OSDP. As far as I see all we need to know is the date field, and the schema that describes that field tells us whatever we need to know to implement in such a way, via its type + format properties (eg: https://github.com/okfn/goodtables/blob/master/examples/hmt/spend-publishing-schema.json#L18).

Demo: I’ll do that this week, but first let’s discuss a bit more here.

It is still not clear to me what is the proposal for taxonomies/hierarchical information in a data package.

If I have three fields like “Function > SubFunction > SubSubFunction”, how can I put this relationship information in the data package?

This kind of information would be useful for data aggregation. What do you think about using an extra attribute “level” for a field type called “taxonomy”:

         "fields": [
             {
                 "name": "function",
                 "title": "Function",
                 "type": "taxonomy",
                 "level": 1
            },
            {
                 "name": "subfunction",
                 "title": "SubFunction",
                 "type": "taxonomy",
                 "level": 2
            },
            ...
          ]
1 Like

@pwalsh: regarding JOINs, I just meant that if you do the resources/dimensions thing with it’s own CSV source file, you need to define some sort of foreign key relationship with the main data table of the dataset - and I’m curious whether that is by convention (i.e. column name) or by schema (somewhere in the JSON file).

regarding dates: I understand that JTS supports dates, my point was just that in terms of defining a model, I’ve found it cleaner to deal with each date field as three different virtual fields which are then integrated into the model as normal fields. cf. spendb – this is not a theoretical concern, it just works better this way in practice.

@aivuk: cool to have you in this debate :slight_smile: With the function levels, would they have only one column each, or multiple columns associated with each level?

I was thinking in the multiple column case, like in a csv with the columns:

"id", "budget", "date", "description", "function", "subfunction"
 1, 100000000, 2015-01-01, Primary schools construction, Education, Primary Education

@pudo ok, gotcha on the JOIN. That is covered by FK support in JSON Table Schema. When I do the next PR on OSEP-04 (should be today), I’ll add an example that shows this.

@aivuk it would be good to get an answer to what @pudo asked, and also, if functional classification is flat like this on each budget line, we would need some level mapping as you describe, but probably in the dimensions/mapping object, and not in the schema object of the resource itself.

Another alternative may be to have a resource that describes the taxonomy, and then the budget lines have references (FK) in JSON Table Schema to that resource.

That is the type of thing I am experimenting with here, which is a classification tree, and each budget line would have a FK to this “table”.

@pwalsh I understand how the FK table is structured, but the question of whether all hierarchy levels are stored in the same file or in separate files is somewhat tangential.

The more I look into this JTS stuff, the less convinced I am: all of this is really, really going towards designing a logical model of the data but for some ideological reason the BDP/JTS/OSDP thingie wants to keep that arbitrarily tied in with the naming and structure of the underlying tables (i.e. not making a distinction between files and dimensions/facts, not using any established lingo, mixing up the notions of a column definition and a logical field definition, …).

I can’t see any need for it, and it confuses the hell out of me. Please, please reconsider. I can see that you’re now lobbying OpenBudgets.eu to adopt this stuff, which means someone would be stuck with this unholy thing for at least three years. Please then also take the time to make it clean, and don’t just try to enforce JTS because there is a page for it on dataprotocols.org

EDIT: Whoever is making this slide presentation is making all my points for me. Love it :slight_smile: Now you guys just need to buy into it.

@pudo: I’ll update the OSEP-04 proposal as I said (it may not be finished today now, but if not it will be on Sunday). Then, you can pull it apart :).

The combination of a mapping (as per the last example I gave where I called it dimensions to try to make it clearer) and each resource.schema does present a logical model that is extractable from the physical model.

It does so by using pre-existing concepts in Data Package (i.e.: JSON Table Schema) - which I understand you are not a fan of: I could be wrong but it seems the main reason you see this as unholy is because some info on your dimensions (the type/format stuff that JTS does) is in fact present on the Resource, and not directly on the mapping or dimensions object, which the OLAP gods look down on unfavourably?

About confusion: if it is confusing, then I’ll try to do a better job of explaining it in the next PR on OSEP-04.

We are not “enforcing” JTS “because” there is a page for it on Data Protocols. We are using it because:

  • it solves a problem - at the very minimum: type hinting for plain text data
  • it is part of Tabular Data Package, which OSDP extends
  • we see value in Data Package as a generic format
  • we can leverage other tooling we have built, and are building, around Data Package

You may also argue against this as a circular dependency. Is there an alternative to Data Package that you consider better? Would you rather we have a completely ad hoc/custom data structure/input format, that has no relation to any existing spec/implementation work for packaging plain text data?

Here are some significant updates to OSEP-04 as a pull request.

Everything needs discussion, and I doubt this is the final version. However, it should give us additional material to talk around.

Of note:

  • Greatly fleshed out the mapping object (was openspending.mapping). This has been significantly influenced by @pudo’s work here, even if it doesn’t fully conform to it.
  • Tried to give more explanation of the distinction between this and BDP, and why this is different
  • Possibly the most controversial and subject to change, I attempted to explicitly flesh out different taxonomies: https://github.com/pwalsh/osep/blob/feature/osep-04-update/osep-04.md#taxonomy
    • I’m not sure if others will see the utility of this
    • I do see the utility, as I have piles of municipal data that employs two co-existing taxonomies at the source - functional and economic. I’m curious for feedback on this on two levels: (i) general utility, and (ii) if it is too complex to introduce to OSDP at this stage
  • Several examples that progressively show more features of the spec: https://github.com/pwalsh/osep/blob/feature/osep-04-update/osep-04.md#examples

Aspects that are not addressed:

  • flat representation of (functional) classification as per @aivuk
  • I haven’t used OLAP terminology. After thinking about it a bit more I was happy enough to stick with mapping and not directly employ OLAP semantics. I’m not 100% convinced I’m right, BTW, and I am taking @pudo’s comments on this seriously, but I would definitely welcome some other voices around this particular issue (eg: @trickvi @aivuk @adam @rufuspollock)

We have some discussion on this here: feature/osep-04-update by pwalsh · Pull Request #14 · openspending/osep · GitHub

I just made another update to OSEP-4: Updates based on latest feedback. · openspending/osep@f802411 · GitHub

Which is now published as the draft: http://labs.openspending.org/osep/osep-04.html

We’ll likely stick with this draft for a little while, and iterate on it as we get data out of the current database and into flat files.

@trickvi @pudo @aivuk @rufuspollock

Three specific questions, or, problems I want to discuss, related to data modeling.

Less concerned right now about details of implementation in OSDP, just more the general direction, so we can then see how to integrate to OSDP.

Denormed functional classification

I have pattern for normalised functional classification here. But @aivuk raised the issue above of denormed classification: eg: multiple columns in one table that actually represent a tree. I feel less familiar with this use case, and I’m fielding for suggestions on how to do a mapping for this. (eg: id,date,amount,function-1,function-2,function-3 where presumably, 1, 2 and 3 represent levels, and the presence of multiple levels on one line represents a parent-child relation (function-3 is a child of function-2 is a child of function-1))

Representing government entities

We know that a major (the major) data that is a target for open spending is government data. We also know we want to allow grouping across datasets, for comparison and otherwise. So, we need a structure for saying “this data belongs to place X, is a [budget] at the [federal|regional] level”. We probably want to make use of something like open civic data division identifiers, and host a database of that (in whatever form) ourselves (which is a service of greater value than just for OpenSpending).

Representing different classification schemes and types

In a previous draft of OSDP, I tried to formalise classifications, following the IMF, around functional, administrative, and economic types (read more about this breakdown). I found this useful as I’m very familiar with municipal spend data in Israel, which is very structured, and has both functional and economic classification built in to the regulations around how municipalities must declare budgets.

I’d love to be able to support the Israeli muni data, and thereby presumably, support richer, or a variety, of classification schemes in spend data from elsewhere. Wondering about general thoughts on this.

A basic example:

code,amount,year
3451.109,10000,2014

Which with domain knowledge, I can parse into a different format of:

functional_code,economic_code,amount,year
3451,109,10000,2014

“3451.109” is the unique identifier, and says for example, “preschool staffing”. But actually, the “109” portion tells me that this is a “salary” type of expense, and I can therefore aggregate across budgets based on “economic classification”, in addition to “functional classification”, with this knowledge.

  • Denormed representation of functional hierarchy: this is very common and used in e.g. UK CRA (for cofog)
  • Representing gov entities: what exactly is the user story we serve here? What can we do if we do this? (I’m not against or anything: just unclear what this is about)
  • Classification schemes: I think we may want some “typing” for taxonomies / classification schems but again would want to think clearly re user story (what can we or other users do when you have this)