Open Spending Data Structure: Ideas and Suggestions

As you know I’m +1 on the whole approach here - it is, in essence, what the mapping stuff is on the OpenSpending Data Package (we could merge these two convos … - or deprecate one)

I think the detail that @pudo has here is good.

Also if it was not clear already: we (openspending) are not using the BDP model of requiring the naming of the source files to conform to the proposed ideal model (and likely BDP will be updated to be more like OSDP in the nearish future - i hope!)

I think we can distinguish three things to resolve here:

  • What is the approximate structure of the ideal “model” we map to e.g. key measures plus “objects/dimensions” e.g. projects, entities, …
    • I think what we have as a base is now excellent (e.g. entities, projects, classifications)
  • What is the minimum we require of people and what is the minimum we suggest
  • I think this is what we have or close to:
    • required: i.e. id, amount, date (?)
    • recommended: “to” (recipient / supplier)
  • remember this should be extensible and we can say: here’s all this other stuff you can do too
  • how we actually implement this in the datapackage.json. I would say the approach we have of object types / measures is ok though i would like some super simple option where people do not need to grok the whole conceptual model. I would also advocate the resource-name/field-name model etc

I think we are very close now to having something good enough to run ahead on …

@pudo

JOIN: could you explain further please.

Dates: I agree on the modeling point, as part of a fact table or some such as part of service that would provide an API. But, that seems to me like an implementation detail that belongs there, and not in OSDP. As far as I see all we need to know is the date field, and the schema that describes that field tells us whatever we need to know to implement in such a way, via its type + format properties (eg: https://github.com/okfn/goodtables/blob/master/examples/hmt/spend-publishing-schema.json#L18).

Demo: I’ll do that this week, but first let’s discuss a bit more here.

It is still not clear to me what is the proposal for taxonomies/hierarchical information in a data package.

If I have three fields like “Function > SubFunction > SubSubFunction”, how can I put this relationship information in the data package?

This kind of information would be useful for data aggregation. What do you think about using an extra attribute “level” for a field type called “taxonomy”:

         "fields": [
             {
                 "name": "function",
                 "title": "Function",
                 "type": "taxonomy",
                 "level": 1
            },
            {
                 "name": "subfunction",
                 "title": "SubFunction",
                 "type": "taxonomy",
                 "level": 2
            },
            ...
          ]
1 Like

@pwalsh: regarding JOINs, I just meant that if you do the resources/dimensions thing with it’s own CSV source file, you need to define some sort of foreign key relationship with the main data table of the dataset - and I’m curious whether that is by convention (i.e. column name) or by schema (somewhere in the JSON file).

regarding dates: I understand that JTS supports dates, my point was just that in terms of defining a model, I’ve found it cleaner to deal with each date field as three different virtual fields which are then integrated into the model as normal fields. cf. spendb – this is not a theoretical concern, it just works better this way in practice.

@aivuk: cool to have you in this debate :slight_smile: With the function levels, would they have only one column each, or multiple columns associated with each level?

I was thinking in the multiple column case, like in a csv with the columns:

"id", "budget", "date", "description", "function", "subfunction"
 1, 100000000, 2015-01-01, Primary schools construction, Education, Primary Education

@pudo ok, gotcha on the JOIN. That is covered by FK support in JSON Table Schema. When I do the next PR on OSEP-04 (should be today), I’ll add an example that shows this.

@aivuk it would be good to get an answer to what @pudo asked, and also, if functional classification is flat like this on each budget line, we would need some level mapping as you describe, but probably in the dimensions/mapping object, and not in the schema object of the resource itself.

Another alternative may be to have a resource that describes the taxonomy, and then the budget lines have references (FK) in JSON Table Schema to that resource.

That is the type of thing I am experimenting with here, which is a classification tree, and each budget line would have a FK to this “table”.

@pwalsh I understand how the FK table is structured, but the question of whether all hierarchy levels are stored in the same file or in separate files is somewhat tangential.

The more I look into this JTS stuff, the less convinced I am: all of this is really, really going towards designing a logical model of the data but for some ideological reason the BDP/JTS/OSDP thingie wants to keep that arbitrarily tied in with the naming and structure of the underlying tables (i.e. not making a distinction between files and dimensions/facts, not using any established lingo, mixing up the notions of a column definition and a logical field definition, …).

I can’t see any need for it, and it confuses the hell out of me. Please, please reconsider. I can see that you’re now lobbying OpenBudgets.eu to adopt this stuff, which means someone would be stuck with this unholy thing for at least three years. Please then also take the time to make it clean, and don’t just try to enforce JTS because there is a page for it on dataprotocols.org

EDIT: Whoever is making this slide presentation is making all my points for me. Love it :slight_smile: Now you guys just need to buy into it.

@pudo: I’ll update the OSEP-04 proposal as I said (it may not be finished today now, but if not it will be on Sunday). Then, you can pull it apart :).

The combination of a mapping (as per the last example I gave where I called it dimensions to try to make it clearer) and each resource.schema does present a logical model that is extractable from the physical model.

It does so by using pre-existing concepts in Data Package (i.e.: JSON Table Schema) - which I understand you are not a fan of: I could be wrong but it seems the main reason you see this as unholy is because some info on your dimensions (the type/format stuff that JTS does) is in fact present on the Resource, and not directly on the mapping or dimensions object, which the OLAP gods look down on unfavourably?

About confusion: if it is confusing, then I’ll try to do a better job of explaining it in the next PR on OSEP-04.

We are not “enforcing” JTS “because” there is a page for it on Data Protocols. We are using it because:

  • it solves a problem - at the very minimum: type hinting for plain text data
  • it is part of Tabular Data Package, which OSDP extends
  • we see value in Data Package as a generic format
  • we can leverage other tooling we have built, and are building, around Data Package

You may also argue against this as a circular dependency. Is there an alternative to Data Package that you consider better? Would you rather we have a completely ad hoc/custom data structure/input format, that has no relation to any existing spec/implementation work for packaging plain text data?

Here are some significant updates to OSEP-04 as a pull request.

Everything needs discussion, and I doubt this is the final version. However, it should give us additional material to talk around.

Of note:

  • Greatly fleshed out the mapping object (was openspending.mapping). This has been significantly influenced by @pudo’s work here, even if it doesn’t fully conform to it.
  • Tried to give more explanation of the distinction between this and BDP, and why this is different
  • Possibly the most controversial and subject to change, I attempted to explicitly flesh out different taxonomies: https://github.com/pwalsh/osep/blob/feature/osep-04-update/osep-04.md#taxonomy
    • I’m not sure if others will see the utility of this
    • I do see the utility, as I have piles of municipal data that employs two co-existing taxonomies at the source - functional and economic. I’m curious for feedback on this on two levels: (i) general utility, and (ii) if it is too complex to introduce to OSDP at this stage
  • Several examples that progressively show more features of the spec: https://github.com/pwalsh/osep/blob/feature/osep-04-update/osep-04.md#examples

Aspects that are not addressed:

  • flat representation of (functional) classification as per @aivuk
  • I haven’t used OLAP terminology. After thinking about it a bit more I was happy enough to stick with mapping and not directly employ OLAP semantics. I’m not 100% convinced I’m right, BTW, and I am taking @pudo’s comments on this seriously, but I would definitely welcome some other voices around this particular issue (eg: @trickvi @aivuk @adam @rufuspollock)

We have some discussion on this here: feature/osep-04-update by pwalsh · Pull Request #14 · openspending/osep · GitHub

I just made another update to OSEP-4: Updates based on latest feedback. · openspending/osep@f802411 · GitHub

Which is now published as the draft: http://labs.openspending.org/osep/osep-04.html

We’ll likely stick with this draft for a little while, and iterate on it as we get data out of the current database and into flat files.

@trickvi @pudo @aivuk @rufuspollock

Three specific questions, or, problems I want to discuss, related to data modeling.

Less concerned right now about details of implementation in OSDP, just more the general direction, so we can then see how to integrate to OSDP.

Denormed functional classification

I have pattern for normalised functional classification here. But @aivuk raised the issue above of denormed classification: eg: multiple columns in one table that actually represent a tree. I feel less familiar with this use case, and I’m fielding for suggestions on how to do a mapping for this. (eg: id,date,amount,function-1,function-2,function-3 where presumably, 1, 2 and 3 represent levels, and the presence of multiple levels on one line represents a parent-child relation (function-3 is a child of function-2 is a child of function-1))

Representing government entities

We know that a major (the major) data that is a target for open spending is government data. We also know we want to allow grouping across datasets, for comparison and otherwise. So, we need a structure for saying “this data belongs to place X, is a [budget] at the [federal|regional] level”. We probably want to make use of something like open civic data division identifiers, and host a database of that (in whatever form) ourselves (which is a service of greater value than just for OpenSpending).

Representing different classification schemes and types

In a previous draft of OSDP, I tried to formalise classifications, following the IMF, around functional, administrative, and economic types (read more about this breakdown). I found this useful as I’m very familiar with municipal spend data in Israel, which is very structured, and has both functional and economic classification built in to the regulations around how municipalities must declare budgets.

I’d love to be able to support the Israeli muni data, and thereby presumably, support richer, or a variety, of classification schemes in spend data from elsewhere. Wondering about general thoughts on this.

A basic example:

code,amount,year
3451.109,10000,2014

Which with domain knowledge, I can parse into a different format of:

functional_code,economic_code,amount,year
3451,109,10000,2014

“3451.109” is the unique identifier, and says for example, “preschool staffing”. But actually, the “109” portion tells me that this is a “salary” type of expense, and I can therefore aggregate across budgets based on “economic classification”, in addition to “functional classification”, with this knowledge.

  • Denormed representation of functional hierarchy: this is very common and used in e.g. UK CRA (for cofog)
  • Representing gov entities: what exactly is the user story we serve here? What can we do if we do this? (I’m not against or anything: just unclear what this is about)
  • Classification schemes: I think we may want some “typing” for taxonomies / classification schems but again would want to think clearly re user story (what can we or other users do when you have this)

Denormed functional classification

Yes ok, I get that this is common, and I checked out the UK CRA data. The current draft of OSEP-04 can handle one level, or when the data is normalised, a node in a tree see here. I guess to support this type of denormed hierarchy we will need to introduce both a dedicated new UX pattern for this in the data loader, and, a new pattern on the attribute group in OSDP. For example, what is currently this:

"function": {
  "id": "budget_tree/id",
  "title": "budget_tree/title",
  "description": "budget_tree/summary",
  "cofog": "budget_tree/cofog_code"
}

In the case of denormed hierarchies (of any type, not just functional classification), might become:

"function": [
  {
    "id": "budget_tree/id-1",
    "title": "budget_tree/title-1",
    "description": "budget_tree/summary-1",
    "cofog": "budget_tree/cofog_code-1"
  },
  {
    "id": "budget_tree/id-2",
    "title": "budget_tree/title-2",
    "description": "budget_tree/summary-2",
    "cofog": "budget_tree/cofog_code-2"
  },
  {
    "id": "budget_tree/id-3",
    "title": "budget_tree/title-3",
    "description": "budget_tree/summary-3",
    "cofog": "budget_tree/cofog_code-3"
  }
]

Representing government entities

Personas

Data producer: One who has data to load into OpenSpending
Data consumer: One who uses OpenSpending to interact with and discover spend data

User stories

  • As a data producer, I want a structured way to declare if my data is that of a government, including the level of government, and the type of data (eg: a budget), so that my data can be used by data consumers who are specifically interested in government spending, and these users can associate my data with other data packages that have the same properties
  • As a data consumer, I want a way to find data on OpenSpending by filtering/searching by region, place (country), administration type (national, regional), so that I can access data from a particular region/government

Classification schemes and types

  • As a data producer, I want to have a structured way to declare economic classification of my data, in addition to functional classification, so that I can tell unique stories with the data that are not possible via functional classification
  • As a data consumer, I want to look at a budget via economic classification if it exists for that budget, so I can get a cross section overview of things like salary expenditure, or interest rates on existing loans, as a percentage of the total budget.

References:

OSDP has a “direction” attribute, which is the same as “type” in BDP.

Previously, cases have been made for “type” consistency at both resource level (BDP) and package level (OSDP).

Real data often does not mach this expectation.

eg:

Also see comments I made on BDP here, and comments from @pudo here.

I’m late to this party, but:

  • As a data consumer, I want to progressively dive into a dataset by aggregating it initially at levels I’m familiar with (federal departments) before discovering the internal structure of those departments.
  • As a data storyteller, I want to use hierarchies in the visualisations that I build.

Example:
https://public.tableau.com/profile/steve.bennett#!/vizhome/AusBudget2015-16/AustralianFederalBudget2015-16

1 Like

@stevage would you mind adding this as an issue on the Fiscal Data Package issue tracker? I"ll also add user stories there too on the thread. I’ve also got some ideas on how we might add it, so it would be good to present there all in one place.

Done: Hierarchies for organisational structure, functional classifications, ... ? · Issue #55 · openspending/fiscal-data-package · GitHub

2 Likes