A thought about fiscal data standardisation

Looking through Rufus slides from GIFT, I just wanted to share one thought I had while looking at fiscal data recently that may be relevant to it’s standardisation. It basically comes down to this:

Any piece of information that is used as metadata to describe a fiscal dataset in one context, will be part of the line-item data in another fiscal dataset. Hence, when standardising fiscal data, we should not distinguish data and metdata.

This probably deserves some explanation. An dataset may be labeled as “Budget of Country X” has the implicit metadata that it relates to country X. But in another dataset, perhaps for an INGO, “country” will be a column in the data and vary on a per-line basis. The same is true of datasets with names like “Health care spend in X”, “Revenues of the Government of X”: both also appear as in-data distinctions in other datasets.

This makes me think that trying to distinguish between data and metadata in these cases is not useful. It would be much more consistent to map fiscal datasets towards a line-based common mapping which allows for static field values, effectively stating: “All the line items in dataset X are about country Y”. Basically, all metadata fields in the data package would be modelled as dimensions with constant values.

What do people think?

I’m wondering about what this actually means in practice.

Yes, metadata in one case (e.g. country value) may become a data value in another. However, often we do want to distinguish in a given dataset between data and metadata. This is often driven by practical considerations e.g. you provide a different editor for metadata than data, data is prepared by one group, metadata (at least to some extent) by another.

I’m especially wary of starting to “inline” metadata into the data. Not because there is some huge distinction but because permissions / control is often different. In particular, whilst data may come from an authoritative source, metadata is often, at least in part, created by others to enable the data to be processed or used better in some way (e.g. automatically aggregated!)

I guess in practice it would mean extracting the value range of the things you want to consider as metadata from the actual data. OpenSpending already does this, I’m just proposing to formalise the process.

In current OS, we have datasets like EU FTS or WB Privatizations which contain data relating to about fifty countries or so. But nobody can be bothered to transcribe that list into metadata, so the associated country lists ends up being wildly inaccurate.

As for metadata being made up: there’s two parts to this, I think. One is that we’re asking people for information that doesn’t really exist (such as the “Title” of a dataset) - of course the result is going to be weird. So OS should stop doing that.

The second part is that of course this gets into interpretation (“Who is the body deciding this budget?”). But it does so in the same way in which dataset alignment/mapping would (“Is this dimension a functional classification?”), so I think it would be conceptually appropriate to express it in the same layer of abstraction which will eventually carry that information.