Hi @pudo
Thanks for taking the time to engage with this. There are some important points/ideas here, so let’s talk through it:
schema_proposal in YAML
As described here: A proposed metadata structure for OpenSpending raw data. · GitHub
(I’m just going to go over it all, even if some is obvious)
meta_data
We have conventions for this in Data Package (name, title, etc.), these properties sit directly on the top-level object of a Data Package Descriptor - so we are basically in alignment here.
model
This is great. It ties in with some points @rufuspollock talked about with me this week about the data model being centred around 4 different classes of thing:
- Entities
- Transactions
- Projects
- Taxonomies
Let’s break it down a bit:
If we just forget about the different dimension scheme for a minute, Data Package has a robust system for declaring meta data on a dimension, and the basic properties of each attribute in a dimension.
So, even if we are talking about dimensions being split over multiple resources (ie: different CSV files in a package), or, in a single resource (one file has the necessary data for different dimensions), we can provide that information on each Data Package Resource (Data Packages - Data Protocols - Open Knowledge Foundation), and more specifically for types/format, on the schema of each resource (JSON Table Schema - Data Protocols - Open Knowledge Foundation).
To demonstrate, your dimension called project could be described on a resource like this:
resources: [
{
"name": "project-data",
"title": "Project",
"description": "Project under which funds were released",
"schema": {
"fields": [
{
"name": "project_name",
"title": "Project name",
"type": "string",
"format": "default"
},
... and so on
]
}
}
]
So, all the core attribute information, and also, the meta data type stuff on dimensions like name and title (or label, from the YAML example) can easily be represented in Data Package - this is prime use case case for data package. It ties this data logically to the resource(s) the data describes.
But what are we missing?
I think ignoring smaller differences (like having currency as part of a unit object to describe value - which is really great), the main thing missing in OSEP-04, which you are addressing here, has two forms:
- The explicit mapping of Resources to dimensions (some thing that says “this resource is definitely all the Entity data”)
- A way to align internal taxonomies with external ones, like here, and ensuring that this alignment is rich (not one single field to another single field)
Let’s address the second point first:
OSEP-04 does not yet deal with such alignment yet, although it is clearly a goal. Budget Data Package does expect a COFOG mapping, but it exhibits the problems you have described (no explicit way to get that extra info on the mapping).
So, as far as I see, if we are going to consider adding this to OSEP-04 in the near future, we should discuss a way to do so that would be Data Package friendly.
For the first point:
The openspending.mapping
object provides mapping of fields (and, a limited set of them), and, thinking out loud, could probably be used to provide a way to map dimensions <> resources using a similar pattern? I also like the fact that all that attribute info is with the resources (normal Data Package stuff), and this is just responsible for the mapping aspects of dimensions.
So, the points I’m trying to make here (and I hope I’ve understood your proposal well enough):
- Good work in making dimensions explicit
- I like the treatment of aligning taxonomies
- Current spec does already store most of the raw meta data you are proposing
- Current spec could absorb most of these ideas, likely by expanding the
openspending.mapping
object, or rethinking it