Bulk data: how to evaluate if a dataset is complete


#1

Hi all,

for the GODI 2016 we think about splitting our question “Is data available in bulk?” into 2 sub-questions that together fulfil the bulk criterion: 1) is the dataset complete? 2) Can the dataset easily be downloaded?

The reason is that we would like to foreground the issue that governments might publish patchy, or incomplete data available as a comprehensive download, under an open license, that is machine-readable, etc. Thereby we wish to better evaluate whether governments publish datasets in accordance with our key dataset definitions.

We thought of two different ways to measure “data completeness” and I would like to know your opinion about them:

  • we could take our dataset definitions (for instance public procurement data have to contain tender name, tender status etc.) and ask submitters to check if a dataset contains these elements. While this procedure has some caveats (for instance some of these categories might be called differently in different nations and users might not recognize them, etc.) it would be fairly easy to define and would be aligned with our dataset requirements.

  • we could ask submitters to give feedback about the data within each category (such as all names contained in tender names, all dates contained in tender status, etc.) and assess, if these data seem to contain all data that belong to the category (I use “seem to”, because we are missing a hard criterion for completeness). While this might be feasible to measure for some datasets (e.g. checking if budgets of all national government departments are listed), in other cases it seems almost impossible for a layperson to evaluate, or might be biased by subjective assessments.

I would love to hear your opinion about this and maybe you also have other ideas how we could possibly measure the completeness of a dataset.

All the best
Danny


#2

And I flagging this topic to our French group - @RouxRC @pzwsk @samgta who had issues with this last year…


#3

I would be cautious here.

The question of whether governments are publishing the “right” dataset should be kept separate from whether the dataset itself is open.

The bulk criteria is a simple, clear criterion related to openness. Its about being able to get the specific dataset in bulk – not a question of whether it is “complete” (which is not super well-defined).

In general, questions of what data should be published has been dealt with in dataset definitions (your option 1).

I would also warn against changes that cause issues in comparability across years (e.g. against last year) unless there is a really good reason to do it.


#4

Good point, Rufus. You are right that the bulk criterion is a matter of technical accessibility to a defined set of data (which has to be downloadable as a whole, not split into various files, etc.)

However, the bulk question implies that we anticipate a particular data structure with attributes (columns in a spreadsheet such as tender name, tender status) and tuples (what we want to decribe with these attributes, such as tenders themselves) and that all attributes and tuples should be contained in a comprehensive file (or split into a reasonable amount of files for time series or very large datasets). “Bulk access” then describes that the data are put into one data file that can be downloaded at once - but the precondition is that the file contains all attributes (and ideally all tuples).

This precondition is not made explicit in our questions - we are looking for datasets with specific attributes. But we do not explicitely double check whether these attributes are provided. Submitters may associate this question with “Does the data exist?” or “Is the data available online?”. To avoid confusion we could insert the question: “Does the data contain all required information of our dataset descriptions”. Even if we do should not score this question, it might be a condition for bulk access (if not all data are provided, the bulk question cannot apply).

Re: “data completeness”, I’d like to clarify some points. After internal debates, we concluded that the term “completeness” cannot be globally measured with our means - we can measure if a dataset contains all required data attributes, but we cannot realistically assess whether all tuples are in a set: in fact this is the biggest challenge for transparency and accountability and we quickly enter very complicated terrain. Just to mention the issue of public budgeting and how nations treat public-private-partnerships fiscally - nations include PPPs differently in their budgets. This leads to differences when datasets can be considered “complete”.

Given these methodological difficulties, we should also consider another caveat: It would be misleading terminology if governments and affiliated organisations could say that their data is “complete”. It could send wrong political signals (willingly or unwillingly) if actually incomplete data is labelled to be complete and it is not advisable for us to encourage this.


#5

Not sure I really understand that. Why does bulk presume that the data has a particular structure? It could take pretty much any form – “bulk” access would apply to raw TIFF as well as spreadsheets.

Perhaps you could elaborate a bit more on what you are getting at here in terms of what bulk presupposes.

Well it sort-of is: we make it really clear in the dataset definitions. Now there may be a UX issue that submitters do not adequately check the data and adding a specific question might help with that – or just a big sign on the web page saying please check the data really contains what we say it should. However, the question there is whether people just don’t think or they don’t do it because its hard …

Sure, though sometime people may clearly know the data is incomplete (e.g. the UK gov’s paid for cadastre dataset explicitly said it only has titles owned by corporate entities so you would know it was incomplete). It is also good for us to remember perfection is always unattainable :wink: - we are looking for good within the limits of our resources.

Finally re gov’s saying there data was complete: where does that come up. All they could say right now is they comply with the minimal requirements for the Open Data Index (which we will probably up gradually over time).