for the GODI 2016 we think about splitting our question “Is data available in bulk?” into 2 sub-questions that together fulfil the bulk criterion: 1) is the dataset complete? 2) Can the dataset easily be downloaded?
The reason is that we would like to foreground the issue that governments might publish patchy, or incomplete data available as a comprehensive download, under an open license, that is machine-readable, etc. Thereby we wish to better evaluate whether governments publish datasets in accordance with our key dataset definitions.
We thought of two different ways to measure “data completeness” and I would like to know your opinion about them:
we could take our dataset definitions (for instance public procurement data have to contain tender name, tender status etc.) and ask submitters to check if a dataset contains these elements. While this procedure has some caveats (for instance some of these categories might be called differently in different nations and users might not recognize them, etc.) it would be fairly easy to define and would be aligned with our dataset requirements.
we could ask submitters to give feedback about the data within each category (such as all names contained in tender names, all dates contained in tender status, etc.) and assess, if these data seem to contain all data that belong to the category (I use “seem to”, because we are missing a hard criterion for completeness). While this might be feasible to measure for some datasets (e.g. checking if budgets of all national government departments are listed), in other cases it seems almost impossible for a layperson to evaluate, or might be biased by subjective assessments.
I would love to hear your opinion about this and maybe you also have other ideas how we could possibly measure the completeness of a dataset.
All the best