I see that there is a quite important discussion about whether we should analyse a complete dataset, whether we over-score a partial dataset that is completely open, etc. This is a BIG POINT and we should discuss here - because our new survey especially addresses this. Let me explain here:
-
with the new index we want to encourage governments to publish all data in one dataset - i.e. in one file containing all fields/characteristics we want to see in there. This is our reference point - this is why we have dataset definitions and we clearly only want to evaluate the datasets that meet all our requirements - hence using Q5 (which we consider to integrate into Q3).
-
however, there are a lot of cases where these data are not provided in one dataset. To answer to @carlos_iglesias_moro comment (how Q1 and Q5 relate to each other). We decided that it would be a radical step to only measure a dataset that contains all our requirements - e.g. a spreadsheet containing water pollutants of all watersources in one file, etc.) - to be rigorous we would have to ask “Are all data included in a file” and if that’s not the case we would stop the survey - because actually we only want to analyse the openness of datasets that meet all our requirements
We decided against this step and also accept the evaluation of partial datasets. And this opens two issues discussed by @RouxRC - 1) do we “over-score” datasets if they are only partial (openness vs. “completeness”) and 2) shall we analyse multiple datasets or focus on one partial dataset?
To point 1 - we definitely only want to evaluate our reference dataset (meeting all of our criteria). If there is no such dataset, we still want to see if there are other datasets we could evaluate - to understand, how open these datasets are, to acknowledge first steps taken by government in the right direction and to sensitize our submitters for the fact that they are only looking at a partial dataset and still give them the chance to evaluate it.
But also we want to encourage governments to publish a complete dataset - and therefore we want to explicitely flag in the overall score something like this “THE SCORE ONLY APPLIES TO A PART OF THE DATA - THE DATASET CANNOT BE REGARDED FULLY OPEN”. Alternatively we can have a lower score in total - e.g. subtracting 50% score for partial datasets, or sth. similar. The point is that we exactly DO NOT want to communicate that the dataset is fully open if it does not even meet our criteria - but we do not want to cut off datasets that are partial either. The critical point here is how we can incentivize governments to publish complete datasets. Key is to have a clever way of flagging this - and a disclaimer might not be enough if we display a 100% score - so considering negative scores might be an option here, that we will consider for our weighting.
To point 2 - In past editions we allowed to analyse several datasets, but I think it is methodologically a problem to evaluate multiple datasets because we compare apples and oranges:
In Romania we found several datasets for national statistics - one was for free but not machine-readable, another one was available in bulk but had to be paid - in the end the dataset got a 100% score - because we added up partial scores in one overall score. http://index.okfn.org/place/romania/statistics/
National maps of the UK are not complete, but still we evaluated it as bulk - it got a 100% score even without containing Northern Ireland. http://index.okfn.org/place/united-kingdom/map/
In Cameroon we found company registers for several types of enterprises - the dataset got a score of 0% because every question was answered with “Unsure”. http://index.okfn.org/place/cameroon/companies/
So we had several cases where partial datasets were treated differently - all leading to different scores. But the case of Romania shows that it does not make sense to add up scores for different datasets because it makes our evaluation arbitrary again - what if a dataset contains some characteristics and is for free - while the complete dataset has to be paid. We cannot simply add up their scores because in the end the message is “A specific dataset is open to a certain extent”.
I agree, @RouxRC , that it makes sense to document alternative datasets. This is also why we use Q2.2: We want to see where datasets can be found. We could repurpose question 2.2. and use it to list alternative datasets - a comment section could be used so submitters can describe alternative datasets (re: @cmfg) and tell us our rationale why they only looked at one specific dataset (which should most likely be the fact that this dataset was thee one most compliant with Q3.