Reflections on this year's index

+1 to everything Benjamin said.

I would add that, for next year, there should be a better way to evaluate the dataset requirements (qualifying criteria). Quite often a dataset meets all but one of the requirements, but it is still useful to people - definitely more useful than not having the data. So a score of zero, same as if no data existed, seems awfully inaccurate if we intend to measure the usefulness of open data to citizens.

I don’t know the solution for this problem. Perhaps count points for each requirement, or having the percentage of requirements be multiplied by the points obtained from the questions. Ideas are welcome.

Finally, in the quite frequent case where data in the same set is split among two or more datasets, often from different publishers, each of which will have different answers for the questions, as mentioned by @b_ooghe. In that case, we arbitrarily chose one to go as the “main” dataset and all the others were mentioned in the comments section.

As for the answers to the questions, we didn’t have time to discuss this case and be consistent in the responses every time. In some cases we chose “no” unless the answer was yes to each and every one of the datasets (essentially treating all these datasets on the same set as if it was just one combined dataset) (SOLUTION 1). In others, like example 2, we answered the questions considering just the dataset chosen as the “main” dataset and ignored the rest (SOLUTION 2). Other contributors, as I found out later, chose “unsure” as the answer to questions in cases like this (SOLUTION 3).

Example 1: Weather forecast for Brazil has 3 datasets: [A] and [B] from Inmet, [C] from Inpe. A and C are forecasts, B is past historical measurements. Only B is machine-readable, but we chose “no” as the answer for the corresponding question because we treated the three datasets as one.

Example 2: Company register for Brazil has 3 datasets: [D] from Receita Federal, [E] from Ministério do Planejamento and [F] from JUCESP. Only E is openly licensed and machine readable, the others are not. Only E and F can be searched for company names and provide company addresses. Only D and F are regularly updated. Only D is complete (E is only for registered suppliers or government contractors, F is only for companies in the state of São Paulo).

What I propose is that we should decide on the appropriate solution (SOLUTION 1, 2 or 3) for these situations. Then make it normalized across all countries during the review process. And perhaps, for next year, take into account these “multiple datasets in a set” situations for the census model and forms themselves.

Finally, another improvement for next time should be a wider job of spreading the word about the start of the census. We only learned about it after the original period had expired and then extended. Perhaps a more direct outreach could prove more fruitful: contacting past contributors and government officials directly. The Open Data Barometer, for comparison, started this year asking directly for government contributions, which I believe will lead to digging up findings that independent researches working alone wouldn’t find before.