I’m currently researching local OGD initiatives in Italy and France. While discussing the OD Index definition of open data, I suggest that three extra questions could be added:
Whether the machine readable format is also open.
Perhaps this could be a subquestion of the sixth. For example, ten points could be allocated to machine readable datasets and five extra if an open version is available.
Whether data is fit for publicization and re-use.
Fit data should be properly anonymized, free of meaning conflict (understandable to people that lack domain knowledge), and possibly mindful of averse consequences that may result in damage to data providers, users, or pressures to hide data.
I wonder if the first point is relevant to the datasets examined by the index. I also understand that the third may be quite trivial. Perhaps a dataset could be defined as 'fit to the best of one’s knowledge’.
Whether users (both primary and secondary) can provide feedback.
Feedback could help with mistakes, missing observations, and fitness to re-use. This may be relevant to OGD only, since it may not be in the interest of non-governmental data providers to allow for users’ feedback. Also, one could argue that the participatory aspect of OGD is beyond its definition, as it moves towards open government practices. However, I believe including user-driven feedback can be a determinant of valuable re-use(s).
I readapted the 9 questions to define optimal open (government) data as complying with 15 characteristics:
Data should exist…
…in digital form…
…publicized…
…online…
…for free…
…in a machine readable (open) format…
…available in bulk…
…complete of context information (metadata)…
…URIs…
…and linked to other data (LOD)…
…under open license…
…up to date…
…risk-free (to the best of one’s knowledge)…
…with no meaning conflict…
…and allow for users’ feedback.
I would love to hear the community’s opinion on this!
@civicdata@Federico these are good ideas and it’s important that data re-users understand the answers to these questions. I only came to the Open Data Index late last year but I thought it was measuring compliance with the Open Definition. Other measurement programs capture the best practices you mention.
Perhaps @mor or someone else that was there at the beginning of the Open Data Index can answer about the original measurement intentions, plans for the future and relationship to other measurement initiatives?
Thanks for the suggestions (especially historic - I get lets of requests from Universities for this).
@Stephen just replying on this specific point: whilst the Open Data Index does aim to cover the Open Definition criteria it has always included additional questions e.g. on timeliness.
More generally in response to this thread: criteria, at least for questions that would go into scoring, would be things like:
the ease with which they could be reliably assessed: we want questions where it is reasonably easy to make an assessment that others would agree with - i.e. we want something objective rather than subjective.
agnosticism about specific formats. Whilst we required machine readability we don’t insist or score for e.g. linked open data (using URIs etc).
On things like openness to feedback: I feel this a definite question one could ask but not a question one might score (its pretty subjective and is not directly related to the openness of the data - more about the openness of the process).
“Compete”: I would imagine that “complete” was part of either simple “Is it available” (if a large part isn’t then it isn’t available) and/or part of “bulk”.
“Historic”: we had somethign originally about date first available but find this very hard to reliably assess. Remember people can add any info they like in the description but I think historic is tough to assess in a reliable cross-country or cross-city way.
Accurate: I can imagine this is one that would easily generate a lot of edit wars - assessing accuracy level of data is very difficult and its not clear how you would score (all reasonably size databases will have some errors).
To build on top of Rufus’ answer, I will mentioned to factors -
The index is a crowdsourced effort. As part of it, we have learned that it is used as a learning tool. For some people, the index is the first encounter with the open data definition and the concept of open. We still see errors in machine readable and license answer. While I know we need to improve the definitions we are using, we also need to take this part into consideration.
Looking again at the crowdsource answers, as Rufus said, some of the questions will be hard to assess - how do we know that a dataset is anonymised properly? This can only be answered by privacy experts. How can we know that it has the complete context of metadata?
First and for most, I see the index as an advocacy tool. I think that we can add question on subject we want to advocate, but also take into consideration that the global index is a global benchmarking, and is already bias toward developed countries (and it will stay like this for a while, developed countries started Open Data years before the global south). We try to make it as valid and reliable as we can with our limitations (Crowdsourced data is know as not reliable), but this is not an academic research, it is a tool, and in order to use it wisely we need to see what we do need to add to it so it will serve our network. I think some of that some of the question here are good for that, and some are missing the point. Taking machine readable as an example again - Some government officials still struggle with the concept of machine readable. Taking points off because the format is XLS and not CSV can cause frustration with the work they are doing and to actually harm the process.
To conclude, we should always revise our methodology and some suggestion here are good, but we need to take into consideration the purpose of the tool and the nature of it.
Let me add a few details/scenarios where the existing categories don’t seem to apply.
Complete
Police department releases crime report data in bulk format,
including id, lat, lon, catgeory, address, name, description, etc.
But they don’t release white color crime, location type, arrest, full report
text, or sub-category. So it is bulk, but not everything that would be
released if you did a FOIA request.
Planning department releases all building permit data incluing all
available fileds. But they only release Open and Pending permits.
Closed permits are not released, so it’s bulk, but not complete.
Historic
Police release last year of crime data, but don’t release the last
10 years of data, which they also have in their DB but choose not to
release. So it’s bulk and complete for the time period, but not
historic.
Accurate
Police release bulk, complete, historic crime report data. But
there are serious flaws in the data, like horrible geocoding,
duplicate unique identifiers, bad date formatting, and incorrect
categorization.
These are all real situations we have in Louisville KY ,that you can see in other cities too. In Louisville for crime, the city gets 100%, despite these flaws.
So they are not pushed to fix anything (the mayor and staff use this Open Data Index to drive what they are working on to improve the score). If the purpose of the this tool is to encourage opening more data and advocacy, it’s partially failing because these categories are left out.
The first two categories on the ODI are “Data exists,” and “It’s digital.” While that is all good to know, it doesn’t seem to help the public at all since it’s still not released, and maybe a city shouldn’t get ‘points’ for it. If we think there might be too many categories, maybe these two could be removed?
Since we are evaluating the text on the current questions, should we consider adding some of these grading areas on the data? I know it would help with increasing the quality and quantity of data released in Louisville.
@civicdata - We are evaluating the questions for the Global Open Data Index.
For the US - Census related questions please speak to @hackyourcity and the team in the Sunlight foundation.