Open Data Quality - The Next Shift In Open Data?


#1

Dear all,

We are pleased to announce our call to better understand data quality. As we say in the intro, “It is a call to recalibrate our attention to the many different elements contributing to the ‘good quality’ of open data, the trade-offs between them and how they support data usability”. We see it as critical to step away from mass data publication towards usable data.

Now we would love to hear your thoughts on this. What are your experiences with open data quality? Which quality issues hinder you from using open data? How do you define these data qualities?

Please do feel free to share your thoughts in this thread.


#2

I think there is a difference between data quality and quality publishing.

If we’re taking about quality publishing, then there are many guides and tools to help, e.g.

If we’re talking about data quality then their are a few options:

  • allow people to provide feedback on data you’ve published
  • improve data collection and validation rules
  • through data governance, run a data improvement program

Often open data publishing teams are provided with an export of data from the source system and have no opportunity to improve the data. However they may be able to provide context, provenance and quality information to help re-users better understand the data.

Oh! and this paper is excellent


#3

This is a really important topic and something we covered a little while back with the ODI by looking at inherent quality issues in Companies House, Land Registry and NHS data. In essence, publishers need more support with resources, standards & tools to get better data out there.

Strong collection standards need to be enforced (e.g. use only valid Country names when registering a business or require GP surgeries to give valid phone numbers to NHS Choices) and tools are needed to automate the extraction from source systems into the ‘open’ standard format (e.g. local authority spending data).

It’s not impossible but it needs effort & some funding. We also need a single set of definitions, standards and the involvement of the community to improve data - some form of open data GitHub.

My main concern is - who should co-ordinate this? As an earlier post mentions, there are guides from W3C, ODI, OKI and probably others, but are they all the same? How do they differ from UK Local Government Association guides? A single entity needs to lead it.


#4

It should be machine readable and structure formats. For example, csv files should have one row to present titles and other rows must have the same fields to match the titles.


#5

Here’s some lightweight standards for Australian local government http://standards.opencouncildata.org


#6

Absolutely. The common challenge is often having ‘the same’ column titles in different parts of government (etc). However, alongside the format, we need to think about collecting high quality data too.

Standardisation is theoretically simple. Gathering the data in the first place can be the tricky bit (as our ODI study indicated).


#7

Quality (granularity / error margin / updates) also depends on the application, especially when there are important thresholds or cut-off points:

  • for most purposes, it does not really matter if a municipality has 49 950 or 50 050 inhabitants. However, in Belgium it has a huge impact on local politics, because a municipality with 50 001 inhabitants suddenly has the right to appoint more City Council Members, and the mayor and the deputies earn a lot more
  • it probably doesn’t matter if it’s 28 or 31 degrees outside, unless there are labor rules demanding certain actions from employers (or not allowing to work at all) when the temperature is 30 degrees
  • maps from a month ago are fine for driving to most places, unless you happen to visit a city that completely changed its mobility plan with one-way streets, speed limits and car free zones

Session proposal: Understanding open data quality
#8

There is great work conducted in France by la FING on data quality. Within the Infolab program, they produced a methodology to assess the quality of the data with more than 120+ criteria to check. The methodology includes a sprint type of event in which users can assess the quality of data in about two hours.

The project is in French, I think it’s worth translating.

You can find the methodology here http://infolabs.io/sprint-qualite

And, if you don’t understand French, you should still have a look at the impressive checklist.


#9

Problem with Data Quality is of course that everyone uses a different definition (as shown above). Furthermore, is it really a core task of the government to raise the quality of a dataset beyond the prime reason why the data was created in the first place? When we’re talking about Open Data, the goal is to maximize data adoption by third parties. “Bad” (whatever that may be) quality data might also be interesting to publish. Therefore in recent work, I did not talk about data quality, but about data interoperability: the better interoperable datasets are, the easier it will be for third parties to adopt it in their software systems. If this would be of interest, I published something on that over here: https://pietercolpaert.be/papers/iet-otd/


#10

Hi Pieter!

Yes I agree that “bad” data can be sufficient for some usages. I tend to say a big part of quality only depends on the usage. That said, another part of the quality rely on the reusability of the data – I find this word more expressive than interoperability which is more technical and, sometimes, obscure. In that sense, good metadata is a key factor of data quality: data can be “bad” but metadata (documentation) have to say how “bad” they are.
That’s why in our methodology we are focusing on metadata quality with more than 30 criterias about it.
See http://infolabs.io/sprint-qualite (beta)


#11

Hi Charles,

I really like the more expressive word reusability as well! Let’s use that?


#12

As Pieter identifies, there are lots of different definitions of quality in play here, all depending on what is being assessed (the ‘form’ or ‘contents’ of a dataset) and the context of re-use (Connecting with other datasets for interoperability? Using to get a rough sense of an issue? Using to make operational decisions? etc.)

With the Open Contracting Data Standard we’ve been experimenting with a number of ways to talk about data quality - trying to be clear in our language between the ‘technical validity’ of a dataset against a schema (possible to assess automatically), vs. the ‘conformance’ in use of terms (requiring some human assessment), vs. the completeness of data (possibly split into questions of ‘depth’ and ‘breadth’ - possible to assess in automated ways to some extent) vs. utility (only possible to assess in relation to a particular use case).

I see a lot of practical value in the check-list approach for running quality assessments, particularly if there is a separation between a generic checklist covering the ‘technical re-usability’ of data, and domain-specific checklists born out of a mapping of a broad sample of use-cases, and the particular data requirements they give rise to. That might not support a simple quality score for a dataset, but would support reporting on the extent to which different kinds of economic and civic use-cases are supported by the available data.


#13

Our report examining the usability of Government Open Data may be found at http://phoensight.com/open-data-supply/

These findings were also presented in Dec 2016 at Strata-Hadoop Singapore


#14

Dear All,

Quality is generally understood as fit for purpose meaning serving a certain purpose defined by the user in a certain context Re ISO quality management standards http://www.iso.org/iso/pub100080.pdf . Data quality is determined by a number of data quality dimensions as accuracy, completeness, consistency and timeliness each having a precise meaning in relation to the purpose for which the data is generated in the first place and to the intended re-use of the data once opened to outsiders.

However no-one is able to check by subsequent monitoring or measurement if existing data is compliant with any of these dimensions. This means that the process that generated the data has to be validated and revalidated to ensure that it can, in fact, perform according to the requirements it is designed to meet.

For this very reason official statistical agencies consider they should provide information covering the underlying concepts, variables and classifications used, the methodology of data collection and processing, and indications of the quality of the statistical information - in general, sufficient information to enable the user to understand all of the attributes of the statistics, including their limitations, for informed decision-making. Re UN National Quality Assurance Framework http://unstats.un.org/unsd/dnss/QualityNQAF/nqaf.aspx.

Probably one of the first questions is what an existing dataset is intended to describe?

Sincèrement,

Gérard


#16

Hi everyone,

I coordinated the translation of the quality data sprint (http://infolabs.io/sprint-qualite) checklist with my students from the Open Data for Urban Research class at the Urban School of Sciences Po.

Here is the translated checklist of 120+ checkpoints : https://frama.link/dataqualitysprint

This is of course freely reusable as long as you quote la Fing and Urban School of Sciences Po.

Hope this is useful!