Tracking Data Issues: what's the current state of the art?

Has anyone worked any more on the topic discussed in the blog post Tracking Issues with Data the Simple Way by @rufuspollock in the case where data is not hosted on GitHub?

I’d be particularly interested to hear about approaches successfully adopted for public sector open data portals. This article suggests that it is still an important challenge:

These advances aside, Kaehny said the “big criticism” he has left for the city’s progress report revolves around what it doesn’t include: any mention of how the department plans to address issues with the quality of some data sets.

From “inconsistent geospatial data” to “bad metadata” to erroneous data duplication, Kaehny sees issues that the city could address if it had a better process for accepting feedback from the public and reacting to those comments.

3 Likes

Hi there @ewan_klein

Yes, we have done lots of work on Data Quality, mostly via our Frictionless Data specifications and related tools.

In particular, we have a Python library called goodtables that produces detailed, granular quality reports from tabular data, or collections of tabular data (e.g.: we can scan a public CKAN API and build quality reports for every tabular data file published on the instance).

We are about to launch a full continuous data validation service called goodtables.io that makes this trivially easy to do.

The Data Quality Spec has been extracted out of our work on goodtables to enable reuse, and we’ve got a beta version of a Data Quality Dashboard that can display quality results for large sets of data (think: visual interaction with quality results for each tabular data file on a CKAN instance, or any other public data collection).

Hi @pwalsh,

Thanks for the super-quick response! The Data Quality Dashboard looks great, so thanks for that link. However, I’m particularly interested in cases where data may be incorrect for reasons other than being syntactically ill-formed or invalid relative to a schema; for example, a geo-coding could be inaccurate, or a string could contain typos. Some of these errors may be picked up by people with ‘domain knowledge’ but with little or no technical expertise, so easy-to-use reporting mechanisms (plus support for tracking the data steward’s response) would be really helpful.

1 Like

@ewan_klein I wonder to what extent your issues could be addressed by writing custom checks in Good Tables:

cc: @roll

@danfowler
Yes custom checks seems pretty relevant. Here is a toy example of reporting error if there is an unicode character in the data:

But of course it could be any other checks with custom error codes related to your domain.

@ewan_klein
If you’re interested I’m ready to help here or you could create an issue - Issues · frictionlessdata/goodtables-py · GitHub - with description of what of custom check you’re looking for.

@roll and @danfowler, thanks both for the suggestions. I can see how a custom check would give you the basic functionality but I don’t fully grasp how this would be integrated within a user-facing scenario on a platform like CKAN. Do you have a rough idea of what the workflow / dataflow would be?

1 Like

@ewan_klein yes, this is a good question. I understand what you’re saying now. For a given registry, there should be an easy, user-facing way to flag data “issues”. Good Tables is good for automatically flagging a set of issues with data that the publisher could foresee.

@rufuspollock does the forthcoming Data Package Registry do anything about this?

One way to integrate the results in a user facing way in CKAN would be integrate with the ckan issues extension and automatically open issues (this would also allow domain experts to directly inspect data and flag issues):

@ewan_klein I also became aware of DBHub (@justinclift), aimed at managing SQLite databases in the cloud, which does aim to have an “issues” section:

Thanks @danfowler. :smile:

@ewan_klein This is something we’re hoping to address, if not completely solve :wink:, through a new data collaboration platform being worked on. As Dan mentioned, it’s called “DBHub.io”.

We’re still getting the basic bits in place, and the “Issues” section, called “Discussions”, is just a placeholder for the next few weeks. :wink:

The concept for the platform is fairly simple. It’s a way to share data sets using the same model as Git/GitHub. eg forks, merging, + social aspects

There’s a development server online that generally runs the latest code, and you’re quite welcome to play with that… and even attempt busting it if you want… as long as you file bug reports if you manage to. :grinning:

https://dev1.dbhub.io/justinclift/DB4S%20download%20stats.sqlite

Obviously hoping the platform gets good traction, and works out well. We do have one potential advantage, in that it’ll be the “Cloud” component for (eg integrated with) a popular SQLite GUI, which generally gets a shade over 1 mill downloads for each release.

Fingers crossed, etc. :slight_smile:

Hi @justinclift (and thanks for the pointer, @danfowler)

DBHub.io looks really interesting. As it happens, one of my MSc students, Andreea Pascu, did a lot of work last year on organising and cleaning off-road bike count data from City of Edinburgh Council (her code is here) and her canonical version of the data is in a SQLIite DB. There are CSV dumps from her DB published as open data with help from @sally_kerr, but we didn’t find an obvious way of publishing the source DB. So we could upload this DBHub.io, right?

I think the idea of focussing on a particular use-case, as you have done, is an excellent one: flesh out out the key ideas for versioning, issues first with an interested community and then see whether the framework can be generalised / ported etc.

One question: Is it possible in DBHub.io to associate a specific license with an uploaded DB?

Definitely! With a caveat though. :wink:

The caveat is that the only server we have online at the moment is the “dev1” VM. We wipe the data on there and generally muck around with it (debugging, etc). So, that’d be a very err… temporary place to put data.

When the platform codebase is a bit more developed, we’ll start putting the first real “production” servers online which should (!) be a suitable long term home. :smile:

It will be some time in the next few days (or maybe next week). I need to figure out which licenses to display by default, have some way of indicating deprecated vs recommended, and so on. I’ve already started adding some general server side code for this, but it’s not visible through the UI. @danfowler also pointed to a license discussion on this forum which is likely relevant and I still need to read through.

There’s also a GitHub Issue (opened yesterday) about the granularity of licenses DBHub.io will need:

Constructive input/ideas/etc there is welcome . :slight_smile:

@ewan_klein Apologies, we’ve not added the license pieces yet. I’ve been putting time into making the webUI easier to use and more useful in general (eg added pagination, sorting, further cache usage). License pieces will probably be added early next week now.

With Andreea Pascu’s canonical SQLite DB, any idea how big it is? I’ve looked quickly at the CSV dumps and they didn’t appear too huge. She’s very much welcome to upload it to our dev1 server (please? :smile:), as it’ll give us further sample data to test things against. We can make sure we don’t wipe it either.

Licensing wise, we’ve been manually adding that info to the Description area of each page for the databases we’ve uploaded. The Description area supports Markdown, and is changable through the “Settings” page for each database.

eg: https://dev1.dbhub.io/justinclift/Assembly%20Election%202017.sqlite

@ewan_klein As an observation with the CSV dumps, they’re numbered sequentially from 01 to 48.

Numbers 06, 13, and 16 aren’t present though. Is that expected? :slight_smile:

Hi @justinclift,

I’m Prof. Ewan’s MSc student who worked on the Edinburgh Bike Counts project.

Related to your question, I had a look at the csv dumps (here are all of them: /outputFiles): counter ‘06’ seems to be there, however we had no data for ‘13’ and ‘16’. Those were the examples of (more recently installed) counters, for which we didn’t have any data collected - therefore there was no file or analysis to produce. Hope this answers your question.

-Andreea

Thanks @andreeaPascu, you’re right. Sorry, my mistake. :wink:

I’ve just again looked over the ones I downloaded and 06 is there. The missing ones were 13, 16, and 23.

For the 23 one, it’s on GitHub but doesn’t seem to be on data.edinburghopendata.info. That’s probably the one I meant instead of “06” but typo’d due to brain fade or something. :wink:

With the original SQLite database you created, would the project I’m putting time into (dbhub.io) be useful for publishing it? (once our basic features are all in place :innocent:)

Hi @justinclift. No worries :slight_smile:

You are right about number 23 not appearing the Edinburgh open data portal, and I am not sure why. @ewan_klein do you have any thoughts why number 22 (North Meadow Walk East 1) was added, but 23 (Meadow Walk East 2) was missed out?

Related to dbhub.io, yes, I believe once the platform is ready, we could use it for our db. It is not currently published anywhere and it will be valuable to have it accessible in this format.

-Andreea

Just for the heck of it, imported the CSV files into a database, set the data types manually (ugh), and uploaded it to the development server:

It’s already been helpful in showing up a few things that needed tweaking, but the general layout and responsiveness (on a low end development server) seems ok. :slight_smile: