Tracking Data Issues: what's the current state of the art?

ewan_klein · February 28, 2017, 10:59am

Has anyone worked any more on the topic discussed in the blog post Tracking Issues with Data the Simple Way by @rufuspollock in the case where data is not hosted on GitHub?

I’d be particularly interested to hear about approaches successfully adopted for public sector open data portals. This article suggests that it is still an important challenge:

These advances aside, Kaehny said the “big criticism” he has left for the city’s progress report revolves around what it doesn’t include: any mention of how the department plans to address issues with the quality of some data sets.

From “inconsistent geospatial data” to “bad metadata” to erroneous data duplication, Kaehny sees issues that the city could address if it had a better process for accepting feedback from the public and reacting to those comments.

pwalsh · February 28, 2017, 11:16am

Hi there @ewan_klein

Yes, we have done lots of work on Data Quality, mostly via our Frictionless Data specifications and related tools.

In particular, we have a Python library called goodtables that produces detailed, granular quality reports from tabular data, or collections of tabular data (e.g.: we can scan a public CKAN API and build quality reports for every tabular data file published on the instance).

We are about to launch a full continuous data validation service called goodtables.io that makes this trivially easy to do.

The Data Quality Spec has been extracted out of our work on goodtables to enable reuse, and we’ve got a beta version of a Data Quality Dashboard that can display quality results for large sets of data (think: visual interaction with quality results for each tabular data file on a CKAN instance, or any other public data collection).

ewan_klein · February 28, 2017, 11:47am

Hi @pwalsh,

Thanks for the super-quick response! The Data Quality Dashboard looks great, so thanks for that link. However, I’m particularly interested in cases where data may be incorrect for reasons other than being syntactically ill-formed or invalid relative to a schema; for example, a geo-coding could be inaccurate, or a string could contain typos. Some of these errors may be picked up by people with ‘domain knowledge’ but with little or no technical expertise, so easy-to-use reporting mechanisms (plus support for tracking the data steward’s response) would be really helpful.

danfowler · March 2, 2017, 6:59am

@ewan_klein I wonder to what extent your issues could be addressed by writing custom checks in Good Tables:

cc: @roll

roll · March 2, 2017, 7:56am

@danfowler
Yes custom checks seems pretty relevant. Here is a toy example of reporting error if there is an unicode character in the data:

github.com

frictionlessdata/goodtables-py/blob/main/examples/custom_check.py

from pprint import pprint
from goodtables import Inspector, check

@check('unicode-found', type='structure', context='body', after='duplicate-row')
def unicode_found(errors, columns, row_number, state=None):
    for column in columns:
        if len(column) == 4:
            if column['value'] == '中国人':
                message = 'Row {row_number} has unicode in column {column_number}'
                message = message.format(
                    row_number=row_number,
                    column_number=column['column-number'])
                errors.append({
                    'code': 'unicode-found',
                    'message': message,
                    'row-number': row_number,
                    'column-number': column['column-number'],
                })

This file has been truncated. show original

But of course it could be any other checks with custom error codes related to your domain.

@ewan_klein
If you’re interested I’m ready to help here or you could create an issue - Issues · frictionlessdata/goodtables-py · GitHub - with description of what of custom check you’re looking for.

ewan_klein · March 14, 2017, 10:35am

@roll and @danfowler, thanks both for the suggestions. I can see how a custom check would give you the basic functionality but I don’t fully grasp how this would be integrated within a user-facing scenario on a platform like CKAN. Do you have a rough idea of what the workflow / dataflow would be?

danfowler · March 14, 2017, 4:40pm

@ewan_klein yes, this is a good question. I understand what you’re saying now. For a given registry, there should be an easy, user-facing way to flag data “issues”. Good Tables is good for automatically flagging a set of issues with data that the publisher could foresee.

@rufuspollock does the forthcoming Data Package Registry do anything about this?

rufuspollock · March 18, 2017, 6:12am

One way to integrate the results in a user facing way in CKAN would be integrate with the ckan issues extension and automatically open issues (this would also allow domain experts to directly inspect data and flag issues):

danfowler · April 18, 2017, 8:51am

@ewan_klein I also became aware of DBHub (@justinclift), aimed at managing SQLite databases in the cloud, which does aim to have an “issues” section:

justinclift · April 18, 2017, 2:52pm

Thanks @danfowler.

@ewan_klein This is something we’re hoping to address, if not completely solve , through a new data collaboration platform being worked on. As Dan mentioned, it’s called “DBHub.io”.

We’re still getting the basic bits in place, and the “Issues” section, called “Discussions”, is just a placeholder for the next few weeks.

The concept for the platform is fairly simple. It’s a way to share data sets using the same model as Git/GitHub. eg forks, merging, + social aspects

There’s a development server online that generally runs the latest code, and you’re quite welcome to play with that… and even attempt busting it if you want… as long as you file bug reports if you manage to.

https://dev1.dbhub.io/justinclift/DB4S%20download%20stats.sqlite

Obviously hoping the platform gets good traction, and works out well. We do have one potential advantage, in that it’ll be the “Cloud” component for (eg integrated with) a popular SQLite GUI, which generally gets a shade over 1 mill downloads for each release.

Fingers crossed, etc.

ewan_klein · April 19, 2017, 7:20am

Hi @justinclift (and thanks for the pointer, @danfowler)

DBHub.io looks really interesting. As it happens, one of my MSc students, Andreea Pascu, did a lot of work last year on organising and cleaning off-road bike count data from City of Edinburgh Council (her code is here) and her canonical version of the data is in a SQLIite DB. There are CSV dumps from her DB published as open data with help from @sally_kerr, but we didn’t find an obvious way of publishing the source DB. So we could upload this DBHub.io, right?

I think the idea of focussing on a particular use-case, as you have done, is an excellent one: flesh out out the key ideas for versioning, issues first with an interested community and then see whether the framework can be generalised / ported etc.

One question: Is it possible in DBHub.io to associate a specific license with an uploaded DB?

justinclift · April 19, 2017, 10:54am

Definitely! With a caveat though.

The caveat is that the only server we have online at the moment is the “dev1” VM. We wipe the data on there and generally muck around with it (debugging, etc). So, that’d be a very err… temporary place to put data.

When the platform codebase is a bit more developed, we’ll start putting the first real “production” servers online which should (!) be a suitable long term home.

It will be some time in the next few days (or maybe next week). I need to figure out which licenses to display by default, have some way of indicating deprecated vs recommended, and so on. I’ve already started adding some general server side code for this, but it’s not visible through the UI. @danfowler also pointed to a license discussion on this forum which is likely relevant and I still need to read through.

There’s also a GitHub Issue (opened yesterday) about the granularity of licenses DBHub.io will need:

github.com/sqlitebrowser/dbhub.io

Per version license info

opened 05:34PM - 18 Apr 17 UTC

closed 01:16PM - 01 Jul 17 UTC

justinclift

enhancement

Just realised we'll probably need to have the license information be "per versio…n" instead of "per database". For example, lets say someone publishes 10 versions of a data set over time using licenseA. If they change the license to licenseB for the 11th version, that needs to be just for that one specific version (onwards) and not suddenly be applied to all previous versions. Probably a good thing I haven't written much of the license handling code yet, although it's pretty close to getting focused on now.

Constructive input/ideas/etc there is welcome .

justinclift · April 28, 2017, 5:23pm

@ewan_klein Apologies, we’ve not added the license pieces yet. I’ve been putting time into making the webUI easier to use and more useful in general (eg added pagination, sorting, further cache usage). License pieces will probably be added early next week now.

With Andreea Pascu’s canonical SQLite DB, any idea how big it is? I’ve looked quickly at the CSV dumps and they didn’t appear too huge. She’s very much welcome to upload it to our dev1 server (please? ), as it’ll give us further sample data to test things against. We can make sure we don’t wipe it either.

Licensing wise, we’ve been manually adding that info to the Description area of each page for the databases we’ve uploaded. The Description area supports Markdown, and is changable through the “Settings” page for each database.

eg: https://dev1.dbhub.io/justinclift/Assembly%20Election%202017.sqlite

justinclift · April 28, 2017, 5:37pm

@ewan_klein As an observation with the CSV dumps, they’re numbered sequentially from 01 to 48.

Numbers 06, 13, and 16 aren’t present though. Is that expected?

andreeaPascu · May 1, 2017, 12:27pm

Hi @justinclift,

I’m Prof. Ewan’s MSc student who worked on the Edinburgh Bike Counts project.

Related to your question, I had a look at the csv dumps (here are all of them: /outputFiles): counter ‘06’ seems to be there, however we had no data for ‘13’ and ‘16’. Those were the examples of (more recently installed) counters, for which we didn’t have any data collected - therefore there was no file or analysis to produce. Hope this answers your question.

-Andreea

justinclift · May 1, 2017, 2:24pm

Thanks @andreeaPascu, you’re right. Sorry, my mistake.

I’ve just again looked over the ones I downloaded and 06 is there. The missing ones were 13, 16, and 23.

For the 23 one, it’s on GitHub but doesn’t seem to be on data.edinburghopendata.info. That’s probably the one I meant instead of “06” but typo’d due to brain fade or something.

With the original SQLite database you created, would the project I’m putting time into (dbhub.io) be useful for publishing it? (once our basic features are all in place )

andreeaPascu · May 2, 2017, 8:25am

Hi @justinclift. No worries

You are right about number 23 not appearing the Edinburgh open data portal, and I am not sure why. @ewan_klein do you have any thoughts why number 22 (North Meadow Walk East 1) was added, but 23 (Meadow Walk East 2) was missed out?

Related to dbhub.io, yes, I believe once the platform is ready, we could use it for our db. It is not currently published anywhere and it will be valuable to have it accessible in this format.

-Andreea

justinclift · May 3, 2017, 6:23am

Just for the heck of it, imported the CSV files into a database, set the data types manually (ugh), and uploaded it to the development server:

https://dev1.dbhub.io/justinclift/Bike%20Counts%20In%20Edinburgh.sqlite

It’s already been helpful in showing up a few things that needed tweaking, but the general layout and responsiveness (on a low end development server) seems ok.

Topic		Replies	Views
Launching goodtables.io: tell us what you think! Frictionless Data	39	5974	October 4, 2021
I did a data portals site for you :-) DataPortals.org	9	2487	February 12, 2018
Tool for collaborating on small open data - looking for feedback Open Knowledge Labs open-data	20	2254	March 25, 2017
Project to monitor the status of open data portals CKAN open-data	20	3221	December 20, 2016
Geo Data Package Frictionless Data	42	5389	March 1, 2018

Tracking Data Issues: what's the current state of the art?

Related topics