Datahub and the UX of open data for non-technical researchers

opendata

#1

Hi all!

I’m a newcomer looking for some orientation in this crazy and exciting world of open data. As a coder just dipping my toes into the civic hacking world, I see so much unrealized potential here – it seems that a bit of elbow grease investment around the UX, for the benefit of non-technical folks, could go such a long way in making open data more accessible; at least that’s what I experienced first-hand at a civic tech hackathon I attended yesterday.

As an outsider, I’m interested in your comments (especially where I’m wrong about things).

It appears to me that the discourse around open data is dominated by publishers and academics, with some overlap. The former have a tendency to dump unstructured data on their sites (sometimes CKAN, sometimes custom); the latter have a tendency to philosophize about metadata schemas and ontologies and curation strategies etc. Meanwhile, third-party users quietly muddle through with CKAN APIs or scraping scripts – not a problem for those who know exactly what dataset they need.

However, a huge potential open data audience of researchers and government policy analysts is left out of the discussion – they don’t have the technical background, they don’t want to sift through dozens of city/state/federal/other sites to find data, and even when they do, it’s challenging for them to add the structural info that their visualization tools need to properly graph the numbers. And those are the people we want most to make good use of the data, no?

Think: government workers with a legal or economics background, needing to quickly understand dozens of datasets from different places as they are trying to make evidence-based policy recommendations. Why, in 2018, can’t they open up their BI software, go “File → Import Data → Search on Datahub”, and have 99% of all of the world’s open datasets at their fingertips? Just like I, as a coder, have been able to type npm install [any-javascript-lib-in-the-world], without having to think about the where and how, since 2010? Quantity is key here, not quality, as long as basic structural info has been extracted.

Socrata has made some inroads here. opendatanetwork.com is the only open-data platform I know that approaches being actually useful for casual consumers looking for specific information. But it’s closed-source. It’s US only. It doesn’t allow others to upload anything. No datapackage.jsons, only raw download or API. And it only has structural info because the Socrata platform seems to force publishers to include it.

Do you think it is at all feasible to turn e.g. datahub.io into the open equivalent of opendatanetwork.com, with the help of volunteer civic hackers? I.e. a public registry of datapackage.json files that are semi-automatically generated and which just link to the actual data, wherever it is hosted? So that one day I will be able to go to my terminal and type data pull zimbabwe-harare-highschool-gradrates-2012 in my terminal, or go to Tableau and go Import Data → Search Datahub → NOAA 11404 Hydrographic Survey? Again, it doesn’t have to be “core data” quality; it just has to work.

I look forward to your input!

Sebastian


#2

Hi @skosch It’s great to have some fresh eyes on the open data ecosystem and I agree - there’s much to do. I’ll respond to two aspects in your post:

  1. searching across open data portals
  2. visualising data

Searching

http://search.data.gov.au is a beta in Australia that provides search across open data portals for all levels of Government. It is powered by https://github.com/TerriaJS/magda.

The ODI are doing some research into open data search https://theodi.org/article/exploring-human-data-interaction-challenges-in-data-discovery/

There’s also https://www.google.com/publicdata/directory

I agree open data search is a work in progress.

Visualisation

data.gov allows people to open data in tools directly from the portal https://www.data.gov/meta/open-apps

https://figure.nz in New Zealand creates ready-to-use visualisations

https://frictionlessdata.io/specs/views/ enables visualisations of data packages e.g. https://datahub.io/examples/vega-views-tutorial-topojson

Summary

Much is being done but it will never be finished and can always be better.

Governments, volunteers and other organisations all do their bit to get us a little closer to a frictionless open data ecosystem, that we all wish was here today. Over time, we’ll collectively create something awesome. :smile:

Have a think about how you may like to contribute. There are lots of open data projects to choose from and you’re very welcome to contribute to any of the Open Knowledge ones.

I’ve found the community very welcoming, happy to help learners, and I get lots of satisfaction from contributing my little bit.


#3

Hi Stephen! Thanks so much for taking the time to put all those links together for me.

TerriaJS/Magda looks fantastic, and I wouldn’t have found it otherwise. The ODI article hits the nail on the head though: there is a “long tail” of data that’s not accessible or discoverable.

It seems to me that well-intentioned government efforts à la “let’s build a website for searching [country/province/city] datasets” are actually making things worse, because end-user tooling won’t improve unless there is a canonical way to get your data (or at least the structure/metainfo/data link) from a centralized, source-agnostic place that has critical mass. The frictionless data specs (datapackage.json & friends) seem like the perfect starting point to make this happen, in combination with an unaffiliated, open service like datahub (or another one, but I like the name :slight_smile:) .

As for visualization: from what I could gather (admittedly, my sample size is small) most analysts and researchers are comfortable with industrial BI tools, whether commercial or open source, so that wheel really doesn’t need reinventing. The “open with apps” thing is neat, though.

I’m wondering what I can do to bring this closer to reality – not as an alternative to what’s out there, but on top of it. It’s a Herculean task, obviously. I’m hoping to discuss this with our Civic Tech folks here in Toronto this week. Meanwhile I’m grateful for anyone else’s insights or links :+1:

Thanks again!