Discussion on data provenance for journalists and NGOs

pudo · August 30, 2014, 2:21pm

One of the big concerns in the use of open data in journalism and activism is properly sourcing the data: can you tell exactly where each field of a highly integrated record comes from? Can you filter out results from unreliable sources which would not stand up in court?

I just wanted to flag the discussions on that topic that we’ve been having on uf6/design and get people’s comments on what other standards and technologies exist that could facilitate this.

Has field-level attribution ever come up around the data package spec discussions?

rufuspollock · September 11, 2014, 8:34am

Provenance stuff definitely comes frequently especially in scraping exercises. There has been some thought around this in (Tabular) Data Package and I’ve just booted an issue on per-cell or per-row annotation with a straw-man suggestion.

pudo · September 11, 2014, 9:05am

That seems like a workable path, but I somehow dislike the somewhat arbitrary split between data and metadata: knowing where a given fact comes from is data as well (and it shouldn’t be easy to change the data but not the metadata…). If you were to create a fully-sourced table, your metadata file is now multiple times as large as your actual data file.

Perhaps it makes more sense to think up a CSV format that holds statements, i.e. to push the problem up a layer in the stack. (And yes, I don’t know what the difference between CSV statements and nQuads is… really been infected).

rufuspollock · September 11, 2014, 9:07am

I tend to agree with you - in the questions section in that issue I suggest that make annotations / notes into a new “data” resource/table.

rufuspollock · September 14, 2014, 11:03am

I note Stefan Urbanek has made some good suggestions in the issue on github and is definitely +1 on keeping the provenance / notes as a separate data file. I think the arguments for that are quite compelling and will try and create a straw-man proposal for how to do layer this into e.g. Tabular Data Packages …

@pudo any thoughts on the above and especially the suggestion of a separate table / CSV file to record the provenance info?

Topic		Replies	Views
Emerging patterns / workflows for Data Packages (2014) Frictionless Data	0	883	August 9, 2016
W3C CSV for the Web - how does it relate to Data Packages? Frictionless Data	10	4813	November 27, 2017
Tracking Data Issues: what's the current state of the art? Open Knowledge Labs	17	2272	May 3, 2017
What interesting ideas exist to start practice data journalism? Community	1	701	July 30, 2018
Geo Data Package Frictionless Data	42	5382	March 1, 2018

Discussion on data provenance for journalists and NGOs

Related topics