Discussion on data provenance for journalists and NGOs


#1

One of the big concerns in the use of open data in journalism and activism is properly sourcing the data: can you tell exactly where each field of a highly integrated record comes from? Can you filter out results from unreliable sources which would not stand up in court?

I just wanted to flag the discussions on that topic that we’ve been having on uf6/design and get people’s comments on what other standards and technologies exist that could facilitate this.

Has field-level attribution ever come up around the data package spec discussions?


#2

Provenance stuff definitely comes frequently especially in scraping exercises. There has been some thought around this in (Tabular) Data Package and I’ve just booted an issue on per-cell or per-row annotation with a straw-man suggestion.


#3

That seems like a workable path, but I somehow dislike the somewhat arbitrary split between data and metadata: knowing where a given fact comes from is data as well (and it shouldn’t be easy to change the data but not the metadata…). If you were to create a fully-sourced table, your metadata file is now multiple times as large as your actual data file.

Perhaps it makes more sense to think up a CSV format that holds statements, i.e. to push the problem up a layer in the stack. (And yes, I don’t know what the difference between CSV statements and nQuads is… really been infected).


#4

I tend to agree with you - in the questions section in that issue I suggest that make annotations / notes into a new “data” resource/table.


#5

I note Stefan Urbanek has made some good suggestions in the issue on github and is definitely +1 on keeping the provenance / notes as a separate data file. I think the arguments for that are quite compelling and will try and create a straw-man proposal for how to do layer this into e.g. Tabular Data Packages …

@pudo any thoughts on the above and especially the suggestion of a separate table / CSV file to record the provenance info?