Discussion on data provenance for journalists and NGOs


One of the big concerns in the use of open data in journalism and activism is properly sourcing the data: can you tell exactly where each field of a highly integrated record comes from? Can you filter out results from unreliable sources which would not stand up in court?

I just wanted to flag the discussions on that topic that we’ve been having on uf6/design and get people’s comments on what other standards and technologies exist that could facilitate this.

Has field-level attribution ever come up around the data package spec discussions?


Provenance stuff definitely comes frequently especially in scraping exercises. There has been some thought around this in (Tabular) Data Package and I’ve just booted an issue on per-cell or per-row annotation with a straw-man suggestion.


That seems like a workable path, but I somehow dislike the somewhat arbitrary split between data and metadata: knowing where a given fact comes from is data as well (and it shouldn’t be easy to change the data but not the metadata…). If you were to create a fully-sourced table, your metadata file is now multiple times as large as your actual data file.

Perhaps it makes more sense to think up a CSV format that holds statements, i.e. to push the problem up a layer in the stack. (And yes, I don’t know what the difference between CSV statements and nQuads is… really been infected).


I tend to agree with you - in the questions section in that issue I suggest that make annotations / notes into a new “data” resource/table.


I note Stefan Urbanek has made some good suggestions in the issue on github and is definitely +1 on keeping the provenance / notes as a separate data file. I think the arguments for that are quite compelling and will try and create a straw-man proposal for how to do layer this into e.g. Tabular Data Packages …

@pudo any thoughts on the above and especially the suggestion of a separate table / CSV file to record the provenance info?