One of the big concerns in the use of open data in journalism and activism is properly sourcing the data: can you tell exactly where each field of a highly integrated record comes from? Can you filter out results from unreliable sources which would not stand up in court?
I just wanted to flag the discussions on that topic that we’ve been having on uf6/design and get people’s comments on what other standards and technologies exist that could facilitate this.
Has field-level attribution ever come up around the data package spec discussions?
That seems like a workable path, but I somehow dislike the somewhat arbitrary split between data and metadata: knowing where a given fact comes from is data as well (and it shouldn’t be easy to change the data but not the metadata…). If you were to create a fully-sourced table, your metadata file is now multiple times as large as your actual data file.
Perhaps it makes more sense to think up a CSV format that holds statements, i.e. to push the problem up a layer in the stack. (And yes, I don’t know what the difference between CSV statements and nQuads is… really been infected).
I note Stefan Urbanek has made some good suggestions in the issue on github and is definitely +1 on keeping the provenance / notes as a separate data file. I think the arguments for that are quite compelling and will try and create a straw-man proposal for how to do layer this into e.g. Tabular Data Packages …
@pudo any thoughts on the above and especially the suggestion of a separate table / CSV file to record the provenance info?