What do we mean by "human effort" and "make data usable"?

Wagner_Faria_de_Oliv · December 1, 2016, 2:01pm

Hello all!
As we are a team of researchers working for the brazilian index, we came up to different reasoning strategies to answer the last question of the survey (“Please provide an assessment of how easily the data are usable without human effort”).

First reasoning
“I think data is usable because the spreadsheet is easy to work, so that, to make my analysis, is easier, so I put 3.”
“Data in API format requires programming skills, so, for me (not a programmer), I have no idea on how to process this data to work with it, so I put 1”.

Second reasoning
“Data in API format is the most machine-processable format that permits a webservice to access all data at once, with no pre-stablished queries, making data more usable in different ways, so I put 3”.
“Data in spreadsheet format is usable, but not as much as the API, so I put 2”.

Which of these reasonings should I follow? Or it is ok to let different people pursue different strategies?
@tlacoyodefrijol @Mor

Thanks in advance,
Wagner

herrmann · December 6, 2016, 7:27pm

If I may offer yet another different reasoning, I think that data consumers are expected to have a minimum of data literacy.

In that sense, the “human effort” would be to clean up data that has quality problems, e.g., removing non-data headers from a spreadsheet, converting dates that are in non-standard format, fixing incorrectly marked character encoding. That is, the usual data cleaning jobs. Having to scrape websites or PDFs would also qualify as “human effort” for the effort in setting up these scrapers. Improper documentation or lack thereof (i.e. an API manual, a data dictionary explaining the meaning of fileds in a spreadsheet) is also a factor against making data usable and requiring effort to figure things out.

In my opinion, I think neither spreadsheets nor APIs should be penalized just for being so. Instead, this criteria should reflect the amount of hurdles faced by data literate users in actually making use of the data.

dannylammerhirt · December 13, 2016, 7:52pm

Dear @Wagner_Faria_de_Oliv

this is a great point. Our idea was to test how we can evaluate the International Open Data Charter Principle 3 (Accessible and Usable) and 4 (Comparable and Interoperable).

The principles state that open data should be released in “open formats to ensure that the data is available to the widest range of users to find, access, and use. In many cases, this will include providing data in multiple, standardized formats, so that it can be processed by computers and used by people”.

@herrmann describes very well what we mean by “human effort”, namely all manual labour you have to put into the data (including writing a scraper, cleaning data, changing values, etc.). This assessment should then be based primarily on your own knowledge and intended use case. The reason why we decided to add this question is our experience that file formats are often a poor proxy for usability. So this question is somehow a complement to B8 which evaluates open and machine-readable formats.

Let me know if this answers your questions or if any more feedback would be helpful.

Danny

yurukov · December 18, 2016, 2:41pm

I agree with @herrmann. Data literacy is a must. APIs in my view are the best case of accessibility as they imply that cleanup is not needed. Spreadsheets are often more ambiguous when it comes to the meaning of the presented data.

One more aspect that I’m not clear about is metadata and other documentation. A lot of the data on the opendata portal of the Bulgarian government is in CSV format, so it’s available for download and though an API. The documentation however is mostly in Bulgaria for now. Most of the 1600 datasets are easy to use, but should they be penalized as the documentation is not in English, French or German? Should the UK data be penalized since it’s documentation is not translated in Bulgarian, Italian or Spanish?