How to score PDFs?

Krzysztof_Madejski · November 25, 2016, 11:01am

I’m a bit confused with dropping out the technical aspect of open data from the methodology:

The only part of our methodology that is not aligned with the open definition is “Open Machine readable” format. We give a full score to machine-readable formats whose source code is not open, but who are usable with at least one free and open source software in order to emphasise practical openness.

In my opinion PDFs fall into that category, but B8 question doesn’t specify it as a format. So how should we treat it? Should it only affect answer to B9 (reuse aspect)?

Do I get B9 right? I understand that Budget in PDF would get a lower B9 score than a legislation in PDF (without copy-block), because of the different reuse scenarios. Am I right?

dannylammerhirt · November 28, 2016, 7:11pm

Dear Krzysztof,

to answer your question about B8: the question evaluates two aspects, whether:

the dataset is in a machine-readable format
the dataset is in a format usable with a free/libre/open source software.

Both have to be fulfilled. This means that our list of formats includes only formats that meet both requirements so we can give a full score for open and machine-readable formats. PDFs do not meet the criteria of machine-readability (You can find more info here ).

Please use the comment section on the right side of the question text to tell us whether the data are provided as PDF.

To your second question about B9: You are right. The usability of a file format is determined by 1) the type of data contained in the file, 2) your use case of the data. For instance, if you submit data on draft bill content and votes on a bill, let us know what your use case is, and how easy it is to use the data for this specific case (for instance do you need to extract information from a pdf file first, etc.). Base your assessment on these two steps. Please document your approach in the comment section, this is very useful information both for our reviewers and us.

Please do not hesitate to send any further questions if anything is unclear!

Anna_Alberts · March 5, 2017, 4:18pm

Hi Danny,

Here a follow up question for PDFs, because it is always raining PDFs with Budget Data. So far I have answered for a budget law that I find, most yes (depending on the entry) B8 Empty, and B 9 hard.

However B5 always no, as PDFs are not datasets as such, and we cannot speak of “bulk download”. Or should I answer yes to B5 as it is a simple click to get the pdf?

Cheers, Anna

dannylammerhirt · March 10, 2017, 1:55pm

Hi @Anna_Alberts,

a dataset is considered to be made of information that has a logical link to one another (e.g. concentration of air pollutants by region and time, budget per gov department). It does not matter in which format this information is presented (pdf, csv, xml) to count as a “dataset”.

A dataset can contain all required data, but may not be machine-readable, for instance if you find a budget table in a pdf. If you can download this entire dataset in a pdf file the dataset counts as downloadable at once. The difference is that this information itself is not necessarily machine-readable and only usable with extensive manual effort (see survey question B9).

Does this answer your question?