Our process of finding and evaluating key datasets

This thread discusses how we searched for datasets during review, and our decision-making process to accept or reject a dataset. It also explains our idea of picking the most representative dataset for evaluation.

This document gives an explanation of our current approach to select datasets.

There are many opinions and great feedback as to whether we should or should not evaluate entire datasets.

Should we be more attentive towards the existence of some open data in a country? If so, why? What information is most useful to you? Please do not hesitate to get in touch with us. What do you feel is most constructive and useful information the index should provide?

@martinsz, @herrmann, @mrybi, @nickmhalliday

4 Likes

Maybe the wrong thread, but what about criteria for dropping a dataset altogether ? And/or changing / more stricter criteria ?

IMHO, the GODI cannot be used to monitor progress (because of the changing datasets, change of definition and/or criteria) over consecutive years. I think it is even mentioned somewhere in the little print, and the goal of the GODI for OKFN is probably more about pushing open data (which is laudable), but be aware that the GODI is being used / abused to monitor the progress and impact of government policy.

The yearly EU Data Portal Landscaping benchmark takes a slightly different approach (although this benchmark is more focused on policy / governance, and less on datasets itself): new criteria / requirements can - and are - added every year, but the criteria from the previous year are not removed, making it a slightly better tool to monitor progress.

I’d suggest to keep the datasets from previous years, and if the criteria for a given datasets become more strict, to add additional points.

So e.g. if the criteria for water quality are based upon availability of X and Y, that’s a maximum score 20 points, and if next year you also want Z, a place still gets 20 points for X an Y, and may get an additional 10 points if Z is also available (maximum score 30).

7 Likes

About law (National Laws) and draftlegislation (Draft Legislation): an important criteria is official format. In some countries like Brazil, a lot of contents in law and draftlegislation was “converted” to structured HTML, but the convertions are unreliable and all reamains as unofficial, without value of evidence. The official remains only in PDF.

Another big problem is the evidence preservation in reliable digital preservation repositories: the traditional official gazettes are losting traditional (reliable) paper evidence, because now exist only in digital media, and third world nations are not prepared to ensure its digital production. In Brasil we are losting evidence-verifiability with this “digital transiction”, there are no backups, no standard cryptographic checksums, etc.


Summarizing: law and draftlegislation are key datasets, but without better evaluating process we mix informal and formal datasets.

PS: informal in this context is something like “open to caos and corruption”.

Well my main concern with this methodological leap is that it makes it data incomparable over years. IHMO, tracking progress is vital if we consider GODI as a way to open discussion about data publishing and evidence for advocacy.

Also, I am not sure if this new approach really tells us much about the actual availability of the data. I would prefer weighted assessment to have a clearer picture about the state of the data. For example indicate compliance with each criteria as marking it red/green and also show overall score. Than you can, by one glance get a much better idea on what is available. The current approach requires you to read a separate set of instructions to understand the result correctly. How many visitors of the site would actually bother with taking a look at it?

You mention that one of the reason for taking the “all or nothing” approach are specific use cases. But I could not find those examples (in the public GODI interface, i looked at just at a couple of underlying documents).

I also think that the usability info can suffer with this approach. In context of open data, I perceive usability as potential to reuse the data.

For example, Czech data about weather are scored 45% - criteria about precapitations etc are met, but the data are closed form both legal and technical point of view and scattered in multiple places of the site. So such data are probably quite helpful for someone who is just checking if it’s going to rain tomorrow. But pretty much useless for anyone who would wish to embody it in his/hers own analysis or mobile app.

Czech data about election results are available in machine readable format and downloadable in bulk under open licence. But also in XLS or HTML if non-tech people care to checkout detailed results for their village. Currently the dataset is missing invalid votes count (but there’s number of valid votes and sum of votes). But nonavailability of explicit number of invalid votes scores the whole dataset with 0%. A seemingly tiny detail can turn the value of the dataset upside down (last years data with same criteria got 100%).
So there is a big discrepancy between my grasp of usability of the data and the score. And this makes me believe than weighted assessment is much better.

5 Likes

Interesting point @ppkrauss. When looking at our results, do you see any countries for which those problems apply you mentioned? Any specific files that could show these problems? And who else might be able to respond to this? @herrmann, @david.cabo maybe?

Best
Danny

2 Likes

Hi @mrybi,

Thank you so much for your feedback.

Currently we have a data category and try to find data online that is most representative for this data category. So whenever we need to make a decision what data to look at, we currently prioritise data category (which data contains all relevant information) over purely legal and technical openness (which data is available as open data). I agree that with this approach we may tend to check publicly accessible information (possibly more likely to be published online), instead of only looking at the open data that is available. This however depends on the data category and country too.

What I take from your proposal, @mrybi, is that we could flip this logic. We have a data description, look for data elements online that meet most open data criteria, and then run the assessment only with this data. We could visualize next to the final score how much of the data category we actually assessed as open data. Also the score would need to be lowered. I think this can be a good way forward to have a nuanced image of open data publication (that can flag data publication gaps).

Going with @barthanssens, we would keep the data categories from this year, and potentially add optional new categories. But bear in mind that changing our reference point would change the scores again in the next year.

What do others think?

2 Likes

Hi,

Hope this is the right place to post this for further discussion.

I’ll like to revive the debate about whether datasets that require automated registration should be penalised for availability. I believe this should not be the case, given that it’s common practice to require registration for access to realtime data.

For reference, here is the thread I started back in December 2016, when I was working on the submissions:

My argument is that realtime data is inherently more valuable than aggregated data provided at monthly or even yearly basis. And countries that provide such data should not be penalised in the evaluation for providing better quality data.

@dannylammerhirt, who reviewed Singapore’s Weather Forecast dataset acknowledged this in his notes:

There are APIs not just for forecasts but readings of temperature, humidity, precipitation and wind conditions in near-realtime. The high-frequency readings are arguably more useful than forecasts, for users who are trying to analyse weather patterns. Users need to register on the Data.gov.sg Developer Portal for an API key to access the data. Registration is free and users gain immediate access after registration.
from https://index.okfn.org/place/sg/weather/

Any thoughts from others about this?

Best,
Zhaowei

1 Like

Hi @Lin_Zhaowei,

Thank you so much for this - it is indeed an important discussion, because currently we treat registration requirements as closed access (see our position to this topic in the blogpost here).

Confounding very different degrees of access controls was criticized by some people, and I see their concerns.

However, I would like to open an own topic for this, if that is ok. Everyone interested should comment on this topic.

1 Like

Well, in general yes :slight_smile: If i am not mistaken, your answer implies that for some countries there are several official sources which can be considered relevant for a data category. From my experience from Czech Republic, which is quite a small country, fo most data categories, there is a single official (published by govt) data source. But that is probably just a detail.

I think we can still give prominence to the data category, aka content of the data as i would name it. But, instead of current all or nothing approach (which prevents data with some content gaps from being scored altogehter) we can apply weighted assesment to content criteria and then score also legal openeness and tech accesibility and reuse potential. So we would still be able to flag gaps that previous GODI might have missed but we could avoid rejecting some avaible data, which are used and usable but tend to have some data gaps…
This can be imho also good for advocacy purposes, because small steps will be visible which might increase motivation to improve data quality.

2 Likes

I agree that this is a good way forward. I have another question that I would like the forum to comment on: Let’s assume we weight single data content against one another in our scoring. For example we only find company names in a registry. Assuming this data is fully open. How would you propose to weight this?

If we assess three data elements (company name, address and unique identifiers) - should we just score all of them to 33%? So if we one element only as open data it would be 33% open?

The reason I’m asking is, that we need to justify what criteria this scoring is based on. I encourage you to read our process of defining key datasets.

Here I explain that a major problem of key datasets is to find a reasonable way to assign importance. What data matters more than others? How would we measure this?

Please do share your ideas with us!

1 Like

Well there will always be a trade-off to make, and the easiest to get started is indeed just say 33% open if there is only one out of 3 available (you are already doing this with data sets actually: every data set is deemed equally important)
Unless there is a very good reason not to do so, but that would open up lengthy discussion with IMHO little difference

3 Likes

I agree that the easiest (and crudest) way is to weight each item equally for a start.

If we want to maintain the categories and give GODI room to add in new categories and even criteria in the future, one possibility is to change the raw scoring from percentages to nominal figures, and to use percentages only for calculating the index. This means that we can change the weights for each item down the road, if needed.

The publication of both the raw nominal score and the final percentages allow users to both 1) compare progress for any given item under a data category, and 2) compare countries/places.

1 Like

Hi @dannylammerhirt,

When looking at our results, do you see any countries for which those problems apply you mentioned?

Yes, Brazil… And many similar contries with similar “maturity level in E-government” and similar informality level of digital preservation… Perhaps 90% of Latin America countries.

Any specific files that could show these problems?

First it is important to reinforce the context, is digital preservation and digital integrity of a public content — not usual “digital certification” of authorship. See this handbook page about checksums.

There are 3 or 4 millions of documents, each document with an Brazilian full-text norm (government acts as law, etc.). The cost to audit the integrity of millions of digital documents, without checksums (by full-backups and its full-homologation), is very high, no one is doing it.

Sampling one. Brazilians “Civil Internet Frame” federal Law of 2014, officially named Lei Federal nº 12.965 de 2014 (it is linking to LexML portal that is the Brazilian’s equivalent of European N-LEX portal).
The full-text of the law is at the online government gazette as “official PDF” (see here), and at official sites transcripted to HTML, like the presidential page with HTML full-text of the law.

The problems:

  1. The HTML reamains as unofficial transcription, without value of evidence. See the link to the HTML version, at the end there are a red phrase
    “Este texto não substitui o publicado no DOU de 24.4.2014”
    = “This text does not replaces the published one in the OGU (Offcial Gazette of the Union) in 2014-04-24”.

  2. The PDF have no checksum, which would be so that any citizen can freely check integrity… So, no citizen can proof that there was (in the past at the official webserver) other content, that the document was changed.
    And there no other resource to check integrity, the foot note is only an ID (say “código 00012014042400124”) to the same PDF, that is not a proof of integrity.


Yes, hello @herrmann? :wink: and perhaps @wagner_faria_de_oliv

1 Like

Dear all,

Thank you so much for all the great contributions so far. I noted them in a document and we will discuss those internally. In any case, I’m very much vouching for a more nuanced representation of the availability of single data elements, possibly by indicating a percentage of how many data elements are provided as open data.

While we are processing this, please do share any further thoughts you might have.

Best
Danny

1 Like

Thank you this write up is very helpful. It does illuminate the challenges in carrying out the review.

My first thought is how realistic it is to expect all the characteristics to be in one data set?

For example with the worked example of draft legislation how many Parliamentary bodies publish one dataset as described?

If there are none then it might suggest that the underlying production processes are standalone and not seen as being related. If there are some full sets already existing it might suggest that with some effort other Parliaments might be able to link up their data more effectively?

Making changes costs time and money so from a data creators perspective evidence that their end users really want having all the characteristics in one place and clearly together would be help encourage Parliaments to consider making changes in the future if they are rebuilding their sites.

On the whole they will prefer a driver from users needs not a philosophical good - nice as that might be. Do some bodies already have evidence that their users prefer the current formats? Maybe there should be a debate should be around the evidence of the user needs for such datasets?

1 Like