Every Politician is a website that aims to collect basic data about politicians and legislatures in parliaments around the world (I think). According to the website, it is the world’s richest dataset on politicians, having data on 75900 politicians from 233 countries. The project is built by MySociety.

I was engaging in conversation about it with @GeorgieBurr at the new member introductions topic, but decided to move the conversation here in order not to pollute the space with a specific discussion about EP.

By “maintained via prompts”, do you mean that the data will be entered manually by people? Not only would that be tiresome for high volumes of data, it will also be error prone, introducing typos. Or can scrappers be introduced in the process somehow? I do not understand the workflow that you intend to follow in Wikidata.

I’ve found a page on Wikidata that describes how to get data out of it. It looks like you can use an API to get the data out.

It seems like you did not yet provide a link to the entry(ies) on Wikidata that have the data of EP. Is it just not there yet? Do you plan to continue to use Github?

Thanks for all the clarifications! EP looks awesome already as it is.

1 Like

Hi @herrmann,

I’m one of the developers working on the EveryPolitician/Democratic Commons projects, and @georgieburr asked me to explain a bit more of what we’re doing.

Getting information into Wikidata happens in one of two ways - entered manually, or entered and maintained in bulk. We agree that bulk entry is often the best path to take, but the main barrier to overcome is that of licensing - Wikidata requires that information entered be free of licensing constraints (or more specifically that it may be licensed as CC0), but in most of the world the fact that something exists on the web doesn’t guarantee this. In addition, writing scrapers or transforming data can be technically challenging, and by offering a lower barrier to entry we can support and encourage a wider range of users.

To help overcome this we’re encouraging people who have an interest in the data being correct to help maintain it in Wikidata, using a few tools we’ve built to smooth the process. The “prompts” that Georgie refers to is basically a comparison between any CSV source and the data in Wikidata - an example of this is one which compares current members of the South African National Assembly. Behind the scenes this compares a Wikidata query for all current members with the output of a scraper looking at the official website. By highlighting discrepancies we make it easier for groups to use Wikidata as their primary data store by making it easier to spot missing or incorrect information, and then correct it.

In many cases we find that local groups have also done excellent work in creating or maintaining pages on Wikidata. Since information in Wikidata is suitably licensed, we can use these pages and the fact that Wikipedia and Wikidata are aware of each other to mass-import information. To aid this we’ve built a tool known as Verification Pages which takes a list of things which are presumed to be true (in CSV format, so this could be the output of a scraper or an existing source) then presents people with statements (such as “Joe Bloggs is a councillor for Foobury”), links to something which could be a source to back this statement (such as the Foobury council website), and then asks if the two match. Where they do, this tool automatically adds the necessary data to Wikidata, meaning large volumes can quickly be checked for veracity and then entered.

Finally, where people already have found or created data which can be licensed as CC0 we’re encouraging them to load it into Wikidata, and then both maintain it and rely on Wikidata as a primary source of data. This neatly brings us on to your second question!

Getting data out of Wikidata is possible in a number of ways. As you mentioned you can use the Wikidata API to easily access information about a specific object (such as a single politician). You can also use the Wikidata Query Service to make SPARQL queries against the data which can return the results in a number of formats (eg current members of the South African National Assembly in JSON).

As far as EveryPolitician goes, we’ve already switched some countries to use Wikidata as their primary source of data. For example, South Africa relies exclusively on Wikidata to decide who to include as a politician. The plan is to move countries to use Wikidata as this type of source as and when the data for those countries is accurate enough - EveryPolitician will then continue to make the data easily available. Even where we don’t use Wikidata as a primary source of information, in many countries EveryPolitician also includes biographical data (such as date of birth, or gender) which is sourced from Wikidata.

Ultimately the intent is that by encouraging people to all use the same source of information, the amount of effort needed to maintain it will be shared amongst all users and similarly all will benefit. We plan to continue generating datasets from the information in Wikidata on a regular basis and storing them in GitHub for all to use with the minimum of effort.

Hopefully this answers your questions - if you want me to dive into a bit more detail on any of it, just let me know!


1 Like

Thank you very much for the detailed explanation, Nick. All of it seems a very sensible direction for the project moving forward.

In regards to the licensing issue, have you evaluated what would be the legal status of the data scraped from official webpages? I suppose it depends on the laws of each country, right? Are you documenting the legal analysis and reasoning of each in a wiki page somewhere?

All the best,

The legal status varies depending on the country, mostly their take on sui generis database rights. Wikimedia have a page on this which neatly sums it up as “it can be difficult”.

We obviously believe that an official web page stating something such as the members of a legislative assembly should be information in the public domain, but whether it is or not is something which would really need a court case to test.

I don’t believe we’re keeping a document somewhere detailing the presumed legal status of each country, but it’s definitely a good idea! I’ll see if we have anything internally (or in people’s heads) I can pull together.


1 Like

Cool! Good to hear that.

I have some familiarity with the legal landscape of databases in Brazil, which is mentioned in a very streamlined way in the Wikipedia page you mentioned. It is somewhat similar to the United States in requiring a creative step.

So, if you take a list of all of the members of parliament, for example, one could not possibly argue that an “intellectual creation” that involved the “selection, organization or arrangement of content” to produce the list. I think it’s safe to say it is in the public domain.

On the other hand, a list of the most effective members of parliament could possibly be argued to be protected, as one might say that the way you choose to measure effectiveness could be considered an “intellectual creation”. Of course this argumentation could be challenged, but the legal status of this other example would not be as clear.

Fortunately, what we need for a project like Every Politician are comprehensive lists of all of the politicians, not filtered or arranged in any particular way. So in countries that have similar criteria for the protection of databases, we should be good to go.

Other legal aspects we should be concerned about is personal data protection regulation, such as the GDPR in Europe. Since data about politicians could be considered personal data, even if it is in the public interest, it is important to consider whether if that kind of legislation would impose restrictions on processing or distributing that data.

Disclaimer: Anything I say here or anywhere else is just my own personal opinion and does not constitute legal advice of any kind.