Towards a data schema for Covid-19 epidemiological and response data

In order to fight the pandemic, it is essential to have consistent data to work with. Frictionless Data can help achieve that consistency by providing the means to automatically detect common errors and prompt for it to be corrected.

With that in mind, I have added Tabular Data Packages to, a collaborative platform that has been scraping data from the official State Health Secretariat reports. The data is then made available both as CSV downloads and though an API. It is updated daily and is the only, AFAIK, open data source on Covid-19 in Brazil that has city-level data.

More information about it can be found on this blog post (in Portuguese) and the website.

A recent issue on the project discusses whether we should suggest or recommend that authorities do use a specific table schema. That would help not only in collecting information more directly from CSVs instead of scraping PDFs, but also making data more uniform across states, making them more comparable and easier to aggregate.

The local chapter of Open Knowledge has even been evaluating transparency of states regarding the disclosure of Covid-19 data, but it does not propose a specific schema, nor have the data points required in the “content” part of the score been based in any sort of international “standard” for Covid-19 data.

In fact, each international source I have looked into uses a different schema and has different data. For instance, the Johns Hopkins CSSE dataset does feature daily data on recovered cases, whereas most other international datasets don’t. I don’t think an internationally agreed standard schema for Covid-19 data does exist, but it definitely should exist.

Are you aware of any data standardization efforts for Covid-19 data?


Hi @herrmann! Thanks for starting this important topic. One of our collaborators, Phil Rocca Serra, has started work on a COVID-19 datapackage, which you can see here: Phil has been working with some medical professionals to get their input as well to make sure the datapackage works with actual data from the medical field. We’re hoping to write a blog post about this soon. If you are interested, please leave your feedback or ideas here or in an issue in that repository.


Hi, @lwinfree! it’s cool that Phil is working on a data package related to Covid-19. I took a look at it, and it looks interesting, but I’m not sure if it matches the scope of epidemiological data I was thinking about.

Shouldn’t there be a field for the number of confirmed cases, but that have not been hospitalized? Shouldn’t there be a field for recovered cases? Maybe we could take a look at the Johns Hopkins University CSSE data and the fields they have there as a frame of reference.

Hi @herrmann, you are right, there are gaps. As @lwinfree pointed out, this is a first pass on deriving a Frictionless definition. I started the work basing the definition on WHO updates as available from CSSE dataset but also from data released in Europe by Italy and Spain, trying to get the overlap. During the EU Elixir Biohackthon 2020 on Covid-19, I created another package based on the US CDC Module that would released around the same time and for which schema dot org push out specific properties which could be used for th rdfType attribute. The module itself was a bit surprised as there was no information about ‘WHO region’ which, being the US would be have States I supposed. I just pushed that the master branch on the repo @lwinfree mentioned:

here, the fields are semantically markup with Schema dot org but also the stato ontology
Would that be more useful?

1 Like

Yes, @proccaserra, I think that is indeed more useful.

Are you aware of any standardization efforts being started regarding this? Could this schema be a starting point for a proposal?

I’ve seen but don’t know much about it - interesting?

1 Like

This is well outside my field, but I understand the European Union has undertaken or funded significant work in this area:

The European Commission has recently announced a data sharing platform for Covid‑19 research:

The WHO was consulting on data protocols in 2015:

The Open Data Institute has recently been pushing for open data but mostly as advocacy:

I would caution against reinventing wheels without thoroughly researching the current status. HTH R.

Could someone tag this topic “covid19” (I don’t have sufficient status):

Sorry @robbiemorrison. Having trouble working out how to tag the whole topic without having to move all the posts to a new topic so will leave this for now.

@stephenabbottpugh That’s kinda weird. I admin two forums and happily add and modify tags. Perhaps you should check your configurations? Another hint, set the tags visible in the dropdown to a large number, else people think the 20 or so tags they otherwise see are all that exist. Cheers, R.

Might be of relevance: