Country-specific data package register

Hi all,

what are your thoughts about creating country-specific registries for data packages in the spirit of Data Packaged Core Datasets · GitHub? The Dutch register for example would be located at https://github.com/datasets-nl.

A bit of context: I’m very fond of the frictionless data initiative and am trying to create/use data packages as much as possible. Currently I’m involved in two projects that deal with geo-information about the natural environment. A single data package is interesting/relevant for both projects. The question is: how do I include the same data package in both projects? Currently I have the “master” data package here: susthacking/data/bestand-veehouderijbedrijven at master · openstate/susthacking · GitHub that is forked in the other project: open-data/bestand-veehouderijbedrijven at master · FarmHackNL/open-data · GitHub This is of course suboptimal and a nightmare to keep in sync. Hence the idea to create a central registry.

I’d rather not “pollute” Data Packaged Core Datasets · GitHub with country-specific data and am thinking of creating https://github.com/datasets-nl or https://github.com/datasetsNL to store country-specific data packages.

Including them in other projects should then be done through submodules although I’m not a big fan of them.

Any thoughts on this idea and the format /datasets-nl or /datasetsNL?

Cheers,
Simeon

4 Likes

Good initiative (!)… I think it is important to avoid intersections on datasets projects, ideal is the focus on indepedent data that is really country-specific. PS: about label, I vote datasets-nl.


Hum… thinking better… we need some intersection…

Questions in the case of “some intersections allowed”

Let’s see the country-codes.csv example, with standard names in English (official_name_en and name) and an official_name_fr for French… And only these two languages.

At /datasets-br we would want to include official_name_pt column (pt is the language of BR), so to reduce duplications and intersections we must creat a new /datasets-br/country-codes/data/country-codes.csv file with only two columns, ISO3166-1-numeric (as primary key) and official_name_pt… Posssible problems:

  • only ISO3166-1-numeric is not mnemonic, and not so useful as ISO3166-1-Alpha-2… Use more one or two columns as candidate keys?

  • to produce a kind of “SQL JOIN” with two CSV files (the country-codes.csv from /datasets-br and from the main /datasets) is not so easy for all users, why not copy all other columns?

  • and about /datasets-pt? it will use the same translations, but perhaps some variants (pt, pt-BR and pt-PT not always the same).

Questions in the case of “some intersections in the curation”

Suppose a big and vibrant community working in /datasets-br, /datasets-nl, etc. and all in a “no /datasets intersections allowed” mode, but each one with a big set of people looking for a official_name_X column at /datasets/country-codes

I think this is the “most vibrant” aspect: a new demand, a new pression in the curatory organization of the central /datasets, perhaps a kind of “federated democracy” :wink:

In the case of country-codes, and supposing that “join CSV files” is not a problem for users, the federated community can help central /datasets to maintain a new country-codes-names.csv file with all official_name_X columns: this solution seems better than intersection in /datasets-Y projects.

1 Like

Hi Peter, thanks for your thoughts!

I completely agree that there should be as little intersections as possible. Keeping these in sync is difficult and a waste of resources.

My geo background blinded me to the issues you raise as I planned to simply use the geographical dimension as a categorisation metric: /datasets-nl and /datasets-br will only contain datasets located within the country’s borders (however these might be defined… dragons ahead, I realise).

For datasets that are not bound to a single country (like country-codes.csv), I’d argue it’s cleaner to put them in /datasets and ask /datasets-nl contributors to add/maintain additional columns as they see fit.

I’m in favor of keeping everything as simple and as self-contained as possible. There’s a huge user base that can’t script but is able to produce beautiful and insightful visualisation, analyses, articles, and insights. It’s important to give them access to data (while they are learning how to code/do advanced data stuffs :slight_smile: )

1 Like

Hi @simeonnedkov and readers, let’s start some initiative in this direction?
A good demand is the country-specific ISO 3166-2 tables. For instance ISO 3166-2:BR:

  • it is a simple table (columns “Subdivision”,“Name”, and “CurrentLevel”), and easy to maintain.
  • need some data adiction, from local-interest curators, about yerars of creation/extinction (more two coluns)… See my draft table for BR.
2 Likes

Hi @mor, you solve all problems here (!),
maybe you can point us to someone, that solve or give an agenda to our problem?

Brazil now have two simple and util datasets to add at this proposed “country-specific data package register”,

PS: if there are some green light to continue, we will invite more people to endorse/review, etc.

I think this is more for @danfowler (who really solves all issues here) and @Jobarratt

@ppkrauss - @danfowler will take a look at this tomorrow for you

@simeonnedkov @ppkrauss I really like the idea of having many more Data Package registries on GitHub, but I’m not sold yet on having lots of country-specific ones. I’m sure it might be relatively easy to find yourself in the situation where you have a dataset that you need and are willing and able to maintain, but may not fit in either datasets-nl or datasets, which would be a real problem :smile:. That being said datasets-nl seem like a fine place to start testing out the idea for projects in the Netherlands. We can help promote new datasets as you add them and see if other people jump on board. It might be worth coming up with some guidelines for contribution.

@ppkrauss You said:

In the case of country-codes, and supposing that “join CSV files” is not a problem for users, the federated community can help central /datasets to maintain a new country-codes-names.csv file with all official_name_X columns: this solution seems better than intersection in /datasets-Y projects.

I particularly like this idea. I think it counts as an generally useful dataset to add to /datasets/. Perhaps you could create a new issue for it: Sign in to GitHub · GitHub

@ppkrauss You also said:

Brazil now have two simple and util datasets to add at this proposed “country-specific data package register”,

  • ISO-3166-2-BR-history

As I said above, I’m neutral as to whether you should manage these on /datasets-br/ or /okfn-brasil/, but I do think they both seem important and generally useful! Given that you are managing the ISO-3166-2-BR dataset through Google Sheets, you might be interested in generating a Data Package directly from there. See this post for some background info on it:

Hi @danfowler , thanks your complete reply. Let’s see your comments:

…it might be relatively easy to find yourself in the situation where you have a dataset that you need and are willing and able to maintain, but may not fit …

…hum… but I think we know exactly, and it fit. The discussion here shows that it fit.

datasets-nl seem like a fine place to start testing out the idea for projects in the Netherlands…

…hum… not the case: it is not a “sandbox” at this time, it is a mature ideia… the discussion here is something like
“hello Open Knowledge, let’s expand github.com/datasets initiative … potencialize application and collaboration…, we need your support and your endorsement!”

… I’m neutral as to whether you should manage these on /datasets-br/ or /okfn-brasil/, but I do think they both seem important and generally useful!

… ok, lets vote to use of the Open Knowledge seal in the iniciative! :wink:
But, allow me to express differently, puting in other words: we need (you agree?) a new repository into the Open Kownlege oficial Curated Core Datasets · GitHub, so something like /datasets/br-datasets or /datasets/nl-datasets… let’s fork github.com/okfn-brasil/dataset-cbo to github.com/datasets/br-datasets (!)
PS: after fork I will move the data to data/cbo, review Readme, datapackage.json, etc. (it will be a dataset with many sub-datasets).

We started (!) a little project, using the brand and logo of Data Packaged Core Datasets:
(thanks to @Mor and @danfowler!)

Examples:

2 Likes

Great @ppkrauss! Let me know if you need any help with some of the tooling for working with Data Packages. For example, have you seen goodtables.io which allows you to automatically validate Data Packages in a GitHub repo?

Hi @danfowler, thanks!
Goodtables.io is being a good news (!!) to me :slight_smile:
It seems great (!), I am subcribed just now, and waiting to my first test (my next commit) with goodtables.io/datasets-br/state-codes.

PS: do you have at github/datasets some recommendation for versioning? … I suggest to use stadandard http://semver.org

@ppkrauss Great question: @ethanwhite’s team are exploring this for their own Data Package-based tool: Script Versioning Protocol · Issue #908 · weecology/retriever · GitHub

@danfowler - has someone from our team gotten in touch with you yet? I recommend that Andrew check in with you to see if there was any discussion of this. If so, we be happy to participate and follow whatever recommendations come out. If not we’re happy to take a first crack at thinking through what makes sense and report back.

1 Like

Yes, Andrew did! @pwalsh recommended bringing to the general specs forum: GitHub - frictionlessdata/specs: Technical specifications and guidelines for implementing Frictionless Data.

Great. I’ll let Andrew handle this since he got it started, but just ping if you want me to jump in at some point.

Hi @danfowler or @Jobarratt, this is another technical solicitation… But there are also some issues to be clarified. To continue Datasets-BR project and to offer better opportunity to the country-specific registries in general, is interesting to use a consistent namespace at Datahub.io.

Will be natural to use, as “reserved authority names”, the core suffixed by country codes: datahub.io/core-brdatahub.io/core-nl,  etc.

Official core-countrySpecific URL at Datahub

Problem and solicitation

I was created https://datahub.io/ppKrauss/br-state-codes from GitHub - datasets-br/state-codes: Brazilian states 2-letter codes (ISO 3166-2:BR), official abbreviations throughout the country's history only as proof-of-concept, worked fine. The https://datahub.io/ppKrauss is the one-user authority, that is good for me, but is not good for BR-community. Ideal is:

  1. many users (eg. the BR-curators) are administrative authorities;
  2. All the code-XX as reserved names to XX-community (where XX is a country-code), related to OK-Network chapters or groups.

Example: GitHub - datasets-br/state-codes: Brazilian states 2-letter codes (ISO 3166-2:BR), official abbreviations throughout the country's history exist as stable dataset at a stable Github project (github.com/datasets-br), like all datasets at github.com/datasets. So is natural to publish it as https://datahub.io/core-br/state-codes.