A bit of context: I’m very fond of the frictionless data initiative and am trying to create/use data packages as much as possible. Currently I’m involved in two projects that deal with geo-information about the natural environment. A single data package is interesting/relevant for both projects. The question is: how do I include the same data package in both projects? Currently I have the “master” data package here: susthacking/data/bestand-veehouderijbedrijven at master · openstate/susthacking · GitHub that is forked in the other project: open-data/bestand-veehouderijbedrijven at master · FarmHackNL/open-data · GitHub This is of course suboptimal and a nightmare to keep in sync. Hence the idea to create a central registry.
Good initiative (!)… I think it is important to avoid intersections on datasets projects, ideal is the focus on indepedent data that is really country-specific. PS: about label, I vote datasets-nl.
Hum… thinking better… we need some intersection…
Questions in the case of “some intersections allowed”
Let’s see the country-codes.csv example, with standard names in English (official_name_en and name) and an official_name_fr for French… And only these two languages.
At /datasets-br we would want to include official_name_pt column (pt is the language of BR), so to reduce duplications and intersections we must creat a new /datasets-br/country-codes/data/country-codes.csv file with only two columns, ISO3166-1-numeric (as primary key) and official_name_pt… Posssible problems:
only ISO3166-1-numeric is not mnemonic, and not so useful as ISO3166-1-Alpha-2… Use more one or two columns as candidate keys?
to produce a kind of “SQL JOIN” with two CSV files (the country-codes.csv from /datasets-br and from the main /datasets) is not so easy for all users, why not copy all other columns?
and about /datasets-pt? it will use the same translations, but perhaps some variants (pt, pt-BR and pt-PT not always the same).
Questions in the case of “some intersections in the curation”
Suppose a big and vibrant community working in /datasets-br, /datasets-nl, etc. and all in a “no /datasets intersections allowed” mode, but each one with a big set of people looking for a official_name_X column at /datasets/country-codes…
I think this is the “most vibrant” aspect: a new demand, a new pression in the curatory organization of the central /datasets, perhaps a kind of “federated democracy”
In the case of country-codes, and supposing that “join CSV files” is not a problem for users, the federated community can help central /datasets to maintain a new country-codes-names.csv file with all official_name_X columns: this solution seems better than intersection in /datasets-Y projects.
I completely agree that there should be as little intersections as possible. Keeping these in sync is difficult and a waste of resources.
My geo background blinded me to the issues you raise as I planned to simply use the geographical dimension as a categorisation metric: /datasets-nl and /datasets-br will only contain datasets located within the country’s borders (however these might be defined… dragons ahead, I realise).
For datasets that are not bound to a single country (like country-codes.csv), I’d argue it’s cleaner to put them in /datasets and ask /datasets-nl contributors to add/maintain additional columns as they see fit.
I’m in favor of keeping everything as simple and as self-contained as possible. There’s a huge user base that can’t script but is able to produce beautiful and insightful visualisation, analyses, articles, and insights. It’s important to give them access to data (while they are learning how to code/do advanced data stuffs )
Hi @simeonnedkov and readers, let’s start some initiative in this direction?
A good demand is the country-specific ISO 3166-2 tables. For instance ISO 3166-2:BR:
it is a simple table (columns “Subdivision”,“Name”, and “CurrentLevel”), and easy to maintain.
need some data adiction, from local-interest curators, about yerars of creation/extinction (more two coluns)… See my draft table for BR.
@simeonnedkov@ppkrauss I really like the idea of having many more Data Package registries on GitHub, but I’m not sold yet on having lots of country-specific ones. I’m sure it might be relatively easy to find yourself in the situation where you have a dataset that you need and are willing and able to maintain, but may not fit in either datasets-nl or datasets, which would be a real problem . That being said datasets-nl seem like a fine place to start testing out the idea for projects in the Netherlands. We can help promote new datasets as you add them and see if other people jump on board. It might be worth coming up with some guidelines for contribution.
In the case of country-codes, and supposing that “join CSV files” is not a problem for users, the federated community can help central /datasets to maintain a new country-codes-names.csv file with all official_name_X columns: this solution seems better than intersection in /datasets-Y projects.
I particularly like this idea. I think it counts as an generally useful dataset to add to /datasets/. Perhaps you could create a new issue for it: Sign in to GitHub · GitHub
As I said above, I’m neutral as to whether you should manage these on /datasets-br/ or /okfn-brasil/, but I do think they both seem important and generally useful! Given that you are managing the ISO-3166-2-BR dataset through Google Sheets, you might be interested in generating a Data Package directly from there. See this post for some background info on it:
Hi @danfowler , thanks your complete reply. Let’s see your comments:
…it might be relatively easy to find yourself in the situation where you have a dataset that you need and are willing and able to maintain, but may not fit …
…hum… but I think we know exactly, and it fit. The discussion here shows that it fit.
… datasets-nl seem like a fine place to start testing out the idea for projects in the Netherlands…
…hum… not the case: it is not a “sandbox” at this time, it is a mature ideia… the discussion here is something like “hello Open Knowledge, let’s expand github.com/datasets initiative … potencialize application and collaboration…, we need your support and your endorsement!”…
… I’m neutral as to whether you should manage these on /datasets-br/ or /okfn-brasil/, but I do think they both seem important and generally useful!
… ok, lets vote to use of the Open Knowledge seal in the iniciative!
But, allow me to express differently, puting in other words: we need (you agree?) a new repository into the Open Kownlege oficial Curated Core Datasets · GitHub, so something like /datasets/br-datasets or /datasets/nl-datasets… let’s fork github.com/okfn-brasil/dataset-cbo to github.com/datasets/br-datasets (!) PS: after fork I will move the data to data/cbo, review Readme, datapackage.json, etc. (it will be a dataset with many sub-datasets).
Great @ppkrauss! Let me know if you need any help with some of the tooling for working with Data Packages. For example, have you seen goodtables.io which allows you to automatically validate Data Packages in a GitHub repo?
@danfowler - has someone from our team gotten in touch with you yet? I recommend that Andrew check in with you to see if there was any discussion of this. If so, we be happy to participate and follow whatever recommendations come out. If not we’re happy to take a first crack at thinking through what makes sense and report back.
Hi @danfowler or @Jobarratt, this is another technical solicitation… But there are also some issues to be clarified. To continue Datasets-BR project and to offer better opportunity to the country-specific registries in general, is interesting to use a consistent namespace at Datahub.io.
Will be natural to use, as “reserved authority names”, the core suffixed by country codes: datahub.io/core-br, datahub.io/core-nl, etc.