When I first dived into the world of data science, I was very impressed by R features. A lot of work have been done by a very big and active community of statisticans and more generally scientists.
One very interesting feature of R is that many packages for data science come with a lot of datasets.
Maybe you have heard previously of R - Edgar Anderson’s Iris Data https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html
If not, you have probably ever known of Survival of passengers on the Titanic dataset : https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html
These are “classical” datasets exposed in many data science books, websites about data science and machine learning (such as Kaggle…)
For a complete list of datasets available in datasets package see https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
Many other R packages provide such datasets that are great to learn data science, to test machine learning algorithms (classification…)
R provide nice data structure to deal with datasets: dataframe
So what’s the problem ?
Problem is that a lot of people don’t like R! but they like data science, they like machine learning, …
So, in the last years, many alternatives emerged:
- DataFrames.jl https://github.com/JuliaStats/DataFrames.jl
So there is at least one tool to “play” with data for most language ?
Yes and no.
Yes many languages have now a library to deal with data…
but what about the data themselves ?
These example datasets are tied to one language community (R, Python…)
Some projects to access R datasets using Python have been done
See for example:
Same for Julia with
But there is definitely a room for a project to “liberate” data from these language-specific repositories and advocate for a new language-agnostic method for distributing clean datasets.
DataPackage could be a solution for this ?
It will be great if OKFN community could help on this!
I have setup an organization for this
And a possible roadmap is available
Such project could provide some highlight to OKFN and DataPackage format,
it could be a lab for DataPackage format.
It will be nice to have feedback about interest in such a project.