Hello,
When I first dived into the world of data science, I was very impressed by R features. A lot of work have been done by a very big and active community of statisticans and more generally scientists.
One very interesting feature of R is that many packages for data science come with a lot of datasets.
Maybe you have heard previously of R - Edgar Anderson’s Iris Data R: Edgar Anderson's Iris Data
If not, you have probably ever known of Survival of passengers on the Titanic dataset : R: Survival of passengers on the Titanic
These are “classical” datasets exposed in many data science books, websites about data science and machine learning (such as Kaggle…)
For a complete list of datasets available in datasets package see R: The R Datasets Package
Many other R packages provide such datasets that are great to learn data science, to test machine learning algorithms (classification…)
R provide nice data structure to deal with datasets: dataframe
So what’s the problem ?
Problem is that a lot of people don’t like R! but they like data science, they like machine learning, …
So, in the last years, many alternatives emerged:
-
For Pythonists
- Python Pandas http://pandas.pydata.org/
- xarray http://xarray.pydata.org/ (for more than 2 dimensional data)
-
For Julia
-
…
So there is at least one tool to “play” with data for most language ?
Yes and no.
Yes many languages have now a library to deal with data…
but what about the data themselves ?
These example datasets are tied to one language community (R, Python…)
Some projects to access R datasets using Python have been done
See for example:
PyDataset GitHub - iamaziz/PyDataset: Instant access to many datasets in Python.
Same for Julia with
RDatasets.jl GitHub - JuliaStats/RDatasets.jl: Julia package for loading many of the data sets available in R
But there is definitely a room for a project to “liberate” data from these language-specific repositories and advocate for a new language-agnostic method for distributing clean datasets.
DataPackage could be a solution for this ?
It will be great if OKFN community could help on this!
I have setup an organization for this
And a possible roadmap is available
Such project could provide some highlight to OKFN and DataPackage format,
it could be a lab for DataPackage format.
It will be nice to have feedback about interest in such a project.
Kind regards