Put R datasets in a reusable, language agnostic format (such as DataPackage)


#1

Hello,

When I first dived into the world of data science, I was very impressed by R features. A lot of work have been done by a very big and active community of statisticans and more generally scientists.

One very interesting feature of R is that many packages for data science come with a lot of datasets.

Maybe you have heard previously of R - Edgar Anderson’s Iris Data https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html

If not, you have probably ever known of Survival of passengers on the Titanic dataset : https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html

These are “classical” datasets exposed in many data science books, websites about data science and machine learning (such as Kaggle…)

For a complete list of datasets available in datasets package see https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

Many other R packages provide such datasets that are great to learn data science, to test machine learning algorithms (classification…)

R provide nice data structure to deal with datasets: dataframe

So what’s the problem ?

Problem is that a lot of people don’t like R! but they like data science, they like machine learning, …

So, in the last years, many alternatives emerged:

So there is at least one tool to “play” with data for most language ?

Yes and no.

Yes many languages have now a library to deal with data…
but what about the data themselves ?

These example datasets are tied to one language community (R, Python…)

Some projects to access R datasets using Python have been done

See for example:

PyDataset https://github.com/iamaziz/PyDataset

Same for Julia with

RDatasets.jl https://github.com/johnmyleswhite/RDatasets.jl

But there is definitely a room for a project to “liberate” data from these language-specific repositories and advocate for a new language-agnostic method for distributing clean datasets.

DataPackage could be a solution for this ?

It will be great if OKFN community could help on this!

I have setup an organization for this
https://github.com/Rdatasets

And a possible roadmap is available

Such project could provide some highlight to OKFN and DataPackage format,
it could be a lab for DataPackage format.

It will be nice to have feedback about interest in such a project.

Kind regards


#2

Hi, have you seen this? http://okfnlabs.org/blog/2016/07/14/using-data-packages-with-r.html

Not sure if it answers all your questions, but it might provide some input in case you didn’t know about it.

P.S.: R is awesome :heart:


#3

Hi @victornitu,

Thanks for this link.
I was planning to use https://github.com/frictionlessdata/datapackage-r in this R script instead of dpmr
but I’m quite busy these days.

I’d be pleased if you could have a look as you seem to love R (I’m more a Python/Pandas lover)

Kind regards


#4

Hey! Have you come across feather https://github.com/wesm/feather? Super cool for reading and writing dataframes in R and python, though not for long term archiving of datasets.

I discovered it recently and feather’s become an integral part of my data processing workflow.


#6

Yes we have :slight_smile: - thanks for flagging though!

It does serve slightly different (and complementary) purposes!