Put R datasets in a reusable, language agnostic format (such as DataPackage)

Hello,

When I first dived into the world of data science, I was very impressed by R features. A lot of work have been done by a very big and active community of statisticans and more generally scientists.

One very interesting feature of R is that many packages for data science come with a lot of datasets.

Maybe you have heard previously of R - Edgar Anderson’s Iris Data R: Edgar Anderson's Iris Data

If not, you have probably ever known of Survival of passengers on the Titanic dataset : R: Survival of passengers on the Titanic

These are “classical” datasets exposed in many data science books, websites about data science and machine learning (such as Kaggle…)

For a complete list of datasets available in datasets package see R: The R Datasets Package

Many other R packages provide such datasets that are great to learn data science, to test machine learning algorithms (classification…)

R provide nice data structure to deal with datasets: dataframe

So what’s the problem ?

Problem is that a lot of people don’t like R! but they like data science, they like machine learning, …

So, in the last years, many alternatives emerged:

So there is at least one tool to “play” with data for most language ?

Yes and no.

Yes many languages have now a library to deal with data…
but what about the data themselves ?

These example datasets are tied to one language community (R, Python…)

Some projects to access R datasets using Python have been done

See for example:

PyDataset GitHub - iamaziz/PyDataset: Instant access to many datasets in Python.

Same for Julia with

RDatasets.jl GitHub - JuliaStats/RDatasets.jl: Julia package for loading many of the data sets available in R

But there is definitely a room for a project to “liberate” data from these language-specific repositories and advocate for a new language-agnostic method for distributing clean datasets.

DataPackage could be a solution for this ?

It will be great if OKFN community could help on this!

I have setup an organization for this

And a possible roadmap is available

Such project could provide some highlight to OKFN and DataPackage format,
it could be a lab for DataPackage format.

It will be nice to have feedback about interest in such a project.

Kind regards

4 Likes

Hi, have you seen this? Using Data Packages with R - Open Knowledge Labs

Not sure if it answers all your questions, but it might provide some input in case you didn’t know about it.

P.S.: R is awesome :heart:

3 Likes

Hi @victornitu,

Thanks for this link.
I was planning to use GitHub - frictionlessdata/datapackage-r: An R package for working with Data Package. in this R script instead of dpmr
but I’m quite busy these days.

I’d be pleased if you could have a look as you seem to love R (I’m more a Python/Pandas lover)

Kind regards

1 Like

Hey! Have you come across feather GitHub - wesm/feather: Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow? Super cool for reading and writing dataframes in R and python, though not for long term archiving of datasets.

I discovered it recently and feather’s become an integral part of my data processing workflow.

Yes we have :slight_smile: - thanks for flagging though!

It does serve slightly different (and complementary) purposes!