DataPackage for 3 dimensional arrays (and maybe more)

scls · February 27, 2016, 9:26am

Hello,

I’d like to be able to store this kind of dataset into DataPackage.

wch/r-source/blob/trunk/src/library/datasets/data/HairEyeColor.R

HairEyeColor <-
array(c(32, 53, 10, 3, 11, 50, 10, 30, 10, 25, 7, 5, 3, 15, 7, 8,
        36, 66, 16, 4,  9, 34,  7, 64,  5, 29, 7, 5, 2, 14, 7, 8,
        32, 53, 10, 3, 11, 50, 10, 30, 10, 25, 7, 5, 3, 15, 7, 8,
        36, 66, 16, 4,  9, 34,  7, 64,  5, 29, 7, 5, 2, 14, 7, 8),
      dim = c(4, 4, 2),
      dimnames =
      list(Hair = c("Black", "Brown", "Red", "Blond"),
           Eye = c("Brown", "Blue", "Hazel", "Green"),
           Sex = c("Male", "Female")))
           
class(HairEyeColor) <- "table"

It looks like

, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8

I wonder if DataPackage is able to store such a dataset (and maybe with more dimensions)

In Python, this kind of feature is comparable to Pandas DataFrame http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html vs xarray Xarray documentation (N-D labeled arrays and datasets).

Kind regards

scls · February 27, 2016, 9:29am

A related issue Can't create datapackage for datasets::HairEyeColor · Issue #12 · christophergandrud/dpmr · GitHub

An other “similar” dataset

github.com

wch/r-source/blob/trunk/src/library/datasets/data/Titanic.R

Titanic <-
array(c(  0,   0,  35,   0,
          0,   0,  17,   0,
        118, 154, 387, 670,
          4,  13,  89,   3,
          5,  11,  13,   0,
          1,  13,  14,   0,
         57,  14,  75, 192,
        140,  80,  76,  20),
      dim = c(4, 2, 2, 2),
      dimnames =
      list(Class = c("1st", "2nd", "3rd", "Crew"),
           Sex = c("Male", "Female"),
           Age = c("Child", "Adult"),
           Survived = c("No", "Yes")))
class(Titanic) <- "table"

> Titanic
, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

rufuspollock · February 29, 2016, 3:10pm

This is really useful. I have actually been personally thinking about the Pandas / R dataframe model for a while and recently posted a stub repo about it here (thoughts / issues welcome):

As I understand your examples you have 3 dimensions:

Hair color
Eye color
Sex

Either you can store this as a 3-D array (4x4x2) or as normalized data (you’d have 4 columns - 3 for dimensions and one for the measure).

So question, I guess, is how would you store this in a tabular data package? The natural way would be to “normalize” …

BUT you would want some way to store the hint about how to load this into the DataFrame - right? that’s the key question. What you really want is some way to say what is the “value” and what are the dimensions. For that it is worth looking at this open issue:

github.com/frictionlessdata/specs

Spec for model/cube

opened 10:00AM - 30 Dec 15 UTC

pwalsh

New Spec

Fiscal Data Package has a [`mapping` object](http://fiscal.dataprotocols.org/spe…c/#mapping). This is very very handy for building a logical model out of the physical data sources when appropriate. This logical model can in turn be used to automate visualisations and data loaders, for example. Actually, there is nothing particularly "Fiscal" about this `mapping`: it is simply an OLAP cube implementation with measures and dimensions. I think we could extract out the generic pattern and expose it as a spec for declaring a model/cube mapping for any tabular data package.

What else would you need for R to be able to load losslessly?

scls · February 29, 2016, 4:21pm

Hi Rufus,

You said

So question, I guess, is how would you store this in a tabular data package? The natural way would be to “normalize” …

“normalize” is a politically correct word to say here “flattenize”

My goal is to store R datasets in a reusable format to be able to load losslessly both using R and Python (and maybe more)

So that’s quite hard for now to say what we need as my approach is only experimental…

see this R code here

github.com

Rdatasets/Rdatasets/blob/master/Rdatasets2dpkg.R

library(R2HTML)

require(dpmr)

#packages = c("datasets", "boot", "KMsurv", "robustbase", "car", "cluster", "COUNT", "Ecdat", "gap", "ggplot2", "HistData", "lattice", "MASS", "plm", "plyr", "pscl", "reshape2", "rpart", "sandwich", "sem",  "survival", "vcd", "Zelig", "HSAUR", "psych", "quantreg", "geepack", "texmex", "multgee", "evir", "lme4")
packages = c("datasets")
# Installed only packages that are not pre-installed.
# Credits: http://stackoverflow.com/a/9345167/756986
new.packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos="http://cran.rstudio.com")
index = data(package=packages)$results[,c(1,3,4)]
index = data.frame(index, stringsAsFactors=FALSE)
index_out = NULL

# Load packages which store datasets
for (i in packages) {
        library(i, character.only=TRUE)
        print(i)
}

This file has been truncated. show original

HairEyeColor was the first dataset to face problem with DataPackage saving (without flattening)
but I will probably found other datasets which could lead to some problems.

For example, how we could managed hierarchical index ?

scls · March 1, 2016, 8:49am

With Pandas xarray we can do:

import xarray

da = xarray.DataArray([[[32, 53, 10, 3],
                        [11, 50, 10, 30],
                        [10, 25, 7, 5],
                        [3, 15, 7, 8]],
                       [[36, 66, 16, 4],
                        [ 9, 34,  7, 64],
                        [ 5, 29, 7, 5],
                        [2, 14, 7, 8]]],
                      name='Number', dims=['Sex', 'Hair', 'Eye'],
                      coords=[['Male', 'Female'],
                              ['Black', 'Brown', 'Red', 'Blond'], 
                              ['Brown', 'Blue', 'Hazel', 'Green']])

which can be converted to Pandas Series with hierarchical index:

In [10]: da.to_series()
Out[10]:
Sex     Hair   Eye
Male    Black  Brown    32
               Blue     53
               Hazel    10
               Green     3
        Brown  Brown    11
               Blue     50
               Hazel    10
               Green    30
        Red    Brown    10
               Blue     25
               Hazel     7
               Green     5
        Blond  Brown     3
               Blue     15
               Hazel     7
               Green     8
Female  Black  Brown    36
               Blue     66
               Hazel    16
               Green     4
        Brown  Brown     9
               Blue     34
               Hazel     7
               Green    64
        Red    Brown     5
               Blue     29
               Hazel     7
               Green     5
        Blond  Brown     2
               Blue     14
               Hazel     7
               Green     8
Name: Number, dtype: int64

danfowler · August 2, 2016, 12:00am

My goal is to store R datasets in a reusable format to be able to load losslessly both using R and Python (and maybe more)

Hello @scls this is a really interesting use case! My very first thought is to experiment with storing this kind of data an arbitrary JSON object as inline data in a Data Package. We can explore that and other options given that we have two in-progress libraries for working Data Packages in R and Pandas.

I wanted to point you to two posts introducing these libraries:

danfowler · February 7, 2017, 9:29pm

@scls @rufuspollock have you seen this?

The JSON-stat format is a simple lightweight JSON format for data dissemination. It is based in a cube model that arises from the evidence that the most common form of data dissemination is the tabular form. In this cube model, datasets are organized in dimensions. Dimensions are organized in categories.

Topic		Replies	Views
Put R datasets in a reusable, language agnostic format (such as DataPackage) Frictionless Data	4	1754	August 3, 2017
New post on Labs blog: Working with Data Packages in R Frictionless Data	0	982	February 16, 2018
Data Package Users and Usages Frictionless Data	0	1051	December 29, 2014
Datapackages management: get table and get series(fields) easily Frictionless Data	3	1200	December 5, 2016
Data Package, Directory and Repository Names and Versions Frictionless Data	4	1015	February 20, 2018

Related topics