DataPackage for 3 dimensional arrays (and maybe more)


#1

Hello,

I’d like to be able to store this kind of dataset into DataPackage.

It looks like

, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8

I wonder if DataPackage is able to store such a dataset (and maybe with more dimensions)

In Python, this kind of feature is comparable to Pandas DataFrame http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html vs xarray http://xarray.pydata.org/en/stable/ (N-D labeled arrays and datasets).

Kind regards


#2

A related issue https://github.com/christophergandrud/dpmr/issues/12

An other “similar” dataset

> Titanic
, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

#3

This is really useful. I have actually been personally thinking about the Pandas / R dataframe model for a while and recently posted a stub repo about it here (thoughts / issues welcome):

As I understand your examples you have 3 dimensions:

  • Hair color
  • Eye color
  • Sex

Either you can store this as a 3-D array (4x4x2) or as normalized data (you’d have 4 columns - 3 for dimensions and one for the measure).

So question, I guess, is how would you store this in a tabular data package? The natural way would be to “normalize” …

BUT you would want some way to store the hint about how to load this into the DataFrame - right? that’s the key question. What you really want is some way to say what is the “value” and what are the dimensions. For that it is worth looking at this open issue:

What else would you need for R to be able to load losslessly?


#4

Hi Rufus,

You said

So question, I guess, is how would you store this in a tabular data package? The natural way would be to “normalize” …

“normalize” is a politically correct word to say here “flattenize” :wink:

My goal is to store R datasets in a reusable format to be able to load losslessly both using R and Python (and maybe more)

So that’s quite hard for now to say what we need as my approach is only experimental…

see this R code here

HairEyeColor was the first dataset to face problem with DataPackage saving (without flattening)
but I will probably found other datasets which could lead to some problems.

For example, how we could managed hierarchical index ?


#5

With Pandas xarray we can do:

import xarray

da = xarray.DataArray([[[32, 53, 10, 3],
                        [11, 50, 10, 30],
                        [10, 25, 7, 5],
                        [3, 15, 7, 8]],
                       [[36, 66, 16, 4],
                        [ 9, 34,  7, 64],
                        [ 5, 29, 7, 5],
                        [2, 14, 7, 8]]],
                      name='Number', dims=['Sex', 'Hair', 'Eye'],
                      coords=[['Male', 'Female'],
                              ['Black', 'Brown', 'Red', 'Blond'], 
                              ['Brown', 'Blue', 'Hazel', 'Green']])

which can be converted to Pandas Series with hierarchical index:

In [10]: da.to_series()
Out[10]:
Sex     Hair   Eye
Male    Black  Brown    32
               Blue     53
               Hazel    10
               Green     3
        Brown  Brown    11
               Blue     50
               Hazel    10
               Green    30
        Red    Brown    10
               Blue     25
               Hazel     7
               Green     5
        Blond  Brown     3
               Blue     15
               Hazel     7
               Green     8
Female  Black  Brown    36
               Blue     66
               Hazel    16
               Green     4
        Brown  Brown     9
               Blue     34
               Hazel     7
               Green    64
        Red    Brown     5
               Blue     29
               Hazel     7
               Green     5
        Blond  Brown     2
               Blue     14
               Hazel     7
               Green     8
Name: Number, dtype: int64

#6

My goal is to store R datasets in a reusable format to be able to load losslessly both using R and Python (and maybe more)

Hello @scls this is a really interesting use case! My very first thought is to experiment with storing this kind of data an arbitrary JSON object as inline data in a Data Package. We can explore that and other options given that we have two in-progress libraries for working Data Packages in R and Pandas.

I wanted to point you to two posts introducing these libraries:


#7

@scls @rufuspollock have you seen this?

The JSON-stat format is a simple lightweight JSON format for data dissemination. It is based in a cube model that arises from the evidence that the most common form of data dissemination is the tabular form. In this cube model, datasets are organized in dimensions. Dimensions are organized in categories.