Something like R's ordered factors or enums as column type?


#1

I am very interested in the tabular data package format. I have done a lot of work in R with cleaning up questionnaire data, and sharing it with other researchers. I was always wondering about what data formats to use for exchange. My own workflow, as described here has been to do all the cleanup and wrangling in one R script, serialize the R structure to disk, and load it into a report script in literate R. However, that is obviously not good for interoperability with people who don’t use R, or even possible for future-proofing.

My problem with just emitting a CSV is that you lose some of the information. Two two elements that seem to still be missing from the tabular data specification are ordered factors and NA handling. To discuss the first one, I often work with likert-data, and have to encode the fact that columnA is not just a random string column, but it can contain exactly these five values (or NA): Never, Rarely, Sometimes, Frequently, All the time, and that these five have an ordered relationship.

This becomes important for automatic graphing, because I want to make sure that it always orders these columns correctly (and not, say, alphabetically). And I’d love to pull in an open data package which I can send straight to ggplot, without worrying about changing string columns to ordered factors. (This could also be part of the verification, if the specification says that it can only contain these five strings, it would be an error to contain any other strings).

Any hope of getting this added to the specification?


Can you add code descriptions to a data package?
#2

Good question! I would like to know more about this as well. How R-specific is it, though? Do SQL systems support something like ordered factors?


#3

I have been looking into this, because I am looking at processing very
large clicklogs, with a lot of factor-style columns (only fifty possible
different URLs, but 22 million entries etc). Postgres for example has
support for enums with ordering, but they have to be declared ahead of time
(it’s possible to add later, but this is a very costly process, and not
something to be taken lightly).

Julia’s dataframe.jl supports factors
http://dataframesjl.readthedocs.org/en/latest/pooling.html?highlight=factor


#4

This is a really good question.

  • NA handling (do you mean NaN or n/a btw?) - take a look at thttps://github.com/dataprotocols/dataprotocols/issues/97 and please add your suggestions. I think adding support for this is quite easy.

  • Regarding ordered factors / enums: this again seems a valuable and simple thing to support in JSON Table Schema. Could you open an issue here https://github.com/dataprotocols/dataprotocols/issues/97 with an outline of what you would be looking for (and perhaps a suggested implementation).

Summary: very likely to get this into the spec (note: our usual approach is to add these as draft and see one or two trial implementations and then finalize).


#5

Has there been any progress on this subject? Except for an issue on missing values, I could not find an issue on factors.

I am asking because for a project I’m working on we use tabular data packages as exchange format and we needed to be able to record the categories present in each of the categorical columns. We now have a working solution, but it would be nice if there would be a more widely supported solution for this.


#6

Thanks for the prompt. I added an issue here, feel free to add support, details, and specific implementation suggestions.