I am very interested in the tabular data package format. I have done a lot of work in R with cleaning up questionnaire data, and sharing it with other researchers. I was always wondering about what data formats to use for exchange. My own workflow, as described here has been to do all the cleanup and wrangling in one R script, serialize the R structure to disk, and load it into a report script in literate R. However, that is obviously not good for interoperability with people who don’t use R, or even possible for future-proofing.
My problem with just emitting a CSV is that you lose some of the information. Two two elements that seem to still be missing from the tabular data specification are ordered factors and NA handling. To discuss the first one, I often work with likert-data, and have to encode the fact that columnA is not just a random string column, but it can contain exactly these five values (or NA): Never, Rarely, Sometimes, Frequently, All the time, and that these five have an ordered relationship.
This becomes important for automatic graphing, because I want to make sure that it always orders these columns correctly (and not, say, alphabetically). And I’d love to pull in an open data package which I can send straight to ggplot, without worrying about changing string columns to ordered factors. (This could also be part of the verification, if the specification says that it can only contain these five strings, it would be an error to contain any other strings).
Any hope of getting this added to the specification?