This is a question, perhaps a suggestion for Frictionlessdata specs/CSV dialect
… or perhaps only an opportunity to say “hello I am in the CSV hell of the decimal-separator interpretation”.
Typical problem
On my country we use comma as decimal separator, but we accept as human-readable, in data-interchange standards like JSON and CSV, the point as decimal separator.
Imagining… I have an originalSpreadsheet
with my dataset. I published my dataset as originalSpreadsheet.csv
, saving it as CSV with comma as delimiter — no special decision, only because it seems natural, the comma-separated tradition. So is natural to think something as
of course, if I am using comma as delimiter, the decimal-separator is point, not comma, because any CSV cell without quotes will be interpreted as a number.
(and numbers will stay as numbers not strings)
Make sense for me and for some others working and testing in the same environment, so we continue to do the same thing…
One year later somebody is reading the originalSpreadsheet.csv
file with my country’s locale settings, with MS-Excel or Google-spreadsheet… The numeric column is reading as integer not as decimal.
… In the horst cenario we lost all effort to use frictionlessData/specs, and community back to use XLS and XLSX as data-interchange standard (where the information about decimal separator isn’t lost).
Suggestion and rationale
The CSV standard is adaptable, there are 9 environment parameters to define the file format (delimiter, doubleQuote, lineTerminator, quoteChar, etc.)… But there are no paramter to say “my decimal-separator is comma”.
There are good standards about “other necessary environment parameters”, as ISO15897 and CLDR, that really say “my decimal-separator is comma” when I need it as my locale.
But the problem is not the locale of the CSV-reader agent, that must render the layout as user whant, the problem is the origin of the dataset: “When I generated originalSpreadsheet.csv
what locale I was using on the CSV-write agent?”.
This information must to stay with the CSV description (the JSON metadata with the CSV Dialect descriptor)… And I not see it at frictionlessData/specs/csv-dialect.
Is possible to add a decimal-separator parameter in the csv-dialect?
Will be also an enhancement for digital preservation and data provenance.
… Another option is to say “Please be consistent in your community”, saving and reading CSV files with the same locale (and no other “obvious” assumption about decimal-separator interpretation)… But it is a error-prone stance, as experience has shown: the parameter will well-come for reliability aims.