The CSV standard need a decimal-separator parameter?

ppkrauss · February 12, 2018, 12:41pm

This is a question, perhaps a suggestion for Frictionlessdata specs/CSV dialect
… or perhaps only an opportunity to say “hello I am in the CSV hell of the decimal-separator interpretation”.

Typical problem

On my country we use comma as decimal separator, but we accept as human-readable, in data-interchange standards like JSON and CSV, the point as decimal separator.

Imagining… I have an originalSpreadsheet with my dataset. I published my dataset as originalSpreadsheet.csv, saving it as CSV with comma as delimiter — no special decision, only because it seems natural, the comma-separated tradition. So is natural to think something as

of course, if I am using comma as delimiter, the decimal-separator is point, not comma, because any CSV cell without quotes will be interpreted as a number.
(and numbers will stay as numbers not strings)

Make sense for me and for some others working and testing in the same environment, so we continue to do the same thing…

One year later somebody is reading the originalSpreadsheet.csv file with my country’s locale settings, with MS-Excel or Google-spreadsheet… The numeric column is reading as integer not as decimal.

… In the horst cenario we lost all effort to use frictionlessData/specs, and community back to use XLS and XLSX as data-interchange standard (where the information about decimal separator isn’t lost).

Suggestion and rationale

The CSV standard is adaptable, there are 9 environment parameters to define the file format (delimiter, doubleQuote, lineTerminator, quoteChar, etc.)… But there are no paramter to say “my decimal-separator is comma”.

There are good standards about “other necessary environment parameters”, as ISO15897 and CLDR, that really say “my decimal-separator is comma” when I need it as my locale.

But the problem is not the locale of the CSV-reader agent, that must render the layout as user whant, the problem is the origin of the dataset: “When I generated originalSpreadsheet.csv what locale I was using on the CSV-write agent?”.
This information must to stay with the CSV description (the JSON metadata with the CSV Dialect descriptor)… And I not see it at frictionlessData/specs/csv-dialect.

Is possible to add a decimal-separator parameter in the csv-dialect?

Will be also an enhancement for digital preservation and data provenance.

… Another option is to say “Please be consistent in your community”, saving and reading CSV files with the same locale (and no other “obvious” assumption about decimal-separator interpretation)… But it is a error-prone stance, as experience has shown: the parameter will well-come for reliability aims.

Stephen · February 13, 2018, 1:23am

Have you seen decimalChar in Table Schema | Frictionless Standards ?

pwalsh · February 13, 2018, 6:05am

@ppkrauss

The issue you raise is definitely not a CSV Dialect issue. As @Stephen points out, we handle this, but in the Table Schema specification.

ppkrauss · February 13, 2018, 10:52am

Hi @stephen and @pwalsh,

I agree that decimalChar has exactly the semantic of the parameter that I was looking for,
so it is a piece in the solution of the problem, thanks!

Well, it seems a solution when we using “CSV + TableSchema” specifications, as in https://datahub.io/core
where each dataset have a complete description in its metadata… A complex and so detailed datapackage.json…

But the context here is “only CSV”… We can imagine a dataset in a context of low maturity, were the only thing is the CSV file and a very simple JSON, with some or all 8 parameters as default.
So, to this new issue, thanks for @stephen, I can express more formally the suggestion:

to add the decimalChar parameter in the specs/csv-dialect specification.

( sorry, to verbose, adding more context and drama )

NOTE

It is not for USA, where we have no problem, even not for Europe, where we are in a high maturity context. It is for Angola, Azerbaijan, Bolivia, Bosnia, Brazil… All countries with other reality and “low digital maturity”, with low national-standards enforcement, etc. that represents ~50% of the countries using Arabic numerals with decimal comma. In that countries we need to show confidence in our choice of an open-data ecosystem, all days.

View/Import/Export CSV files are critical operations for serious use. To lost a decimal-separator is a bug, and with it we lost reliability in all the open-data ecosystem. To fix the bug the information about decimal-separator must be “easy to get”: even when expressed in a TableSchema context, it is not obvious to CSV parser, neither easy to check by people that will explain how to open the CSV file.

The coherent use of CSV is more difficult for us… And there are a “history collective trauma” with CSV in the country’s culture.

Formalizing the suggestion

To reuse the decimalChar of specs/table-schema in the specs/csv-dialect… Lets see in more details how we can add this parameter:

Consider a kind of hierarchy between specs. specs/csv-dialect (CSV) is more simple and more general than specs/table-schema (TableSchema). Only users with higher maturity level can adopt TableSchema and describe all details of the table…
Consider hierarchy as in issue #447, so a we can imagine inheritance rules for the decimalChar parameter.
Adding decimalChar to specs/csv-dialect. No impact (? perhaps little for Goodtables), no conflict. If delimiter=decimalChar use quotes to represent decimal numbers.
Recognizing specs/csv-dialect/decimalChar from TableSchema. No problem, will be the “default decimalChar” for all columns, and can be redefined for some columns.
In the datapackage.json the object dialect is in the root and column specification in distant branches, resources/schema/fields, where is easy to easy to inherit CSV’s decimalChar definition as default.

Stephen · February 13, 2018, 11:12am

I’m not sure I fully understand but if you have a low maturity and only working with CSV then perhaps saving as Tab Separated Value file could help - Excel supports this.

If making data packages is hard, Data Curator or Data Package Creator could help (Note: Data Curator can save TSV but doesn’t save the CSV dialect yet)

@serahrono what’s your view on Data Package Creator support for CSV Dialects?

EDIT: @ppkrauss Data Curator should have TSV and CSV Dialect support with our next release tomorrow.

Hope that helps.

gcc · July 15, 2019, 7:23am

Is there a way specifying the separator, when defining the resource to be used by CKAN?

Topic		Replies	Views
Profile for tab or semicolon separated value files Frictionless Data	1	692	August 3, 2017
Correct use of CSV Dialect? Frictionless Data	1	1003	June 13, 2017
Table Schema: Currency pattern Frictionless Data	3	2303	April 16, 2018
Table Schema Constraints UUID Frictionless Data	6	1718	April 12, 2018
Tabular location data - what are my options? Frictionless Data	4	1200	July 12, 2017

The CSV standard need a decimal-separator parameter?

Typical problem

Suggestion and rationale

Formalizing the suggestion

Related topics