W3C CSV for the Web - how does it relate to Data Packages?

W3C recently released some recommendations CSV on the Web Working Group Wiki

As a publisher of open data I’d like to understand the relationship between Data Packages and the W3C work so I can adopt the good practice of sharing machine and human readable information about the structure of the data.

Any advice on simple tools or practices to adopt would also be appreciated.

Obviously I recommend Data Packages and especially Tabular Data Packages over the W3C spec.

Whilst the W3C work was based originally on Tabular Data Packages it has diverged quite substantially and got quite complex. As a result it is no longer compatible with Data Packages.

Some comments on this in this thread here back in October: lists.okfn.org Mailing Lists Relevant excerpts with some additions:

  • Data Package is obviously more generic - it is not just for Tabular Data. So you can create Data Packages for lots of other kinds of data too.
  • Tabular Data Package and the w3c spec have significant similarities because, originally, the W3C spec was heavily based on Tabular Data Package. However, there has been quite a bit of divergence. Some of this is summarized in this issue: Various improvements including greater alignment with Tabular Data Package and JSON Table Schema · Issue #702 · w3c/csvw · GitHub
  • Also, strictly, Tabular Data Package is a spec for publishing tabular data which says: a) publish CSV b) describe the general metadata and data metadata using datapackage.json. The W3C spec is about describing CSV that is on the web. However, de facto this is not a large difference as you can use JSON Table Schema + Tabular Data package to describe generic CSV
  • Tabular Data Package is generally somewhat more modular: it consists of 3 small components each of which can be used on their own: a) Data Package spec b) JSON Table Schema c) CSV Dialect Description Format

Generally, I would like to (have) seen convergence here but that hasn’t entirely happened - and at this point likely won’t happen as the W3C spec is going into lock-down and I think JTS / Tabular Data Package should retain their zen-like simplicity if at all possible (making it super easy for publishers and consumers to use Tabular Data Packages is absolutely key).

The similarities are because the W3C spec was originally directly based on the Tabular Data Package setup and I was an author. Over time quite a bit of change has occurred, a lot of it related to transformation to RDF (which I, personally, think is better served by support outside of the metadata spec) and compatibility with other W3C specs (e.g. the core data type definitions following XSD).

Excerpts from github issue 702 where I commented on the draft spec

This issue suggests a variety of improvements to the current version of the specification. It is the result of substantial reflection on the current version of the spec. It is something of an “omnibus” issue and it could be broken up in more bite-sized chunks.

The various improvements are both made in themselves and also to seek greater alignment where possible with the Tabular Data Package and JSON Table Schema specifications.

As people know, this spec was originally heavily based on Tabular Data Package. Over time it seems to have drifted somewhat. I would like to suggest various revisions to bring closer alignment. Why do this? If we can converge these specs then:

  • it is possible for tooling to be easily reused between the two.
  • it is possible that there could be complete convergence on all or parts of the spec that would allow for direct reuse and/or merge
  • we reduce confusion amongst the community

In addition, i would note that Tabular Data Package has seen several years of real-world use and alignment gets the benefits of that experience.

Suggested Changes

  • Rename tables attribute to resources
  • This aligns with the entire Data Package family of specs. And allows for potential and extension and reuse of these types of specifications.
  • This would also make the resource attribute on foreignKeys make more sense
  • Rename tableSchema to simple schema
    • This re-aligns with JSON Table Schema. In addition, schema is simpler and shorter than tableSchema and is equally and sufficiently descriptive for the purposes require (parsimony is always good in specs)

Tables, Columns etc

  • Remove rowTitles: Dubious of need. We should always strive for parsimony.
  • Remove aboutUrl: Primary purpose is to allow the generation of URI based identifiers for rows and columns. I think this is of minor importance for most use cases and in most cases could be performed explictly at the processing stage rather than written into the metadata. As such this brings limited value but adds substantial implementational and cognitive complexity to the specification.
  • Remove valueUrl: Similar reasoning
  • titlestitle and require it to be single valued. Simplicity. Making this multi-valued makes processing and use more complicated. The main purpose of the title that I could see would be as a label in some kind of display. As such you really want one and only one value. (One object would be i18n: but this is common across many attribute values and I suggest we address this in other ways e.g. @{code} approach if we need to specify here).
  • required: move this down onto a constraints object. This is a approach in JSON Table Schema
  • separator: rather than allow this on column require it to be part of dialect. why specially move this up on to the column object - treat it like every other dialect property
  • columns vs fields (JTS): this can be resolved by an upgrade in JTS to use the columns naming

Data types:

Data types are one of the most crucial areas of the specification because they are the core of metadata’s value add (the key thing missing in CSV is types!)

  • Alignment of datatypes between JSON Table Schema and this schema
    • The set of types are currently very close. The main differences afaict are:
      • decimal (not in JTS): this could be resolved in JTS - see Decimal as an alias for number in JTS datatypes (?) · Issue #208 · frictionlessdata/specs · GitHub (I note number is also defined in the sc
      • duration (not in JTS): not sure about this one. It does exist in SQL so suggest JTS adds this - Add duration as a type to JTS · Issue #210 · frictionlessdata/specs · GitHub
      • gYear, gMonth, gMonthDay etc (not in JTS): could these be treated as formats on date / datetime? Alternatively could add to JTS.
      • hexBinary, QName: do we really want these? Could these be formats on other types.
      • Subtypes of decimal e.g. unsignedLong, positiveInteger etc: can these be expressed as types on decimal / number?
      • Subtypes of string e.g. html, xml: could these be addressed via format on string (there are many text formats - why special case these?)
      • geopoint, geojson (in JTS not in spec): geodata is so common (even in CSV!) it would be nice to have some basic support. Plus we would align
  • datatype: make it single value and have all other properties moved out onto a constraints object or a format object
    • datatype is the single more useful thing in the spec. Let’s keep it incredibly simple. By allowing datatypes to be rich objects we make parsing more complicated. Almost all the properties on the datatype can be moved off either into format or a constraints object (see next item). (Only other two items are @type and @id. @type could just be omitted and @id could just be moved out to sit in parallel - perhaps with a rename)
  • Move constraints information back into a dedicated constraints attribute
    • constraints are things like minLength, maxLength etc
    • Align with JSON Table Schema
    • Keep constraints (which are primarily for validation) nicely contained and described
    • Aside: I do wonder whether constraints (and associated validation) are a separate mini-spec (based on experience with JSON Table Schema)

Other Suggested Changes

  • Keep namespacing to a minimum, at least in our examples for properties (e.g. dc:title vs title). We can definitely allow namespacing - and it comes automatically really - but let’s de-emphasize it. We really want to strive for simplicity and namespacing adds that (i have to work out what that dc: means …)

  • virtual columns: remove from spec.

    • virtual columns are clearly a processing step and are likely relatively complex. Adding these to the spec brings substantial additional complexity with limited benefit.
    • Already non-normative in the spec. Let’s remove this altogether.
  • Remove explicit support for URI Templates

    • The purpose of these - as I understand it - is primarily to support URI generation in aboutUrl. As above this is of limited value and these add substantial implementational and cognitive complexity to the specification.
  • Keep transformation and conversion hints separate from the main spec

    • There are various points in the spec where transformation / conversion hints are alongside description metadata. For example:
      • The dialect description includes things like skipRows, skipColumns etc
      • suppressOutput in the table description
    • It would be better to clearly delimit this metadata so as to keep the spec conceptually clearer. Having all transformation information in a separate “transformation” properties would help here (and we already partly do this by having the transformations property separated out).
  • Move transformation out of the main spec

    • We should consider entirely removing transformation from this spec and having it clearly separated. Transformation is a complex process and one often done in a variety of ways. Providing a specification for it is a substantial matter and also one that can be separated from that of providing a good descriptive metadata.
  • i18n support: i18n is one of the more complex issues and presents real challenges for trading off expressiveness vs simplicity and ease of use. My sense is that the ultimate JSON-LD route is rather heavy-weight e.g.

    "title": {"@value": "The title of this Table", "@language": "en"}

    I would suggest the earlier JSON-LD approach of appending “@” to fields e.g.

    `“title@en”: “The title of this table”

    My preference for the latter is that it will be much easier to use and read for most non-expert users. I also think it “degrades” more nicely in that in the “title@lang” approach you do not have to change the default langague value (likely the one most important to users) when you add new languages. That said, I acknowledge that either approach has challenges (e.g. this latter one means that parsing property values must be i18n aware).

Example

To illustrate some of the changes here is the current main example and then a rewritten version:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  "dc:title": "Tree Operations",
  "dcat:keyword": ["tree", "street", "maintenance"],
  "dc:publisher": {
    "schema:name": "Example Municipality",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"},
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": ["GID", "Generic Identifier"],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "on_street",
      "titles": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": {"base": "date", "format": "M/d/yyyy"}
    }],
    "primaryKey": "GID",
    "aboutUrl": "#gid-{GID}"
  }
}

Rewritten version

{
  "url": "tree-ops.csv",
  "title": "Tree Operations",
  "keywords": ["tree", "street", "maintenance"],
  "publisher": {
    "name": "Example Municipality",
    "url": "http://example.org"
  },
  "license": "http://opendefinition.org/licenses/cc-by/",
  "modified": "2010-12-31",
  # note I would suggest always nesting this within a resources array but that's a bigger discussion
  "schema": {
    "columns": [{
      "name": "GID",
      "title": "Generic Identifier",
      "description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "constraints": {
         "required": true
       }
    }, {
      "name": "on_street",
      "title": "On Street",
      "description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "title": "Species",
      "description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "title": "Trim Cycle",
      "description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "title": "Inventory Date",
      "description": "The date of the operation that was performed.",
      "datatype": "date",
      "format": "M/d/yyyy"
    }],
    "primaryKey": "GID",
  }
}
5 Likes

Are you hoping to see Data Packages become the most common unit of data uploaded to CKAN-based data portals? I’m trying to figure out whether that’s a good goal, and whether data.gov.au (for instance) should be heading that way.

If people are uploading data packages, then presumably they’d want all the CKAN-metadata to be embedded inside it…which means (offline?) authoring tools…which seems odd because obviously CKAN already has tools for managing metadata.

Or are they not meant to fit together that way?

I was toying with the idea of providing data package files as a resource to complement the most popular open data on the Queensland CKAN open data portal. Then when I saw the W3C recommendations I wondered if I was going down the right path.

It’s a pity, as Rufus explains above, that time and other constraints didn’t permit the two standards to converge, thus causing confusion.

I was also intrigued by the possibility of releasing a CSV file and making it accessible as RDF (but my main focus was simply explaining the structure of the data).

Some of us discussed the concept of a library of field descriptors to describe common data and their constraints. E.g. Australian postcodes:

{ "name": "Postcode", "type": "string", "title": "Australia Post Postcode", "description": "Australian postal code verification. Australia has 4-digit numeric postal codes with the following state based specific ranges. ACT: 0200-0299 and 2600-2639. NSW: 1000-1999, 2000-2599 and 2640-2914. NT: 0900-0999 and 0800-0899. QLD: 9000-9999 and 4000-4999. SA: 5000-5999. TAS: 7800-7999 and 7000-7499. VIC: 8000-8999 and 3000-3999. WA: 6800-6999 and 6000-6799", "constraints": {"pattern": "^(0[289][0-9]{2})|([1345689][0-9]{3})|(2[0-8][0-9]{2})|(290[0-9])|(291[0-4])|(7[0-4][0-9]{2})|(7[8-9][0-9]{2})$"} }

2 Likes

About nowadays adoption, I am, also, using and recommending OKI’s tabular-data-package standard. But, about the near future…


Perhaps, another aproach… We “lost the game”, the simplicity is not a value to W3C. But a jump to a complex standard will not occur, W3C’s tabular-data-model not showing any trace that it will become a de facto standard in a couple of years.

What is possible now is to advocate a smoother transition process, where each member of user-community can see itself in a stage of a maturity process model. OKI can offer an intermediary standard, something like “SIMPLE-tabular-data-model”, a subset of W3C’s standard constrained in best practices and the basic necessities of the user-community…

There are a well-known “basic necessities” source, it is a kind of mapping algorithm from OKI’s tabular-data-package standard to this SIMPLE-tabular-data-model standard.

PS: we can see the same pattern, for FOSS community, the same kind of W3C’s error, in the jump from simple XPath v1 (and XSLT v1) to the complex XPath v2, with no offer of an intermediary v1.1 (to eg. libxml2 adoption).

3 Likes

In my opinion, a big problem with W3C’s tabular data model is that one cannot process or validate a table in an offline process. You have to dereference some URLs to retrieve the necessary metadata. This problem alone already rules out the viability to use of the CSVW standards for a lot of use cases where connectivity is not guaranteed.

The W3C has a process which seeks to get evidence of two independent and interoperable implementations of a proposed standard before reaching recommendation status. Looking at the CSVW Implementation Report it seems that csvlint and RDF-tabular pass the compliance tests. So, to the W3C, that is enough evidence of implementation. It does not matter to them that Tabular Data Packages seemingly has got more traction in tools and community usage than CSVW.

On the other hand, in the past the W3C has abandoned XHTML 2.0, which was also an esoteric and complex proposal but built inside the W3C, in favor of HTML 5, which is simpler, more objective and pragmatic, but built by the WHATWG, outside the W3C. The reason being community usage was overwhelmingly in favor of HTML 5, so at some point the W3C decided could not ignore it any more and brought HTML 5 into its own standardization process.

Perhaps frictionless data could take a similar approach to how the WHATWG has done in the past and eventually succeed at standardizing Tabular Data Packages at the W3C.

2 Likes

@herrmann, good points, and perhaps your proposal of frictionless data community in the role as WHATWG, is a complement of my proporsal of SIMPLE-tabular-data-model standard :wink:

Trying to explain how…


When we talk about “broad standards” as XML, UTF8, HTML, CSV, etc. the adoption of the standard not imply the use of the “full standard”, we can adopt a kind of “subset of the standard”…

When we talk about open data, there are an open data publisher framework (p. eg. https://data.gov.uk ) that is the “local authority”, so this authority can obligate publishers to adopt some specific “local standard based on a broad standard”.
This pattern is usual:

standard_X + some_constraints_C = local_standard

Examples:

  • an organization’s standard template over the HTML standard (X=HTML, C=template);

  • the SciELO adoption of the JATS standard is described as “JATS plus SciELO style”, resulting in the SPS local standard, so:
    X=JATS, C=SciELO style, local standard = SPS

And when the local standard of an autority is open (have an open license), any other autority can reuse it. In fact, good HTML templates are reused, and the SciELO-Brasil local SPS is reused by SciELO-Chile, SciELO-Spain, etc.

PROPOSAL: to develop at frictionless data community a SIMPLE-tabular-data-model standard (a set of constraints over W3C-tabular-data-model that will simplify it), that will be usefull for open data publishers.


PS: about to the problem of «dereference some URLs to retrieve the necessary metadata» that you explained, it is an optional feature of the W3C-tabular-data-model standard, and a typical feature to be removed from its simplified version.

To clarify what I meant about offline use, W3C’s tabular data model has a couple of ways to do offline processing of CSV metadata:

  1. Overriding metadata”, by which a user can specify, e.g. in a command line interface, a metadata file to use to validate the CSV
  2. Embedded metadata”, by which the metadata is inserted as comment headers in the beginning of the CSV file

Neither of those are practical for offline use.

The first offers no formal association between the data and metadata. Supposedly, practitioners might use the same file name and a different extension to link the CSV and the schema files. But that is not part of the standard and you can’t guarantee that data sources will offer data and metadata files named in this way. So offilne data processors are left with guesswork trying to locate the schema.

The second offers a strong link as both data and metadata are provided in the same file. However, most CSV data processing tools probably can’t handle well comments in CSV files and will fail loading a file using this standard, thus lowering compatibility.

Lastly, about the competition of standards, there is also a precedent to having two different W3C standards to do essentially the same thing: see Microdata vs. RDFa. So I see no problem if eventually the W3C accepts both the Tabular Data Model and Tabular Data Packages simultaneously, especially if Tabular Data Packages do gain a lot of traction in practical usage.

It might also be possible to provide metadata using both of the standards simultaneously for the same CSV data, akin to marking up hypertext with both Microdata and RDFa at the same time, but I haven’t really looked into this possibility in detail to see if it is feasible.

Hi all! More tham 1 year passed… Some news, some statistics about W3C standard adoption? In nowadays, is better to ignore it or not?


For me, other interesting questions to continue the discussion here:

  • there are a document describing intersections between tabular-data-model standards (W3C and OKFN)?

  • there are some statistics about datapackage-representation (adoption) in all CKAN implementations?

  • Some estimatives about adoption of (valid) specs/data-package v1.0 at Data Packaged Core Datasets · GitHub and similar repositories?
    (perhaps a sample by Goodtables usage)

1 Like

I suspect if people were using data packages in CKAN they would download GitHub - frictionlessdata/ckanext-datapackager: CKAN extension for importing/exporting Data Packages.

Based on the low number of downloads (6) perhaps adoption isn’t strong. Happy to be corrected.

2 Likes

Hi Stephen, I don’t yet know the level of adoption but i’d warn that github download stats are pretty meaningless (most users will clone or get from pypi not from github …)