W3C CSV for the Web - how does it relate to Data Packages?

Obviously I recommend Data Packages and especially Tabular Data Packages over the W3C spec.

Whilst the W3C work was based originally on Tabular Data Packages it has diverged quite substantially and got quite complex. As a result it is no longer compatible with Data Packages.

Some comments on this in this thread here back in October: lists.okfn.org Mailing Lists Relevant excerpts with some additions:

  • Data Package is obviously more generic - it is not just for Tabular Data. So you can create Data Packages for lots of other kinds of data too.
  • Tabular Data Package and the w3c spec have significant similarities because, originally, the W3C spec was heavily based on Tabular Data Package. However, there has been quite a bit of divergence. Some of this is summarized in this issue: Various improvements including greater alignment with Tabular Data Package and JSON Table Schema · Issue #702 · w3c/csvw · GitHub
  • Also, strictly, Tabular Data Package is a spec for publishing tabular data which says: a) publish CSV b) describe the general metadata and data metadata using datapackage.json. The W3C spec is about describing CSV that is on the web. However, de facto this is not a large difference as you can use JSON Table Schema + Tabular Data package to describe generic CSV
  • Tabular Data Package is generally somewhat more modular: it consists of 3 small components each of which can be used on their own: a) Data Package spec b) JSON Table Schema c) CSV Dialect Description Format

Generally, I would like to (have) seen convergence here but that hasn’t entirely happened - and at this point likely won’t happen as the W3C spec is going into lock-down and I think JTS / Tabular Data Package should retain their zen-like simplicity if at all possible (making it super easy for publishers and consumers to use Tabular Data Packages is absolutely key).

The similarities are because the W3C spec was originally directly based on the Tabular Data Package setup and I was an author. Over time quite a bit of change has occurred, a lot of it related to transformation to RDF (which I, personally, think is better served by support outside of the metadata spec) and compatibility with other W3C specs (e.g. the core data type definitions following XSD).

Excerpts from github issue 702 where I commented on the draft spec

This issue suggests a variety of improvements to the current version of the specification. It is the result of substantial reflection on the current version of the spec. It is something of an “omnibus” issue and it could be broken up in more bite-sized chunks.

The various improvements are both made in themselves and also to seek greater alignment where possible with the Tabular Data Package and JSON Table Schema specifications.

As people know, this spec was originally heavily based on Tabular Data Package. Over time it seems to have drifted somewhat. I would like to suggest various revisions to bring closer alignment. Why do this? If we can converge these specs then:

  • it is possible for tooling to be easily reused between the two.
  • it is possible that there could be complete convergence on all or parts of the spec that would allow for direct reuse and/or merge
  • we reduce confusion amongst the community

In addition, i would note that Tabular Data Package has seen several years of real-world use and alignment gets the benefits of that experience.

Suggested Changes

  • Rename tables attribute to resources
  • This aligns with the entire Data Package family of specs. And allows for potential and extension and reuse of these types of specifications.
  • This would also make the resource attribute on foreignKeys make more sense
  • Rename tableSchema to simple schema
    • This re-aligns with JSON Table Schema. In addition, schema is simpler and shorter than tableSchema and is equally and sufficiently descriptive for the purposes require (parsimony is always good in specs)

Tables, Columns etc

  • Remove rowTitles: Dubious of need. We should always strive for parsimony.
  • Remove aboutUrl: Primary purpose is to allow the generation of URI based identifiers for rows and columns. I think this is of minor importance for most use cases and in most cases could be performed explictly at the processing stage rather than written into the metadata. As such this brings limited value but adds substantial implementational and cognitive complexity to the specification.
  • Remove valueUrl: Similar reasoning
  • titlestitle and require it to be single valued. Simplicity. Making this multi-valued makes processing and use more complicated. The main purpose of the title that I could see would be as a label in some kind of display. As such you really want one and only one value. (One object would be i18n: but this is common across many attribute values and I suggest we address this in other ways e.g. @{code} approach if we need to specify here).
  • required: move this down onto a constraints object. This is a approach in JSON Table Schema
  • separator: rather than allow this on column require it to be part of dialect. why specially move this up on to the column object - treat it like every other dialect property
  • columns vs fields (JTS): this can be resolved by an upgrade in JTS to use the columns naming

Data types:

Data types are one of the most crucial areas of the specification because they are the core of metadata’s value add (the key thing missing in CSV is types!)

  • Alignment of datatypes between JSON Table Schema and this schema
    • The set of types are currently very close. The main differences afaict are:
      • decimal (not in JTS): this could be resolved in JTS - see Decimal as an alias for number in JTS datatypes (?) · Issue #208 · frictionlessdata/specs · GitHub (I note number is also defined in the sc
      • duration (not in JTS): not sure about this one. It does exist in SQL so suggest JTS adds this - Add duration as a type to JTS · Issue #210 · frictionlessdata/specs · GitHub
      • gYear, gMonth, gMonthDay etc (not in JTS): could these be treated as formats on date / datetime? Alternatively could add to JTS.
      • hexBinary, QName: do we really want these? Could these be formats on other types.
      • Subtypes of decimal e.g. unsignedLong, positiveInteger etc: can these be expressed as types on decimal / number?
      • Subtypes of string e.g. html, xml: could these be addressed via format on string (there are many text formats - why special case these?)
      • geopoint, geojson (in JTS not in spec): geodata is so common (even in CSV!) it would be nice to have some basic support. Plus we would align
  • datatype: make it single value and have all other properties moved out onto a constraints object or a format object
    • datatype is the single more useful thing in the spec. Let’s keep it incredibly simple. By allowing datatypes to be rich objects we make parsing more complicated. Almost all the properties on the datatype can be moved off either into format or a constraints object (see next item). (Only other two items are @type and @id. @type could just be omitted and @id could just be moved out to sit in parallel - perhaps with a rename)
  • Move constraints information back into a dedicated constraints attribute
    • constraints are things like minLength, maxLength etc
    • Align with JSON Table Schema
    • Keep constraints (which are primarily for validation) nicely contained and described
    • Aside: I do wonder whether constraints (and associated validation) are a separate mini-spec (based on experience with JSON Table Schema)

Other Suggested Changes

  • Keep namespacing to a minimum, at least in our examples for properties (e.g. dc:title vs title). We can definitely allow namespacing - and it comes automatically really - but let’s de-emphasize it. We really want to strive for simplicity and namespacing adds that (i have to work out what that dc: means …)

  • virtual columns: remove from spec.

    • virtual columns are clearly a processing step and are likely relatively complex. Adding these to the spec brings substantial additional complexity with limited benefit.
    • Already non-normative in the spec. Let’s remove this altogether.
  • Remove explicit support for URI Templates

    • The purpose of these - as I understand it - is primarily to support URI generation in aboutUrl. As above this is of limited value and these add substantial implementational and cognitive complexity to the specification.
  • Keep transformation and conversion hints separate from the main spec

    • There are various points in the spec where transformation / conversion hints are alongside description metadata. For example:
      • The dialect description includes things like skipRows, skipColumns etc
      • suppressOutput in the table description
    • It would be better to clearly delimit this metadata so as to keep the spec conceptually clearer. Having all transformation information in a separate “transformation” properties would help here (and we already partly do this by having the transformations property separated out).
  • Move transformation out of the main spec

    • We should consider entirely removing transformation from this spec and having it clearly separated. Transformation is a complex process and one often done in a variety of ways. Providing a specification for it is a substantial matter and also one that can be separated from that of providing a good descriptive metadata.
  • i18n support: i18n is one of the more complex issues and presents real challenges for trading off expressiveness vs simplicity and ease of use. My sense is that the ultimate JSON-LD route is rather heavy-weight e.g.

    "title": {"@value": "The title of this Table", "@language": "en"}

    I would suggest the earlier JSON-LD approach of appending “@” to fields e.g.

    `“title@en”: “The title of this table”

    My preference for the latter is that it will be much easier to use and read for most non-expert users. I also think it “degrades” more nicely in that in the “title@lang” approach you do not have to change the default langague value (likely the one most important to users) when you add new languages. That said, I acknowledge that either approach has challenges (e.g. this latter one means that parsing property values must be i18n aware).

Example

To illustrate some of the changes here is the current main example and then a rewritten version:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  "dc:title": "Tree Operations",
  "dcat:keyword": ["tree", "street", "maintenance"],
  "dc:publisher": {
    "schema:name": "Example Municipality",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"},
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": ["GID", "Generic Identifier"],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "on_street",
      "titles": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": {"base": "date", "format": "M/d/yyyy"}
    }],
    "primaryKey": "GID",
    "aboutUrl": "#gid-{GID}"
  }
}

Rewritten version

{
  "url": "tree-ops.csv",
  "title": "Tree Operations",
  "keywords": ["tree", "street", "maintenance"],
  "publisher": {
    "name": "Example Municipality",
    "url": "http://example.org"
  },
  "license": "http://opendefinition.org/licenses/cc-by/",
  "modified": "2010-12-31",
  # note I would suggest always nesting this within a resources array but that's a bigger discussion
  "schema": {
    "columns": [{
      "name": "GID",
      "title": "Generic Identifier",
      "description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "constraints": {
         "required": true
       }
    }, {
      "name": "on_street",
      "title": "On Street",
      "description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "title": "Species",
      "description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "title": "Trim Cycle",
      "description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "title": "Inventory Date",
      "description": "The date of the operation that was performed.",
      "datatype": "date",
      "format": "M/d/yyyy"
    }],
    "primaryKey": "GID",
  }
}
5 Likes