W3C CSV for the Web - how does it relate to Data Packages?


#1

W3C recently released some recommendations http://www.w3.org/2013/csvw/wiki/Main_Page

As a publisher of open data I’d like to understand the relationship between Data Packages and the W3C work so I can adopt the good practice of sharing machine and human readable information about the structure of the data.

Any advice on simple tools or practices to adopt would also be appreciated.


What should an Open Data Law look like?
#2

Obviously I recommend Data Packages and especially Tabular Data Packages over the W3C spec.

Whilst the W3C work was based originally on Tabular Data Packages it has diverged quite substantially and got quite complex. As a result it is no longer compatible with Data Packages.

Some comments on this in this thread here back in October: https://lists.okfn.org/pipermail/okfn-labs/2015-October/001660.html Relevant excerpts with some additions:

  • Data Package is obviously more generic - it is not just for Tabular Data. So you can create Data Packages for lots of other kinds of data too.
  • Tabular Data Package and the w3c spec have significant similarities because, originally, the W3C spec was heavily based on Tabular Data Package. However, there has been quite a bit of divergence. Some of this is summarized in this issue: https://github.com/w3c/csvw/issues/702
  • Also, strictly, Tabular Data Package is a spec for publishing tabular data which says: a) publish CSV b) describe the general metadata and data metadata using datapackage.json. The W3C spec is about describing CSV that is on the web. However, de facto this is not a large difference as you can use JSON Table Schema + Tabular Data package to describe generic CSV
  • Tabular Data Package is generally somewhat more modular: it consists of 3 small components each of which can be used on their own: a) Data Package spec b) JSON Table Schema c) CSV Dialect Description Format

Generally, I would like to (have) seen convergence here but that hasn’t entirely happened - and at this point likely won’t happen as the W3C spec is going into lock-down and I think JTS / Tabular Data Package should retain their zen-like simplicity if at all possible (making it super easy for publishers and consumers to use Tabular Data Packages is absolutely key).

The similarities are because the W3C spec was originally directly based on the Tabular Data Package setup and I was an author. Over time quite a bit of change has occurred, a lot of it related to transformation to RDF (which I, personally, think is better served by support outside of the metadata spec) and compatibility with other W3C specs (e.g. the core data type definitions following XSD).


#3

Are you hoping to see Data Packages become the most common unit of data uploaded to CKAN-based data portals? I’m trying to figure out whether that’s a good goal, and whether data.gov.au (for instance) should be heading that way.

If people are uploading data packages, then presumably they’d want all the CKAN-metadata to be embedded inside it…which means (offline?) authoring tools…which seems odd because obviously CKAN already has tools for managing metadata.

Or are they not meant to fit together that way?


#4

I was toying with the idea of providing data package files as a resource to complement the most popular open data on the Queensland CKAN open data portal. Then when I saw the W3C recommendations I wondered if I was going down the right path.

It’s a pity, as Rufus explains above, that time and other constraints didn’t permit the two standards to converge, thus causing confusion.

I was also intrigued by the possibility of releasing a CSV file and making it accessible as RDF (but my main focus was simply explaining the structure of the data).

Some of us discussed the concept of a library of field descriptors to describe common data and their constraints. E.g. Australian postcodes:

{ "name": "Postcode", "type": "string", "title": "Australia Post Postcode", "description": "Australian postal code verification. Australia has 4-digit numeric postal codes with the following state based specific ranges. ACT: 0200-0299 and 2600-2639. NSW: 1000-1999, 2000-2599 and 2640-2914. NT: 0900-0999 and 0800-0899. QLD: 9000-9999 and 4000-4999. SA: 5000-5999. TAS: 7800-7999 and 7000-7499. VIC: 8000-8999 and 3000-3999. WA: 6800-6999 and 6000-6799", "constraints": {"pattern": "^(0[289][0-9]{2})|([1345689][0-9]{3})|(2[0-8][0-9]{2})|(290[0-9])|(291[0-4])|(7[0-4][0-9]{2})|(7[8-9][0-9]{2})$"} }


#5

About nowadays adoption, I am, also, using and recommending OKI’s tabular-data-package standard. But, about the near future…


Perhaps, another aproach… We “lost the game”, the simplicity is not a value to W3C. But a jump to a complex standard will not occur, W3C’s tabular-data-model not showing any trace that it will become a de facto standard in a couple of years.

What is possible now is to advocate a smoother transition process, where each member of user-community can see itself in a stage of a maturity process model. OKI can offer an intermediary standard, something like “SIMPLE-tabular-data-model”, a subset of W3C’s standard constrained in best practices and the basic necessities of the user-community…

There are a well-known “basic necessities” source, it is a kind of mapping algorithm from OKI’s tabular-data-package standard to this SIMPLE-tabular-data-model standard.

PS: we can see the same pattern, for FOSS community, the same kind of W3C’s error, in the jump from simple XPath v1 (and XSLT v1) to the complex XPath v2, with no offer of an intermediary v1.1 (to eg. libxml2 adoption).


#6

In my opinion, a big problem with W3C’s tabular data model is that one cannot process or validate a table in an offline process. You have to dereference some URLs to retrieve the necessary metadata. This problem alone already rules out the viability to use of the CSVW standards for a lot of use cases where connectivity is not guaranteed.

The W3C has a process which seeks to get evidence of two independent and interoperable implementations of a proposed standard before reaching recommendation status. Looking at the CSVW Implementation Report it seems that csvlint and RDF-tabular pass the compliance tests. So, to the W3C, that is enough evidence of implementation. It does not matter to them that Tabular Data Packages seemingly has got more traction in tools and community usage than CSVW.

On the other hand, in the past the W3C has abandoned XHTML 2.0, which was also an esoteric and complex proposal but built inside the W3C, in favor of HTML 5, which is simpler, more objective and pragmatic, but built by the WHATWG, outside the W3C. The reason being community usage was overwhelmingly in favor of HTML 5, so at some point the W3C decided could not ignore it any more and brought HTML 5 into its own standardization process.

Perhaps frictionless data could take a similar approach to how the WHATWG has done in the past and eventually succeed at standardizing Tabular Data Packages at the W3C.


#7

@herrmann, good points, and perhaps your proposal of frictionless data community in the role as WHATWG, is a complement of my proporsal of SIMPLE-tabular-data-model standard :wink:

Trying to explain how…


When we talk about “broad standards” as XML, UTF8, HTML, CSV, etc. the adoption of the standard not imply the use of the “full standard”, we can adopt a kind of “subset of the standard”…

When we talk about open data, there are an open data publisher framework (p. eg. https://data.gov.uk ) that is the “local authority”, so this authority can obligate publishers to adopt some specific “local standard based on a broad standard”.
This pattern is usual:

standard_X + some_constraints_C = local_standard

Examples:

  • an organization’s standard template over the HTML standard (X=HTML, C=template);

  • the SciELO adoption of the JATS standard is described as “JATS plus SciELO style”, resulting in the SPS local standard, so:
    X=JATS, C=SciELO style, local standard = SPS

And when the local standard of an autority is open (have an open license), any other autority can reuse it. In fact, good HTML templates are reused, and the SciELO-Brasil local SPS is reused by SciELO-Chile, SciELO-Spain, etc.

PROPOSAL: to develop at frictionless data community a SIMPLE-tabular-data-model standard (a set of constraints over W3C-tabular-data-model that will simplify it), that will be usefull for open data publishers.


PS: about to the problem of «dereference some URLs to retrieve the necessary metadata» that you explained, it is an optional feature of the W3C-tabular-data-model standard, and a typical feature to be removed from its simplified version.


#8

To clarify what I meant about offline use, W3C’s tabular data model has a couple of ways to do offline processing of CSV metadata:

  1. Overriding metadata”, by which a user can specify, e.g. in a command line interface, a metadata file to use to validate the CSV
  2. Embedded metadata”, by which the metadata is inserted as comment headers in the beginning of the CSV file

Neither of those are practical for offline use.

The first offers no formal association between the data and metadata. Supposedly, practitioners might use the same file name and a different extension to link the CSV and the schema files. But that is not part of the standard and you can’t guarantee that data sources will offer data and metadata files named in this way. So offilne data processors are left with guesswork trying to locate the schema.

The second offers a strong link as both data and metadata are provided in the same file. However, most CSV data processing tools probably can’t handle well comments in CSV files and will fail loading a file using this standard, thus lowering compatibility.

Lastly, about the competition of standards, there is also a precedent to having two different W3C standards to do essentially the same thing: see Microdata vs. RDFa. So I see no problem if eventually the W3C accepts both the Tabular Data Model and Tabular Data Packages simultaneously, especially if Tabular Data Packages do gain a lot of traction in practical usage.

It might also be possible to provide metadata using both of the standards simultaneously for the same CSV data, akin to marking up hypertext with both Microdata and RDFa at the same time, but I haven’t really looked into this possibility in detail to see if it is feasible.


#9

Hi all! More tham 1 year passed… Some news, some statistics about W3C standard adoption? In nowadays, is better to ignore it or not?


For me, other interesting questions to continue the discussion here:

  • there are a document describing intersections between tabular-data-model standards (W3C and OKFN)?

  • there are some statistics about datapackage-representation (adoption) in all CKAN implementations?

  • Some estimatives about adoption of (valid) specs/data-package v1.0 at https://github.com/datasets and similar repositories?
    (perhaps a sample by Goodtables usage)


#10

I suspect if people were using data packages in CKAN they would download https://github.com/frictionlessdata/ckanext-datapackager

Based on the low number of downloads (6) perhaps adoption isn’t strong. Happy to be corrected.


#11

Hi Stephen, I don’t yet know the level of adoption but i’d warn that github download stats are pretty meaningless (most users will clone or get from pypi not from github …)