Thoughts on JSON Table Schema


#1

See initial discussion of this topic here: https://lists.okfn.org/pipermail/data-protocols/2015-July/000101.html

Hi,

First can I say I am a long-time follower and huge fan of the
dataprotocols.org project.

At Snowplow we are thinking of using JSON Table Schema in our Iglu schema
repository system:

First a quick question - I couldn’t find a JSON Schema for the JSON Table
Schema. Has anybody written this yet?

More broadly: I’m not convinced that the current unitary JSON Table Schema
is a viable approach.

Different relational databases have different capabilities - for example, a
valid table definition for Redshift must have SORTKEY and DISTKEY, and
indexes are not supported. This is distinct from Postgres DDL, which in
turn is distinct from BigQuery DDL, Vertica DDL etc.

For me, the value of a JSON Table Schema would be in making table DDL
declarative and composable. To be useful though, it must be possible to
generate valid idiomatic (i.e. database-specific) DDL from a given instance
of a JSON Table Schema.

Based on this, I’m leaning towards a JSON Table Schema which has
database-specific flavors. I think the two options here are:

  1. Create a separate definition document (in JSON Schema) for each
    database that we want to support, or
  2. Create a unitary JSON Table Schema which uses enums of e.g.
    database-specific field-descriptor types to support differences

The downside of the first option is that there is no guaranteed
predictability of schema shape between different database types. The second
option is a little more fiddly but probably more useful long-term.

Does anybody have any thoughts on the above?

Thanks,

Alex


#2

@alexdean these are great questions.

JSON Table Schema

This is a great point. But remember JSON Table Schema is designed to be extensible. So, for example you could add info specific to a particular target in as additional properties on each field value, or you even add a separate section distinct from fields property which is focused on a given target e.g. Redshift. Documenting these extensions would be really useful.

e.g.

schema: {
  fields: [ ... ]
  redshift: {
   ...
  }
} 

So, if I understand your suggestion correctly, I think option 2 is the one we should incline to - and this is something exciting to add!


#3

@alexdean did you make any progress here. We’re probably about to look at pushing Tabular Data Packages to BigQuery and that has a similar use case so I’d be interested in your thoughts and experience here.


#4

We’ve made some good progress on a related aspect - we have updated our Schema Guru tool so that it can generate Redshift CREATE TABLE DDL from JSON Schema: Schema Guru 0.3.0 released for generating Redshift tables from JSON Schemas.

At the moment the transformation is done using an AST built in Scala, and it goes direct from JSON Schema to DDL. But yes we are definitely still interested in an intermediate form of the DDL - some declarative form written in JSON and validating against a given JSON Schema.


#5

Where does your raw JSON Schema come from? I imagine you may want to use JSON Table Schema direct rather than have it as an intermediate step. We’re thinking generally about a JSON Table Schema to SQL translation as we could use that both for standard RDMS and e.g. Redshift. Would you be interested in contributing to this?


#6

Hi Rufus - our JSON Schemas come from Iglu Central. They are not inherently tabular - part of the “magic” of Schema Guru is that it flattens them (within reason) to fit into a database format. So yes, JSON Table Schema or similar would be as an intermediate step (JSON Schema -> tabular JSON -> DDL).

Definitely interested in collaborating with you guys on this. I guess given you are most focused on BigQuery, I’d be really interested to see a variant of JSON Table Schema which could maximally define a BigQuery table (i.e. anything that can go into BigQuery table DDL can be expressed in the JSON Table Schema). We could then do the same for Redshift and see what the commonalities are…


#7

We have no specific orientation to BigQuery - Redshift, BigQuery etc are all equally valid targets and Redshift might actually be better as RDBMS like. We’ll be looking into this near-term both for Redshift and BigQuery.


#8

@alexdean we now have a full JSON Table Schema to SQL python library in operation here: https://github.com/okfn/jts-sql - we may also “web-service-ize” this.

/cc @danfowler


#9

FYI, I wrote two python3 tools (still in alpha version), one for obtaining a JTS (generalized for a database, with table descriptions conforming to JTS) from a live PostgreSQL database (


http://pg-jts.readthedocs.org
) and a second for visualizing the output of the first (

http://jts-erd.readthedocs.org
).


#10

Great work! I’d love to include these in the next Labs newsletter! Would you be interested in doing a write-up?