The Tabular Data Package as the one source for data definition

I’d like to check our planned approach to defining a data package and using that definition as the one source of all definitions from which schemas and other resources can be generated.

We’d like to:

  1. Take an existing tabular data package ( the Open Referral tabular data package that links services, providers, locations etc)
  2. Define proposed extensions (annotated as such)
  3. Define constraints to get the “application profile” which says how the package will be used in our situation (which fields to use, which vocabularies are used to populate them, …)
  4. Auto-generate from that one or more entity relation diagrams with colour coding to distinguish the original from extensions and constraints
  5. Autogetenate JSON schemas which define responses to web methods querying the data (eg getService)
  6. Autogenerate CSV schemas for tabular (partial) views of the data (eg a list of services)

Is this logic sound? Are there existing tools that do any of this?

I only know of jts_ERD tool for a small part of it.

Thanks

2 Likes

Well the silence is deafening :grinning: but we’ve pressed on anyway, written the code and put it in GutHub.

See the Schemas and Schema generation part of this readme file.

Feedback welcomed.

2 Likes

Hi @MikeThacker! Thanks for the posts and sorry for the delay. I’d love to give you some feedback. In order to help understand, could you please give us some context and background on this project and how you are using datapackages? And could you please clarify if there are specific things with datapacakges that we can help you with?
Thanks!

Hello @lwinfree and thanks for your response.

Although I’m trying to design an approach I can use for many projects where we refine a data standard, for this specific project I’m looking for a way to document extensions to the existing OpenReferral data format standard and define an application profile (saying how the standard will be used in a particular scenario).

OpenReferral already has a Tabular Data Package, an Entity Relation Diagram and an API. I think the second two are manually crafted from the first. I want an automated way of generating the second two (and more) from the first.

Once that is done, I will use a copy of Tabular Data Package (with a few more properties added) to define proposed extensions to the existing OpenReferral standard and a further copy to define our application profile (stating which tables/fields to use, enumerations, taxonomies from which to populate values, …).

My colleague and I have made good progress on this as shown in our GitHub Human-Services repository.

And could you please clarify if there are specific things with data packages that we can help you with?

Well I’d really just like to know if this is a sensible use of data packages and if anyone can see a flaw in the logic. Essentially I want one single machine-readable source defining a data standard from which I can generate ERD, schemas and human-readable documentation.

Thanks very much

Hi @MikeThacker! Thanks for providing more detail. I’ve shared this with the broad FD team.
For now, I’m wondering if you have been working with the OpenReferral team on your project? One of our current Tool Fund grantees is focused on building datapackage support for their Human Services API: https://frictionlessdata.io/articles/open-referral/. Let me know if you’d like to be connected to them - I think there are some synergies between what you are working on.
Thanks,

Hello @lwinfree. Yes my colleagues and I have been speaking with Greg at Open Referral. We’re using the Tabular Data Package to record and our proposed extensions to the Open Referral schema and will more formally submit them if our piloting shows they work.

My post here was more to get feedback on how sensible it is to use a Tabular data Package definition with extras as the source from which all documentation and schemas are derived.
Thanks

@MikeThacker yes this sounds quite sensible based on a quick read through.

1 Like

Since my original post, we’ve made good progress using an annotated tabular data package to autogenerate variants (different tabular data packages) of that and then associated machine-readable resources.

We’ve now concluded that we should keep a pure (without our annotations) main tabular data package with a full data structure and use separate machine readable definitions for each “application profile”. An application profile will be a tabular data package that contains a subset of the tables and fields in the main package. It might also change the optional/required setting?

Is there a standard way of documenting and generating these application profiles, i.e. these views on a full data package? All we’ve done so far is to define Jolt transformations.

Related: Is there a way of defining extra constraints? e.g. one of two fields must be populated or these must be at least one record in a one-to-many relationship (i.e. a cardinality of 1:∞)

There’s some more discussion here.

TIA

Hi Mike, thanks very much for these questions. Could I please ask you to repost this to Discord? You are more likely to get community inputs there.

You can also use the Matrix bridge to access the channel.

Thanks!