Data Package Versions


#1

I’m really pleased with the Data Package Version pattern but I think a couple more scenarios need to be added to the pattern and I’d like your thoughts on how the version number should be incremented.

Scenarios

You have published a tabular data package grants.csv v1.0.0. It has a foreign key relationship with another tabular data package codes.csv v2.0.0. The code.csv data has changed and some codes have been combined and other split.

Is this:

  • a breaking change causing an increment in the MAJOR version number
  • a backwards-compatible change that only needs a change in the MINOR version number?

If grants.csv data is updated to use the new codes and the foreign key reference is updated to use the new codes.csv is this a MAJOR change because the table schema has been changed?

Look forward to hearing your thoughts :smile:


#2

Nice collection of proposed patterns!

I believe that combining or splitting lines should also be considered breaking change, causing an increment in the MAJOR version number, because that would make it incompatible with other tables that use it in a foreign key relationship.

About the second question, if grants.csv is updated just to make its foreign keys compatible with the new version of the table which it references, its dependencies declaration should also be updated to state that it depends on the new version of codes.csv, as indicated in the dependencies pattern.

As for whether this should increment grants.csv's MAJOR, MINOR or PATCH version, I’m not sure. If an application uses this table individually, then it should not break as it is only corrects some of its values for compatibility with a new version of one of its dependencies. On the other hand, if an application makes use of the data on grants.csv and also all of its dependencies, then a MAJOR change to any of its dependencies would also break the application.


#3

Thanks @herrmann based on your points:

  • I think codes.csv is a MAJOR change
  • I’m leaning towards grants.csv being a MAJOR change

Interested in hearing thoughts from others…


#4

I’d say both were MAJOR changes.


#5

I’ll draft a PR for the Data Package Version pattern


#6

Proposed change to pattern. Feedback welcome :slightly_smiling_face:

Data Package Version

The Data Package version format follows the Semantic Versioning specification format: MAJOR.MINOR.PATCH

The version numbers, and the way they change, convey meaning about how the data package has been modified from one version to the next.

Specification

Given a Data Package version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible changes, e.g.

  • Change the table schema
  • Change the name of fields, a data resource or a data package
  • Change the data package id
  • Add, remove or re-order fields
  • Change a foreignKey relationship to refer to a different resource

MINOR version when you add data in a backwards-compatible manner, e.g.

  • Add new data to an existing data resource
  • Add a new data resource

PATCH version when you make backwards-compatible fixes, e.g.

  • Corrections to existing data
  • Changes to metadata

Scenarios

  • You are developing your data though public consultation. Start your initial data release at 0.1.0
  • You release your data for the first time. Use version 1.0.0
  • You append last months data to an existing release. Increment the MINOR version number
  • You append a column to the data. Increment the MAJOR version number
  • You relocate the data to a new URL or path. No change in the version number
  • You change a title, description, or other descriptive metadata. Increment the PATCH version
  • You fix a data entry error by modifying a value. Increment the PATCH version
  • You split a row of data in a foreign key reference table. Increment the MAJOR version number
  • You update the data and schema to refer to a new version of a foreign key reference table. Increment the MAJOR version number

#7

That’s a nice way to word it, @Stephen. It’s exactly as we had been discussing.

However, I still have doubts about this approach that updating the data to make it compatible with a dependency should necessarily increment the MAJOR version number. I think it kind of contradicts this part of the pattern:

PATCH version when you make backwards-compatible fixes, e.g.

  • Corrections to existing data
  • Changes to metadata

When you combine this pattern with the dependencies pattern, which explicitly models which version of the data it depends on for foreign keys, it seems that an increment in the MAJOR version number of a dependency is already explicit enough. The application can then decide whether or not it will need to use all of its dependencies.

In case it does, and if there is an increment in the MAJOR version number of any of its dependencies, it’s already clear enough that there is a change in the set of ‘the data plus all of its dependencies’ that would break the application.

On the other hand, if the application does not use all of its dependencies (e.g. if it does not require to use the fields that have foreign keys), the change would not break the application. The application can figure this out by looking at the dependencies and deciding whether or not it does need to dip into them. However, if the situation discussed here causes an increment to the MAJOR version number, the application cannot make use of the data because the versioning system is indicating a breaking change, even though the data would still be usable by the unmodified application.

So, maybe this situation should be labeled as a PATCH change in grants.csv. The specs can then let the application figure out itself whether or not it does need and make use of its dependencies, in which case a MAJOR version number to any of those would be considered a breaking change.


#8

I can see your point as if the codes.csv is in the same datapackage.json then the foreignKeys reference in the schema won’t have changed and hence not require a MAJOR version change based on

MAJOR version when you make incompatible changes, e.g.

  • Change the table schema

As I’m implementing the Foreign Keys to Data Packages pattern, I was thinking about that and the foreignKeys reference would change. This would then invoke a MAJOR version change.

Looks like further refinement is needed. Wording changes to the above welcome.

We probably need to cater for:

Perhaps these pattern statements need clarifying:

  • Corrections to existing data (to differentiate between fixing errors and re-coding values)
  • Change the table schema

#9

I’ve tried to be more explicit in the pattern below. What do you think? (I’m expecting some debate on my constraints statements.)

Love to hear from the original pattern contributors @henrykironde @ethanwhite @zhangcandrew @pwalsh @rufuspollock

@herrmann the change below makes your suggested grants.csv PATCH change, a MINOR change
@rufuspollock the change below makes your suggested grants.csv MAJOR change, a MINOR change

(I removed the forum solution indicator until we’re agreed.)

Data Package Version

The Data Package version format follows the Semantic Versioning specification format: MAJOR.MINOR.PATCH

The version numbers, and the way they change, convey meaning about how the data package has been modified from one version to the next.

Given a Data Package version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible changes, e.g.

  • Change the data package, resource or field name or identifier
  • Add, remove or re-order fields
  • Change a field type or format
  • Change a field constraint to be more restrictive
  • Combine, split, delete or change the meaning of data that is referenced by another data resource

MINOR version when you add data or change metadata in a backwards-compatible manner, e.g.

  • Add a new data resource to a data package
  • Add new data to an existing data resource
  • Change a field constraint to be less restrictive
  • Update a reference to another data resource
  • Change data to reflect changes in referenced data

PATCH version when you make backwards-compatible fixes, e.g.

  • Correct errors in existing data
  • Change descriptive metadata properties

Scenarios

  • You are developing your data though public consultation. Start your initial data release at 0.1.0
  • You release your data for the first time. Use version 1.0.0
  • You append last months data to an existing release. Increment the MINOR version number
  • You append a column to the data. Increment the MAJOR version number
  • You relocate the data to a new URL or path. No change in the version number
  • You change a title, description, or other descriptive metadata. Increment the PATCH version
  • You fix a data entry error by modifying a value. Increment the PATCH version
  • You split a row of data in a foreign key reference table. Increment the MAJOR version number
  • You update the data and schema to refer to a new version of a foreign key reference table. Increment the MINOR version number

#10

It looks good, @Stephen!

I think the main issue we should be concerned is whether changes are backwards compatible or break compatibility, and this proposal seems sensible to me. The constraint statements you suggest fit well in that line of thought - a change to a field constraint to make it more restrictive does break compatibility, but the other way around does not.


#11

This looks good to me. Nice work @Stephen!


#12

PR submitted…