Readme.md practice for data packages

I’ve just read the FAQ for Readme.md files within a data package and have some questions.

The recommendation is to format the Readme as:

Short description of the dataset (the first sentence and first paragraph should be extractable to provide short standalone descriptions)

## Data
Put specific information about the data in a Data section. This can be things like information about the source of the data, the specific structure of the data, missing values etc.

## Preparation
Put information on preparing the data in a Preparation section. In particular, any instructions about how to run any preparation and processing scripts to generate the data should go here.

## License
Put additional information on the permissions and licensing of the data in the Data Package in the License section.  Since licensing information is often not clear from the data producers, the guideline here is to license the Data Package under the Public Domain Dedication and License, and then to add any relevant information or disclaimers regarding the source data.

This seems to repeat many things (schema, license) that would be found the the data package or table schema.

I’m wondering if we could assume that human-readable views of the data package would be provided and other complementary information may be more useful in the readme.md. For example:

  • Why was the dataset created? (reference to legislation if relevant)
  • How was it collected - what events lead up to its collection?
  • When was it collected? (Temporal extent)
  • Where was it collected? (Spatial extent name, coordinate reference system, minimum bounding rectangle)
  • Which instruments were used to collect it?
  • What does “null” mean? Unknown, missing or not applicable?
  • Other comments e.g.
  • error corrections
  • transformations
  • if the data had been aggregated, what level of detail can be expected
  • known caveats or limitations in the data

What information would be most valuable to open data consumers in determining if the data is fit for their purpose?

1 Like

@Stephen I like the way you think :cool:

You have a very good point! The Data and License sections are superfluous, but it did remind me of a README.md that did provide this information in a human-readable way: https://github.com/cmoa/collection/blob/master/README.md This was presumably a script that automatically pulled from the datapackage.json file and dumped into Markdown :thinking:.

When was it collected? (Temporal extent)

Believe it or not, as of the current version, this can also be put into a Data Package: Data Package | Frictionless Standards (search for “temporal”)

Everything else is great! In a way, it’s an interesting exercise to list all the things that a user should know before working with a dataset, and then go through them and see which things could benefit from being added as structured data to the datapackage.json (e.g. we have temporal but not spatial why?)

1 Like

Again, this is also relevant to the push on a refresh to the guides we have coming soon: Update guides to Frictionless Data specifications v1 · Issue #332 · frictionlessdata/project · GitHub

To explain (and following your intuition): these sections are for explanations about these. For example, the license section is not about stating the license but explaining that choice especially in the case where you reuse other data.

Regarding the data section: i’ve found that often you want to explain where the data comes from. Obviously you don’t want to repeat stuff and in the upcoming datahub.io data packagization upgrade (coming soon!) one thing we ahve done is allowed people to embed stuff from their datapackage.json into the README.

1 Like

Thanks @rufuspollock, so, if I understand correctly:

  • the data package licence would apply to the whole package (i.e. a licence spanning the data resources, data package.json and readme.md).
  • if the data resources have different licences to the data package, I could specify this using the optional licenses property in each data resource.
  • if I’ve got tricky nested copyright in a data resource I need to explain (e.g. the data contains a mix of CC0 any CC BY SA content from different sources) I could explain that in the Readme.md.

Have I got that right?

Is there any particular flavour of markdown preferred for a readme.md file? I note there are a number of variations:

I assume different data package friendly platforms may provide varying levels of support for portraying markdown files.

1 Like

@Stephen the type of Markdown was left unspecified. There was some discussion about this in the original issue. Should the lack of a preferred markdown type be made more explicit?

e.g.:

The description MUST be Markdown formatted – this also allows for simple plain text as plain text is itself valid Markdown. The specific version (or “flavor”) of Markdown is unspecified.

Proposal: markdown in descriptions · Issue #152 · frictionlessdata/specs · GitHub

1 Like

I think being explicit about the type of Markdown is good, enabling implementers of:

  • tools to create data packages to only use “tags” that will be rendered correctly on supporting platforms.
  • supporting platforms to have a simpler approach to rendering a defined set of Markdown

I’m leaning towards using CommonMark as a “stricter” implementation of the original Markdown in the planned changes to Comma Chameleon.

My target platforms are:

1 Like

Maybe you can raise an issue on Issues · frictionlessdata/specs · GitHub with your suggestions?

Done

Here’s my sample text to help a data packager get started providing provenance information. Feedback welcome