Readme.md practice for data packages

Stephen · June 10, 2017, 5:12am

I’ve just read the FAQ for Readme.md files within a data package and have some questions.

The recommendation is to format the Readme as:

Short description of the dataset (the first sentence and first paragraph should be extractable to provide short standalone descriptions)

## Data
Put specific information about the data in a Data section. This can be things like information about the source of the data, the specific structure of the data, missing values etc.

## Preparation
Put information on preparing the data in a Preparation section. In particular, any instructions about how to run any preparation and processing scripts to generate the data should go here.

## License
Put additional information on the permissions and licensing of the data in the Data Package in the License section.  Since licensing information is often not clear from the data producers, the guideline here is to license the Data Package under the Public Domain Dedication and License, and then to add any relevant information or disclaimers regarding the source data.

This seems to repeat many things (schema, license) that would be found the the data package or table schema.

I’m wondering if we could assume that human-readable views of the data package would be provided and other complementary information may be more useful in the readme.md. For example:

Why was the dataset created? (reference to legislation if relevant)
How was it collected - what events lead up to its collection?
When was it collected? (Temporal extent)
Where was it collected? (Spatial extent name, coordinate reference system, minimum bounding rectangle)
Which instruments were used to collect it?
What does “null” mean? Unknown, missing or not applicable?
Other comments e.g.
error corrections
transformations
if the data had been aggregated, what level of detail can be expected
known caveats or limitations in the data

What information would be most valuable to open data consumers in determining if the data is fit for their purpose?

danfowler · June 12, 2017, 9:57am

@Stephen I like the way you think

You have a very good point! The Data and License sections are superfluous, but it did remind me of a README.md that did provide this information in a human-readable way: https://github.com/cmoa/collection/blob/master/README.md This was presumably a script that automatically pulled from the datapackage.json file and dumped into Markdown .

When was it collected? (Temporal extent)

Believe it or not, as of the current version, this can also be put into a Data Package: Data Package | Frictionless Standards (search for “temporal”)

Everything else is great! In a way, it’s an interesting exercise to list all the things that a user should know before working with a dataset, and then go through them and see which things could benefit from being added as structured data to the datapackage.json (e.g. we have temporal but not spatial why?)

danfowler · June 14, 2017, 8:10am

Again, this is also relevant to the push on a refresh to the guides we have coming soon: Update guides to Frictionless Data specifications v1 · Issue #332 · frictionlessdata/project · GitHub

rufuspollock · June 18, 2017, 6:30pm

To explain (and following your intuition): these sections are for explanations about these. For example, the license section is not about stating the license but explaining that choice especially in the case where you reuse other data.

Regarding the data section: i’ve found that often you want to explain where the data comes from. Obviously you don’t want to repeat stuff and in the upcoming datahub.io data packagization upgrade (coming soon!) one thing we ahve done is allowed people to embed stuff from their datapackage.json into the README.

Stephen · June 18, 2017, 7:30pm

Thanks @rufuspollock, so, if I understand correctly:

the data package licence would apply to the whole package (i.e. a licence spanning the data resources, data package.json and readme.md).
if the data resources have different licences to the data package, I could specify this using the optional licenses property in each data resource.
if I’ve got tricky nested copyright in a data resource I need to explain (e.g. the data contains a mix of CC0 any CC BY SA content from different sources) I could explain that in the Readme.md.

Have I got that right?

Stephen · July 3, 2017, 8:12am

Is there any particular flavour of markdown preferred for a readme.md file? I note there are a number of variations:

I assume different data package friendly platforms may provide varying levels of support for portraying markdown files.

danfowler · July 4, 2017, 4:08am

@Stephen the type of Markdown was left unspecified. There was some discussion about this in the original issue. Should the lack of a preferred markdown type be made more explicit?

e.g.:

The description MUST be Markdown formatted – this also allows for simple plain text as plain text is itself valid Markdown. The specific version (or “flavor”) of Markdown is unspecified.

Proposal: markdown in descriptions · Issue #152 · frictionlessdata/specs · GitHub

Stephen · July 4, 2017, 10:43am

I think being explicit about the type of Markdown is good, enabling implementers of:

tools to create data packages to only use “tags” that will be rendered correctly on supporting platforms.
supporting platforms to have a simpler approach to rendering a defined set of Markdown

I’m leaning towards using CommonMark as a “stricter” implementation of the original Markdown in the planned changes to Comma Chameleon.

My target platforms are:

CKAN via the Data Packager extension
GitHub via the Comma Chameleon - Octopub integration

danfowler · July 5, 2017, 5:06am

Maybe you can raise an issue on Issues · frictionlessdata/specs · GitHub with your suggestions?

Stephen · July 5, 2017, 5:18am

Done

https://github.com/frictionlessdata/specs/issues/485

Stephen · August 5, 2017, 6:43am

Here’s my sample text to help a data packager get started providing provenance information. Feedback welcome

github.com

qcif/data-curator/blob/develop/test/features/tools/sample-provenance-information.md

A short description of the dataset. The first sentence and the first paragraph should be written to provide short standalone descriptions. These descriptions may be used by data platforms and other software to provide a summary of the data.

## Data

### Why was the data created?

- Reference any law or policy that requires you to collect the data.

### When was the data collected?

- On what day or over what duration was the data collected?
- Consider [adding a temporal extent](http://frictionlessdata.io/specs/data-package/#descriptor) to the tabular data resource
- How often will the data be updated?

### Where was the data collected?

- Provide a well known name for the area the data was collected in
- Provide a minimum bounding rectangle to describe the spatial extent if you have not implemented this using contraints
- If location data is included in the data, what is the coordinate reference system

This file has been truncated. show original