I’ve just read the FAQ for Readme.md files within a data package and have some questions.
The recommendation is to format the Readme as:
Short description of the dataset (the first sentence and first paragraph should be extractable to provide short standalone descriptions)
## Data
Put specific information about the data in a Data section. This can be things like information about the source of the data, the specific structure of the data, missing values etc.
## Preparation
Put information on preparing the data in a Preparation section. In particular, any instructions about how to run any preparation and processing scripts to generate the data should go here.
## License
Put additional information on the permissions and licensing of the data in the Data Package in the License section. Since licensing information is often not clear from the data producers, the guideline here is to license the Data Package under the Public Domain Dedication and License, and then to add any relevant information or disclaimers regarding the source data.
This seems to repeat many things (schema, license) that would be found the the data package or table schema.
I’m wondering if we could assume that human-readable views of the data package would be provided and other complementary information may be more useful in the readme.md. For example:
Why was the dataset created? (reference to legislation if relevant)
How was it collected - what events lead up to its collection?
When was it collected? (Temporal extent)
Where was it collected? (Spatial extent name, coordinate reference system, minimum bounding rectangle)
Which instruments were used to collect it?
What does “null” mean? Unknown, missing or not applicable?
Other comments e.g.
error corrections
transformations
if the data had been aggregated, what level of detail can be expected
known caveats or limitations in the data
What information would be most valuable to open data consumers in determining if the data is fit for their purpose?
You have a very good point! The Data and License sections are superfluous, but it did remind me of a README.md that did provide this information in a human-readable way: https://github.com/cmoa/collection/blob/master/README.md This was presumably a script that automatically pulled from the datapackage.json file and dumped into Markdown .
When was it collected? (Temporal extent)
Believe it or not, as of the current version, this can also be put into a Data Package: Data Package | Frictionless Standards (search for “temporal”)
Everything else is great! In a way, it’s an interesting exercise to list all the things that a user should know before working with a dataset, and then go through them and see which things could benefit from being added as structured data to the datapackage.json (e.g. we have temporal but not spatial why?)
To explain (and following your intuition): these sections are for explanations about these. For example, the license section is not about stating the license but explaining that choice especially in the case where you reuse other data.
Regarding the data section: i’ve found that often you want to explain where the data comes from. Obviously you don’t want to repeat stuff and in the upcoming datahub.io data packagization upgrade (coming soon!) one thing we ahve done is allowed people to embed stuff from their datapackage.json into the README.
Thanks @rufuspollock, so, if I understand correctly:
the data package licence would apply to the whole package (i.e. a licence spanning the data resources, data package.json and readme.md).
if the data resources have different licences to the data package, I could specify this using the optional licenses property in each data resource.
if I’ve got tricky nested copyright in a data resource I need to explain (e.g. the data contains a mix of CC0 any CC BY SA content from different sources) I could explain that in the Readme.md.
@Stephen the type of Markdown was left unspecified. There was some discussion about this in the original issue. Should the lack of a preferred markdown type be made more explicit?
e.g.:
The description MUST be Markdown formatted – this also allows for simple plain text as plain text is itself valid Markdown. The specific version (or “flavor”) of Markdown is unspecified.