I've just read the FAQ for Readme.md files within a data package and have some questions.
The recommendation is to format the Readme as:
Short description of the dataset (the first sentence and first paragraph should be extractable to provide short standalone descriptions)
Put specific information about the data in a Data section. This can be things like information about the source of the data, the specific structure of the data, missing values etc.
Put information on preparing the data in a Preparation section. In particular, any instructions about how to run any preparation and processing scripts to generate the data should go here.
Put additional information on the permissions and licensing of the data in the Data Package in the License section. Since licensing information is often not clear from the data producers, the guideline here is to license the Data Package under the Public Domain Dedication and License, and then to add any relevant information or disclaimers regarding the source data.
This seems to repeat many things (schema, license) that would be found the the data package or table schema.
I'm wondering if we could assume that human-readable views of the data package would be provided and other complementary information may be more useful in the readme.md. For example:
- Why was the dataset created? (reference to legislation if relevant)
- How was it collected - what events lead up to its collection?
- When was it collected? (Temporal extent)
- Where was it collected? (Spatial extent name, coordinate reference system, minimum bounding rectangle)
- Which instruments were used to collect it?
- What does “null” mean? Unknown, missing or not applicable?
- Other comments e.g.
- error corrections
- if the data had been aggregated, what level of detail can be expected
- known caveats or limitations in the data
What information would be most valuable to open data consumers in determining if the data is fit for their purpose?