We, at Queensland University of Technology, have built a data publishing platform using CKAN. Its purpose is to provide a platform to researchers to publish their data either as standalone sets or as supplementary documents for a research publication.
We did encounter some questions that we needed to address on the business practice/business logic level. Namely, what do we do about data that are subject to embargo conditions? what do we do about data that are not quite ready for consumption by the general public (working data)? what do we do about data who’s authors do not wish to share publicly (e.g. pre-journal publication)?
In all of these cases, we would like to provide an opportunity to publicise the existence of the data sets without necessarily publishing them. CKAN’s public/private settings for publishing data sets is not suitable. So we have created a new setting that exposes the metadata but restricts download to authorised users.
Arguably, this is a step back from the open mandate since some data sets are no longer accessible publicly (and anonymously). But the alternative is that we do not put these data sets on CKAN at all. This is also the question of how sustainable is our customisation of CKAN if it is not picked up by the community and remains “niche”.
The other points you raise are, in opinion, related to:
A/ how how should original (mother) data sets and derivative (daughter) data sets be connected?
B/ where is the information that allows us to derive daughter data sets from mother data sets (instructions/code/rationale) to enable reproducibility?
C/ how should this information be presented so that can be subject to public scrutiny (verifiability)?
D/ how to connect original data sets with replicated data sets? (i.e. data sets that used the same methods to be collected/derived but rely on different source material.)
CKAN on its own is not enough. Linking it to scientific workflow software would be a good step towards answering these questions. In my (limited) experience, scientific workflow software are highly variable in quality and application. Some of them are very specific to a particular scientific domain or type of application. Others can be cumbersome or not very user-friendly. On the other extreme are generic tools that require programming experience such as R or Matlab (or Python or whatever.)
But still, in my opinion, it is probably on the level of human practice that processes need to be developed. Ultimately, if researchers do the right thing and provide the information needed, it does not matter what tools we choose. Unfortunately, this requires a level of education and training that we have not been able to provide so far.