Is there a model for frictionless science?

Starl3n · September 20, 2015, 2:33am

I’ve been doing work with open Government data for the last few years, mostly around the platform capability of CKAN. As I’ve gone further into the various areas concerning data I’ve necessarily bumped up against project requirements for hosting a platform for research data. While CKAN needs some maturity in this space it has some immediate application in the area of open access catalogues; to make available both research papers and research datasets. It is less mature in the area of research management systems or managing ‘working data’, but I don’t think it is too much of a stretch to integrate or develop with these extended requirements in mind.

However, when looking at open access I’ve been considering what frictionless science might look like. I’ve been thinking about how to publish the full set of research artifacts needed to replicate and review work undertaken by labs, or to swap out data and reconstitute the research in a new context. That thinking, done only with little access to end users, has revealed the following short list of what might be published as a ‘dataset’ listing of ‘resources’.

Paper - the summary narrative which explains all context for the work
Data - any raw data used to test a hypothesis
Code - and algorithms or open source codebases and configurations details used to work with raw data to produce insights or secondary analysis inputs
Environment - any infrastructure orchestration scripts for automating the replication of data analysis in publicly available cloud environments.

With that all said, I thought to draw those interested in this idea to the following recent work:

Pyramids, pipelines and a can-of-sweave - work done by Florian Mayer for the WA Dept of Parks of Wildlife. I think this is a great example of how to cover the first three points above.

How to build a supercomputer on AWS with spot instances - work done by Link Digital (disclosure - this is my company) as a funded proof of concept thanks to Intel, AWS and the NCI (24th largest supercomputer facility in the world as of this afternoon).

You can review the work via the video demonstrations below. All work is or will shortly be open sourced. Very keen to get some feedback on all this from those actively working on open science initiatives