A tool for collaborating on datapackages


#1

Hello,

I’d like to present to you a tool useful for preparing datapackages : tuttle, a make for data.

When we write scripts to create data, we don’t make it right on the first time. How many times did you have to comment the beginning of a script, so that executions jumps directly to a bug fix ?
With tuttle, you won’t have to. First, it computes only what is necessary : for example if a file has already been downloaded, it won’t do it again. But also, when you change a line of code, tuttle knows exactly what data must be removed and what part of the code must be run instead.

This brings fluidity and repeatability when you work on your own. But this is also very useful for team work : when you merge code, retrieve scripts modified by someone else or use a continuous integration system, you don’t need to wonder which data is not valid any more : tuttle does it for you.

Moreover you can follow the progression of computing with a report like this : http://stuff.lexman.org/s-and-p-500/scripts/.tuttle/report.html

You can use any language or arbitrary tool (like an xls to csv converter, or a git command), but tuttle already has built-in support for shell, batch, python and sql for sqlite.

If you’re interested by this fluent way to work on data, tuttle’s tutorial explains in detail how to use it : https://github.com/lexman/tuttle/blob/master/doc/tutorial_musketeers/tutorial.md .

You can also have a look at how I translated one of the core packages (s-and-p-500) with a tuttlefile : https://github.com/lexman/s-and-p-500/blob/tuttle/scripts/tuttlefile . It runs every hour on my server, so whenever the xls file changes, tuttle handles the whole datapackage update and even pushing data back to github.

Tuttle is a tool for collaborating on data as we collaborate on code, thus it might be of interest for the Open Knowledge Foundation.

Hope you will enjoy it,

Alexandre


#2

Hi, Alexandre.

This tool looks awesome! I look forward to experimenting with it in the near future.

An interesting feature to add to this tool could be automatically generating machine readable provenance documentation using the W3C PROV model and one of its standard notations.

Cheers,
Augusto


#3

Hello herrmann,

I had a look at the PROV model, and it seem apropriate to export a workflow to one of this formats.
I’ve added a feature request on github to track it : https://github.com/lexman/tuttle/issues/26 . Feel free to add comments or precision or further details there.

By the way, if you experiment with tuttle, please send me feedbacks :slight_smile:

Alex