I’d like to present to you a tool useful for preparing datapackages : tuttle, a make for data.
When we write scripts to create data, we don’t make it right on the first time. How many times did you have to comment the beginning of a script, so that executions jumps directly to a bug fix ?
With tuttle, you won’t have to. First, it computes only what is necessary : for example if a file has already been downloaded, it won’t do it again. But also, when you change a line of code, tuttle knows exactly what data must be removed and what part of the code must be run instead.
This brings fluidity and repeatability when you work on your own. But this is also very useful for team work : when you merge code, retrieve scripts modified by someone else or use a continuous integration system, you don’t need to wonder which data is not valid any more : tuttle does it for you.
Moreover you can follow the progression of computing with a report like this : http://stuff.lexman.org/s-and-p-500/scripts/.tuttle/report.html
You can use any language or arbitrary tool (like an xls to csv converter, or a git command), but tuttle already has built-in support for shell, batch, python and sql for sqlite.
If you’re interested by this fluent way to work on data, tuttle’s tutorial explains in detail how to use it : https://github.com/lexman/tuttle/blob/master/doc/tutorial_musketeers/tutorial.md .
You can also have a look at how I translated one of the core packages (s-and-p-500) with a tuttlefile : https://github.com/lexman/s-and-p-500/blob/tuttle/scripts/tuttlefile . It runs every hour on my server, so whenever the xls file changes, tuttle handles the whole datapackage update and even pushing data back to github.
Tuttle is a tool for collaborating on data as we collaborate on code, thus it might be of interest for the Open Knowledge Foundation.
Hope you will enjoy it,