Launching goodtables.io: tell us what you think!

One of the main goals of Frictionless Data is to improve the quality of data published and to make it easier to maintain this quality over time. Building on top of the excellent goodtables Python library we are launching a free service to provide Continuous Data Validation to everybody:

https://goodtables.io

GoodTables.io builds on all the work that has been done in Frictionless Data specifications and tooling to date. It is designed to integrate with different backends and run validation jobs whenever data is updated. For this first Beta version, we are focusing on data hosted on GitHub repositories and Amazon S3 buckets.

There are a lot of rough edges to polish and you can see what issues are already on the roadmap on the issue tracker, we’d love to hear your early feedback and learn about your use of the service.

To register a new source simply login with GitHub and authorize the application. Once on the dashboard page, click on the “Manage Sources” link on the header:

  • For Github repos, click the Synchronize button to get a list of your repos, and then activate the relevant one.
  • For Amazon S3 buckets, enter the access key id and secret key, and the name of the bucket (we currently only support buckets located in the Oregon (us-west-2) region).

Validation jobs should start the next time a commit is pushed to the repo or a file is updated on the bucket.

Feel free to add any comments to this thread, your feedback is greatly appreciated!

10 Likes

This is a really cool idea, and I’d love to set it up on our data repo, but the permissions required are really expansive - basically fully control of everything. That isn’t necessary for CI, we don’t provide those kinds of permission to Travis or AppVeyor, so I’m wondering why they are necessary here. Ah… looks like there’s an existing issue on this which I’ll go comment on.

1 Like

This is Awsome :smile: Kudos!!

1 Like

@ethanwhite you are absolutely right. We’ve toned down the permissions, see here for details.

I have some data and continuous integration happening at GitHub - Stephen-Gates/GTFS: Public transport data in GTFS format with schemas, a data package and tests so I thought I’d point GoodTables.io at it. Unfortunately I got…

Does the file have to be .csv? Error message wasn’t very helpful.

Thanks @amercader, much appreciated.

Is there support for organizations yet? I granted access to my org, but don’t see any repos for it, just for my account.

The error messages are definitely not very helpful in this case. I’m pretty sure files don’t need to have the .csv extension but I’m not sure what went wrong there. @roll any ideas?

We’ll investigate, thanks for the feedback!

@ethanwhite repos from the orgs you belong to should definitely come up so something is not right. We’ll investigate. Can you try and resync anyway to make sure it wasn’t a glitch?

OK. I tried resyncing and that didn’t help. I opened an issue so we can chat more over there:

Yes, the change in permissions broke this so it’s definitely a bug, and one that we need to fix ASAP

This looks like it could be useful for the project I’m putting time into (dev server here), once we start work on our API.

Is there a sample report or demo site or similar?

I’m not seeing anything along those lines on the goodtables.io website atm, though it’s possible I’ve missed something obvious. :smiley:

1 Like

@Stephen

Does the file have to be .csv?

Yes. For now we support only Tabular Data Packages. In specs-v1 will be introduced concept of Tabular Resources so support for general Data Packages will be added pretty soon.

2 Likes

Hi @justinclift, that’s a good point, we should put an example report on the home page. Here’s an example of a failing job:

http://goodtables.io/github/amercader/car-fuel-and-emissions/jobs/2

Just to be clear, these are CSVs but with a .txt extension, AFAICT

@amercader
That’s correct. So it’s a reason of format-error. Goodtables expects tabular format.

Yep, that would help. As a thought from a complete newbie (sorry) to this area of things, is goodtables.io expected to just validate + report on the data (acceptable to me), or is it expecting to have write access and actively muck around with data directly?

From my point of view (a bit of a control freak about things changing my data :grin:), I’m absolutely happy to run validation/checking processes on data which reports on them for manual follow-up/correction. Directly changing things itself though without oversight is a complete no go. At least, not without some kind of (extensive?) trial period to ensure there’s no edge case bugs which incorrectly change the data.

Asking that because the goodtables.io sign in for GitHub requests change level access to my repos. And that’s an absolute “not going to happen”.

I’m not aware of any other CI system that does this, though I’ve only used Jenkins and Travis CI in any depth so that could just be me. :wink:

1 Like

@justinclift Right now the service should only ask for the following permissions:

  • repo:status: We need this to be able to write the commit statuses (success or failure)

  • admin:repo_hook: We need this to be able to create and remove the necessary webhooks to ping the service whenever there is a commit pushed to the repository.

These don’t allow the service write access to the actual contents of the repo, and if we ever need to do that it will require a separate authorization process by the user.

Does this make sense?

@amercader Spotted a minor bug in the sample report you pointed to.

In the hmmm… “commit heading” up top, there is this text:

    Pushed by Adrià Mercader on branch refs/heads/master (7c765e)

There’s a link on the “Adrià Mercader” text going to https://github.com/Adrià%20Mercader instead of amercader (Adrià Mercader) · GitHub. At a guess the template for the commit header message is just inserting the wrong variable (user full name instead of GitHub account name). Should be pretty easy to fix. :slight_smile:

Thanks, it’s probably just my knee jerk reaction of “Aaagh, it’s asking for more than just read-only!”. I’ll investigate the meanings of those permissions later (prob tomorrow), and think it over more then. :slight_smile:

1 Like

Good spot! Thanks, that should be an easy fix

1 Like