Launching goodtables.io: tell us what you think!

amercader · May 1, 2017, 11:04pm

One of the main goals of Frictionless Data is to improve the quality of data published and to make it easier to maintain this quality over time. Building on top of the excellent goodtables Python library we are launching a free service to provide Continuous Data Validation to everybody:

https://goodtables.io

GoodTables.io builds on all the work that has been done in Frictionless Data specifications and tooling to date. It is designed to integrate with different backends and run validation jobs whenever data is updated. For this first Beta version, we are focusing on data hosted on GitHub repositories and Amazon S3 buckets.

There are a lot of rough edges to polish and you can see what issues are already on the roadmap on the issue tracker, we’d love to hear your early feedback and learn about your use of the service.

To register a new source simply login with GitHub and authorize the application. Once on the dashboard page, click on the “Manage Sources” link on the header:

For Github repos, click the Synchronize button to get a list of your repos, and then activate the relevant one.
For Amazon S3 buckets, enter the access key id and secret key, and the name of the bucket (we currently only support buckets located in the Oregon (us-west-2) region).

Validation jobs should start the next time a commit is pushed to the repo or a file is updated on the bucket.

Feel free to add any comments to this thread, your feedback is greatly appreciated!

ethanwhite · May 2, 2017, 12:36am

This is a really cool idea, and I’d love to set it up on our data repo, but the permissions required are really expansive - basically fully control of everything. That isn’t necessary for CI, we don’t provide those kinds of permission to Travis or AppVeyor, so I’m wondering why they are necessary here. Ah… looks like there’s an existing issue on this which I’ll go comment on.

nikeshbalami · May 2, 2017, 12:37am

This is Awsome Kudos!!

amercader · May 2, 2017, 5:21am

@ethanwhite you are absolutely right. We’ve toned down the permissions, see here for details.

Stephen · May 2, 2017, 12:29pm

I have some data and continuous integration happening at GitHub - Stephen-Gates/GTFS: Public transport data in GTFS format with schemas, a data package and tests so I thought I’d point GoodTables.io at it. Unfortunately I got…

Does the file have to be .csv? Error message wasn’t very helpful.

ethanwhite · May 2, 2017, 1:22pm

Thanks @amercader, much appreciated.

Is there support for organizations yet? I granted access to my org, but don’t see any repos for it, just for my account.

amercader · May 2, 2017, 4:45pm

The error messages are definitely not very helpful in this case. I’m pretty sure files don’t need to have the .csv extension but I’m not sure what went wrong there. @roll any ideas?

We’ll investigate, thanks for the feedback!

amercader · May 2, 2017, 4:47pm

@ethanwhite repos from the orgs you belong to should definitely come up so something is not right. We’ll investigate. Can you try and resync anyway to make sure it wasn’t a glitch?

ethanwhite · May 2, 2017, 8:08pm

OK. I tried resyncing and that didn’t help. I opened an issue so we can chat more over there:

github.com/frictionlessdata/goodtables.io

Repos from organization account not accessible

opened 05:30PM - 02 May 17 UTC

closed 06:56AM - 05 May 17 UTC

ethanwhite

bug

I can't currently see repos from my organization account. The organization is ht…tps://github.com/weecology/ which I authorized it when granting access to goodtables. Third-party access tab for weecology screenshot: ![screenshot from 2017-05-02 13-26-04](https://cloud.githubusercontent.com/assets/744427/25630453/3b119002-2f3b-11e7-94d6-1c7c9ec3e9d7.png) Filtered results on goodtables.io: ![screenshot from 2017-05-02 13-26-29](https://cloud.githubusercontent.com/assets/744427/25630471/48e284ca-2f3b-11e7-9bb9-f7e3350816e5.png) I have tried resyncing without success.

amercader · May 2, 2017, 11:01pm

Yes, the change in permissions broke this so it’s definitely a bug, and one that we need to fix ASAP

justinclift · May 3, 2017, 6:48am

This looks like it could be useful for the project I’m putting time into (dev server here), once we start work on our API.

Is there a sample report or demo site or similar?

I’m not seeing anything along those lines on the goodtables.io website atm, though it’s possible I’ve missed something obvious.

roll · May 3, 2017, 1:42pm

@Stephen

Does the file have to be .csv?

Yes. For now we support only Tabular Data Packages. In specs-v1 will be introduced concept of Tabular Resources so support for general Data Packages will be added pretty soon.

amercader · May 3, 2017, 2:49pm

Hi @justinclift, that’s a good point, we should put an example report on the home page. Here’s an example of a failing job:

http://goodtables.io/github/amercader/car-fuel-and-emissions/jobs/2

amercader · May 3, 2017, 2:50pm

Just to be clear, these are CSVs but with a .txt extension, AFAICT

roll · May 3, 2017, 3:07pm

@amercader
That’s correct. So it’s a reason of format-error. Goodtables expects tabular format.

justinclift · May 3, 2017, 3:14pm

Yep, that would help. As a thought from a complete newbie (sorry) to this area of things, is goodtables.io expected to just validate + report on the data (acceptable to me), or is it expecting to have write access and actively muck around with data directly?

From my point of view (a bit of a control freak about things changing my data ), I’m absolutely happy to run validation/checking processes on data which reports on them for manual follow-up/correction. Directly changing things itself though without oversight is a complete no go. At least, not without some kind of (extensive?) trial period to ensure there’s no edge case bugs which incorrectly change the data.

Asking that because the goodtables.io sign in for GitHub requests change level access to my repos. And that’s an absolute “not going to happen”.

I’m not aware of any other CI system that does this, though I’ve only used Jenkins and Travis CI in any depth so that could just be me.

amercader · May 3, 2017, 4:35pm

@justinclift Right now the service should only ask for the following permissions:

repo:status: We need this to be able to write the commit statuses (success or failure)
admin:repo_hook: We need this to be able to create and remove the necessary webhooks to ping the service whenever there is a commit pushed to the repository.

These don’t allow the service write access to the actual contents of the repo, and if we ever need to do that it will require a separate authorization process by the user.

Does this make sense?

justinclift · May 3, 2017, 4:36pm

@amercader Spotted a minor bug in the sample report you pointed to.

In the hmmm… “commit heading” up top, there is this text:

Pushed by Adrià Mercader on branch refs/heads/master (7c765e)

There’s a link on the “Adrià Mercader” text going to https://github.com/Adrià%20Mercader instead of amercader (Adrià Mercader) · GitHub. At a guess the template for the commit header message is just inserting the wrong variable (user full name instead of GitHub account name). Should be pretty easy to fix.

justinclift · May 3, 2017, 4:39pm

Thanks, it’s probably just my knee jerk reaction of “Aaagh, it’s asking for more than just read-only!”. I’ll investigate the meanings of those permissions later (prob tomorrow), and think it over more then.

amercader · May 3, 2017, 10:32pm

Good spot! Thanks, that should be an easy fix

Topic		Replies	Views
Important: Goodtables.io is retired and out of support as of 1 December 2022 Frictionless Data	0	1272	July 12, 2022
Tracking Data Issues: what's the current state of the art? Open Knowledge Labs	17	2272	May 3, 2017
New Frictionless Data Case Study Published: Center for Data Science and Public Policy Frictionless Data	0	725	August 14, 2017
Country-specific data package register Frictionless Data	16	4194	February 19, 2018
Finding data packages Frictionless Data	8	1949	February 26, 2018

Launching goodtables.io: tell us what you think!

Related topics