Working with community data on GitHub: New Project DataTig

In the spirit of “If your not ashamed of your first release you released to late” who wants to see my new project? :slight_smile: https://www.datatig.com/

Do you store community data in a GitHub repository - maybe CSV, YML or MarkDown files? We’ll take that and automatically build a nice website for you!

For example … http://dataportals.org/ data is in a spreadsheet at dataportals.org/portals.csv at master · okfn/dataportals.org · GitHub .

It can now also be browsed at https://www.datatig.com/gh/okfn/dataportals.org/b/master/

This reveals several things:

Hope that’s useful for people!

I’m now looking for other GitHub repositories with community data to test it on to make sure the results are as useful as they can be - let me know if you have anything like that.

Thanks,
James

(This builds on my previous work with community data sets - things like https://opentechcalendar.co.uk/ - so this is a topic I’m keen to discuss with people!)

5 Likes

This reminds me that @rufuspollock suggested that it would be good to merge open-data-scotland/scotland-data-catalogues.csv at master · okfnscot/open-data-scotland · GitHub into the dataportals list and then filter out the Scotland-specific sites to get something like Scotland's Open Data Catalogues. Is this a good idea? Does anyone fancy giving a hand in setting it up?

1 Like

@ewan_klein it would be great to merge the scottish list into the data portals list.

Could you open an issue here Issues · okfn/dataportals.org · GitHub

Would you or colleagues be up for helping with the merge if someone was helping from data portals end?

2 Likes

@jarofgreen great to hear about this project and there are lot of connections with the new DataHub work: https://datahub.io/ (which I’m heavily involved in …).

In particular, we already provide showcases of data and have plans to provide APIs in future as part of DataHub.

There’s a chat channel for DataHub if you are interested here where we could talk more: datahubio/chat - Gitter

1 Like

@rufuspollock I did see DataHub in my research. The reason I carried on anyway is that I saw them as serving different things - Open Data and “Community Data” being subtlety different in how they are created, edited and to what purpose they are put. I will hang out in the gitter channel when I have time - it would be great to chat!

1 Like

I’m heavily over-committed at the moment, so would need to identify a volunteer or find money at this end. That said, it would be great to have some involved in the data portals side in support.

If I was to go to someone in Scottish Government, say, to ask for help, how would I sell the added value of this integration? I can see a benefit if it made it easier for other people to contribute data via the data portals route about Scotland-relevant open data. Is this realistic?

1 Like

@ewan_klein @rufuspollock I have been assisting with the Data Portals project. I also find myself over-committed at the moment, but have every intention to make time to get caught up with existing issues this week.

I know it’s been frustrating for those who have already made submissions.

Hey, just to say DataTig is still continuing and is still focussing on situations where the community is being asked to crowd source data in a git repository.

I think there is still an interesting use case here that I keep seeing in the real world and while I’ve seen many great data tools go past, I know of very few others in this particular niche.

The links in my original post don’t work any more but the Python tool is now at DataTig · PyPI and docs are at DataTig — DataTig documentation

I’ve just written a blog post about it: DataTig helps you crowd source data in a git repository

ps. The other tool I’ve seen that caters to this is JKAN - https://jkan.io/ - “A lightweight, backend-free open data portal, powered by Jekyll” - very interesting!

2 Likes

Hi @jarofgreen!

This seems to be an interesting project. It’s a pity only now I’ve seen this post. DataTig does sound interesting from the description, but I wonder how useful it could be to Dataportals.org in its current state.

I have been considering switching to a static site generator for some time, but we would need to be able to generate a map with clusters as we have today.

It the time since you originally posted, we have also been able to add Data Package and Table Schema descriptors to the data, as well as data validation with Frictionless Repository. That should at least catch some of the data quality problems present there. Others still remain, of course, like the examples you mentioned: managing tags, verifying link health, etc.

I can also see that there is a lot of overlap between DataTig and Frictionless Data. The static site generator part is handled in Frictionless by LiveMark. The data validation part is already very advanced in the Frictionless Framework. I think it would make sense to considering using some of that or collaborating in some way. Fritcionless Data also does have a dedicated cateogy in this very forum here that could be used to discuss this.

As for ingesting the data from Scotland into Dataportals.org, I’m up for helping with that, too, @ewan_klein. I think the first step would be opening an issue as suggested by @rufuspollock above (I just checked and could find no issue about Scotland there yet). Could you do that?

Note that I myself have compiled some years ago a similar list of official Brazilian open government data catalogues and those have already been added into Dataportals.org.

1 Like

Hello and thanks for the reply!

DataTig does sound interesting from the description, but I wonder how useful it could be to Dataportals.org in its current state.

In it’s current state? No. DataTig works with repositories where each record is a JSON, YAML or MD file. It doesn’t work with CSV’s as source data. That may change if I see a lot of need, but right now I don’t see it as a priority to change because I see much less use of CSV in real cases.

This also aligns with my personal view: I fundamentally don’t think CSV’s are a good model for crowd sourcing data. Edit’s are more complicated and are much more prone to generate git conflicts. The git history is harder to parse.

I can also see that there is a lot of overlap between DataTig and Frictionless Data.

There does seem a lot of interesting things to look at. Adding exports of the data as Frictionless is probably the first thing I’ll do.

However I do think they serve different but important use cases, and I plan to keep on working on the use case of crowd sourcing data and focussing on particular patterns I’ve seen in real world cases.

(I’ve used frictionless · PyPI before actually to test data and generate database schemas and am familiar with the Table schema.)

Maybe we could have a chat sometime? It would be interesting to think about future plans for Dataportals.org too.

ps.

I have been considering switching to a static site generator for some time, but we would need to be able to generate a map with clusters as we have today.

I just looked at the current site and the map clustering is done client side by Javascript, so it could be switched to a static site generator?

As for ingesting the data from Scotland into Dataportals.org, I’m up for helping with that, too,

I’m not sure if the data Ewan links counts as up to date these days. The people behind https://opendata.scot/ may be intersting to talk to?

1 Like

Would it be possible to have something similar to this but instead as a simple csv file on GitHub as a database?
So are there database services for science and other open … projects?

I know that there is plenty of DaaS and no/less-code services in the commercial field. But I wonder whether there is something for the open data and open science field…?

Up to now I only see github csv + github pages and wikidata as options, but are there more?

And mentioning wikidata, there surely is also a need for a less-code app/interface building tool that would allow to create nicely looking frontend websites for end users instead of using the quite complex query service.

I’m very interested to hear what exists like this currently or maybe in the near future…

(Sorry if this thread seems wrong for this - I’m new and couldn’t find a better place…)

Hi @Johannes, welcome to the forum!

Did you check out Frictionless Livemark, which I linked to in my post above? It allows to generate sites using csv files as data sources, so it could be one such tool you’re looking for. Also it’s being actively maintained.

Thanks, I digged into that.

Regarding your Dataportals website, I think it would be really great to have some “Meta-Tool” that allows to build such portal-sites, while offering flexibility to adapt it to the specific project need (e.g. an institution that wants to build a catalog for only their data; maybe also adding a category like “Apps/Code” based on the data, similar to Kaggle).

It would be great if all these sites use the same data model and have some federation/data sharing between them (e.g. a big portal combines all the data from the sub portals, while preserving a high amount of flexibility for the sub portals).

Doing it in the static-site with data (JSON/YAML/etc.) on git - way (open source of course) plus API for allowing authentication and (preferably directly in-site-) editing/adding of content is a really great solution, I think. Makes it as barrier-free as possible and allows a high number of projects to use that kind of software. That’s why I would prefer such a design over huge software suits such as CKAN with Geo-Addons.

So maybe some further developed JKAN (with added map feature) or a similar solution as a new project would be a good solution?

I am really interested and waiting for news from the Frictionless project (since they announced that some CMS GUI for Livemark will come) and from Tina CMS (they also announced progress on it in this year) - the latter not such a data-driven product, of course.

I guess some collaboration and synergies on this would be great!

What are your current plans/ideas (for your Dataportals site)?

1 Like

Those are interesting ideas.

In fact, I have been considering for some time the possibility of revamping both DataPortals.org and PublicBodies.org with static site generators that are updated automatically with Github Actions every time the underlying csv is changed.

I have done something like this in this data visualization, which uses Python, Pandas and Folium to generate the Leaflet.js map visualizations, which is then embedded into a site generated with Markdown and Jekyll. The data is then fetched to build the site again once a day, scheduled with Github Actions.

Maybe something similar for these projects could work, and even save on the hosting costs that these sites incur.

2 Likes