Penalty for no bulk data not appropriate for realtime or big data

Stephen · May 19, 2015, 9:02am

I think the question for bulk data in the census needs to change. It is not always possible to publish open data in bulk. As pointed out in the open data handbook publishing bulk data is not practical for realtime or big data.

Can I suggest that the current question is reworded from:

Is the data available in bulk? - Data is available in bulk if the whole dataset can be downloaded easily. It is not available in bulk, if access to the data is through a web page that provides access to only part of the database.

to something like:

Is the data available in bulk or via a real-time feed? - Data is available in bulk if the whole dataset can be downloaded easily. It is not available in bulk, if access to the data is through a web page that provides access to only part of the database. A real-time feed provides access to a subset of a database that changes frequently and is too large to download in bulk.

As an example, in my view, a real-time public transport fed in GTFS-RT format should not be penalised 10 points for not being available in bulk.

What do you think - Should the question be changed?

rufuspollock · May 19, 2015, 1:31pm

I am in two minds here.

Even with “real-time” data or other high frequency data (e.g. weather data) you can provide the data in bulk - though it may be not be as timely as an API.

The bulk is essential: i may be able to get real-time twitter feeds by an API but I can’t get the data in bulk and that’s what I’d need to have real freedom with that data.

Having written this I think I would still incline to arguing that people have to provide data in bulk and that that is an essential part of the open definition.

PS: we should delete the item about impracticality in the open data handbook.

Stephen · May 19, 2015, 2:09pm

Fair enough. People do need the option to get the data in bulk to perform analysis over time. I’ll make a suggestion on the open data handbook.

Woah - it was already changed by the time I got there - impressive

civicdata · May 19, 2015, 2:13pm

While I agree that GTFS-RT is not what you would consider ‘bulk’ data, it is dependent on GTFS data, which would be considered bulk (though I doubt anyone would release the RT data without the base data).

Since the Transit page on the census mentions both, then bulk may be appropriate in this case:
http://us-city.census.okfn.org/dataset/transit

A related question is how an agency that only release GTFS gets the same score as one that releases RT data as well.

Stephen · May 19, 2015, 2:18pm

On Australia’s Regional Open Data Census we have 2 datasets Timetables and Real-time transit, so you could get 190 points if you publish both.

civicdata · May 19, 2015, 2:21pm

Well that makes sense then to possibly redefine the wording ‘bulk’ for only RT data. Ta.

rufuspollock · May 19, 2015, 3:13pm

To be clear: it would always be great to have the API there - and I’m not suggesting that bulk is preferable for most users. However, the bulk stuff is essential for open data: the basic logic is that with bulk data anyone can build an API (perhaps not real-time perfectly but close) whilst with an API i’m beholden to the provider.

Overall, for this time of data I would definitely be including the links to the APIs in the comments on a particular dataset and perhaps even having it as the primary link (and then listing the bulk sources in the comments).

Stephen · May 20, 2015, 3:09am

One other thought on, “bulk data being essential”; We must choose our words carefully. The Open Definition states,

“data should be machine-readable, available in bulk, and provided in an open format”.

“Should”, from RFC2110 means,

there may be valid reasons to ignore this requirement but the full implications must be understood and carefully weighed before choosing a different course.

So, in my mind, providing bulk data isn’t mandatory but it should be provided unless there is a valid reason not to.

Perhaps the Open Data Handbook definition of bulk, which, in part, states:

The provision of bulk access is a requirement of open data.

should be soften to match the Open Definition.

rufuspollock · May 20, 2015, 12:06pm

I think “should” was not being used RFC style in the OD - more as MUST. Maybe we should update with “MUST”.

That said your framing as providing some useful optionality. Maybe one for Open Definition council to consider.

mlinksva · May 21, 2015, 10:52pm

Hi, arrived here from bulk is an access matter, form shouldn't be specific to data, machine… by mlinksva · Pull Request #104 · okfn/opendefinition · GitHub

To extent I’ve contributed, ‘should’ is being used RFC style in the Open Definition, as is ‘must’; the distinction is why both are used. We have an issue asking us to state that explicitly, which I’ll prioritize addressing now.

I’m not all that committed to a ‘should’ regarding bulk, but I’ll note that the OD is binary, ‘should’ being the only way it can express nuance, though the work or license can still be evaluated as ‘open’ even if it does not meet all shoulds. I hadn’t looked at the Open Data Index before, but I see that it has a scored methodology. I don’t see at a glance whether there is a threshold score for which a dataset gets to be counted as open (as 106 of 970 do according to the home page). For example, http://index.okfn.org/place/colombia/legislation/ is said to be ‘75% open’, falling short on machine readability and bulk access. Is that dataset one of the 106?

BTW, I find it somewhat odd that accessibility to the public and for free are considered ‘legal’ aspects of openness by the index.

rufuspollock · May 22, 2015, 10:32am

OK. I’d say machine readable and bulk are supposed to be MUSTs (certainly machine readable is and I’d argue for bulk too).

It is useful to distinguish bulk provision which is an access item and the conditions for a license to be open. Bulk won’t usually be in the license in any way.

Stephen · May 22, 2015, 11:28am

So back to the first question in this topic… Do we want a real-time public transport feed in GTFS_RT format without the provision of bulk data classified as Open Data based on the Open Definition?

I don’t mind the answer but in my view the Open Data Definition, Index and Handbook must align in their definitions and how they evaluate if something is open so that Open Knowledge presents a consistent position.

mlinksva · May 23, 2015, 9:12pm

Alright. On 2nd thought I’d argue bulk was already a must as the OD requires access to the work as a whole.

Re oddity of access and price as legal items, I was referring to how they are described in the ODI methodology page. I’ll file an issue in that repo … done.

Very much agree with @Stephen that the various definitions ought be in alignment.

Stephen · May 24, 2015, 3:55am

Discussion on fixing the Open Data Index methodology wording has been added as a separate topic in this Forum.

The draft of Open Definition 2.1 has been updated to reflect discussion above.

Looping back to the start of this topic, a GTFS-RT feed without a bulk data download would not be classed as Open Data based on the draft Open Definition 2.1 (but please don’t let that stop you publishing because I really appreciate knowing if I’ve missed my bus or it’s just late.)

emmaAkin · May 17, 2016, 2:31pm

It is interest to see that we could make that distinction. But this makes it a little difficult to convince more providers to make that data open.

jaakkokorhonen · July 9, 2016, 1:05pm

Hey, only now I noticed I missed this discussion. “Available in its entirety — and able to be downloaded “in bulk”” did not go into the Open Definition 2.1.

Is the reason for this that then a data stream would not be Open Data, or was there another reason?

jaakkokorhonen · July 9, 2016, 2:08pm

GTFS-RT stream would not be Open Data, and to be used with full scale of how Open Data is intended, traffic organisations should provide a yearly dump of realised traffic. Which in Helsinki metropolitan area has been requested of the transit authority, and they have complied by providing a month in a year in bulk, which is something that was feasible with the hardware capacity they had at that time.

Having an Open API is indeed a different thing. There is more emphasis on usability and software transparency, exactly because data from an API one cannot put to one’s own capacity and software. The dynamics are different. For this, we wrote a separate Open API definition in Finnish. http://avoinrajapinta.fi/

Stephen · July 9, 2016, 8:53pm

The Open Definition 2.1 covers the “bulk issue” in section 1.2 Access, “The work must be provided as a whole…”.

So, my interpretation of this is, if you provide openly licensed data only via an API, it is not open data.

(I agree @emmaAkin, this doesn’t help convince providers to publish only via an API but hopefully the data is valuable without having an open data tag.)

As @rufuspollock mentions above… [quote=“rufuspollock, post:2, topic:294”]
bulk is essential: i may be able to get real-time twitter feeds by an API but I can’t get the data in bulk and that’s what I’d need to have real freedom with that data.
[/quote]

Researchers I’ve spoken with also place high value on bulk data. They need it placed near their computing resources to perform complex analysis of the data and how it has changed over time.

@jaakkokorhonen it’s great that Helsinki provided one month of GTFS-RT data in bulk (and I do empathise with their capacity constraints).

Lastly, thanks for your work on defining an Open API - one that is openly defined, assessable, testable but doesn’t necessarily provide open data (e.g. your My Data example). I wonder what we should call an Open API that provides Open Data?

jaakkokorhonen · July 12, 2016, 10:16am

Thank you @Stephen for clarifying this.

If there might be native English speakers interested in translating the Open API Definition, i would be happy to set up a web conf. Tis might also be a nice session agenda in an event.