I think the question for bulk data in the census needs to change. It is not always possible to publish open data in bulk. As pointed out in the open data handbook publishing bulk data is not practical for realtime or big data.
Can I suggest that the current question is reworded from:
Is the data available in bulk? - Data is available in bulk if the whole dataset can be downloaded easily. It is not available in bulk, if access to the data is through a web page that provides access to only part of the database.
to something like:
Is the data available in bulk or via a real-time feed? - Data is available in bulk if the whole dataset can be downloaded easily. It is not available in bulk, if access to the data is through a web page that provides access to only part of the database. A real-time feed provides access to a subset of a database that changes frequently and is too large to download in bulk.
As an example, in my view, a real-time public transport fed in GTFS-RT format should not be penalised 10 points for not being available in bulk.
What do you think - Should the question be changed?
Even with āreal-timeā data or other high frequency data (e.g. weather data) you can provide the data in bulk - though it may be not be as timely as an API.
The bulk is essential: i may be able to get real-time twitter feeds by an API but I canāt get the data in bulk and thatās what Iād need to have real freedom with that data.
Having written this I think I would still incline to arguing that people have to provide data in bulk and that that is an essential part of the open definition.
PS: we should delete the item about impracticality in the open data handbook.
While I agree that GTFS-RT is not what you would consider ābulkā data, it is dependent on GTFS data, which would be considered bulk (though I doubt anyone would release the RT data without the base data).
To be clear: it would always be great to have the API there - and Iām not suggesting that bulk is preferable for most users. However, the bulk stuff is essential for open data: the basic logic is that with bulk data anyone can build an API (perhaps not real-time perfectly but close) whilst with an API iām beholden to the provider.
Overall, for this time of data I would definitely be including the links to the APIs in the comments on a particular dataset and perhaps even having it as the primary link (and then listing the bulk sources in the comments).
there may be valid reasons to ignore this requirement but the full implications must be understood and carefully weighed before choosing a different course.
So, in my mind, providing bulk data isnāt mandatory but it should be provided unless there is a valid reason not to.
Perhaps the Open Data Handbook definition of bulk, which, in part, states:
The provision of bulk access is a requirement of open data.
To extent Iāve contributed, āshouldā is being used RFC style in the Open Definition, as is āmustā; the distinction is why both are used. We have an issue asking us to state that explicitly, which Iāll prioritize addressing now.
Iām not all that committed to a āshouldā regarding bulk, but Iāll note that the OD is binary, āshouldā being the only way it can express nuance, though the work or license can still be evaluated as āopenā even if it does not meet all shoulds. I hadnāt looked at the Open Data Index before, but I see that it has a scored methodology. I donāt see at a glance whether there is a threshold score for which a dataset gets to be counted as open (as 106 of 970 do according to the home page). For example, http://index.okfn.org/place/colombia/legislation/ is said to be ā75% openā, falling short on machine readability and bulk access. Is that dataset one of the 106?
BTW, I find it somewhat odd that accessibility to the public and for free are considered ālegalā aspects of openness by the index.
OK. Iād say machine readable and bulk are supposed to be MUSTs (certainly machine readable is and Iād argue for bulk too).
It is useful to distinguish bulk provision which is an access item and the conditions for a license to be open. Bulk wonāt usually be in the license in any way.
So back to the first question in this topicā¦ Do we want a real-time public transport feed in GTFS_RT format without the provision of bulk data classified as Open Data based on the Open Definition?
I donāt mind the answer but in my view the Open Data Definition, Index and Handbook must align in their definitions and how they evaluate if something is open so that Open Knowledge presents a consistent position.
Alright. On 2nd thought Iād argue bulk was already a must as the OD requires access to the work as a whole.
Re oddity of access and price as legal items, I was referring to how they are described in the ODI methodology page. Iāll file an issue in that repo ā¦ done.
Very much agree with @Stephen that the various definitions ought be in alignment.
Looping back to the start of this topic, a GTFS-RT feed without a bulk data download would not be classed as Open Data based on the draft Open Definition 2.1 (but please donāt let that stop you publishing because I really appreciate knowing if Iāve missed my bus or itās just late.)
Hey, only now I noticed I missed this discussion. āAvailable in its entirety ā and able to be downloaded āin bulkāā did not go into the Open Definition 2.1.
Is the reason for this that then a data stream would not be Open Data, or was there another reason?
GTFS-RT stream would not be Open Data, and to be used with full scale of how Open Data is intended, traffic organisations should provide a yearly dump of realised traffic. Which in Helsinki metropolitan area has been requested of the transit authority, and they have complied by providing a month in a year in bulk, which is something that was feasible with the hardware capacity they had at that time.
Having an Open API is indeed a different thing. There is more emphasis on usability and software transparency, exactly because data from an API one cannot put to oneās own capacity and software. The dynamics are different. For this, we wrote a separate Open API definition in Finnish. http://avoinrajapinta.fi/
The Open Definition 2.1 covers the ābulk issueā in section 1.2 Access, āThe work must be provided as a wholeā¦ā.
So, my interpretation of this is, if you provide openly licensed data only via an API, it is not open data.
(I agree @emmaAkin, this doesnāt help convince providers to publish only via an API but hopefully the data is valuable without having an open data tag.)
As @rufuspollock mentions aboveā¦ [quote=ārufuspollock, post:2, topic:294ā]
bulk is essential: i may be able to get real-time twitter feeds by an API but I canāt get the data in bulk and thatās what Iād need to have real freedom with that data.
[/quote]
Researchers Iāve spoken with also place high value on bulk data. They need it placed near their computing resources to perform complex analysis of the data and how it has changed over time.
@jaakkokorhonen itās great that Helsinki provided one month of GTFS-RT data in bulk (and I do empathise with their capacity constraints).
Lastly, thanks for your work on defining an Open API - one that is openly defined, assessable, testable but doesnāt necessarily provide open data (e.g. your My Data example). I wonder what we should call an Open API that provides Open Data?
If there might be native English speakers interested in translating the Open API Definition, i would be happy to set up a web conf. Tis might also be a nice session agenda in an event.