Geo Data Package


#1

We are pleased to share that the Frictionless Data project has now commissioned research work on the Geo Data Package. This will take place over the next few weeks.

As part of this research work,

  • we would like to understand what approaches and standards, if any, you currently employ in your work with geo data and where the gaps are.
  • We would also like to hear of potential use cases from users that would find a Geo Data Package solution useful in their work.

We invite you to share your input either as comments under this discussion, or via our Specs Issue Tracker.

Edit: The issue tracker link has now been updated. Please use https://github.com/frictionlessdata/specs/issues


#2

Hi everyone,
I’m the main person working on this - thought I’d introduce myself. I’ve been pretty heavily involved in both open data and maps for the last few years. Particularly relevant things I’ve worked on include nationalmap.gov.au, standards.opencouncildata.org, and CSV-geo-AU, plus Fiscal Data Package.

I’m particularly interested to hear what other peoples’ experience using spatial data have been, especially in other countries (I’m in Australia).

The main spatial “standards” I’ve come across are:

  • Shapefile. Widely supported, but crap: 10-character field name limits, multiple files required, potential for any projection (and not necessarily documented), opaque binary format.
  • KML. Pretty widely supported but a really messy format for spatial data, because it incorporates styling, icons, etc.
  • GeoJSON. Support isn’t as good as you’d expect, but it’s a pretty nice standard, and the recent-ish update that removes non-EPSG:4326 projections improves reusability. But it’s still fairly common to see non-compliant GeoJSON, and messes such as points encoded as multipoints.
  • CSV: Super convenient for non-spatial users, but very reliant on conventions to make it work. There’s a ton of data that could be expressed as tabular data joined to geometries defined elsewhere (like administrative boundaries), but few standards for doing that (csv-geo-au being one).

And of course the usual issues around metadata: what do the fields mean, how to interpret them, etc

I’m less familiar with how Data Package works in practice, and what workflows it enables, so I’d really love to hear (as Serah said): pain points around existing spatial standards (or lack thereof), and how a spatial data package can help.


#3

Some resources / discussions that may help contributors to this work:


#4

Wow, there is a lot of backstory here :slight_smile: Thanks so much for providing the links.

A couple of issues I’d like to discuss (ie, where my opinion maybe differs from that expressed in the Geo “profile” issue:

  1. should alternative SRSes/CRSes/projections be supported? My feeling is no. If my assumption that a purpose of Data Package is to ease the flow of information from experts to non-experts, then we just don’t need projections. We can force the data providers to convert to EPSG:4326, which has worked really well for GeoJSON. This simplifies life for the non-expert consumer enormously.

  2. Should raster data be supported? Again, my feeling is no - at least initially. My experience is that raster data is often enormous, and rarely consumed by non-experts other than for fairly trivial cases like a historical map image overlaid on Google Earth, which doesn’t really have a lot of “data” value. I see a lot of complexity in trying to support this use case, and much less value.

  3. Is CSV with two columns for point data ok? dr-shorthair argued pretty persuasively against it (in the Issue thread), but my own experience has been pretty positive. Two-column format makes intuitive sense to non-spatial people (who are aware of latitude and longitude, and that’s the extent of their knowledge). And it’s very easy to convert from that to other formats when required, especially swapping the order. Dealing with a format like “POINT(144.9 -37.8)” or “(144.9,37.8)” or whatever is, in my experience, harder work, and often error-prone due to the coordinate order issue.

So I’m still inclined to have a Spatial Data Package support:

  • GeoJSON for points, lines, polygons (and multi-X)
  • CSV for points, with “lon” and “lat” columns, expressed in decimal degrees.
  • maaaybe CSV with region-bound columns (administrative or statistical regions)
  • nothing else.

#5

Btw thanks very much Stephen for http://frictionlessdata.io/guides/point-location-data/.

Just wanted to ask if there is any backstory on this “geopoint” concept? I haven’t come across this before.


#6

In my field (ecology) rasters are one of the primary uses case of spatial data. They are regularly used by non-experts to link their biological data to environmental information. Not including them in the standard would prevent us from using the spec in our development. This is why our work in this space has explicitly included rasters. It does make it more complex, but I think it’s a necessary complexity.

In the Data Retriever we use the tabular data spec for handling data that we don’t produce. While this is a slight deviation from the primary intention of the data package it is a feasible use case that allows the large amount of data that does not adopt these standards directly to be accessed using the same tooling as formal data packages. This use case is presumably part of the reason that url is part of the spec as well as path for accessing data. I think the goal of the Geo Data Package should to facilitate similar usage, in which case supporting non-EPSG:4326 projections is necessary. This adds complexity to associated tooling, but it doesn’t really add complexity to the spec itself. Tooling can always reproject into EPSG:4326 by default to keep things simple for non-experts.


#7

I’m not aware of the back story for geopoint but I noticed @rufuspollock did the last edit in that section of the specification. Perhaps Rufus or @pwalsh can comment on the history of the geopoint type and formats?

Personally I’ve never come across the array or object formats in real-life.


#8

I would like to see the ability to provide a minimum bounding rectangle/polygon as a constraint to support validating if a point in within a boundary.

This is currently only possible using minimum and maximum constraints on lat, lon as numbers in separate columns.


#9

Definite :+1: on this one - though I’m not an expert so i don’t know how big a deal coordinate projections are (this was discussed a bit here https://github.com/frictionlessdata/specs/issues/86#issuecomment-143147023)

Ditto - definitely agree.

Again I agree :smile: - i guess my only question is whether we mandate two columns or allow both as we have at the moment. A given “logical” object (a location) split across two physical columns is coming other in other contexts and I think it is very natural.

Note re geopoint

I am the one who introduced this and i borrowed it directly from elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html

Personally I think this may now be overcomplex, especially the various parsing options and we should have “one way to do two things” (or max two :wink: …).


#10

@rufuspollock - would be curious to hear your responses to my points above.


#11

Another thought, we should describe how to provide a spatial extent for the dataset - similar to temporal extent described in http://frictionlessdata.io/specs/data-package/#descriptor


#12

There are some examples of this in Henry’s draft packages, e.g., https://github.com/henrykironde/spatial-packages/blob/master/data/vector_packages/BOUNDARY_ARC.json


#13

Another example of prior art for reference: https://doi.org/10.1016/j.cageo.2016.09.001

“A comprehensive open package format for preservation and distribution of geospatial data and metadata”

X.Ponsa J.Masób Computers & Geosciences Volume 97, December 2016, Pages 89-97


#14

(Back from holidays). Ok, there’s a philosophical choice here that sort of goes to the heart of what problem data packages in general are trying to solve:

Ethan:

I think the goal of the Geo Data Package should to facilitate similar usage, in which case supporting non-EPSG:4326 projections is necessary. This adds complexity to associated tooling, but it doesn’t really add complexity to the spec itself. Tooling can always reproject into EPSG:4326 by default to keep things simple for non-experts.

So, in this view, a data package is a wrapper around messy data, and it’s up to the person consuming the data to use tools that can translate it into something clean. The value that Data Package is providing is allowing that process to be automated: the data itself is provided in its original format, but it’s relatively straight forward to build a tool (eg, the Data Retriever) that uses the Data Package metadata to transform the data.

But this particular use case seems a bit of an oddity:

  1. The developer of the tool creates the Data Package metadata
  2. The tool then uses that metadata to transform the data at runtime, invisibly to the end user.

In other words, the Data Package isn’t really serving any interoperability goal here. It could just as easily be a proprietary format (or code), because neither the provider of the data nor the consumer actually interacts with it at all. (I get that there is a benefit in being able to use other Data Package-supporting tools as part of the behind-the-scenes workflow. It’s hard to assess how important that is, in the general case.)

At its extreme, a Spatial Data Package could support everything. But then, nothing would support Spatial Data Package, except possibly one tool…whose job would be to convert the resources indicated, into something more directly consumable, probably GeoJSON.

So, on this point:

Tooling can always reproject into EPSG:4326 by default to keep things simple for non-experts.

I just don’t think this is a great approach, because you could make that statement now - tools exist to convert spatial formats into something more useful to a non-expert. But where is that tooling? How does the non-expert know what tool they need, or how to use it? To me, the value of something like Spatial Data Package is the promise that you’re guaranteed something that is immediately useful. All the work of converting and reprojecting should be done at the upstream end: a consumer should not be exposed to such complexities.


#15

Another thought, we should describe how to provide a spatial extent for the dataset - similar to temporal extent described in http://frictionlessdata.io/specs/data-package/#descriptor

What use cases for extents do you envision? Given that they can be computed automatically from the spatial data (as either a bounding box or a polygon such as a convex hull), is it worth recording separately as metadata?


#16

To an expert, they’re not a big deal. To a non-expert, they can be a huge deal. I’ve seen people waste half a day or more just grappling with the task of getting data from one projection to another. The steps go something like:

  1. Spend a while trying to get your map to display, and either nothing happens, or it looks completely wrong.
  2. Eventually identify that the problem is the source projection.
  3. Somehow work out what the source projection is.
  4. Somehow work out what target projection you need.
  5. Repeat steps 3 and 4 for the format of the projections (ie, once you know that Google Web Mercator is what you need, figure out that “EPSG:3857” is the magic string that you need in this particular app…unless it’s a proj4 string, or a projection file, or something else).
  6. Work out where to put those pieces of information, or whether you need some extra piece of software like ogr2ogr, which as a non-spatial-user, you most definitely don’t have installed, and you can’t really believe what an enormous piece of software gdal-bin is to install, and why the hell do you need to install the whole of gdal when you just want this one tiny command line utility anyway, except at this point someone says you should just use QGIS, and now you’ve lost the rest of your afternoon too.
  7. Somewhere you’ll probably trip up on the subtle distinctions between “spatial reference systems”, “coordinate reference systems” and “projections” (and the fact that EPSG:4326 isn’t technically a projection at all)

It’s pretty similar for expert users, except that the above happens a lot faster - minutes rather than hours.

I wrote this (slightly inaccurate) blog post a couple of years ago when I was trying to get my head around it: https://stevebennett.me/2014/07/25/web-map-projections-the-bare-minimum-you-need-to-know/

One of the crazy things (for me) about spatial data is that often the projection is not recorded as metadata. You’re just meant to know, somehow. Shapefiles may, or may not, include the projection information. Dealing with (newer, non-projection-supporting) GeoJSON is such a godsend, because that entire waste of brainpower just goes away.

[In case you’re curious, the one sometimes important limitation of using unprojected data is, somewhat oddly, continental drift. Data stored in a local projection can pin a location to its local tectonic plate. Locations stored with unprojected coordinates represent a point on the surface of the spheroid, and hence over the physical object being represented may drift from the point, a few centimetres per year.]


#17

Thanks for the detailed response @stevage. Certainly there are always tradeoffs in this kind of decision making and I’m definitely not proposing the “support everything” approach. In this case what I’m suggesting is that projections are fundamental to spatial data and as a result I don’t think that being stored in a non-lat/long projection represents “messy data”.

I’m also not suggesting that a non-expert be required to handle the projections. As I understand it a big part of the Frictionless data approach is providing tooling that loads the data in a consistent and desired way across languages. The analogy is that the user doesn’t need to know how to parse data following the csv standard, it just gets loaded and works. I’m suggesting that the tools for handling frictionless data geo data packages either automatically reproject everything into lat/longs, or take a projection argument and reproject the loaded package into the desired projection (defaulting to EPSG:4326). As you mention, this is an easy (and in most cases one line) task for folks with experience and so I don’t think it would be a big burden to incorporate it into tools that load that packages.

I wouldn’t have any issues limiting the projections to a small subset that can be easily reprojected using standard builtin tools, but not support core projections in which data is often collected (e.g., UTMs) seems like it will create barriers to getting data placed in these packages in addition to limiting our ability to use this standard for our work.


#18

I think where I might be struggling is not having a clear understanding of what kinds of workflows Data Package (and hence, Spatial Data Package) is meant to support.

My (perhaps naive) assumption was that it goes like this:

  1. Someone publishing data cares about reuse and interoperability, and decides to invest more time and effort than simply dumping some files on the web, so uses some existing tools (or builds their own) to bundle their data into well described Data Packages, uploads them somewhere.
  2. A casual user who has never heard of Data Package comes across one of these things, and finds the metadata format fairly simple to understand, or can simply access the files inside as expected. By virtue of it being a Spatial Data Package, the files inside are in a very reusable format, life is easy.
  3. Or, a user of some tool that is hooked into the Data Package ecosystem makes the data available to them, and they’re barely aware that it was ever a Data Package to begin with. Perhaps they use R, or some downstream website.

In these cases, I don’t see the benefit of shifting the burden of reprojection etc from the publisher (1) to the casual consumer (2). The publisher is already doing work, and is already familiar with their data: getting them to reproject it and convert it to GeoJSON (for instance) is fairly trivial. Allowing them to publish the data in their local projection leaves an awful lot of friction in the workflow, IMHO. (And I just don’t see the point: projections are not, afaik, inherent to the data in any way, other than the slow drift over time issue I mentioned.)

Can you explain why projections are “fundamental to spatial data”? I’d also be curious what you think about GeoJSON’s total lack of support for anything other than EPSG:4326 - do you see this as seriously compromising GeoJSON’s usefulness?


#19

TLDR; I am not suggesting the the burden of reprojection be shifted to the causal consumer. I am suggesting that it be shifted to the person writing the tool that loads spatial data packages.

My apologies for failing to communicate clearly. Hopefully in addressing your points my perspective will become a little more clear.

It is reasonable to expect some additional investment from data providers, but in my experience every additional hurdle that you put between a data provider and them releasing their data decreases the chance that they will share it at all and increases the chance that if they do they will dump completely uncurated and undocumented data on the web. I think it’s fair to say that this perspective is basically consensus among folks involved in getting scientists to share data in useful ways. E.g., it serves at the core of Dryad’s philosophy on this. One of the key things I like about the frictionless data approach is that it is also relatively low friction on the providers part relative to more complex metadata standards.

My impression is that this is not the primary use case based on the emphasis on developing tooling across languages and the use of JSON (instead of something more human readable and writeable, e.g., YAML). That said, I’m currently also a bit confused by the user example of someone who knows nothing about projecting but will be able to actively use spatial data. Making simple maps is great (though it’s not ideal to do even this in EPSG:4326 in most cases), but presumably most folks will want to analyze the data in ways that will require more useful projects. It seems to me that part of frictionless here is “the data easily ends up in the projection I need”.

My impression is that this is the primary use case with packages being developed to allow loading the packages into the appropriate formats in different languages. Hence the effort to develop libraries for lots of languages: http://frictionlessdata.io/software/

I am not suggesting the the burden of reprojection be shifted to the causal consumer. I am suggesting that it be shifted to the person writing the tool that loads spatial data packages.

  1. See my point above about friction for data providers
  2. This is isn’t as straightforward with rasters
  3. The question about GeoJSON is a good one. It does in the sense that it doesn’t support raster data. It’s great for default map making on the web, but I see the ease of GeoJSON as being for the tools as much as for the user.

More generally I like thinking about GeoJSON here in the sense of “what will this standard bring to the table that GeoJSON doesn’t already provide.” My impression of the really simple version being discussed at the moment is that it’s basically a little bit of metadata on top of GeoJSON, except storing the data in csv. If that’s the case I guess I’m confused as to why the best approach isn’t just to use GeoJSON (a well established standard) and the package could be a few lines of metadata and a link to a GeoJSON file. That would be valuable, it just wouldn’t be useful for us and so we’d just work with what we’ve already developed for the more general geospatial use case.


#20

I really agree. I think we definitely exclude projections by default :slight_smile: