Geo Data Package

Hi Ethan,
Sorry for the slow reply!

every additional hurdle that you put between a data provider and them releasing their data decreases the chance that they will share it at all and increases the chance that if they do they will dump completely uncurated and undocumented data on the web

Yeah, I do get this. (I spent a couple of years working in academic research data sharing, and a couple more in government data sharing). And I do agree there’s a lot of value in improving sub-standard data and republishing it. I’m just not convinced that making Data Package itself serve that purpose is the right move - it directly contradicts the “Simplicity” design goal (and probably the “Focused” and “Web-oriented” ones too), after all. Instead, I think some related spec (“Messy Data Transforms”?) could describe what’s required to turn an existing dataset into a nice, clean Data Package without overly complicating that latter spec.

My impression is that this is not the primary use case based on the emphasis on developing tooling across languages and the use of JSON (instead of something more human readable and writeable, e.g., YAML)

Maybe neither of us are super clear on the use cases :slight_smile: My understanding was that humans can access the files inside the data package with their existing tools without necessarily worrying about the contents of the datapackage.json - but it’s there so that other tools can interoperate at a higher level. So, you can open a CSV file in Excel if you’re used to that, but you can also use a special library to open it in Python, and you’ll have the benefit of data types, descriptions etc.

FWIW, I much prefer JSON over YAML for authoring. YAML is actually pretty atrocious due to its incredible complexity (example), and the syntactic choices around lists and objects/hashes. I’ve had to use it a few times for things like Salt, and found it incredibly painful and error prone, whereas writing JSON is simple to get right. It’s easy to validate a JSON file by eye, whereas YAML is essentially impossible.

It seems to me that part of frictionless here is “the data easily ends up in the projection I need”.

Yes, I’m not sure what you’re arguing for here. People who have specific projection needs will know how to do the projection. But for non-GIS people, EPSG:4326 is likely to be the right choice, for loading data into a web map, for instance.

My impression of the really simple version being discussed at the moment is that it’s basically a little bit of metadata on top of GeoJSON, except storing the data in csv. If that’s the case I guess I’m confused as to why the best approach isn’t just to use GeoJSON (a well established standard) and the package could be a few lines of metadata and a link to a GeoJSON file

Why not just GeoJSON instead of CSV? Because it makes point data (and all its attributes) utterly inaccessible to anyone using non-spatial tools such as Excel. Most spatial tools support CSV (although the lack of standardisation is annoying). Zero non-spatial tools support GeoJSON.

And in many cases, “point data” is really just tabular data that happens to have a location. For instance, event permits around the city have lots of interesting data, and location is just one of them. Similarly wildlife sightings. Car accidents. Tweets. etc. (This seems to be less true of line and vector data, where the geometry is the data).

What I’m leaning towards:

  1. Point data: publish as Tabular Data Package (CSV) with a bit of extra location metadata, and optionally also include the data as GeoJSON.
  2. Simple vector data (lines, polygons): publish as Data Package containing a GeoJSON, plus a bit of extra location metadata, and optionally, schema metadata (describing the fields).
  3. Complex spatial data (raster, coverages, multi-layered stuff that doesn’t suit GeoJSON, data where projections are crucial for some reason): not sure. :slight_smile:

@stevage - no problem. Thanks for engaging in this discussion. At this point I think it’s fair to say that we see different users & use cases as being most important and therefore disagree on the best route forward. I’ll leave you to it and we’ll continue with our use case separately. If you decide you want to generalize more at some point give us a ping and we can chat more about the decisions we made to facilitate this.

Thanks - your input is incredibly valuable (still re-reading some bits), sorry for showing up late to the party and throwing around opinions.

It may be the case that the two kinds of Data Package can and should coexist. For the sake of argument call your style of DP “messy SDP” (it’s a clean, structured interface around data found in the wild) and mine “tidy SDP” (choices of file format and structure dictated by the standard). Now, many messy SDPs can automatically be converted into tidy SDPs. (Excluding raster data, for instance). A question to figure out is where that conversion should be done, and which of the two SDPs should be presented publicly (and how to distinguish them).

@stevage what’s the status of this work? Are you planning to circulate a draft for comment?

Hello Stephen, everyone,

We haven’t shared the report by @stevage widely yet, but you can read it here Spatial Data Package investigation

We are highly appreciative of everyone’s contributions to the spatial data package discussion over the last few months, and particularly to Steve Bennett for lending his time and expertise to this work.

Have a read and let us know what you think.

2 Likes

How do you plan to capture feedback?

My first questions…

Point datasets

Recommendation for creators:

  1. Point datasets SHOULD be published as Tabular Data Package, in CSV, with locations represented as “Latitude” and “Longitude” columns.
  2. These columns SHOULD be given appropriate types when such are supported in Table Schema.
  3. “locations” and “spatial-profile” metadata SHOULD be included, to indicate that the TDP is also a spatially-enriched dataset, and how to interpret the location information.
  4. For maximum reusability, a GeoJSON version of the data should also be included within the Data Package. (NOTE: It cannot be included as a resource without breaking the rules of Tabular Resource, but can be included in the package nonetheless.)

Currently, Table Schema supports “geopoint” and “geojson” location types, which we do not recommend. We propose individual “latitude” and “longitude” types.

  1. What change do you propose to the Table Schema type and format:
    a. 2 x new type: - latitude and longitude, or
    b. for type: number, 2 x new formats : latitude and longitude?

  2. Do you propose to depreciate the type: geopoint and geojson and all underlying formats?

Next question about spatial-profile

Location metadata

Given that location information can exist in conjunction with other kinds of data, we recommend that two types of location metadata be included in Data Package descriptors where appropriate:

Package-level “spatial-profile” attribute indicates that the Data Package contains location information, and what sort it is. This makes filtering for location-containing Data Packages easy. (This attribute may be superfluous, in that it can be inferred from attributes on the resources.)

Package-level “spatial-profile” attribute

One of:

{
 "spatial-profile": "tabular-points",
 "spatial-profile": "simple-vector",
 "spatial-profile": "raster",
 "spatial-profile": "vector"
}

conflicts with…

Point datasets


4. For maximum reusability, a GeoJSON version of the data should also be included within the Data Package. (NOTE: It cannot be included as a resource without breaking the rules of Tabular Resource, but can be included in the package nonetheless.)

If spatial-profile is “one of:” then you can’t have different spatial data resources in the one data package.

I suggest, either:

  • spatial-profile be applied at the data resource level. This will support the discovery of different types of spatial data
  • spatial-profile at the data package level is an array of profile types
  • spatial-profile at the data package level is boolean (i.e. there’s spatial data inside but I can’t say what without looking further e.g. data resource format or mediatype)

I really like locations: of "type": "lat-lon". :+1:

This enables:

  • any axis order (lat,lon or lon,lat)
  • relationship between lat and lon to be explict rather implied by field name or adjacent columns in the table
  • validation of presence of both fields or none (i.e. you can’t have a lon without a lat)
"resources": [{
  "name": "office-locations",
  "profile": "tabular-data-resource",
  "locations": [
    {
      "type": "lat-lon",
      "fields": {
        "latitude": "lat",
        "longitude": "lon"
      }
    }
  ],
  ...
}]  

Should locations metadata that supports linking to boundaries, mimic the foreignKeys format? The concepts are similar.


  "locations": [
    {
      // REQUIRED: described above.
      "type": "boundary-id",
      // REQUIRED: the name of the field containing the identifiers.
      "field": "Council ABS ID",
      // REQUIRED: a codelist from a predefined (TBD) set, in hyphenated lower case.
      // Colon (:) indicates a subset which is not always present. Possibilities that make immediate sense:
      // "iso-3166-1:alpha-2": 2-letter country codes
      // "iso-3166-1:alpha-3": 3-letter country codes
      // "iso-3166-2": 5-, 6- or 7-letter character administrative subdivision codes (eg "FR-33")
      // "nuts-1": 1st-level NUTS code for EU (eg "AT3" = Western Austria)
      // "nuts-2": 2nd-level NUTS code for EU (eg "AT33" = Tyrol)
      // "nuts-3": 3nd-level NUTS code for EU (eg "AT332" = Innsbruck)
      // "csv-geo-au": Australian statistical and administrative boundaries as defined by csv-geo-au standard.
      "codelist": "csv-geo-au:lga_code",
      // OPTIONAL: an identifier for the specific version of the boundaries (often a year).
      "version": "2011",
      // OPTIONAL (TBD): local or web path to an actual source of those boundaries, in the absence of a codelist-resolving service.
      // this could also support the (unverified) use case of attributes and boundaries supplied separately in the same DP.
      "geometrypath": "http://..."
    }
  ]
 "foreignKeys": [
          {
            "fields": "state-code"
            "reference": {
              "resource": "state-codes",
              "fields": "code"
            }
          }
        ]

Equivalents:

  • foreignKeys and "type": "boundary-id"
  • fields and field
  • resource and geometrypath + version
  • reference.fields and codelist

You could also take inspiration from the Table Schema: Foreign Keys to Data Packages pattern - [example data package].

Perhaps I don’t fully understand,

An alternative approach would be to use Table Schema’s “foreignkey” (NOTE: Table Schema | Frictionless Standards) element to link directly to a feature in an external Data Package containing the relevant geometry. This approach has several weaknesess…

Perhaps what you’re saying is that codelist is used to look up some other resource that effectively returns the geometry path If the codes a shared and managed externally?

I think some fully worked examples are needed.

Geometrypath

I think I’d like to verify the “unverified use case” of “attributes and boundaries supplied separately in the same data package” (if I understand it correctly)…

 // OPTIONAL (TBD): local or web path to an actual source of those boundaries, in the absence of a codelist-resolving service.
 // this could also support the (unverified) use case of attributes and boundaries supplied separately in the same DP.
     "geometrypath": "http://..."

Given a data package with

  • datapackage.json
  • tourists_by_district_2016.csv
  • tourism_districts.geojson (tourism district geometry and district-name attribute)

tourists_by_district_2016.csv

tourism district visitors 2016
a 10000
b 23000

Should the data package be

{
  "profile": "data-package",
  "name": "tourists_2016",
  "version": "0.1.0",
  "resources": [
    {
      "path": "tourists_by_district_2016.csv",
      "name": "tourists_by_district_2016",
      "profile": "tabular-data-resource",
      "locations": {
        "type": "boundary_id",
        "field": "tourism district",
        "codelist": "district-name",
        "geometrypath": "tourism_districts.geojson"
      },
      "schema": {
        "fields": [
          {
            "name": "tourism district",
            "type": "string",
            "format": "default",
            "constraints": {
              "required": true,
              "unique": true
            }
          },
          {
            "name": "visitors 2016",
            "type": "integer",
            "format": "default"
          }
        ]
      },
      "primaryKeys": [
        "tourism district"
      ]
    },
    {
      "path": "tourism_districts.geojson",
      "name": "tourism_districts",
      "profile": "data-resource",
      "locations": {
        "type": "geojson"
      },
      "schema": {
        "fields": [
          {
            "name": "district-name",
            "type": "string",
            "format": "default",
            "constraints": {
              "required": true,
              "unique": true
            }
          }
        ]
      },
      "primaryKeys": [
        "district-name"
      ]
    }
  ]
}
  • Is the linking correct?
  • Are the constraints and primaryKeys needed in the geojson schema?

@stevage I’ve just re-read this whole thread and I think you’ve done a great job finding a balance between competing requirements. I think the idea of supporting both “minimal spatial data packages” and “comprehensive spatial data packages” works well.

I’d like to know if @ethanwhite and @henrykironde think that their needs are supported? @rufuspollock I suspect there’s enough “zen like simplicity” here for you?

I have the chance to do some work with a government department on spatial data next month and I’d been keen to test out these ideas with them. They don’t publish any GeoJSON at present. I won’t have access to programmers but am happy to make data packages compliant with the proposal to test ideas out. I’ve already pencilled these changes into the Data Curator backlog. Let me know if you’re thinking about creating software to test the concepts.

Lastly, I’d still like to add the concept of a polygon constraint for lat-lon locations. E.g. All points in must be inside the polygon. I’ll give this some thought and you’re suggestions are welcome.

Thanks @pwalsh and the Open Knowledge team for commissioning the research.

I’ve documented my thoughts on providing a spatial extent to describe and validate lat-lon point data. This lead me to propose some changes to the research paper and consider the harmonisation of the Frictionless Data language Spatial Extent for Lat-Lon Locations - HackMD

Feedback very welcome.

This sample/example is missing some commas. I recommend viewing the sample with http://jsoneditoronline.org/
I think there is something I am missing with how you have organised the resources in the example above.

1 Like

Thanks @henrykironde. Fixed now (I hope!)

The sample supplied looks fine to me.
I still have to give the report a second read but so far looks good

I’ve made a data package for point data representing my suggestions above.

It includes:

  • spatial-profile applied at the data resource level (instead of the data package level)
  • a new Location name property to support validation of points within a spatialExtent
  • a new spatialExtent property - a polygon that all points should be with
    -potentially unnecessary minimum and maximum constraints on lat and lon columns if validation is applied to the spatialExtent. If not, then this is a minimum bounding rectangle to validate the point locations

It excludes:

  • using the tabular data package profile because a convenience geojson copy of the csv point data is included

@Stephen, Looks like the page is missing data package for point data representing my suggestions

1 Like

Link fixed. Also added simple vector example but with spatial-profile applied at the data resource level (instead of the data package level). Thanks again @henrykironde.

Also attempted Tabular data linked to non-standard boundaries but struggled with this. Guidance welcome.

And linking to boundaries via a codelist.

…another thought, how do you specify if point data coordinate pairs are:

  • required/optional - both must exist or both must not exist
  • must be unique (perhaps use primaryKey instead of a constraint - but that feels like a hack)

For point data, we should always have those as required and I think that is what makes this a Geo package for vector data. And I do not think we need to specify any of the above. I would like to hear what other folks think of that.