Hi Ethan,
Sorry for the slow reply!
every additional hurdle that you put between a data provider and them releasing their data decreases the chance that they will share it at all and increases the chance that if they do they will dump completely uncurated and undocumented data on the web
Yeah, I do get this. (I spent a couple of years working in academic research data sharing, and a couple more in government data sharing). And I do agree there’s a lot of value in improving sub-standard data and republishing it. I’m just not convinced that making Data Package itself serve that purpose is the right move - it directly contradicts the “Simplicity” design goal (and probably the “Focused” and “Web-oriented” ones too), after all. Instead, I think some related spec (“Messy Data Transforms”?) could describe what’s required to turn an existing dataset into a nice, clean Data Package without overly complicating that latter spec.
My impression is that this is not the primary use case based on the emphasis on developing tooling across languages and the use of JSON (instead of something more human readable and writeable, e.g., YAML)
Maybe neither of us are super clear on the use cases My understanding was that humans can access the files inside the data package with their existing tools without necessarily worrying about the contents of the datapackage.json - but it’s there so that other tools can interoperate at a higher level. So, you can open a CSV file in Excel if you’re used to that, but you can also use a special library to open it in Python, and you’ll have the benefit of data types, descriptions etc.
FWIW, I much prefer JSON over YAML for authoring. YAML is actually pretty atrocious due to its incredible complexity (example), and the syntactic choices around lists and objects/hashes. I’ve had to use it a few times for things like Salt, and found it incredibly painful and error prone, whereas writing JSON is simple to get right. It’s easy to validate a JSON file by eye, whereas YAML is essentially impossible.
It seems to me that part of frictionless here is “the data easily ends up in the projection I need”.
Yes, I’m not sure what you’re arguing for here. People who have specific projection needs will know how to do the projection. But for non-GIS people, EPSG:4326 is likely to be the right choice, for loading data into a web map, for instance.
My impression of the really simple version being discussed at the moment is that it’s basically a little bit of metadata on top of GeoJSON, except storing the data in csv. If that’s the case I guess I’m confused as to why the best approach isn’t just to use GeoJSON (a well established standard) and the package could be a few lines of metadata and a link to a GeoJSON file
Why not just GeoJSON instead of CSV? Because it makes point data (and all its attributes) utterly inaccessible to anyone using non-spatial tools such as Excel. Most spatial tools support CSV (although the lack of standardisation is annoying). Zero non-spatial tools support GeoJSON.
And in many cases, “point data” is really just tabular data that happens to have a location. For instance, event permits around the city have lots of interesting data, and location is just one of them. Similarly wildlife sightings. Car accidents. Tweets. etc. (This seems to be less true of line and vector data, where the geometry is the data).
What I’m leaning towards:
- Point data: publish as Tabular Data Package (CSV) with a bit of extra location metadata, and optionally also include the data as GeoJSON.
- Simple vector data (lines, polygons): publish as Data Package containing a GeoJSON, plus a bit of extra location metadata, and optionally, schema metadata (describing the fields).
- Complex spatial data (raster, coverages, multi-layered stuff that doesn’t suit GeoJSON, data where projections are crucial for some reason): not sure.