Pilot: Frictionless Data in Archaeology

steko · August 17, 2016, 11:29am

Recently Open Knowledge International have come up with a rather elegant solution for data dissemination that is called Data packages and is basically a simple way to keep data in good, old CSV with the added value of a simple JSON schema file that describes each data field (is it a numeric field for continuous data? for count data? etc.). There is already a lot of documentation on the Data Packages website:

http://frictionlessdata.io/data-packages/

including tutorials by @danfowler explaining how to take advantage of Data packages in many popular scripting and programming languages:

http://frictionlessdata.io/tools/

with the result of avoiding typical problems like strings or other data types imported as factors, etc. From my perspective and first impressions, Data Packages make it way easier to implement a reproducible procedure for my own personal use, but at the same time data sharing is rather immediate, as there is more explicit metadata that can be automatically associated when importing in R (or Jupyter, MATLAB or any other compatible environment actually). There are validation tools written specifically for Data Packages, again based on the simple association of 1 CSV file and 1 JSON file (with possibilities of more complex setups, of course) using the JSON Table Schema.

In short, I think many archaeologists may be interested in testing Data packages, especially if you are already using R or Python, and perhaps providing some feedback to the developers to help with your specific use case and any issues you could encounter.

For example, and again, based on my own experience, I think most if not all “supplementary data” for archaeometry studies based on chemical methods could be easily published as Data Packages, and that
would result in almost zero-effort aggregation, comparison of newly published data and reproducibility - I know some who care a lot about that!

Find inventories (e.g. from excavation) are another of my pet peeves, where relatively simple information is kept stored in a wide variety of digital formats. One would naively think that standardization by consensus for such basic stuff could be something that previous generations had already solved, but we all know that is not the case. Again, Data Packages are not conceived as a panacea but my personal view is that if it gains momentum there are some practical advantages for the general diffusion of open data.

What kind of feedback are we talking about? Concrete practical feedback provided here on the forum is the easiest way to get started. We are also planning to create a dedicated Github repository, e.g. frictionlessdata/pilot-archaeology and in the repo we could have example datasets, and to have a series of hangouts for discussing face-to-face and collecting potential user stories. It’s expected that priority will be given to specific needs (that may well turn out to have wider use) rather than ambitious, discipline-wide goals.

We would like to have a first hangout on 21st September.

danfowler · August 19, 2016, 2:21pm

@steko Looking around on your Journal of Open Archaeology Data, I found the following dataset that I think could serve as a nice pilot that can be pulled together relatively quickly in advance of our hangout. It could demonstrate how to package a dataset, and how, once packaged, it can be pushed into various backends, not just SQL, but Pandas, R, and beyond.

The Cultural Evolution of Neolithic Europe. EUROEVOL Dataset

http://openarchaeologydata.metajnl.com/articles/10.5334/joad.42/ (data: The Cultural Evolution of Neolithic Europe. EUROEVOL Dataset - UCL Discovery)

The reasons I like this one:

It is small-ish, so we could push it to, say, GitHub and work in the open
Consists of several tables that relate to each other which can be modeled in a Data Package
Has an existing SQL dump for comparison on e.g. ease-of-use, performance, import into different backends
Has existing metadata about each table prepared already

WDYT?

steko · September 10, 2016, 5:31pm

Highlighting this date for everyone.

mtl_zack · September 14, 2016, 8:25pm

Hi, I just though that I’d throw in my two cents, since I have been looking over the documentation for around a month now and applying the specification to my dataset of archaeological obsidian.

I definitely think that the OKFN data package specifications can be tremendously useful in archaeology. It offers a great way to publish data in a decentralized manner, through github, as supplementary data appended to journal articles, or presented by research labs and independent scholars through institutional repositories or independent websites. When publishing data in a relatively centralized repository such as OpenContext or tDAR, modifying or updating data might not be so easy, but they do provide the expertise that enables proper structuring, schematization, licensing, and accessibility. The data package specification may act as a sort of middle-ground between centralized publication of data, which may be costly and somewhat inflexible, and having no regard for how data are published at all. In essence, I see data packages as a great protocol for enabling and strengthening the localized publication of revise-able datasets that continue to grow and change post-publication. The extensibility of the specification to allows for specialized schemas also suits this ‘cohesive decentralization’ that tends to occur among a disciplinary specialization, which is only strengthened by the consideration of linked open data principles.

steko · September 16, 2016, 9:04pm

That is fantastic! You’re officially the first archaeologist to work with frictionless data

May I ask how it is working for you and what path you followed (e.g. hand-crafting the datapackage.json, exporting from R, etc.)?

mtl_zack · September 16, 2016, 10:04pm

Well it is reminiscent of the LOD specifications that I’ve tried out (drawing upon the work of Sebastian Heath and Eric Kansa, in particular). I’m not yet finished, still have to construct schemas for my individual resources, and I’ve come to realize that there are much better ways to organize and relate the various resources themselves. This experience has prompted to to consider another major update of the database (strictly speaking its a dataset, or a ‘database-in-theory’) - I now realize that I can more optimally re-define the roles of each table and the relationships between them. I think that this is one of the most important ways in which this is valuable - it transforms a series of loosely related tables or datasets into a series of well collected and well-defined resources, rendering the entire package more reminiscent of a proper and responsibly-managed database.

Although you directed me to the R package earlier on, I found it a lot easier to construct the json file manually using the templates available in the documentation. I did the same when working with JSON-LD, which I still find quite confusing. I was able to go through the required, and then recommended keys, and skip over those that don’t really apply. I think that some sort of web interface that guides the user through the creation of the json file would be an ideal solution - it could provide options for the formats pertaining to particular data types, ensure that the document follows both the frictionless data and overall json specs, and also ensure that everything that the author hopes to schematize is complete through the utilization of a visual experience. I could envision an interface reminiscent of Mapbox Studio or Plot.ly. This fits into a broader complaint I have regarding the relative difficulty involved in accessing the semantic web, as either a contributor or a user of available data (this post written earlier this summer captures these concerns pretty bluntly).

EDIT: I’ve committed an early draft, check it out on github.

steko · September 21, 2016, 7:18pm

Today we had the first hangout. I’m sharing here below my notes from the call, that I think was very productive, especially with the participation of @mtl_zack. All from @wg_openarchaeology are welcome to comment!

Participants

Dan Fowler, Zack Batist, Stefano Costa

General discussion

A challenge for Frictionless Data is to overcome or resolve the overlap between frictionless data and linked open data (keeping in mind that are difficult to make mainstream, at least in the field of archaeology).
We agreed that a strong quality of Data Packages is being a middle-ground solution, that lives in a spectrum of possible representations of data and therefore allows for flexibility at the personal level (unlike centralized repositories) without compromising the aspect of interoperability (unlike spaghetti Excel files).

Data Packages and archaeological databases

Several different kinds of databases, the excavation database is very common but highly unstable and not very usable once it becomes unmaintained. Most of these databases have rather complex structures so it’s not the best use case to start from.
Looking at common workflows, especially involving spatial data, it seems clear that a frequent option is to do the equivalent of a JOIN between spatial and non-spatial “tables”, and create a new “table” that is used in GIS or statistical packages, separated from the source dataset. Is there any room for documenting Data Packages as an alternative to this inefficient workflow? The role of GeoJSON files is not entirely clear.

How to push for Frictionless Data in archaeology?

We all agree that the first recipients must be journal editors, principal investigators and managers of data repositories.
In particular, data repositories could play a key role if we can convince them that Data Packages make it easier for them to accept data and that their role could be that of allowing for import, export and validation of Data Packages.
Clearly big journals in the field of archaeological science and archaeometry could make a difference in the adoption of Data Packages for the dissemination of supplementary data, regardless of their aversion to Open Access.

Moving forward

We agreed that it is too early to start evangelizing but there needs to be a wider participation in the next calls for a strategy to become effective.
There will be a hangout/call in the week of 12th October, where practical issues can be discussed with regard to the proposed case studies (EUROEVOL, DObsiSS) and their Data Package version. In the meantime, we will spread the voice via listservs and other venues of communication.

danfowler · September 22, 2016, 9:50pm

Nice summary @steko ! In the meantime, I will most likely create a new repository for the euroevol dataset (something like frictionlessdata/pilot-archaeology-euroevol), add you to the list of contributors, and create some exploratory issues. We can try to find a middle-ground dataset as well.

By the way, I demonstrated a fairly naive join between tabular and geospatial data using the Data Package libraries here: http://frictionlessdata.io/guides/joining-data-in-python/ . If this is indeed a fairly common use case in archaeology, we can have our third dataset include a mix of geospatial and tabular data and package them together in a standard way.

Topic		Replies	Views
Data packages with R Frictionless Data opendata	6	2328	July 14, 2016
Sloan Foundation Funds Frictionless Data at Open Knowledge Frictionless Data	6	1829	March 24, 2016
New post on Labs blog: Working with Data Packages in R Frictionless Data	0	981	February 16, 2018
Geo Data Package Frictionless Data	42	5331	March 1, 2018
Working with Data Package Creator Frictionless Data	2	903	October 18, 2024