Does archaeology have a reproducibility crisis?

bmarwick · January 3, 2018, 4:32pm

First ever archaeology journal article that is fully reproducible! (and isn’t written by me) This entire PLOS One paper by Shannon McPherron is beautifully written in R markdown: Additional statistical and graphical methods for analyzing site formation processes using artifact orientations

I could successfully reproduce the author’s results. Here are my notes on the attempt:

After downloading the files from Additional statistical and graphical methods for analyzing site formation processes using artifact orientations, we see two problems.

First is the the plos-one.csl file is not included. We can easily get it from styles/plos.csl at master · citation-style-language/styles · GitHub, but it’s a minor irritation to wait to knit the document and then have it stop with an error that this file wasn’t found.

Second is that all files have almost the same name, something like pone.0190195.sNNN.xxx. So the file names mentioned in the article and within the files are not what we have in the download. This requires cross-checking between the file name in the supplement list, and the file description in the Rmd file.

This is a pain because, for example, the note at the top of the Rmd mentions ‘Lenoble_and_Bertran_data.RDS’, which corresponds to ‘S3 File. Lenoble and Bertran (2004) comparative data set. https://doi.org/10.1371/journal.pone.0190195.s003’ in the list of Supporting information in the article, and the downloaded file is called pone.0190195.s003.RDS.

However, later in the Rmd we see a line of code that reads readRDS('Lenoble_and_Bertran_2004.RDS'). This means that the filename we saw in the comment at the top of the Rmd is not correct, and we have to change it (or the code) to make it work.

The problem is that the journal changed the filenames after they are submitted with the article. If the files were deposited by the author at a trustworthy repository, such as osf.io, figshare.com, zenodo.org, etc., then the file names would be unaltered, and we wouldn’t have this bother trying to match up files with the names used for them in the code.

So, don’t submit code and data with journal as supplementary files. The journal messes with your files and makes it harder for others to reuse. Instead, keep control over your work and deposit your materials at a trustworthy repository, and link to those files with a DOI in the text of your article. We’ve written about some of the other reasons why this is a good idea, and what some good repositories are in 404 and https://osf.io/preprints/socarxiv/py4hz/

There is no list of the packages that the code depends on. We have to search the document for require() or library() to see what is required, and install as needed. One solution to this problem is to list all the libraries together at the start of the Rmd. Another, better, solution is to organise this bundle of files as an R package, and then the DESCRIPTION file can serve as a manifest that lists all the dependent packages in one place. We’ve written about this approach in detail in Packaging data analytical work reproducibly using R (and friends) [PeerJ Preprints]

The excellent citr package is loaded by the Rmd, but it does not need to be since that it only used interactively while writing to add citations to the document. The reshape package used in the Rmd has been superseded by the tidyr package.

The images do not appear in the PDF, Word or HTML output. They are only generated as output files in the working directory, not embedded in the rendered document. This could be improved by using knitr::include_graphics() in code chunks in the Rmd file (one per figure) to print the images, with captions, in the rendered document.

The time to run the code and knit the Rmd file is more than a few minutes. I added caching so I didn’t have to wait so much each time, as I did troubleshooting with the libraries and file names.

There is no version control or information about the author’s computational environment (i.e. package version numbers). I’ve written about the importance of these details in https://link.springer.com/article/10.1007/s10816-015-9272-9

To conclude, this is a fantastic accomplishment by McPherron, and a great step forward for archaeological science. Although there’s room for improvement (the same is true for my papers!), I hope that McPherron’s paper inspires others to adopt this open science approach in their own work.

Topic		Replies	Views
Pilot: Frictionless Data in Archaeology Open Archaeology	7	2099	September 22, 2016
About the Open Archaeology working group Open Archaeology	0	1110	August 16, 2016
Open Archaeology Working Group Call #2 Open Archaeology	3	1930	October 25, 2016
Our process of finding and evaluating key datasets Global Open Data Index 2016	16	6348	June 12, 2017
Reflections on this year's index Global Open Data Index 2015	16	4913	December 14, 2015

Does archaeology have a reproducibility crisis?

Related topics