Does archaeology have a reproducibility crisis?


#1

I came across this quite clear article and I thought this group might be interested.

The reproducibility crisis! It’s shaking the very foundations of the
ivory tower. Reportedly the psychology wing is already in rubble.
Medical researchers are having to glue their microscopes to their
benches. But down in our dusty corner of the basement, the
archaeologists don’t appear to have even noticed. Why not?

http://www.joeroe.eu/blog/2016/08/27/does-archaeology-have-a-reproducibility-crisis/

@steko


#2

My recent article in JAMT was cited in that interesting post. The full details of my article are here:

Marwick, B., 2016. Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation. J Archaeol Method Theory, 1-27.

and in bibtex:

@Article{Marwick2016,
author=“Marwick, Ben”,
title=“Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation”,
journal=“Journal of Archaeological Method and Theory”,
year=“2016”,
pages=“1–27”,
abstract=“The use of computers and complex software is pervasive in archaeology, yet their role in the analytical pipeline is rarely exposed for other researchers to inspect or reuse. This limits the progress of archaeology because researchers cannot easily reproduce each other’s
work to verify or extend it. Four general principles of reproducible research that have emerged in other fields are presented. An archaeological case study is described that shows how each principle can be implemented using freely available software. The costs and benefits of implementing reproducible research are assessed. The primary benefit, of sharing data in particular, is increased impact via an increased number of citations. The primary cost is the additional time required to enhance reproducibility, although the exact amount is difficult to quantify.”,
issn=“1573-7764”,
doi=“10.1007/s10816-015-9272-9”,
url=“http://dx.doi.org/10.1007/s10816-015-9272-9
}

Here’s the publisher’s copy ($):

Here’s the open access copy on the GitHub repo:

Feedback most welcome!


#3

Ben, thank you for sharing your work.

I have tried to follow a similar procedure, if less systematic, in some of my work. I wonder if Docker/VMs are really that necessary in your opinion? These tools, while useful, seem to make things more difficult - after all even vanilla R and Markdown are not straightforward for many, and I’m not entirely convinced a Docker image will be so useful 30 years from now. On the other hand, source code remains extremely readable, and in many cases it can be run as is, at least if not using external libraries.

However, as you wrote in your paper, reproducibility does come with an additional cost, so it could just be that this increased complexity is part (or all) of that cost.


#4

Hi all, I’m glad my post was of some interest and thanks to @steko for letting me know about this forum.

What struck me about the finding that only 39% of psychology studies are replicable was a gut instinct that the figure for archaeology would be much lower (in terms of reproducibility, “replicability” being mostly out of our reach). @bmarwick you mention in your article that basic things like including data is rare, but I wonder just what percentage of recent archaeology papers would be “low” or “not reproducible” according to your scale. Does this group think that there would be any use in trying to estimate that figure, to quantify and raise awareness of the extent of the problem?


#5

this is great, and kudos for ‘reproducing’ and sharing on github. that is optimal! not as optimal, but always a great option: sci-hub. i use it weekly, here’s url of the pdf via sci-hub, just to show its power http://link.springer.com.sci-hub.cc/article/10.1007/s10816-015-9272-9


#6

Thank you for joining the discussion, and for starting it in the first place :triangular_flag_on_post:

To answer your question: yes, it would be extremely useful, particularly now that the pressure towards data sharing as part of “standard” publication processes is increasing:

My first thought is to focus on a narrow field of studies, preferably where there is an established tradition of supplementary material published along with the main paper, e.g. archaeological science (rather broad definition, I know), bearing in mind that the psychology figure of 39% comes from a study where they

conducted replications of 100 experimental and correlational studies published in three psychology journals


#7

Hi Stefano,

Thanks for your comments! Yes, I think that becoming more scientific will initially involve doing more difficult things. But after it becomes familiar, then it becomes easier. Just recently Docker became a lot easier to use because a Windows-native version was released. To me, this seems like the normal rhythm of the history of science, so I’m optimistic!

I agree with you that tooling will change over time, perhaps none of our current tools will be useful in 30 years. I’m not sure how long into the future my work will be reproducible, that’s an interesting question. I figure five years is a reasonable expectation at the moment. But I’m not too bothered about that - my paper is focused on four tool-agnostic principles, in order of priority:

  1. Archive and share data and code
  2. Use scripted analyses
  3. Use version control
  4. Recreate computational environments

We might imagine that archaeological research in general might be 80% more reproducible if the first principle is fulfilled universally. If everyone does 1 and 2, then, we might get to 90%, and if they all do 1, 2, 3, we’d get to 95% reproducible, and everyone doing all four principles might get us to 99%. Like this:

| how many principles? | How reproducible is archaeological research? |
|----------------------|----------------------------------------------|
| <1                   | 1%                                           |
| 1                    | 80%                                          |
| 1, 2                 | 90%                                          |
| 1, 2, 3              | 95%                                          |
| 1, 2, 3, 4           | 99%                                          |

So, I feel most strongly about getting everyone to archive and share data and code, and use code, if they are not already. The others two principles can come much later, I don’t mind about that. My personal goal is to do the best I can with these principles, given the tools available to me now, and my limited skills and time. I’ve found a few times that having a Docker container has saved me a few days of work. And when new and better tools come along, I will probably transition to those, and bring with me the experience I have so far (I hope this experience will be transferable!).


#8

Hi Joe,

Yes, I think archaeology is similar or lower to psychology for rate of reproducibility. Generally I think there is strong interest and support in archaeology for empirical reproducibility, like returning to dig a site 100 years after the previous excavation, or re-examining a famous assemblage in a museum. But computational and statistical reproducibility are not at all discussed, and I think this is a huge problem for archaeology. It makes it difficult for us to trust the work of other researchers, and to adapt and resuse the methods that they develop. It’s an interesting question about the empirical reproducibility of archaeological experiments. I’m not sure that we really depend heavily on experiments to general knowledge in the same way that biomedical research does. We’re more dependent on observations, like palaeontology, geology, astronomy, etc. So it hardly matters if we can reproduce the results of experiments or not, such as stone knapping experiments. etc., probably many can’t be because they’re so dependent on specific conditions. But that’s just a guess, what do you think?

I’ve been keeping a list of papers that include R code here: https://github.com/benmarwick/ctv-archaeology. Most of these should be reproducible, and I’ve reproduced some of them myself, but not attempted all of them (I only try those that interest me!).

Yes, I think there would be a lot of interest in a paper that explored how reproducible archaeology is at the moment. We could sample a few hundred papers in select journals (J. Arch. Sci. would be an excellent candidate) over the last few years, and look for the availability of raw data, sufficient details of statistical methods, code, etc. There are a few similar studies in other fields that we could look at for inspiration. Even an analysis on whether or not data are available would be very interesting. We can combine the results with citation counts to see if there is a relationship between reproducbility and impact. Definitely worth to quantify, good idea!


#9

I think part of the problem is that (as usual!) archaeologists have arrived at issues like data sharing via an idiosyncratic route. Repositories like ADS were probably ahead of the curve, being founding in the 90s, but they grew out of concerns about grey literature and the long-term preservation of excavation archives, rather than reproducibility per se. My impression is that they’re very good for huge big volumes of data, i.e. whole excavation reports, but less useful for the little chunks needed for reproducing individual studies. And of course, there’s very little sharing of code, and an underdeveloped tradition of methodological reproducibility in general.

I was probably thinking of an overly broad definition of “experiment”, but what I was getting at is that things like radiocarbon dating, artefact/faunal analysis, biomolecular techniques, etc., all represent empirical observations that can and should be replicated.

I thought JAS would be a good bet too. If there’s some interest here in exploring that project, I can go ahead and compile a list of recent papers. @bmarwick you mentioned similar studies in other fields, could I trouble you for references?


#10

Hi Joe,

I cite a bunch of very interesting studies of data sharing in various disciplines in the second paragraph of the discussion section of my paper above.

For reproducibility proper, there is

@article{BegleyEllis2012,
author = {Begley, C. Glenn and Ellis, Lee M.},
title = {Drug development: Raise standards for preclinical cancer research},
journal = {Nature},
volume = {483},
number = {7391},
pages = {531-533},
note = {10.1038/483531a},
ISSN = {0028-0836},
url = {http://dx.doi.org/10.1038/483531a},
year = {2012},
type = {Journal Article}
}

And I think you know this one already:

@article {aac4716,
author = {,},
title = {Estimating the reproducibility of psychological science},
volume = {349},
number = {6251},
year = {2015},
doi = {10.1126/science.aac4716},
publisher = {American Association for the Advancement of Science},
URL = {http://science.sciencemag.org/content/349/6251/aac4716},
eprint = {http://science.sciencemag.org/content/349/6251/aac4716.full.pdf},
journal = {Science}
}

In archaeology, I think a study of data sharing would be a great way to raise the visibility of the general issue of reproducibility. I’m thinking of something that could be semi-automated, like writing code for scraping the full text of 1000s of archaeology articles on sciencedirect.com for key words that relate to data sharing. Then we can model relationships with other variables like year of publication, number of co-authors, number of citations, etc. What do you think?


#11

Hi Ben - regarding keywords etc, I have a topic model here trained on 20 000 articles (ish) from JSTOR’s DFR, anglophone archaeology journals. Topic 45 & 55 seem closest to a ‘data’ topic, so that might be a handy place to start: http://graeworks.net/digitalarchae/20000/#/topic/45


#12

What I am not seeing mentioned in this discussion is the reproducible results of general artifact analysis, be it prehistoric ceramic, lithic, historic, faunal, botanical, etc. After training 7 people in lithic analysis many years ago, we found that in a given artifact sample those analysts averaged, if I remember correctly, about a 92 to 95 percent reproducibility rate. Trainees in historic artifact analysis and prehistoric ceramic analysis were lower. Even a lone analyst can revisit a group of artifacts a month later and make a few different choices than he/she did initially.

Our discipline is one that contains unavoidable subjectivity. Even the greats like Phillip Phillips, James Ford, and James Griffin argued about how to categorize certain Lower Mississippi Valley ceramics and often compromised so they could get on with the analysis. In fact, Phillips went back years later and “corrected” some of those compromises. Concise transmission of methodologies, including statistical formulas and program/app parameters, should be standard practice today and, if not, then the fault lies in the researcher/analyst. However, while “the data” as presented needs to be reproducible, I would be more concerned with the underlying foundation of the data, the artifact analysis itself.


#13

Hi James,

Yes, this is a relevant concern, I agree. In the wider literature this issue is called ‘empirical reproducbility’, a term first coined by Victoria Stodden, who has written widely on the subject, for example at https://www.edge.org/response-detail/25340.

However, I think there is pretty good awareness of the importance of empirical reproducbility in archaeology, as indicated by the ongoing threads of this work in lithic and faunal studies, especially use-wear and residue analysis, for example:

Breslawski, R. P., & Byers, D. A. (2015). Assessing measurement error in paleozoological osteometrics with bison remains. Journal of Archaeological Science, 53, 235-242.

Proffitt, T., & de la Torre, I. (2014). The effect of raw material on inter-analyst variation and analyst accuracy for lithic analysis: a case study from Olduvai Gorge. Journal of Archaeological Science, 45, 270-283.

Sadr, K. (2015). The Impact of Coder Reliability on Reconstructing Archaeological Settlement Patterns from Satellite Imagery: a Case Study from South Africa. Archaeological Prospection.

Lyman, R. L., & VanPool, T. L. (2009). Metric data in archaeology: a study of intra-analyst and inter-analyst variation. American Antiquity, 485-504.

Rots, V., Pirnay, L., Pirson, P., & Baudoux, O. (2006). Blind tests shed light on possibilities and limitations for identifying stone tool prehension and hafting. Journal of Archaeological Science, 33(7), 935-952.

Blumenschine, R. J., Marean, C. W., & Capaldo, S. D. (1996). Blind tests of inter-analyst correspondence and accuracy in the identification of cut marks, percussion marks, and carnivore tooth marks on bone surfaces. Journal of Archaeological Science, 23(4), 493-507.

If we compare that level of concern with empirical reproducbility with the level of concern for computational and statistical reproducbility (as indicated by publications), I think we’re doing best at empirical reproducbility compared to the other two.


#14

Thank you! It seems we could benefit from the distinction between reproducibility and replicability that was made by @joeroe above: artefact analysis (from naked-eye classification to all kinds of instrumental, micro-destructive analysis) is certainly replicabile, at least in my experience as a ceramic specialist, in that anyone can go back and, as you wrote, “correct” earlier categorizations (this is also extremely time-consuming and generally results in marginal improvements, but that doesn’t matter here). This is replicability: when the same analysis (e.g. categorization) can be performed again, giving possibly different, hopefully better results.

Reproducibility as I understand it is something else: it means that I should be able to re-play your analytical process as-is, obtaining the same results. Therefore computational analysis lends itself very well, since it can be formalized as explicit source code.

Concise transmission of methodologies, including statistical formulas and program/app parameters, should be standard practice today and, if not, then the fault lies in the researcher/analyst.

While we all agree that it should be standard practice, as in points 1+2 described by @bmarwick above, my experience tells me that for the majority of published research I deal with, it is not standard practice at all.


#15

Steko,

While certain studies can be replicated, such as in ceramic paste composition and spectrographic analysis, for the most part our empirical observations can never be 100 percent. Additionally, in this fast pace environment of new discoveries almost daily, many times our analyses are almost passé and have to be revisited prior to publication.

I agree that in many archaeological reports the researchers almost act like standard testing and analytical regimes are proprietary information. Best practice in journal papers and reports, though, is to provide a link to both data and detailed methodology.


#16

@steko, I like your take on on the two rep-x-ability terms, there’s been a lot of discussion about those recently in several disciplines.

There was a major article published recently in Science: What does research reproducibility mean? They present reproducibility as “the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results”.

This is distinct from replicability: “which refers to the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.”

They further define some new terms: methods reproducibility, results reproducibility, and inferential reproducibility (I didn’t find these of great relevance).

But it’s worth noting that the definitions in this paper, which are also consistent with a linguistic analysis at the renowned Language Log blog, are totally opposite to the ACM, which take their definitions from the International Vocabulary of Metrology. Here are the ACM definitions:

Reproducibility (Different team, different experimental setup)
The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently

Replicability (Different team, same experimental setup)
The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts.

These differences in definitions have also been noted by some recent Nature News articles Muddled meanings hamper efforts to fix reproducibility crisis and 1,500 scientists lift the lid on reproducibility. These report on the general problem of a lack of a common definition of reproducibility, despite a widespread recognition that it’s a problem. So those are helpful to establish and recognise that there is a range of definitions, and situate this book more precisely in the existing literature on the topic.

My JAMT paper includes these definitions for archaeologists, which are aligned with how I see the terms used in other social sciences (and not the ACM):

Reproducibility: A study is reproducible if there is a specific set of computational functions/analyses (usually specified in terms of code) that exactly reproduce all of the numbers and data visualizations in a published paper from raw data. Reproducibility does not require independent data collection and instead
uses the methods and data collected by the original investigator. https://osf.io/s9tya/

Replicability: A study is replicated when another researcher independently implements the same methods of data collection and analysis with a new data set. http://languagelog.ldc.upenn.edu/nll/?p=21956

The crux of the difference is in ‘independent methods’ and ‘new data set’. So I’d say that two people separately measuring the same lithic assemblage in basically the same way (caliper measurements, etc.) are doing ‘empirical reproducibility’. Of course there will be some difference in their results, but probably not much. A replication study would be if third researcher went back to the site, collected a new sample of stone artefacts (‘new data set’), and measured them with a 3D scanner (‘independent methods’). This third researcher would have quite different results because of their new data and different methods, but they’d be well-placed to test the substantive anthropological claims that the the previous researchers made about the assemblage.

But that’s really getting into the weeds, and as I noted earlier in the thread, archaeologists generally value these kinds of studies and we see people doing their PhD on them and publishing papers. Accessing museum collections for new or re-analysis is routine.

What we’re missing is all the magic between data collection and publication. We have no culture of rep-x-ability for archaeological data analysis, normally it’s a private, even secret, process. Archaeologists rarely share any of the analysis pipeline, and I think this is bad for the discipline, and bad for science in general. New methods often appear in the literature, but any serious user has to reverse engineer the journal article in order to use the new methods for themselves. This is a huge obstacle to innovation and sharing of new methods.

In my perfect world, every archaeology paper would be accompanied by a code & data repository that allows the reader to reproduce the results presented in the paper. This means if when I see a cool plot or useful computation in a paper, then I can easily get the code, study the author’s methods in detail, and adapt it for my own data, cite their paper, and build on it, teach it to my students, and so on. Maybe some day… :rainbow: :slight_smile:

My current template for achieving this in my own work is here: https://github.com/benmarwick/researchcompendium which has a bunch of robust open source software engineering tools working for me to save time and catch errors.


#17

I see that this thread is six months old, but wanted to let you all know that our project (http://www.faims.edu.au/), which has been working on digital archaeology infrastructure for several years, is preparing to organise a reproducibility project for archaeology. I’m working on definitions and approaches now (to be published in an edited volume that Josh Wells at Indiana University South Bend is putting together), and our next publication after that (lead by a philosopher of science who is Technical Director for our project) will be a small reproducibility pilot project looking at a limited number of publications (we also decided to start with JAS). We’ve got some resources to throw at it. We will be contacting some of you individually (Ben, Joe) because we’re citing your work in our preliminary studies, but I also wanted to post the general invitation here.


#18

This sounds very exciting! Can’t wait to hear more, I’m intrigued.


#19

A few updates relevant to this topic:

I believe we are also seeing more archaeology journal articles using R and sharing R scripts with the publication, and depositing datasets in trustworthy repositories. Those are two important steps towards improving reproducibility. This increase is just my informal impression, but we’re working on ways to measure this more objectively! Recommending sharing of scripts and data in peer reviews seems to be making a difference.


#20

First ever archaeology journal article that is fully reproducible! (and isn’t written by me) This entire PLOS One paper by Shannon McPherron is beautifully written in R markdown: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190195

I could successfully reproduce the author’s results. Here are my notes on the attempt:

After downloading the files from http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190195, we see two problems.

First is the the plos-one.csl file is not included. We can easily get it from https://github.com/citation-style-language/styles/blob/master/plos.csl, but it’s a minor irritation to wait to knit the document and then have it stop with an error that this file wasn’t found.

Second is that all files have almost the same name, something like pone.0190195.sNNN.xxx. So the file names mentioned in the article and within the files are not what we have in the download. This requires cross-checking between the file name in the supplement list, and the file description in the Rmd file.

This is a pain because, for example, the note at the top of the Rmd mentions ‘Lenoble_and_Bertran_data.RDS’, which corresponds to ‘S3 File. Lenoble and Bertran (2004) comparative data set. https://doi.org/10.1371/journal.pone.0190195.s003’ in the list of Supporting information in the article, and the downloaded file is called pone.0190195.s003.RDS.

However, later in the Rmd we see a line of code that reads readRDS('Lenoble_and_Bertran_2004.RDS'). This means that the filename we saw in the comment at the top of the Rmd is not correct, and we have to change it (or the code) to make it work.

The problem is that the journal changed the filenames after they are submitted with the article. If the files were deposited by the author at a trustworthy repository, such as osf.io, figshare.com, zenodo.org, etc., then the file names would be unaltered, and we wouldn’t have this bother trying to match up files with the names used for them in the code.

So, don’t submit code and data with journal as supplementary files. The journal messes with your files and makes it harder for others to reuse. Instead, keep control over your work and deposit your materials at a trustworthy repository, and link to those files with a DOI in the text of your article. We’ve written about some of the other reasons why this is a good idea, and what some good repositories are in saa.org/Portals/0/SAA_Record_Sept_2017_Final_LR.pdf#page=10 and https://osf.io/preprints/socarxiv/py4hz/

There is no list of the packages that the code depends on. We have to search the document for require() or library() to see what is required, and install as needed. One solution to this problem is to list all the libraries together at the start of the Rmd. Another, better, solution is to organise this bundle of files as an R package, and then the DESCRIPTION file can serve as a manifest that lists all the dependent packages in one place. We’ve written about this approach in detail in https://peerj.com/preprints/3192/

The excellent citr package is loaded by the Rmd, but it does not need to be since that it only used interactively while writing to add citations to the document. The reshape package used in the Rmd has been superseded by the tidyr package.

The images do not appear in the PDF, Word or HTML output. They are only generated as output files in the working directory, not embedded in the rendered document. This could be improved by using knitr::include_graphics() in code chunks in the Rmd file (one per figure) to print the images, with captions, in the rendered document.

The time to run the code and knit the Rmd file is more than a few minutes. I added caching so I didn’t have to wait so much each time, as I did troubleshooting with the libraries and file names.

There is no version control or information about the author’s computational environment (i.e. package version numbers). I’ve written about the importance of these details in https://link.springer.com/article/10.1007/s10816-015-9272-9

To conclude, this is a fantastic accomplishment by McPherron, and a great step forward for archaeological science. Although there’s room for improvement (the same is true for my papers!), I hope that McPherron’s paper inspires others to adopt this open science approach in their own work.