Chapter 3 Replication

In the previous chapters, we introduced experiments, their connection with causal inference, and their role in building psychological theory. In principle, repeated experimental work combined with theory building should yield strong research programs that explain and predict phenomena with increasing scope.

Yet in the last ten years there has been an increasing recognition that this idealized view of science might not be a good description of what we actually see when we look at the psychology literature. Many classic findings may be wrong, or at least overstated. Their statistical tests might not be trustworthy. The actual numbers are even wrong in many papers! And even when experimental findings are “real”, they may not reflect deep psychological generalizations.13 And even if they do, they likely don’t reflect generalizations that are true about people in general, only some very specific groups of people. We’ll get to that part later in the book.

How do we know that all this bad stuff is true? Claims about a literature or field as a whole go beyond the kind of standard paradigmatic science that we were talking about in the previous chapter – instead they are part of a new field called meta-science. Meta-science research is research about research, for example investigating how often findings in a literature can be successfully built on, or trying to figure out how widespread some negative practice is within a sub-field. Meta-science allows us to go beyond one-off anecdotes about particular results or rumors about bad practices.

In order to set the terms of discussion, we need to more precisely describe certain ways in which a scientific finding can be repeated. Figure 3.1 gives us a basic starting point for our definitions. For some claim in a paper, if we can take the same data that were analyzed in that paper, do the same analysis, and get the same result, we call that result reproducible (sometimes, analytically or computationally reproducible). If we can collect new data in the same experiment, do the same analysis, and get the same result, we call that a replication and say that the experiment is replicable. If we can do a different analysis with the original dataset, we call this a robustness check and so if a claim passes it is robust. We leave the last quadrant empty because there’s no specific term for it in the literature – the eventual goal is to draw generalizable conclusions but this will require more work than just having a finding that is reproducible and replicable.14 You might have observed that a lot of work is being done here by the word “same.” How do we operationalize same-ness for experimental procedures, statistical analyses, or sample? These are difficult questions that we’ll touch on below. Keep in mind that there’s no single answer and so these terms are always going to helpful guides rather than exact labels.

A terminological framework for meta-science discussions. Based on [](). Figure 3.1: A terminological framework for meta-science discussions. Based on

In this chapter, we’ll primarily discuss reproducibility and replicability; discussions of robustness and generalizability will be taken up in Chapters 11 and 10 respectively. We’ll start out by reviewing some of the key concepts around reproducibility and replicability as well as the key meta-science findings. This literature suggests that when you read an average psychology paper, your expectation should be that it might not replicate!

We’ll then discuss some of the proposed sources of problems in replicability – especially analytic flexibility and publication bias. We end by taking up the issue of how reproducibility and replicability relate to theory building in psychology. To summarize, our view is that reproducibility and replicability are critical foundations for theory building – they are necessary but not sufficient for good theories.

3.1 Reproducibility

As one of their primary purposes, scientific papers report measurements, statistical results, and more complex analytic findings and visualizations. For these results to be subject to scrutiny, readers and reviewers need to be able to access some aspects of the set of steps from the original raw measures all the way to the final products. For much of the history of the scientific paper, complete verification of the provenance of a particular reported number in a paper was impossible – at best, a reader was presented with a verbal or mathematical description of the computations that were performed on the raw data, and the raw data themselves were not available.17 In practice, for many years data have been available “on request,” and professional societies like the American Psychological Association have mandated data sharing for purposes of verification. But in practice data are rarely made available (Wicherts et al., 2006). We believe this is untenable, and we provide a longer argument justifying data sharing in Chapter 4 and discuss some of the practicalities of sharing in Chapter 13.

Data sharing is increasing, and we believe this is a very good thing for science as a whole.18 We’re focusing on data sharing here, because much experimental research uses relatively straightforward analyses. But the same points apply to code sharing as well! In computational research, the relevant position is nicely summed up by a prescient quote from Buckheit & Donoho (1995): “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” But because sharing has been relatively limited in the past, the reproducibility of numbers in nearly all published papers cannot be checked.

Reproducibility is desirable for a number of reasons. Without it:

  • Errors in calculation or reporting could lead to disparities between the reported result and the actual result,
  • Vague verbal descriptions of analytic computations could keep readers from understanding the computations that were actually performed,
  • The robustness of data analyses to alternative model specifications cannot be checked, and
  • Synthesizing evidence across studies, a key part of building a cumulative body of scientific knowledge (Chapter 16), is much more difficult.

Of these reasons, error detection and correction is probably the most pressing. But are errors common? There are plenty of individual instances of errors that are corrected in the published literature (e.g., Cesana-Arlotti et al., 2018), and we ourselves have made significant analytic errors (Frank et al., 2013). But these kinds of experiences don’t tell us about the frequency of error (or the consequences of error for the conclusions that researchers draw).19 There is a very interesting discussion of the pernicious role of scientific error on theory building in Gould et al. (1996)’s “The Mismeasure of Man.” Gould examines research on racial differences in intelligence and documents how scientific errors that supported racial differences were often overlooked. Errors are often caught asymmetrically; we are more motivated to double-check a result that contradicts our biases. This question about frequency is a meta-scientific question that a variety of researchers have attempted to answer over the years. If errors are frequent, that would suggest a need for changes in our policies and practices to reduce their frequency!

Unfortunately, the lack of data availability creates a problem: it’s hard to figure out if calculations are wrong if you can’t check them in the first place. One meta-scientific research program has taken a clever approach to this issue. In standard American Psychological Association (APA) reporting format, inferential statistics must be reported with three pieces of information: the test statistic, the degrees of freedom for the test, and the \(p\)-value (e.g., \(t(18) = -0.74\), \(p = 0.47\)). Yet these pieces of information are redundant with one another. Thus, reported statistics can be checked for consistency simply by evaluating whether they line up with one another – that is, whether the \(p\)-value recomputed from the \(t\) and degrees of freedom matches the reported value.

Bakker & Wicherts (2011) performed precisely this analysis on a sample of 281 papers, and found that around 18% of statistical results were incorrectly reported. Even more worrisome, around 15% of articles contained at least decision error – that is, a case where the error changed the direction of the inference that was made (e.g., from significant to insignificant).20 Confirming Gould’s speculation, most of the reporting errors that led to decision errors were in line with the researchers’ own hypotheses. Nuijten et al. (2016) used an automated method called “statcheck”21 Statcheck is now available as a web app and an R package so that you can check your own manuscripts! to confirm and extend this analysis. They checked \(p\)-values for more than 250,000 psychology papers in the period 1985–2013 and found that around half of all papers contained at least one incorrect \(p\)-value!

These findings provide a lower bound on the number of errors in the literature and suggest that reproducibility of analyses is likely very important. How reproducible are published findings? While there is probably no general way to check reproducibility across the literature, a group of us conducted some more targeted studies of two journals with open-data policies. Hardwicke et al. (2018) and Hardwicke et al. (2021a) identified datasets with reusable data (because not all datasets were complete and comprehensible) and then downloaded the data and attempted to reproduce the main statistical results from 60 of these articles. This process was incredibly labor-intensive, with articles requiring 5–10 hours of work each. Only about a third of articles were completely reproducible without help from the original authors; around 62% were successfully reproduced after – sometimes extensive – correspondence (Figure 3.3). A good number of the remaining papers appeared to have some irreproducible results – due to typos, missing data, or unclear analytic specifications.22 See Artner et al. (2020) for a similar study with a slightly higher reproducibility rate but also a distressingly high rate of decision errors for the primary claims that they assessed.

Figure 3.3: Analytic reproducibility of results from open-data articles in Cognition and Psychological Science. From Hardwicke et al. (2021).

Analytic reproducibility of results from open-data articles in *Cognition* and *Psychological Science*. From Hardwicke et al. (2021).

Transparency is a critical imperative for decreasing the frequency of errors in the published literature. Reporting and computation errors are frequent in the published literature, and the identification of these errors depends on the findings being reproducible. If data are not available, then errors usually cannot be found.

3.2 Replication

Beyond verifying the analyses reported in a paper, we are often interested in understanding whether the measurements can be replicated. To quote from Popper (2005), “the scientifically significant… effect may be defined as that which can be regularly [replicated] by anyone who carries out the appropriate experiment in the way prescribed.”

Replications can be conducted for many reasons (Schmidt, 2009). One goal can be to verify that the results of an existing study can be obtained again if the study is conducted again in exactly the same way, to the best of our abilities. A second goal can be to gain a more precise estimate of the effect of interest by conducting a larger replication study, or combining the results of a replication study with the existing study. A third goal can be to investigate whether an effect will persist when, for example, the experimental manipulation is done in a different, but still theory-consistent, manner. Alternatively, we might want to investigate whether the effect persists in a different population. Such replications are often efforts to “replicate and extend,” and are common both in a sequence of experiments from a single research team or when a new team wants to build on a result from a paper they have read.

Much of the meta-science literature (and attendant debate and discussion) has focused on the first goal of simple verification – so much so that “replication” has become associated with skepticism or even attacks on the foundations of the field. This dynamic is at odds with the role that replication is given in a lot of philosophy of science, where it is assumed to be a typical part of “normal science.”

3.2.1 Conceptual frameworks for replication

The key challenge of replication is invariance – Popper’s stipulation that a replication be conducted “in the way prescribed” in the quote above. That is, what are the features of the world over which a particular observation should be relatively constant, and what are those that are specified as the key ingredients for the effect? Replication is relatively straightforward in the physical and biological sciences, in part because of presupposed theoretical background that allows us to make strong inferences about invariance. If a biologist reports an observation about a particular cell type from an organism, the color of the microscope is presumed not to matter to the observation.

These invariances are far harder to state in psychology, for both the procedure of an experiment and its sample. Procedurally, should the color of the experimental stimulus matter to the measured effect? In some cases yes, in some cases no.23 A fascinating study by Baribault et al. (2018) proposes a method for empirically understanding psychological invariances. Treating a subliminal priming effect as their model system, they sampled thousands of “micro-experiments” in which small parameters of their experimental procedure were randomly sampled. These parameters allowed for measurement of their effect of interest, averaging across this irrelevant variation. It turned out in their case, color did not in fact matter. Yet the task of postulating how a scientific effect should be invariant to lab procedures pales in comparison to the task of postulating how the effect should be invariant across different human populations!24 In some sense, the research program of some branches of the social sciences amounts to an understanding of invariances across human cognition. The search for “universal grammar” in linguistics is a project to find what aspects of grammar are shared across all humans (Chomsky, 1967).

A lot is at stake in this discussion. If Dr. Frog publishes a finding with US undergraduates and Dr. Toad then “replicates” the procedure in Germany, to what extent should we be perturbed if the effect is different in magnitude or absent?25 Presumably not very much if Dr. Toad gave the original instructions in English instead of in German – that’s another one of these pesky invariances that we are always worrying about! People have made a number of replication taxonomies to try and quantify the degree of consistency between two experiments.

One influential one is the distinction between direct replications26 These also get called exact replications sometimes. We think this term is misleading because similarity between two different experiments is always going to be on a gradient, and where you cut this continuum is always going to be a theory-laden decision. One person’s “exact” is another’s “inexact.” and conceptual replications (Zwaan et al., 2018). Direct replications are those that attempt to reproduce all of the salient features of the prior study, up to whatever invariances the experimenters believe are present (e.g., color of the paint, gender of the experimenter, etc.). In contrast, conceptual replications are typically paradigms that attempt to test the same hypothesis via different operationalizations of the manipulation and/or the measure. We follow Zwaan et al. (2018) in thinking that labeling this second type of experiment as “replications” is a little misleading. Rather, they’re alternative tests of the same part of your theory – such tests can be extremely valuable, but they serve a different goal than replication.

3.2.2 The meta-science of replication

In RPP, replication teams reported subjectively that 39% of replications were successful, with 36% reporting a significant effect in the same direction as the original. How generalizable is this estimate – and how replicable is psychological research more broadly? Based on the discussion above, we hope we’ve made you skeptical that this is a well-posed question without a lot of additional details. Any answer is going to have to provide details about the scope of this claim, the definition of replication being used, and the metric for replication success. On the other hand, versions of this question have led to a number of empirical studies that help us better understand the scope of replication issues.

Many subsequent empirical studies of replication have focused on particular subfields or journals, with the goal of informing particular field-specific practices or questions. For example, Camerer et al. (2016) largely adopted the methodological choices of RPP, but applied the procedure to all of the between-subject laboratory articles published in two top economics journals in the period 2011–2014. They found a top-line replication rate of 61% of significant effects in the same direction of the original, higher than in RPP but lower than the naive expectation based on their level of statistical power. Another study attempted to replicate all 21 behavioral experiments published in the journals Science and Nature from 2010–2015, finding a replication rate of 62% significant effects (Camerer et al., 2018).28 This study was notable because they followed a two-step procedure – after an initial round of replications, they followed up on the failures by consulting with the original authors and pursuing extremely large sample sizes. The resulting estimate thus is less subject to many of the critiques of the original RPP paper. While these types of studies do not answer all the questions that were raised about RPP, they suggest that replication rates for top experiments are not as high as we’d like them to be, even when greater care is taken with the sampling and individual study protocols.

Other scientists working in the same field can often predict when an experiment will fail to replicate. Dreber et al. (2015) showed that prediction markets (where participants bet small sums of real money on replication outcomes) made fairly accurate estimates of replication success in the aggregate. This result has itself now been replicated several times (e.g., in the Camerer et al., 2018 study described earlier). Maybe even more surprisingly, there’s some evidence that machine learning models trained on the text of papers can predict replication success fairly accurately (Yang et al., 2020). All this points to the possibility of isolating consistent factors that lead to replication success or failure. In the next section we consider what these factors are in more depth.

The meta-science studies reviewed above are remarkably impressive, and provide some clarity on what we should expect from the literature. When this literature is taken together, the chance of a significant finding in a replication study of a generic experiment in social and cognitive psychology is likely somewhere around 56%. Furthermore, the replication effect will be on average 53% as large (Brian A. Nosek et al., 2021).

On the other hand, they have substantial limitations as well. With relatively few exceptions, the studies chosen for replication used short, computerized tasks that mostly would fall into the categories of social and cognitive psychology. Further, and perhaps most troubling from the perspective of theory development, they tell us only whether a particular experimental effect can be replicated. They tell us almost nothing about whether the construct that the effect was meant to operationalize is in fact real! We’ll return to the difficult issue of how replication and theory construction relate to one another in the final section of this chapter.

Some have called the narrative that emerges from the sum of these meta-science studies the “replication crisis.” We think of it as a major tempering of expectations with respect to the published literature. Your naive expectation might reasonably be that you could read a typical journal article, select an experiment from it, and replicate that experiment in your own research. The upshot of this literature is that you might well be disappointed.

3.3 Causes of replication failure

The general argument of this chapter is that everything is not all right in experimental psychology, and hence that we need to change our methodological practices to avoid negative outcomes like irreproducible papers and unreplicable results. Towards that goal, we have been presenting meta-scientific evidence on reproducibility and replicability. But this evidence has been controversial, to say the least! Do large-scale replication studies like RPP – or for that matter, smaller-scale individual replications of effects like “power posing” – really lead to the conclusion that our methods require changes? Or are there reasons why a lower replication rate is actually consistent with a cumulative, positive vision of psychology?30 One line of argument addresses this question through the dynamics of scientific change. There are many versions, but one is given by Wilson et al. (2020). The idea is that progress in psychology consists of a two-step process by which candidate ideas are “screened” for publication by virtue of small, noisy experiments and then “confirmed” by large-scale replications. On this kind of view, it’s business as usual to find that many randomly-selected findings don’t hold up in large-scale replications and so we shouldn’t be distressed by results like those of RPP. The key to progress is to finding a small set that do hold up, which will lead to new areas of inquiry. We’re not sure this is view is either a good description of current practice or a good normative goal for scientific progress, but we won’t focus on that critique here. Instead, since book is written for experimenters-in-training, we assume that you do not want your experiment to be a false positive from a noisy screening procedure!

In RPP and subsequent meta-science studies, original studies with lower \(p\)-values, larger effect sizes, and larger sample sizes were more likely to replicate successfully (Yang et al., 2020). From a theoretical perspective, this result is to be expected, because the \(p\)-value literally captures the probability of the data (or any “more extreme”) under the null hypothesis of no effect. So a lower \(p\)-value should indicate a lower probability of a spurious result.32 In Chapter 6 we will have a lot more to say about \(p < .05\) but for now we’ll mostly just treat it as a particular research outcome. In some sense, the fundamental question about the replication meta-science literature is why the \(p\)-values aren’t better predictors of replicability! For example, Camerer et al. (2018) computes an expected number of successful replications on the basis of the effects and sample sizes – and their proportion of successful replications is substantially lower than that number.33 This calculation, as with most other metrics of replication success, assumes that the underlying population effect is exactly the same for the replication and the original. This is a limitation because, as we describe, there could be unmeasured moderators that could produce genuine substantive differences between the two estimates. Such heterogeneity is not uncommon, even if it is relatively small, in multi-site replications in which heterogeneity can be directly estimated (Ebersole et al., 2020, p. olsson2020heterogeneity). As such, the metrics will typically underestimate replication success when there is heterogeneity within pairs (Mathur & VanderWeele, 2020a).

One explanation is that the statistical evidence that is presented in papers often dramatically overstates the true evidence from a study. That’s because of two pervasive and critical issues: analytic flexibility (also known as p-hacking or questionable research practices) and publication bias.34 These terms basically mean the same thing and are not used very precisely in the literature. p-hacking is an informal term that sounds like you are doing something bad. Questionable research practices is a more formal-sounding term that is in principle vague enough to encompass many ethical failings but in practice gets used to talk about p-hacking. And analytic flexibility (or “undisclosed analytic flexibility”), the clunky term we mostly favor, describes the actual practice of trying many different things and then pretending you didn’t. Critically, undisclosed analytic flexibility describes a state of affairs not a (questionable) intent, so that’s why we like it a bit better.

Publication bias refers our relative preference for experiments that “work” than those that do not, where “work” is typically defined as yielding a significant result at \(p<.05\). Because of this interest, it is typically easier to publish such results. This situation leads to biases in the literature. Intuitively, this bias will lead to a literature filled with papers where \(p<.05\). Negative findings will then remain unpublished, living in the proverbial “file drawer” (Rosenthal, 1979).35 One estimate is that 96% of (non-preregistered) papers report positive findings (Scheel et al., 2021)! We’ll have a lot more to say about publication bias in Chapters 11 and 16! In a literature with a high degree of publication bias, many findings will be spurious because experimenters got lucky and published the study that “worked” even if that success was due to chance variation. In this situation, these spurious findings will not be replicable and so the overall rate of replicability in the literature will be lowered.

The mathematics of the publication bias scenario strikes some observers as implausible: most psychologists don’t run dozens of studies and report only one out of each group (L. D. Nelson et al., 2018). Instead, a more common scenario is to conduct many different analyses and then report the most successful, creating some of the same effects as publication bias – a promotion of spurious variation – without a file drawer full of failed studies.

It’s our view that publication bias and its even more pervasive cousin, analytic flexibility, are likely to be key drivers of lower replicability. We admit that the meta-scientific evidence for this hypothesis isn’t unambiguous, but that’s because there’s no sure-fire way to diagnose analytic flexibility in a particular paper since we can almost never reconstruct the precise choices that were made in the data collection and analysis process! On the other hand, it is possible to analyze indicators of publication bias in specific literatures and there are several cases where publication bias diagnostics appear to go hand in hand with replication failure.37 Here are two examples. First, in the “power posing” example described above, Simmons & Simonsohn (2017) noted strong evidence of analytic flexibility throughout the literature, leading them to conclude that there was no evidential value in the literature. Second, in the case of “money priming” (incidental exposures to images or text about money that were hypothesized to lead to changes in political attitudes), strong evidence of publication bias (Vadillo et al., 2016) was accompanied by a string of failed replications (D. Rohrer et al., 2015).

3.4 Replication, reproducibility, theory building, and open science

So, empirical measures of reproducibility and replicability in the experimental psychology literature are low – lower than we might have naively suspected and lower than we want. How do we address these issues? And how do these issues interact with the goal of building theories? In this last section, we discuss the relationship between replication and theory – and the role that open and transparent research practices can play.

3.4.1 Reciprocity between replication and theory

Analytic reproducibility is a prerequisite for theory building because if the twin goals of theories are to explain and to predict experimental measurements, then an error-ridden literature undermines this goal. If some proportion of all numerical values reported in the literature were simple, unintentional typos, this situation would create an extra level of noise – irrelevant random variation – impeding our goal of getting precise enough measurements to distinguish between theories. But in fact, the situation is likely to be worse: errors are much more often in the direction that favors authors’ own hypotheses. Thus, irreproducibility not only decreases our precision, it also increases the bias of the literature, creating obstacles to the fair evaluation of theories with respect to data.

Replicability is also foundational to theory building. Across a wide range of different conceptions of how science works, scientific theories are evaluated with respect to their relationship to the world. They must ground out in specific observations. It may be that some observations are by their nature un-repeatable (e.g., a particular astrophysical event might not be observed again a human lifetime). But for laboratory sciences – and experimental psychology can be counted among these, to a certain extent at least – the independent and skeptical evaluation of theories requires repeatability of measurements.

Some authors have argued (following the philosopher Heraclitus), “you can’t step in the same river twice” (McShane & Böckenholt, 2014) – meaning, the circumstances and context of psychological experiments are constantly changing and no observation will be identical to another. This is of course technically true from a philosophical perspective. But that’s where theory comes in! As we discussed above, our theories postulate the invariances that allow us to group together similar observations and generalize across them.

In this sense, replication is critical to theory, but theory is also critical to replication. Without a theory of “what matters” to a particular outcome, we really are stepping into an ever-changing river. But a good theory can concentrate our expectations on a much smaller set of causal relationships, allowing us to make strong predictions about what factors should and shouldn’t matter to experimental outcomes.

3.4.2 Deciding when to replicate to maximize epistemic value

As a scientific community, how much emphasis should we place on replication? In the words of Newell (1973), “you can’t play 20 questions with nature and win”. A series of well-replicated measurements does not itself constitute a theory. Theory construction is its own important activity. We’ve tried to make the case here that a reproducible and replicable literature is a critical foundation for theory building. That doesn’t necessarily mean you have to do replications all the time – that’s only critical if you think you don’t have a very replicable literature and want to check!

More generally, any scientific community needs to trade off between exploring new phenomena and confirming previously reported effects. In a thought-provoking analysis, Oberauer & Lewandowsky (2019) suggest that perhaps replications also aren’t the best test of theoretical hypotheses. In their analysis, if you don’t have a theory then it makes sense to try and discover new phenomena and then to replicate them. If you do have a theory, you should expend your energy in testing new predictions rather than repeating the same test across multiple replications.

Analyses such as this one can provide a guide to our allocation of scientific effort. But our goal in this book is somewhat different. Once you decide to do a particular experiment – replication or otherwise – we assume that you want to maximize its scientific value. Our recommendations about practice are predicated on the assumption that we want the resulting literature to be reproducible and replicable, not that we necessarily want it to consist of replications! There are many concerns that go into whether to replicate – including not only whether you are trying to gather evidence about a particular phenomenon, but also whether you are trying to master techniques and paradigms related to it. As we said at the beginning of this chapter, not all replication is for the purpose of verification.

3.4.3 Open science

The open science movement is a response – really a set of responses – to the challenges of reproducibility and replicability. The open science (and now the broader open scholarship) movement is a broad umbrella (Figure @(fig:replication-umbrella)), but we take open science to be a set of beliefs, research practices, results, and policies that are organized around the central roles of transparency and verifiability in scientific practice.38 Another part of the open science umbrella involves a democratization of the scientific process through efforts to open access to science. This process involves both removal of barriers to access to the scientific literature but also efforts to remove barriers to scientific training – especially to groups historically underrepresented in the sciences. The hope is that these processes increase both the set of people and the range of perspectives contributing to the scientific project. We view these changes as no less critical than the transparency aspects of the open science movement, though more indirectly related to the current discussion of reproducibility and replicability. The core of this movement is the idea of “nullius in verba” (the motto of the British Royal Society), which roughly means “take no one’s word for it.”39 At least that’s a reasonable paraphrase, but there’s some interesting discussion about what this quote from Horace really means in a letter by Gould (1991).

The broad umbrella of open science (adapted from Figure 3.8: The broad umbrella of open science (adapted from

Transparency initiatives are critical for ensuring reproducibility. As we discussed above, you cannot even evaluate reproducibility in the absence of data sharing. Code sharing can go even further towards helping reproducibility, as code makes the exact computations involved in data analysis much more explicit than the verbal descriptions that are the norm in papers (Hardwicke et al., 2018). Further, as we will discuss in Chapter 13, the set of practices involved in preparing materials for sharing can themselves encourage reproducibility by leading to better organizational practices for research data, materials, and code.

Transparency also plays a major role in advancing replicability. This point may not seem obvious at first – why would sharing things openly lead to more replicable experiments? – but it is one of the major theses of this book, so we’ll unpack it a bit. Here are a couple of routes by which transparent practices lead to greater replication rates.

  1. Sharing of experimental materials enables replications to be more methodologically faithful (Chapter 13). As we discussed above, one critique of many replications has been that they differ in key respects from the originals. Sometimes those deviations were purposeful, but in other cases they were simply because the replicators could not use the original experimental materials or scripts. Sharing these, as we encourage you to do, avoids this issue entirely.

  2. Sharing sampling and analysis plans allows replication of key aspects of design and analysis that may not be clear in verbal descriptions, for example exclusion criteria or details of data pre-processing.

  3. Sharing of analytic decision-making via preregistration can lead to a decrease in p-hacking and other practices (Chapter 11). The strength of statistical evidence in the original study is a predictor of replicability in subsequent studies. If original studies are preregistered, they are more likely to report effects that are not subject to inflation via questionable research practices.

  4. If effects are transparently reported as confirmatory vs. exploratory, subsequent experimenters can make a more informed judgment about which effects are likely to be good targets for replication.

For all of these reasons, we believe that open science practices can play a critical role in increasing reproducibility and replicability.

3.5 Chapter summary: Replication

So, is there a “replication crisis”? The common meaning of “crisis” is “a difficult time.” The data we reviewed in this chapter suggest that there are real problems in the reproducibility and replicability of the psychology literature. But there’s no evidence that things have gotten worse. If anything, we are optimistic about the changes in practices that have happened in the last ten years. So in that sense, we are not sure that a crisis narrative is warranted.

On the other hand, for Kuhn (1962), the term “crisis” had a special meaning: it is a period of intense uncertainty in a scientific field brought on by the failure of a particular paradigm. A crisis typically heralds a shift in paradigm, in which new approaches and phenomena come to the fore.

In this sense, the replication crisis narrative isn’t mutually exclusive with other crisis narratives, including the “generalizability crisis” (Yarkoni, 2020) and the “theory crisis” (Oberauer & Lewandowsky, 2019). All of these are symptoms of discontent with standard ways of doing business. We share this discontent! We are writing this book to encourage further changes in experimental methods and practices to improve reproducibility and replicability outcomes – many of them driven by the broader set of ideas referred to as “open science.” These changes may not lead to a paradigm shift in the Kuhnian sense, but we hope that they lead to eventual improvements. In that sense, we tend to side with those who have named the “replication crisis” a “credibility revolution” (Vazire, 2018).


Anderson, C., Bahnik, S., Barnett-Cowan, M., Bosco, F., Chandler, J., Chartier, C., & otherss. (2016). Response to comment on “estimating the reproducibility of psychological science.” Science, 351(6277), 1037–1037.
Artner, R., Verliefde, T., Steegen, S., Gomes, S., Traets, F., Tuerlinckx, F., & Vanpaemel, W. (2020). The reproducibility of statistical results in psychological research: An investigation using unpublished raw data. Psychological Methods.
Bakker, M., & Wicherts, J. M. (2011). The (mis) reporting of statistical results in psychology journals. Behavior Research Methods, 43(3), 666–678.
Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Van Ravenzwaaij, D., White, C. N., De Boeck, P., & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607–2612.
Bench, S. W., Rivera, G. N., Schlegel, R. J., Hicks, J. A., & Lench, H. C. (2017). Does expertise matter in replication? An examination of the reproducibility project: psychology. Journal of Experimental Social Psychology, 68, 181–184.
Buckheit, J. B., & Donoho, D. L. (1995). Wavelab and reproducible research. In Wavelets and statistics (pp. 55–81). Springer.
Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., et al. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436.
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., et al. (2018). Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644.
Carney, D. R., Cuddy, A. J., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21(10), 1363–1368.
Cesana-Arlotti, N., Martı́n, A., Téglás, E., Vorobyova, L., Cetnarski, R., & Bonatti, L. L. (2018). Erratum for the report “precursors of logical reasoning in preverbal human infants.” Science, 361(6408).
Chomsky, N. (1967). Aspects of the theory of syntax. MIT Press.
Dominus, S. (2017). When the revolution came for amy cuddy. When the Revolution Came for Amy Cuddy.
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112(50), 15343–15347.
Ebersole, C. R., Mathur, M. B., Baranski, E., Bart-Plange, D.-J., Buttrick, N. R., Chartier, C. R., Corker, K. S., Corley, M., Hartshorne, J. K., IJzerman, H., et al. (2020). Many labs 5: Testing pre-data-collection peer review as an intervention to increase replicability. Advances in Methods and Practices in Psychological Science, 3(3), 309–331.
Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Investigating the replicability of preclinical cancer biology. Elife, 10, e71601.
Frank, M. C., & Saxe, R. (2012). Teaching replication. Perspectives on Psychological Science, 7, 595–599.
Frank, M. C., Slemmer, J. A., Marcus, G. F., & Johnson, S. P. (2013). " information from multiple modalities helps 5-month-olds learn abstract rules": erratum.
Gelman, A. (2018). Don’t characterize replications as successes or failures. Behavioral and Brain Sciences, 41.
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “estimating the reproducibility of psychological science.” Science, 351(6277), 1037–1037.
Gould, S. J. (1991). Royal shorthand. Science, 251(4990), 142–142.
Gould, S. J., Gold, S. J., et al. (1996). The mismeasure of man. WW Norton & company.
Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., deMayo, B. E., Long, B., Yoon, E. J., & Frank, M. C. (2021a). Analytic reproducibility in articles receiving open data badges at the journal psychological science : An observational study. In Royal Society Open Science (No. 1; Vol. 8, p. 201494).
Hardwicke, T. E., Mathur, M. B., MacDonald, K. E., Nilsonne, G., Banks, G. C., Kidwell, M., Mohr, A. H., Clayton, E., Yoon, E. J., Tessler, M. H., Lenne, R. L., Altman, S. K., Long, B., & Frank, M. C. (2018). Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal cognition.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532.
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams Jr, R. B., Alper, S., Aveyard, M., Axt, J. R., Babalola, M. T., Bahnı́k, Š., et al. (2018). Many labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490.
Kuhn, T. (1962). The structure of scientific revolutions. Princeton University Press.
Lewis, M. L., & Frank, M. C. (2016). Understanding the effect of social context on learning: A replication of xu and tenenbaum (2007b). Journal of Experimental Psychology: General, 145(9), e72.
Mathur, M. B., & VanderWeele, T. J. (2020a). New statistical metrics for multisite replication projects. J. R. Stat. Soc. Ser. A Stat. Soc., 183(3), 1145–1166.
McShane, B. B., & Böckenholt, U. (2014). You cannot step into the same river twice: When power analyses are optimistic. Perspectives on Psychological Science, 9(6), 612–625.
Nelson, L. D., Simmons, J., & Simonsohn, U. (2018). Psychology’s renaissance. Annual Review of Psychology, 69, 511–534.
Newell, A. (1973). You can’t play 20 questions with nature and win: Projective comments on the papers of this symposium.
Nosek, Brian A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Almenberg, A. D., Fidler, F., Hilgard, J., Kline, M., Nuijten, M. B., et al. (2021). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology.
Nuijten, M. B., Hartgerink, C. H. J., Assen, M. A. L. M. van, Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods, 48(4), 1205–1226.
Oberauer, K., & Lewandowsky, S. (2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review, 26(5), 1596–1618.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).
Popper, K. (2005). The logic of scientific discovery. Routledge.
Ramscar, M. (2016). Learning and the replicability of priming effects. Current Opinion in Psychology, 12, 80–84.
Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the robustness of power posing: No effect on hormones and risk tolerance in a large sample of men and women. Psychological Science, 26(5), 653–656.
Rohrer, D., Pashler, H., & Harris, C. R. (2015). Do subtle reminders of money change people’s political views? Journal of Experimental Psychology: General, 144(4), e73.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638.
Scheel, A. M., Schijen, M. R., & Lakens, D. (2021). An excess of positive results: Comparing the standard psychology literature with registered reports. Advances in Methods and Practices in Psychological Science, 4(2), 25152459211007467.
Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13, 90–100.
Schwarz, N., & Clore, G. L. (1983). Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45(3), 513.
Schwarz, N., & Strack, F. (2014). Does merely going through the same moves make for a “direct” replication? Concepts, contexts, and operationalizations.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives on Psychological Science, 13(2), 255–259.
Simmons, J. P., & Simonsohn, U. (2017). Power posing: P-curving the evidence. Psychological Science.
Simonsohn, U. (2015). Small telescopes: Detectability and the evaluation of replication results. Psychol. Sci., 26(5), 559–569.
Vadillo, M. A., Hardwicke, T. E., & Shanks, D. R. (2016). Selection bias, vote counting, and money-priming effects: A comment on rohrer, pashler, and harris (2015) and vohs (2015). In Journal of Experimental Psychology: General (No. 5; Vol. 145, pp. 655–663).
Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences, 113(23), 6454–6459.
Vazire, S. (2018). Implications of the credibility revolution for productivity, creativity, and progress. Perspectives on Psychological Science, 13(4), 411–417.
Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61(7), 726.
Wilson, B. M., Harris, C. R., & Wixted, J. T. (2020). Science is not a signal detection problem. Proceedings of the National Academy of Sciences, 117(11), 5559–5567.
Yang, Y., Youyou, W., & Uzzi, B. (2020). Estimating the deep replicability of scientific findings using human and artificial intelligence. Proceedings of the National Academy of Sciences, 117(20), 10762–10768.
Yarkoni, T. (2020). The generalizability crisis. Behav. Brain Sci., 1–37.
Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, B. (2018). Making replication mainstream.