Why do scientists keep looking for a statistically significant result after failing to find one initially?

The 

Look-elsewhere Effect

, explained.
Bias

What is the Look-elsewhere Effect?

The Look-elsewhere Effect describes how, when scientists analyze the results of their experiments, results that are apparently statistically significant might actually have arisen by chance. One reason why this might happen is that a researcher has ignored a statistically insignificant result that they found previously, choosing to “look elsewhere”—continuing to search for a significant finding instead of accepting their initial results.

Where this bias occurs

Let’s say your friend David is a medical researcher who is trying to develop a drug that will help people recover from colds more quickly. He runs an experiment where he tests his new treatment, collects a bunch of data, and analyzes it using statistical tests. His analysis does not find any significant effect of the treatment on people’s recovery time.

Initially, David is disappointed—but then he decides that maybe the reason he didn’t find a significant result is that he’s just looking in the wrong place. After running a few different tests, he eventually finds a statistically significant effect: the treatment group reported fewer headache symptoms than the control group. Success!

Individual effects

The look-elsewhere effect is fuelled by cognitive distortions that are common to all people but it specifically has to do with statistical tests and their interpretation. For this reason, it mostly affects scientists and researchers who are using statistics to try to prove (or disprove) a hypothesis.

Systemic effects

The look-elsewhere effect is a big factor that contributes to the replication crisis currently facing many branches of science. Replication is the process of repeating an experiment that’s already been done, in order to see whether or not the results will be the same. This is a crucial instrument to verify that the machinery of science is working as it should be: if a study’s results can’t be repeated, it calls into question the validity of their initial findings.

Unfortunately, in recent years, a very large chunk of replications have failed to repeat the results of the original study. Although this problem has received the most attention in psychology, there are parallel crises unfolding in several fields, including economics1 and even medicine—where, by some estimates, only 20 to 25 percent of studies replicate perfectly.2 It probably goes without saying that this is a huge problem, impeding scientific progress and also undermining the public’s faith in scientists.

Why it happens

To understand the look-elsewhere effect, we will first need to have a very basic understanding of what it means to have a “statistically significant” finding. When researchers want to test a hypothesis, they’ll typically run an experiment, where they compare the outcomes of different groups—for instance, one group that receives the treatment the researcher is studying, and a control group that just gets a placebo. As long as all other factors are carefully controlled for, if we find that there’s a difference between how these groups fare, then it’s safe to say that the difference was caused by the treatment. Right?

The problem is, even when researchers have controlled for other variables, there is still the possibility that any differences between groups are due to random coincidence. This is because, although we are trying to make generalizations about how a treatment would affect a whole population, we have to test it on a much smaller sample of individuals. If, for some reason, our sample turns out not to be representative of the whole population, then our results would be misleading.

To illustrate, imagine you’re working at an ice cream parlor, where people are allowed to sample the flavors. One day a huge group of people comes in, about one hundred of them, all wanting to sample the mint chocolate chip. Obviously, there are lots of chocolate chips in the mint chocolate chip, but they’re not totally evenly distributed throughout the bucket. So, as you’re giving people their samples, the vast majority of the time, the samples contain some chocolate—but every now and then, an unlucky person gets a sample that’s just mint ice cream, a sample that doesn’t properly represent the flavor.

In science, sampling poses a similar problem: there’s always a chance that our experimental sample, just through bad luck, has characteristics that make them respond differently to the treatment than the rest of the population. This means that our findings would be the result of chance (also known as sampling error) and would be leading us to the wrong conclusion about our treatment.

We can never fully escape this problem—but we can try to get around it using statistics. There are many statistical tests available to help scientists judge whether their result is actually significant. In many cases, scientists use statistical tests to calculate a p-value, a number that indicates the probability of obtaining a significant result borne of chance instead of treatment effects. For example, a p of 0.1 would indicate a 10% chance. Researchers in different fields will mutually agree on a p threshold that a result has to cross in order to be considered significant. Often, this line is drawn at 0.05, meaning scientists are agreeing to tolerate no more than a 5% probability that a result was just a coincidence. Bogus significant results are known as alpha errors, or Type I errors.

With that out of the way, we can get back to the look-elsewhere effect.

More statistical tests, more problems

One of the reasons that the look-elsewhere effect happens is purely mathematical. It is known in statistics as the problem of multiple comparisons. As the name suggests, this problem arises when scientists perform many statistical tests on the same dataset. While this might not seem like it should be an issue, it actually inflates the chances of committing an alpha error.3 The more times a researcher goes looking for a result in the same dataset, the more likely they are to hit on something that looks interesting on the surface but is actually just the result of noise, or random fluctuations in the data.4

This, in a nutshell, is the statistical explanation for the look-elsewhere effect. However, this doesn’t quite tell the whole story. After all, researchers are trained in statistics—they should know better than to just conduct a bunch of tests willy-nilly. Moreover, there are ways to statistically correct for the problem of multiple comparisons, in cases when it’s really necessary to carry out a lot of different tests.3 So why does this problem persist in scientific research? That answer comes down to unconscious cognitive biases.

Humans are fallible—even scientists

People are prone to a whole suite of biases & heuristics that distort their thinking. What’s more, unconscious biases are just that: unconscious. Even when we have been taught about the flaws in our own thinking, it is often still very difficult to avoid falling into the same cognitive traps. An even more difficult pill to swallow: this truth applies as much for experts as it does for laypeople. Although many of us tend to see scientists as somehow above making the same errors in judgment as the rest of us, evidence has shown that this is not the case. Even more surprisingly, the formal education that scientists have in statistics doesn’t insulate them from biased reasoning when it comes to estimating probabilities.

One famous demonstration of this fact is about sample sizes. It’s a basic fact in statistics that large samples are always better; smaller samples make it more difficult to detect a possible effect. And yet, research has shown that even highly renowned statisticians sometimes fail to account for sample size.

In a paper titled “Belief in the Law of Small Numbers,” the Nobel Prize-winning behavioral economists Daniel Kahneman and Amos Tversky had experienced research scientists, including two authors of statistics textbooks, fill out a questionnaire describing hypothetical research scenarios. The experts were asked to choose sample sizes, estimate the risk of failure, and to give advice to a hypothetical graduate student conducting the project. The results showed that a large majority of respondents made errors in their judgments because they didn’t pay enough attention to sample size.5

In short, it’s clear that even the most erudite among us are vulnerable to cognitive bias. And on top of our lack of intuition for statistics, there are other biases, such as optimism bias and effort justification, that likely play a role in the look-elsewhere effect.

We are optimistic to a fault

Optimism bias describes how we are generally more oriented towards positivity: we pay more attention to positive information, we remember happy events better than upsetting ones, and we have positive expectations of the people and world around us.6 This “bias” isn’t necessarily a bad thing: on the contrary, our general optimism clearly enhances our wellbeing. Sometimes, however, optimism bias can lead us to suppress negative information, ignoring facts that make us feel bad, in favor of ones that brighten our mood.7 When it comes to the look-elsewhere effect, the determination to seek out positive information might lead some researchers to disregard their initial insignificant results, and keep looking for a more exciting finding.

We hate to see our hard work go to waste

By the time a researcher gets to the analysis stage of an experiment, it’s likely that they’ve invested a considerable amount of time and energy into designing the experiment, acquiring all the necessary materials, and collecting data. Research requires a whole lot of effort, and we never want to feel like our hard work has gone to waste. And when it starts to seem like maybe it was for nothing, we start doing some cognitive gymnastics to avoid having to confront that unpleasant truth. This phenomenon is known as effort justification.

Often, effort justification causes people to ascribe higher value to the object or project that they’ve been hard at work on. In a classic study by Elliot Aronson and Judson Mills, female college students were told that they would be participating in a group discussion about sexuality. However, some of them were first put through an embarrassing initiation process, supposedly in order to prove that they wouldn’t be too uncomfortable to participate in the conversation. The women who had to put in the extra effort later rated the contents of the discussion as more interesting, and their fellow group mates more intelligent, compared to those who hadn’t done the initiation.8

When it comes to the look-elsewhere effect, researchers’ unwillingness to let go of projects that they’ve sunk a lot of effort into might drive them to continue running statistical tests, past the point where they should probably give up. It is difficult to accept it when a hypothesis doesn’t pan out, and many people adopt the attitude that finding any significant result is better than coming away with nothing, even if that result isn’t what they were originally looking for.

Academia’s “rat race”

While flawed human reasoning may lead individuals to fall for the look-elsewhere effect, it is undeniable that there are also many structural forces at play that drive this problem. With the replication crisis still ongoing, many have pointed the finger of blame at the culture of modern academia, where researchers are incentivized to publish as many scholarly papers as they can and new graduates are locked into fierce competition for a dwindling number of jobs. According to a 2013 study, there were only enough academic positions for 12.8% of Ph.D. graduates in the United States to find employment,9 and the problem has only shown signs of worsening since then. This kind of job market puts tremendous pressure on people to perform.

Another issue here has to do with how performance is gauged, and the type of research that is seen as publishable. Generally speaking, only statistically significant results are considered interesting enough to merit publication. As a result, many researchers perceive statistically insignificant results to be “failures”—even though an insignificant result still conveys valuable information. This dynamic motivates scientists to “look elsewhere,” and try to reach statistical significance wherever possible.

Why it is important

The look-elsewhere effect, repeated by many people over many years, can add up to devastating consequences for individual researchers. The replication crisis has thrown into question the very existence of concepts that many researchers have staked their entire careers on. For example, in a blog post from June 2020, social psychologist and neuroscientist Michael Inzlicht wrote about a central topic in his work—ego depletion, the idea that self-control relies on a limited store of resources—is, as it turns out, “probably not real.”10 This revelation had a huge emotional impact: in his words, it “undid [his] world.”

But the look-elsewhere effect doesn’t just cause trouble for individuals. As a contributing factor to the replication crisis, it has far-reaching implications: on top of impeding scientific progress and leading scientists to incorrect conclusions, it is also damaging to the reputation of science as an institution. In a time when truth feels increasingly difficult to pin down, and conspiracy theories are gaining alarming ground, it is paramount that the public has trust in scientific experts. Unfortunately, that trust is undermined by the shockingly high number of studies that cannot be reproduced: in some branches of psychology, for example, as many as half of all published studies may not be replicable.15

How to avoid it

As we’ve established, it’s difficult to avoid cognitive biases, even when we know that they exist. When it comes to the look-elsewhere effect, however, there are specific steps that researchers can take to guard against improper statistical practices. Many of these practices are becoming increasingly common, as many scientists push for more openness and transparency in their fields. Some broader changes to the culture of science and academia would also likely help solve this problem.

Preregister studies before they happen 

Preregistration involves submitting a research plan to a registry in advance of actually conducting a study. When researchers preregister a study, they commit to a plan not only for carrying out the experiment itself but also for the analysis of the data, declaring which statistical tests they plan to use.11 In effect, this means that researchers take the option of “looking elsewhere” away from themselves. This, in turn, can minimize Type I errors and help to ensure that published research actually signifies a meaningful finding.

Open up the file drawer

As mentioned above, statistically null results are not generally seen as worthwhile by academics and journal editors. This means that studies that come up with insignificant results are seldom seen by anybody except the researcher(s) who conducted them.

Not only does the disregard for null results encourage the look-elsewhere effect, because researchers don’t see any value in their null results, it can also have negative consequences for science as a whole by creating a bias in the published literature. Imagine, for example, that 99 researchers all over the world have run experiments trying to prove the existence of X, and came up with insignificant results. Those scientists likely wouldn’t publicize their “failed” projects with anybody. But one day, a 100th researcher runs a similar study, (by chance) gets a statistically significant result and publishes it in an academic journal. Because the 99 failed attempts were not published, nobody realizes that this finding is misleading.

This phenomenon is known as the “file drawer problem,” because papers with statistically insignificant results tend to get tossed in a file drawer and sealed away. By encouraging the publication of these null results, scientists can reduce the incentive to “look elsewhere,” while also helping to ensure that attention and funding are directed towards worthwhile pursuits.12

How it all started

Concerns about replicability started to build in different scientific fields in the early 2000s. In a famous 2005 paper boldly titled “Why most published research findings are false,” Stanford University professor John Ioannidis argued that due to a number of statistical factors, including large numbers of statistical tests and flexibility in design and analysis, a large number of published research papers (he was looking specifically at medical research) were based on Type I errors, and couldn’t be replicated.13

Later, in 2012, a team of researchers surveyed over 2,000 psychologists about their use of questionable research practices and found that 67% of them had engaged in at least one such practice. This includes behaviors such as failing to report all the statistical relationships tested, as is often the case with the look-elsewhere effect.14

Example 1 - The Bible code

In the 1990s, Eliyahu Rips and Doron Witztum, two researchers from the Hebrew University at Jerusalem, published a paper in the journal Statistical Science, where they claimed that they had proof that the Book of Genesis contained predictions for the future. In their article, Rips and Witztum demonstrated that if you took every 5th letter in this part of the Bible and put them into a sequence, that letter sequence contained the names, birth dates, and death dates of 32 famous Rabbis from throughout Jewish history.16

On its face, this finding seems like it couldn’t possibly be a coincidence—the odds of something like this are so infinitesimally small. And yet, it’s now widely agreed that the “Bible code” is a trick of the look-elsewhere effect. The Book of Genesis is the longest in the Bible, clocking in at over 38,000 words. Given the sheer number of letters being analyzed, and the flexibility of the analysis itself (Rips and Witztum could just have easily looked at every 6th letter, or every 7th, and so on), it would have been more unusual if the researchers hadn’t found some sort of statistically significant pattern.

Example 2 - Looking elsewhere for the Higgs boson

In December 2011, physicists at the CERN Large Hadron Collider believed that they had found evidence of the Higgs boson particle, a foundational but at that point unconfirmed component of the standard model of particle physics. However, this observation may have been the result of the look-elsewhere effect. When the scientists were believed to have found evidence that the Higgs boson existed, they had been collecting “excess rates” of data, making it more likely that the patterns they’d observed were just the result of random fluctuations.17

Summary

What it is

The look-elsewhere effect describes how findings that appear to be significant might actually have arisen purely through chance.

Why it happens

Researchers are driven to continue “looking elsewhere” for a statistically significant result by cognitive biases such as optimism bias and effort justification, as well as by systemic problems in the science community. Mathematically speaking, doing so is bound to increase the chances that any significant relationship that gets detected will actually just be a random coincidence.

Example 1 – The Bible code and the look-elsewhere effect

In the 1990s, researchers believed they had discovered an amazing pattern in the Book of Genesis: the sequence formed by every fifth letter contained the names, birthdays, and death dates of 32 notable Rabbis. As miraculous as this seems at first blush, this too is also just a result of the look-elsewhere effect, because a statistically significant result is bound to happen with such a large amount of data.

Example 2 – Looking elsewhere for the Higgs boson

In 2011, physicists believed that they had found evidence of the (at that point) elusive Higgs boson particle. In fact, the patterns that they believed indicated the Higgs boson were probably just random fluctuations in their huge dataset.

How to avoid it

Preregistering scientific studies and moving towards publication of statistically insignificant results are two important steps that the scientific community can take to combat the look-elsewhere effect.

Références

  1. Camerer, C. F., Dreber, A., Forsell, E., Ho, T. H., Huber, J., Johannesson, M., … & Heikensten, E. (2016). Evaluating replicability of laboratory experiments in economics. Science351(6280), 1433-1436.
  2. Engber, D. (2019, April 19). Think psychology’s replication crisis is bad? Welcome to the one in medicine. Slate Magazine. https://slate.com/technology/2016/04/biomedicine-facing-a-worse-replication-crisis-than-the-one-plaguing-psychology.html
  3. Goldman, M. (2008). Why is multiple testing a proble,? [PDF]. The University of California, Berkeley. https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf
  4. Koehrsen, W. (2018, February 7). The misleading effect of noise: The multiple comparisons problem. Medium. https://towardsdatascience.com/the-multiple-comparisons-problem-e5573e8b9578
  5. Kahneman, D. (2011). Thinking, fast and slow. Macmillan.
  6. Ackerman, C. E. (2016, September 1). Pollyanna principle: The psychology of positivity bias. PositivePsychology.com. https://positivepsychology.com/pollyanna-principle/
  7. Lovallo, D., & Kahneman, D. (2003, July). Delusions of success: How optimism undermines executives’ decisions. Harvard Business Review. https://hbr.org/2003/07/delusions-of-success-how-optimism-undermines-executives-decisions
  8. Aronson, E., & Mills, J. (1959). The effect of severity of initiation on liking for a group. The Journal of Abnormal and Social Psychology59(2), 177-181. https://doi.org/10.1037/h0047195
  9. Larson, R. C., Ghaffarzadegan, N., & Xue, Y. (2014). Too many PhD graduates or too few academic job openings: the basic reproductive number R0 in academia. Systems research and behavioral science31(6), 745-750.
  10. Inzlicht, M. (2020, June 26). The replication crisis is not over. Michael Inzlicht. https://michaelinzlicht.com/getting-better/2020/6/26/the-replication-crisis-is-not-over
  11. Center for Open Science. (n.d.). Preregistrationhttps://www.cos.io/initiatives/prereg
  12. In praise of replication studies and null results, Nature 578, 489-490 (2019).
  13. Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine2(8), e124.
  14. John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science23(5), 524-532.
  15. Yong, E. (2018, November 19). Psychology’s replication crisis is running out of excuses. The Atlantic. https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/
  16. Flender, S. (2019, July 28). The statistics of the improbable. Medium. https://towardsdatascience.com/the-statistics-of-the-improbable-cec9a754e0ff
  17. Dawid, R. (2015). Higgs discovery and the look elsewhere effect. Philosophy of Science82(1), 76-96.
Notes illustration

Vous souhaitez savoir comment les sciences du comportement peuvent aider votre organisation ?