Most of this post is inspired by a lecture on probabilities
by Ellen Evers during a PhD workshop we taught (together with Job van Wolferen
and Anna van ‘t Veer) called ‘How do we know what’s likely to be true’. I’d
heard this lecture before (we taught the same workshop at Eindhoven a year ago)
but now she extended her talk to the probability of observing a mix of
significant an non-significant findings. If this post is useful for you, credit
goes to Ellen Evers.
A few days ago, I sent around some questions on Twitter (thanks
for answering!) and in this blog post, I’d like to explain the answers.
Understanding this is incredibly important and will change the way you look at
sets of studies that contain a mix of significant and non-significant results,
so you want to read until the end. It’s not that difficult, but you probably
want to get a coffee. 42 people answered the questions, and all but 3 worked in
science, anywhere from 1 to 26 years. If you want to do the questions before
reading the explanations below (which I recommend), go here.
I’ll start with the easiest question, and work towards the most difficult one.
Running a single study
Running a single study
I asked: You are
planning a new study. Beforehand, you judge it is equally likely that the
null-hypothesis is true, as that it is false (a uniform prior). You set the
significance level at 0.05 (and pre-register this single confirmatory test to
guarantee the Type 1 error rate). You design the study to have 80% power if
there is a true effect (assume you succeed perfectly). What do you expect is
the most likely outcome of this single study?
The four response options were:
1) It is most likely that
you will observe a true positive (i.e., there is an effect, and the observed
difference is significant).
|
||
2) It is most likely that
you will observe a true negative (i.e., there is no effect, and the observed
difference is not significant)
|
||
3) It is most likely that
you will observe a false positive (i.e., there is no effect, but the observed
difference is significant).
|
||
4) It is most likely that
you will observe a false negative (i.e., there is an effect, but the observed
difference is not significant)
|
59% of the people chose the correct answer: It’s most likely
that you’ll observe a true negative. You might be surprised, because the
scenario (5% significance level, 80% power, the null hypothesis (H0) and the
alternative hypothesis (H1) are equally likely to be true) is pretty much the
prototypical experiment. It thus means that a typical experiment (at least when
you think your hypothesis is 50% likely to be true) is most likely not to reject the null-hypothesis (earlier, I wrote 'fail', but in the comments Ron Dotsch correctly points out not rejecting the null can be informative as well).
Let’s break it down slowly.
If you perform a single study, the effect you are examining
is either true or false, and the difference you observe is either significant
or not significant. These four possible outcomes are referred to as true
positives, false positives, true negatives, and false negatives. The percentage of false positives equals the Type 1 error rate (or α, the significance level), and false negatives (or Type 2 errors, β) equal 1 minus the power of the study. When the null hypothesis (H0) and the alternative hypothesis (H1) are a-priori equally
likely, the significance level is 5%, and the study has 80% power, the relative
likelihood of the four possible outcomes of this study before we collect the
data is detailed in the table below.
H0 True
(A-Priori 50% Likely)
|
H1 True
(A-Priori 50% Likely)
|
|
Significant Finding
|
False Positive (α)
2.5%
|
True Positive (1-β)
40%
|
Non-Significant Finding
|
True Negative (1- α)
47.5%
|
False Negative (β)
10%
|
The only way a true positive is most likely (the answer
provided by 24% of the participants) given this a-priori likelihood of H0 is when the power is higher than 1-α, so in this example higher
than 95%. After asking which outcome
was most likely, I asked how likely
this outcome was. In the sample of 42 people who filled out my there were
people who responded intuitively, and those who did the math. Twelve people
correctly reported 47.5%. What’s interesting is that 16 people (more than
one-third) reported a percentage higher than 50%. These people might have
simply ignored the information that the hypothesis was equally likely to be
true, as it was that it’s false (which implies no outcome can be higher than
50%), and intuitively calculated probabilities assuming the effect was true,
while ignoring the probability it was not true. The modal response for people
who had indicated earlier that they thought it was most likely to observe a
true positive also points to this, because they judged it would be 80% probable
that this true positive was observed.
Then I asked:
“Assume you performed the single study
described above, and have observed a statistical difference (p < .05, but you don’t have any
further details about effect sizes, exact p-values,
or the sample size). Simply based on the fact that the study is statistically
significant, how likely do you think it is you observed a significant
difference because you were examining a true effect?”
Eight people (who did the math) answered 94.1%, the correct
answer. All but two people who responded intuitively underestimated the correct
answer (the average answer was 57%). The remaining two answered 95%, which
indicates they might have made the common error to assume that observing a
significant result means it’s 95% likely the effect is true (it’s not, see
Nickerson, 2000). It’s interesting that people who responded intuitively overestimated the a-priori chance of a specific outcome, but then massively underestimate the probability of having
observed a specific outcome if the effect was true. The correct answer is 94.1%
because now that we know we did not observe a non-significant effect, we are
left with the remaining probabilities that the effect is significant. There was
2.5% chance of a Type 1 error, and a 40% chance of a true positive. That means
the probability of observing this positive outcome, if the effect is true, is
40 divided by the total, which is 40+2.5. And 40/(40+2.5)=94.1%. Ioannidis (2005) calls this, the post-study probability that the effect is true, the positive predictive value, PPV, (thanks to Marcel van Assen for pointing this out).
What happens if you run multiple studies?
Continuing the example as Ellen Evers taught it, I asked people to imagine they performed three of the studies described above, and found that two were significant but one was not. How likely would it be to observe this outcome of the alternative hypothesis is true? All people who did the math gave the answer 38.4%. This is the a-priori likelihood of finding 2 out of 3 studies to be significant with 80% power and a 5% significance level. If the effect is true, there’s an 80% probability of finding an effect, times 80% probability of finding an effect, times 20% probability of finding a Type 2 error. 0.8*0.8*0.2= 12.8%. If you calculate the probability for the three ways to get two out of three significant results (S S NS; S NS S; NS S S) you multiply it by 3, and 3*12.8 gives 38.4%. Ellen prefers to focus on the single outcome you have observed, including the specific order in which it was observed.
What happens if you run multiple studies?
Continuing the example as Ellen Evers taught it, I asked people to imagine they performed three of the studies described above, and found that two were significant but one was not. How likely would it be to observe this outcome of the alternative hypothesis is true? All people who did the math gave the answer 38.4%. This is the a-priori likelihood of finding 2 out of 3 studies to be significant with 80% power and a 5% significance level. If the effect is true, there’s an 80% probability of finding an effect, times 80% probability of finding an effect, times 20% probability of finding a Type 2 error. 0.8*0.8*0.2= 12.8%. If you calculate the probability for the three ways to get two out of three significant results (S S NS; S NS S; NS S S) you multiply it by 3, and 3*12.8 gives 38.4%. Ellen prefers to focus on the single outcome you have observed, including the specific order in which it was observed.
We therefore also need to know how likely it is to observe
this finding when the null-hypothesis is true. In that case, we would find a
Type 1 error (5%), another Type 1 error (5%), and a true negative (95%), and
0,05*0,05*0,95 = 0.2375%. There are three ways to get this pattern of results,
so if you want the probability of 2 out of 3 significant findings under H0 irrespective of the order, this probability is 0.7125%. That’s not very likely at all.
To answer the question, we need to calculate 12.8/(12.8+0.2375)
(for the specific order in which the results were observed) or 38.4/(38.4+0.7125) (for any 2 out of 3 studies) and both calculations give us 98.18%. Although a-priori it is not extremely likely to observe 2 significant and 1 non-significant finding, after you have observed this outcome, it is more than 98% likely to have observed 2
significant and one non-significant result in three studies when the effect is
true (and thus only 1.82% when the effect is not true).
The probability that, given that you observed a mix of significant and non-significant studies, the effect you observed was true, is important to understand correctly if you do research. In a time where sets of 5 or 6 significant low-powered studies are criticized for being ‘too good to be true’ it’s important that we know when a set of studies with a mix of significant and non-significant studies is ‘too true to be bad’. Ioannidis (2005) briefly mentions you can extend the calculations for multiple studies, but focusses too much on when findings are most likely to be false. What struck me from the lecture Ellen Evers gave, is how likely some sets of studies that include non-significant findings are to be true.
The probability that, given that you observed a mix of significant and non-significant studies, the effect you observed was true, is important to understand correctly if you do research. In a time where sets of 5 or 6 significant low-powered studies are criticized for being ‘too good to be true’ it’s important that we know when a set of studies with a mix of significant and non-significant studies is ‘too true to be bad’. Ioannidis (2005) briefly mentions you can extend the calculations for multiple studies, but focusses too much on when findings are most likely to be false. What struck me from the lecture Ellen Evers gave, is how likely some sets of studies that include non-significant findings are to be true.
These calculations depend on the power, significance level,
and a-priori likelihood that H0 is true. If Ellen and I ever find the time to work on a follow up to our recent article on Practical Recommendations to Increase the Informational Value of Studies, I would like to discuss these issues in more detail. To interpret whether 1
out of 2 studies is still support for your hypothesis, these values matter a
lot, but to interpret whether 4 out of 6 studies are support for your
hypothesis, they are almost completely irrelevant. This means that one or two
non-significant findings in a larger set of studies do almost nothing to reduce
the likelihood that you were examining a true effect. If you’ve performed three
studies that all worked, and a close replication isn’t significant, don’t get
distracted by looking for moderators, at least until the unexpected result is
replicated.
I've taken the spreadsheet Ellen Evers made and shared with the PhD students, and extended is slightly. You can download it here, and use it to perform your own calculations with different levels of power, significant levels, and a-priori likelihoods of H0. On the second tab of the spreadsheet, you can perform these calculations for studies that have different power and significance levels. If you want to start trying out different options immediately, use the online spreadsheet below:
I've taken the spreadsheet Ellen Evers made and shared with the PhD students, and extended is slightly. You can download it here, and use it to perform your own calculations with different levels of power, significant levels, and a-priori likelihoods of H0. On the second tab of the spreadsheet, you can perform these calculations for studies that have different power and significance levels. If you want to start trying out different options immediately, use the online spreadsheet below:
If we want to reduce publication bias, understanding (I mean, really understanding) that sets of studies that include non-significant findings are extremely likely, assuming H1 is true, is a very important realization. Depending on the number of studies, their power,
significance level, and the a-priori likelihood of the idea you were testing,
it can be no problem to submit a set of studies with mixed significant and
non-significant results for publication. If you do, make sure that the Type 1
error rate is controlled (e.g., by pre-registering your study design).
I want to end with a big thanks to Ellen Evers for
explaining this to me last week, and thanks so much to all of you who answered
my questionnaire about probabilities.