Sidebar to Jakob Nielsen's column Risks of Quantitative Studies, March 2004.

In the main article, I said that "one out of every twenty significant results might be random" if you rely solely on statistical analysis. This is a bit of an oversimplification. Here's the detailed story.

"Statistical significance" refers to the probability that the observed result could have occurred randomly if it has no true underlying effect. This probability is usually referred to as "p" and by convention, p should be smaller than 5% to consider a finding significant. Sometimes researchers insist on stronger significance and want p to be smaller than 1%, or even 0.1%, before they'll accept a finding with wide-reaching consequences, say, for a new blood-pressure medication to be taken by millions of patients.

If we test twenty questions that have no underlying effect at play, we would on average expect one statistical test to come out as "significant." This doesn't really mean that one out of every twenty published studies is wrong. It means that one out of every twenty statistical tests is wrong if we simply go fishing for results without founding our study on an understanding of the issues at play.

Good researchers start by building their hypotheses on qualitative insights. For example, after having observed how people read online, a researcher might suspect that scannable layouts would make website content easier to read and understand. If you run statistical tests on questions that are likely to be true, your findings are less likely to be false.

As a thought experiment, let's assume that a researcher, Dr. Bob, has established 100 hypotheses, of which 80% are true. Given the statistical analysis, Bob will erroneously accept one of the 20 false hypotheses. Assuming Bob is running a study with good statistical power, he'll accept most of the 80 true hypotheses, rejecting maybe 10 as insignificant. Bob will then publish 71 of his conclusions, of which 70 are true and one is false. In other words, only 1.4% of Bob's papers will be bogus.

Unfortunately, not all real-world researchers are good enough that 80% of their hypotheses will be correct. And not all studies have sufficient statistical power to accept 70 out of every 80 correct hypotheses. Thus, the percentage of bogus results in most published quantitative research is higher, but we can't determine that percentage exactly because it depends both on researchers' competence and their pre-study insights.