Summary: Explains how the concept of statistical significance relates to the proportion of true and false results that are published in the research literature. Sidebar to Jakob Nielsen's column Risks of Quantitative Studies, March 2004. In the main article, I said that "one out of every twenty significant results might be random" if you rely solely on statistical analysis. This is a bit of an oversimplification. Here's the detailed story. "Statistical significance" refers to the probability that the observed result could have occurred randomly if it has no true underlying effect. This probability is usually referred to as "p" and by convention, p should be smaller than 5% to consider a finding significant. Sometimes researchers insist on stronger significance and want p to be smaller than 1%, or even 0.1%, before they'll accept a finding with wide-reaching consequences, say, for a new blood-pressure medication to be taken by millions of patients. If we test twenty questions that have no underlying effect at play, we would on average expect one statistical test to come out as "significant." This doesn't really mean that one out of every twenty published studies is wrong. It means that one out of every twenty statistical tests is wrong if we simply go fishing for results without founding our study on an understanding of the issues at play. Good researchers start by building their hypotheses on qualitative insights. For example, after having observed how people read online, a researcher might suspect that scannable layouts would make website content easier to read and understand. If you run statistical tests on questions that are likely to be true, your findings are less likely to be false. As a thought experiment, let's assume that a researcher, Dr. Bob, has established 100 hypotheses, of which 80% are true. Given the statistical analysis, Bob will erroneously accept one of the 20 false hypotheses. Assuming Bob is running a study with good statistical power, he'll accept most of the 80 true hypotheses, rejecting maybe 10 as insignificant. Bob will then publish 71 of his conclusions, of which 70 are true and one is false. In other words, only 1.4% of Bob's papers will be bogus. Unfortunately, not all real-world researchers are good enough that 80% of their hypotheses will be correct. And not all studies have sufficient statistical power to accept 70 out of every 80 correct hypotheses. Thus, the percentage of bogus results in most published quantitative research is higher, but we can't determine that percentage exactly because it depends both on researchers' competence and their pre-study insights. Whenever a statistical analysis is performed and the results interpreted, there is always a possibility that the results are purely by chance (random error). This is an inherent limitation of any statistical analysis and cannot be done away with. In addition, mistakes such as measurement errors may cause the experimenter to misinterpret the results (systematic error). Fortunately, the probability that the process was simply a chance encounter can be calculated, and a minimum threshold of statistical significance can be set. If the results are obtained such that the probability that they are simply a chance process is less than this threshold of significance, then we can say the results have a high probability of not being due to chance. Note that the probability is never zero; statistical tests are never 100% certain. Threshold levels merely indicate the risk we are willing to take when it comes to accepting or rejecting a particular hypothesis.
In terms of null hypothesis, the concept of statistical significance can be understood to be the minimum level at which the null hypothesis can be rejected. This means if the experimenter sets his statistical significance level at 5% and the probability that the results are a chance process (i.e. the p-value) is 3%, then the experimenter can claim that the null hypothesis can be rejected. In this case, the experimenter will claim his results statistically significant. Some research disciplines are stricter than others however, and the research design itself may warrant a more or less stringent threshold. In any case, the lower the significance level, the higher the confidence you can have in the result. What's a Good Significance Level?Statistically significant results are required for many practical cases of experimentation in various branches of research. The choice of the statistical significance level is influenced by a number of parameters and depends on the experiment in question. In most cases, the data follows a normal distribution, which is thankfully also the simplest case. With standard normal distribution, you can use a threshold level of 0.05 confidently. However, care should always be taken to account for other distributions within the given population. Although 5%, 1% and 0.1% are common significance levels, it is not clear cut which level to use in an actual study - it depends on the norms of the field, previous studies, and the amount of evidence needed. However, it is not recommended to have a significance level higher than 5% because it too often leads to type 1-errors. Be Cautious When Interpreting the P-Value!It can be very satisfying to work out the p-value after a long experiment, see that it's below the threshold, reject the null hypothesis and assume the experiment is done and dusted. But the truth is that researchers still need to use care when deciding how to interpret the p-value.
Statistical significance is a measure of how unusual your experiment results would be if there were actually no difference in performance between your variation and baseline and the discrepancy in lift was due to random chance alone. What does statistical significance really mean?Online web owners, marketers, and advertisers have recently become interested in making sure their A/B test experiments (e.g., conversion rate A/B testing, ad copy changes, email subject line tweaks) get statistical significance before jumping to conclusions. On the Optimizely results page, if we observe a significance level of 90% or above, it is a statement of statistical surprise: something very unusual has happened if, in fact, there is no difference between a variation compared to the baseline. However, if your experiment fails to meet or exceed your chosen significance threshold, note that the result does not suggest there is little to no evidence of a treatment effect to be found in the experiment. Testing your hypothesisStatistical significance is most practically used in hypothesis testing. For example, you want to know whether changing the color of a button on your website from red to green will result in more people clicking on it. If your button is currently red, that’s called your “null hypothesis,” which takes the form of your experiment baseline. Turning your button green is known as your “alternative hypothesis.” To determine the observed difference in a statistical significance test, you will want to pay attention to two outputs: p-value and the confidence interval. P-value can be defined as the likelihood of seeing evidence as strong or stronger in favor of a difference in performance between your variation and baseline, calculated assuming there actually is no difference between them and any lift observed is entirely owed to random fluke. P-values do not communicate how large or small your effect size is or how important the result might be. Confidence interval refers to an estimated range of values that are likely, but not guaranteed, to include the unknown but exact value summarizing your target population if an experiment was replicated numerous times. An interval is comprised of a point estimate (a single value derived from your statistical model of choice) and a margin of error around that point estimate. Best practices are to report confidence intervals to supplement your statistical significance results, as they can offer information about the observed effect size of your experiment. Why is statistical significance important for business?Your metrics and numbers can fluctuate wildly from day to day. Statistical analysis provides a sound mathematical foundation for making business decisions and eliminating false positives. A statistically significant result depends on two key variables: sample size and effect size. Sample size refers to how large the sample for your experiment is. The larger your sample size, the more confident you can be in the result of the experiment (assuming that it is a randomized sample). If you are running tests on a website, the more traffic your site receives, the sooner you will have a large enough data set to determine if there are statistically significant results. You will run into sampling errors if your sample size is too low. Beyond these two factors, a key thing to remember is the importance of randomized sampling. If traffic to a website is split evenly between two pages, but the sampling isn’t random, it can introduce errors due to differences in behavior of the sampled population. For example, if 100 people visit a website and all the men are shown one version of a page and all the women are shown a different version, then a comparison between the two is not possible, even if the traffic is split 50-50, because the difference in demographics could introduce variations in the data. A truly random sample is needed to determine that the result of the experiment is statistically significant. In the pharmaceutical industry, researchers use statistical test results from clinical trials to evaluate new drugs. Research findings from significance testing indicates drug effectiveness, which can drive investor funding and make or break a product. Get always valid results with Stats EngineA strict set of guidelines is required to get valid results from experiments run with classical statistics: set a minimum detectable effect and sample size in advance, don’t peek at results, and don’t test too many goals or variations at the same time. These guidelines can be cumbersome and, if not followed carefully, can produce severely distorted and dubious results. Fortunately, you can easily determine the statistical significance of your experiments using Stats Engine, the advanced statistical model built-in to Optimizely. Stats Engine operates by combining sequential testing and false discovery rate control to give you trustworthy results faster, regardless of sample size. Updating in real-time, Stats Engine computes always-valid inference, boosting your confidence in making the right decision for your company and to avoid pitfalls along the way. To address these common problems, Stats Engine was created to test more in less time. By helping you make statistically sound decisions in real-time, Stats Engine adjusts values as needed and delivers reliable results quickly and accurately. Start running your tests with Optimizely today and be confident in your decisions. |