A shift from significance test to hypothesis test through power analysis in medical researchGirish Singh
Department of Basic Principles, Institute of Medical Sciences, Banaras Hindu University, Varanasi - 221005, India
Correspondence Address: Source of Support: None, Conflict of Interest: None PMID: 16679686
Source of Support: None, Conflict of Interest: None
Medical research literature until recently, exhibited substantial dominance of the Fisher's significance test approach of statistical inference concentrating more on probability of type I error over Neyman-Pearson's hypothesis test considering both probability of type I and II error. Fisher's approach dichotomises results into significant or not significant results with a P value. The Neyman-Pearson's approach talks of acceptance or rejection of null hypothesis. Based on the same theory these two approaches deal with same objective and conclude in their own way. The advancement in computing techniques and availability of statistical software have resulted in increasing application of power calculations in medical research and thereby reporting the result of significance tests in the light of power of the test also. Significance test approach, when it incorporates power analysis contains the essence of hypothesis test approach. It may be safely argued that rising application of power analysis in medical research may have initiated a shift from Fisher's significance test to Neyman-Pearson's hypothesis test procedure.
Keywords: Power calculation, type II error, effect size, significance level
The application of inferential statistics in biomedical and other areas has so far led to the presentation of results as significant or non-significant giving p values, which was not the intention of the founders of inferential statistics. The P values are many a times unfortunately misunderstood as the probability that the null hypothesis is true. What it actually gives is the strength of evidence against the null hypothesis, the smaller the P value the stronger the evidence against the null hypothesis. For example, a P value of 0.02 means that assuming that the treatment has no effect and given the sample size, an effect as large as the observed effect would be seen in only 2% of studies. Generally significant findings are more likely to be reported than non-significant results in spite of the fact that examining 20 associations will give one result that is significant at P = 0.05 by chance factor. Therefore, it may happen that clinically important differences observed in small studies as non-significant wound be ignored whereas all significant findings may be assumed to result from real treatment effects.
Fisher's versus Neyman-Pearson's approach
There are two approaches in statistical inference namely, Fisher's approach and Neyman-Pearson approach. Statistical inference is based on deductive probabilities, calculated with mathematical formulae that described (under certain assumptions) the frequency of all possible experimental outcomes if an experiment repeated many times. Methods based on this view of probability included an index to measure the strength of evidence called the p value, as an informal index to be used as a measure of discrepancy between the data and the null hypothesis, (proposed by R.A. Fisher in 1920s) and a method for choosing between hypotheses, called a hypothesis test, developed in the early 1930s by the mathematical statisticians Jerzy Neyman and Egon Pearson. The latter approach has the appeal that if we assume an underlying truth, the chances of errors can be calculated with mathematical formulae, deductively and therefore objectively.
Fisher's approach has dominated publications in medical research for long time. Fisher's significance-test concentrates only on probability of type I error i.e. a (probability of rejecting the null hypothesis when it is true) and dichotomises results into significant and non-significant giving P value. Recent developments have led reporting of confidence intervals of the difference in addition to the precise P value without reference to arbitrary thresholds. However, subjective interpretation inherent in Fisher's approach may lead to neither belief nor disbelief in the null hypothesis if P value comes around 0.05. This may push the researcher to perform another experiment. A common mistake also seen is that some researchers tend to interpret the results of a significant test as though these results were an indication of effect size. For example, a P value of 0.001 is assumed to reflect a large effect than a P value of 0.05. This is incorrect because P value is a function of sample size as well as effect size. Similarly, non-significant P value does not always indicate that the treatment has been proved ineffective. A major criticism that Fisher's P value faces is that, it is a measure of evidence that does not take into account the size of the observed effect. A small effect in a study with large sample size can have the same P value as a large effect in a small study.
Neyman and Pearson's hypothesis tests were designed to replace the subjective view of the strength of evidence against the null hypothesis provided by P value with an objective decision- based approach to the results of experiment concentrating on a (probability of type of I error, which is the probability of rejection of a correct null hypothesis), β (probability of type II error, which is the probability of incorrectly accepting the null hypothesis) and power (1-β , which is the probability of correctly rejecting the null hypothesis). Neyman-Pearson's approach relies on a decision rule for interpreting the results in advance and the result of analysis is rejection or acceptance of the null hypothesis. The researcher has the option to change the decision rule by specifying the alternative hypothesis, a and β in advance of the experiment. By fixing a and β in advance, the number of mistakes made over many different experiments would be limited.
Bridging the gap: Power analysis
Both the approaches were in use in medical research though, Fisher's approach dominated till recently. Over the last 10 to 15 years medical literature exhibits substantial shift from Fisher's to Neyman-Pearson's approach through power considerations. It may be argued that power analysis bridges the gap between two approaches. The combination of hypothesis test and significance test approaches is characterized by setting the type I error rate (almost always 5%) and power (almost always > 80%) before the experiment, then calculating a P value and rejecting the null hypothesis if the P value is less than the preset type I error rate.
Recent literature reviews reveal increasing practice of interpreting the results based on power calculations. Interpretation of results obtained by using a statistical tool should also include the potential (power) of the tool itself. If a test fails in yielding a desirable treatment effect, it should not be assumed that there is no effect, rather it may be the weakness of the test to detect otherwise significant effect.
A statistical test's power is the probability that the test procedure will result in statistical significance. Power analysis is used to anticipate the likelihood that the study will yield a significant effect based on the same factors as the significance test itself.
The three factors namely, significance level (a), sample size (n) and the effect size together with power, form a closed system and once any three are established, the fourth is completely determined. In any test procedure the aim is to minimise both a and β and thus maximise 1- β (power).
The term 'effect size' refers to the magnitude of the effect under the alternate hypothesis. The bigger it is, the easier it will be to find. The nature of effect size varies from one statistical procedure to the other; though its function in power analysis remains same in all the procedure. For example, in paired t test, the effect size is the mean of difference divided by standard deviation of difference while in unpaired t test, effect size is the absolute difference between two group means divided by pooled standard deviation. The effect size for correlation coefficient 'r' is simply the 'r' itself. Cohen  has defined small, medium and large effect sizes for many types of tests that may form useful conventions. For unpaired t test, the conventional small, medium and large effect sizes are 0.20, 0.50 and 0.80 respectively; whereas for paired t test these are 0.10, 0.25 and 0.40. Conventional small, medium and large effect sizes in case of correlation coefficient are 0.10, 0.30 and 0.50 respectively.
Power increases when effect size and /or sample size increases. If significance level increases (like, a =0.05 to a=0.10) power also increases. For a given a, n and effect size, one tail test is more powerful than two tailed test. A test based on two samples of equal size will be more powerful than when the samples are of unequal size. Power can also be increased by introducing covariates in the study thereby performing analysis of covariance ( Ancova0 ).
There are three types of power analysis viz, apriori, post-hoc and compromise. A priori power analysis is carried out during the design stage of the study for determining sample size for a given significance level, power and effect size. The effect size may be chosen based on substantive knowledge, previous research or by conventions. A apriori power analysis ensures savings in time and resources while conducting a study which has a little chance of finding a significant effect and also restricts testing more subjects than are necessary to detect an effect. Post-hoc power analysis is done after a study has been carried out, for determining power of the test for given sample size, significance level and the observed effect size. This helps in explaining the results of a study that did not find any significant effect. Compromised power analysis may be carried out in situations where pragmatic constraints restrict adherence to the recommendations derived from apriori power analysis. The maximum possible sample size and the effect size are fixed and alpha and beta error probabilities are chosen in accordance with relative seriousness. Obviously, apriori power analysis is more useful than post-hoc one; whereas compromised power analysis is a more complex, rarely used and constitute controversial issue. Determination of power of the test is required when the test results in non-significance. It is of great value in those situations also in which statistical significance may bear little relation to clinical significance and a conventional analysis using P values is liable to be misleading.
Further, power analysis also has a role in ethical issues. If a study to test a new drug will have required power with a sample of 50 patients, then it would be inappropriate to use a sample of 100 patients since the other 50 are being unnecessarily put at risk. Similarly, if a study requires 100 patients to yield desirable power, it would be inappropriate to use only 50 because these 50 patients have been put at risk for no reason.
A power analysis program can be used to determine power given the values of sample size, significance level and effect size. If the power is deemed to be inadequate, steps may be taken to increase the power. Easy and wide availability of computer software many of which can be freely downloaded from internet, have led power calculations apriori or post-hoc to become popular among researchers of medical and other fields. The advancement in computing techniques has broadened the area of investigation with a deeper insight to pick-up a small effect with greater precision that remained a part of theory in textbooks till recently. Longitudinal studies indicate a large increase in the use of more complex statistical methods. As statistical complexity of published research is increasing, a deeper understanding of statistics has become vital.
It seems logical to expect that these are some of the reasons of witnessing a shift from Fisher's significance testing to Neyman-Pearson's hypothesis testing. It may be safely argued that significance test approach incorporating power calculations may be regarded as an entry into hypothesis test approach. This move may be viewed as the byproduct of rise in power calculation requirements.