A shift from significance test to hypothesis test through power analysis in medical research
Department of Basic Principles, Institute of Medical Sciences, Banaras Hindu University, Varanasi - 221005, India
Department of Basic Principles, Institute of Medical Sciences, Banaras Hindu University, Varanasi - 221005
Medical research literature until recently, exhibited substantial dominance of the Fisher俟Q製 significance test approach of statistical inference concentrating more on probability of type I error over Neyman-Pearson俟Q製 hypothesis test considering both probability of type I and II error. Fisher俟Q製 approach dichotomises results into significant or not significant results with a P value. The Neyman-Pearson俟Q製 approach talks of acceptance or rejection of null hypothesis. Based on the same theory these two approaches deal with same objective and conclude in their own way. The advancement in computing techniques and availability of statistical software have resulted in increasing application of power calculations in medical research and thereby reporting the result of significance tests in the light of power of the test also. Significance test approach, when it incorporates power analysis contains the essence of hypothesis test approach. It may be safely argued that rising application of power analysis in medical research may have initiated a shift from Fisher俟Q製 significance test to Neyman-Pearson俟Q製 hypothesis test procedure.
|How to cite this article:|
Singh G. A shift from significance test to hypothesis test through power analysis in medical research.J Postgrad Med 2006;52:148-150
|How to cite this URL:|
Singh G. A shift from significance test to hypothesis test through power analysis in medical research. J Postgrad Med [serial online] 2006 [cited 2021 Apr 11 ];52:148-150
Available from: https://www.jpgmonline.com/text.asp?2006/52/2/148/25167
The application of inferential statistics in biomedical and other areas has so far led to the presentation of results as significant or non-significant giving p values, which was not the intention of the founders of inferential statistics. The P values are many a times unfortunately misunderstood as the probability that the null hypothesis is true. What it actually gives is the strength of evidence against the null hypothesis, the smaller the P value the stronger the evidence against the null hypothesis. For example, a P value of 0.02 means that assuming that the treatment has no effect and given the sample size, an effect as large as the observed effect would be seen in only 2% of studies. Generally significant findings are more likely to be reported than non-significant results in spite of the fact that examining 20 associations will give one result that is significant at P = 0.05 by chance factor. Therefore, it may happen that clinically important differences observed in small studies as non-significant wound be ignored whereas all significant findings may be assumed to result from real treatment effects.
Fisher's versus Neyman-Pearson's approach
There are two approaches in statistical inference namely, Fisher's approach and Neyman-Pearson approach. Statistical inference is based on deductive probabilities, calculated with mathematical formulae that described (under certain assumptions) the frequency of all possible experimental outcomes if an experiment repeated many times. Methods based on this view of probability included an index to measure the strength of evidence called the p value, as an informal index to be used as a measure of discrepancy between the data and the null hypothesis, (proposed by R.A. Fisher in 1920s) and a method for choosing between hypotheses, called a hypothesis test, developed in the early 1930s by the mathematical statisticians Jerzy Neyman and Egon Pearson. The latter approach has the appeal that if we assume an underlying truth, the chances of errors can be calculated with mathematical formulae, deductively and therefore objectively.
Fisher's approach has dominated publications in medical research for long time. Fisher's significance-test concentrates only on probability of type I error i.e. a (probability of rejecting the null hypothesis when it is true) and dichotomises results into significant and non-significant giving P value. Recent developments have led reporting of confidence intervals of the difference in addition to the precise P value without reference to arbitrary thresholds. However, subjective interpretation inherent in Fisher's approach may lead to neither belief nor disbelief in the null hypothesis if P value comes around 0.05. This may push the researcher to perform another experiment. A common mistake also seen is that some researchers tend to interpret the results of a significant test as though these results were an indication of effect size. For example, a P value of 0.001 is assumed to reflect a large effect than a P value of 0.05. This is incorrect because P value is a function of sample size as well as effect size. Similarly, non-significant P value does not always indicate that the treatment has been proved ineffective. A major criticism that Fisher's P value faces is that, it is a measure of evidence that does not take into account the size of the observed effect. A small effect in a study with large sample size can have the same P value as a large effect in a small study.
Neyman and Pearson's hypothesis tests were designed to replace the subjective view of the strength of evidence against the null hypothesis provided by P value with an objective decision- based approach to the results of experiment concentrating on a (probability of type of I error, which is the probability of rejection of a correct null hypothesis), β (probability of type II error, which is the probability of incorrectly accepting the null hypothesis) and power (1-β , which is the probability of correctly rejecting the null hypothesis). Neyman-Pearson's approach relies on a decision rule for interpreting the results in advance and the result of analysis is rejection or acceptance of the null hypothesis. The researcher has the option to change the decision rule by specifying the alternative hypothesis, a and β in advance of the experiment. By fixing a and β in advance, the number of mistakes made over many different experiments would be limited.
Bridging the gap: Power analysis
Both the approaches were in use in medical research though, Fisher's approach dominated till recently. Over the last 10 to 15 years medical literature exhibits substantial shift from Fisher's to Neyman-Pearson's approach through power considerations. It may be argued that power analysis bridges the gap between two approaches. The combination of hypothesis test and significance test approaches is characterized by setting the type I error rate (almost always 5%) and power (almost always > 80%) before the experiment, then calculating a P value and rejecting the null hypothesis if the P value is less than the preset type I error rate.
Recent literature reviews reveal increasing practice of interpreting the results based on power calculations. Interpretation of results obtained by using a statistical tool should also include the potential (power) of the tool itself. If a test fails in yielding a desirable treatment effect, it should not be assumed that there is no effect, rather it may be the weakness of the test to detect otherwise significant effect.
A statistical test's power is the probability that the test procedure will result in statistical significance. Power analysis is used to anticipate the likelihood that the study will yield a significant effect based on the same factors as the significance test itself.
The three factors namely, significance level (a), sample size (n) and the effect size together with power, form a closed system and once any three are established, the fourth is completely determined. In any test procedure the aim is to minimise both a and β and thus maximise 1- β (power).
The term 'effect size' refers to the magnitude of the effect under the alternate hypothesis. The bigger it is, the easier it will be to find. The nature of effect size varies from one statistical procedure to the other; though its function in power analysis remains same in all the procedure. For example, in paired t test, the effect size is the mean of difference divided by standard deviation of difference while in unpaired t test, effect size is the absolute difference between two group means divided by pooled standard deviation. The effect size for correlation coefficient 'r' is simply the 'r' itself. Cohen  has defined small, medium and large effect sizes for many types of tests that may form useful conventions. For unpaired t test, the conventional small, medium and large effect sizes are 0.20, 0.50 and 0.80 respectively; whereas for paired t test these are 0.10, 0.25 and 0.40. Conventional small, medium and large effect sizes in case of correlation coefficient are 0.10, 0.30 and 0.50 respectively.
Power increases when effect size and /or sample size increases. If significance level increases (like, a =0.05 to a=0.10) power also increases. For a given a, n and effect size, one tail test is more powerful than two tailed test. A test based on two samples of equal size will be more powerful than when the samples are of unequal size. Power can also be increased by introducing covariates in the study thereby performing analysis of covariance ( Ancova0 ).
There are three types of power analysis viz, apriori, post-hoc and compromise. A priori power analysis is carried out during the design stage of the study for determining sample size for a given significance level, power and effect size. The effect size may be chosen based on substantive knowledge, previous research or by conventions. A apriori power analysis ensures savings in time and resources while conducting a study which has a little chance of finding a significant effect and also restricts testing more subjects than are necessary to detect an effect. Post-hoc power analysis is done after a study has been carried out, for determining power of the test for given sample size, significance level and the observed effect size. This helps in explaining the results of a study that did not find any significant effect. Compromised power analysis may be carried out in situations where pragmatic constraints restrict adherence to the recommendations derived from apriori power analysis. The maximum possible sample size and the effect size are fixed and alpha and beta error probabilities are chosen in accordance with relative seriousness. Obviously, apriori power analysis is more useful than post-hoc one; whereas compromised power analysis is a more complex, rarely used and constitute controversial issue. Determination of power of the test is required when the test results in non-significance. It is of great value in those situations also in which statistical significance may bear little relation to clinical significance and a conventional analysis using P values is liable to be misleading.
Further, power analysis also has a role in ethical issues. If a study to test a new drug will have required power with a sample of 50 patients, then it would be inappropriate to use a sample of 100 patients since the other 50 are being unnecessarily put at risk. Similarly, if a study requires 100 patients to yield desirable power, it would be inappropriate to use only 50 because these 50 patients have been put at risk for no reason.
A power analysis program can be used to determine power given the values of sample size, significance level and effect size. If the power is deemed to be inadequate, steps may be taken to increase the power. Easy and wide availability of computer software many of which can be freely downloaded from internet, have led power calculations apriori or post-hoc to become popular among researchers of medical and other fields. The advancement in computing techniques has broadened the area of investigation with a deeper insight to pick-up a small effect with greater precision that remained a part of theory in textbooks till recently. Longitudinal studies indicate a large increase in the use of more complex statistical methods. As statistical complexity of published research is increasing, a deeper understanding of statistics has become vital.
It seems logical to expect that these are some of the reasons of witnessing a shift from Fisher's significance testing to Neyman-Pearson's hypothesis testing. It may be safely argued that significance test approach incorporating power calculations may be regarded as an entry into hypothesis test approach. This move may be viewed as the byproduct of rise in power calculation requirements.
|1||Lehmann EL. The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? J Am Stat Assoc 1993;88:1242-9.|
|2||Fisher R. Statistical Methods for Research Workers. 13th ed. Hafner: New York; 1958.|
|3||Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philos Trans Roy Soc A 1933;231:289-337.|
|4||Fisher RA. Statistical methods and scientific inference. Collins Macmillan: London; 1973.|
|5||Lilford RJ, Braunholtz D. The statistical basis of public policy: a paradigm shift is overdue. Br Med J 1996;313:603-7.|
|6||Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol 1998;51:355-60.|
|7||Sterne JA, Smith GD. Sifting the evidence - what's wrong with significance tests. Br Med J 2001;322:226-31. |
|8||Goodman SN. P values, hypothesis tests and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 1993;137:485-96. |
|9||Moher D, Dulberg CS, Wells GA. Statistical power, sample size and their reporting in randomized controlled trials. JAMA 1994;272:122-4.|
|10||Cohen J. Statistical Power Analysis for the Behavioural Sciences. 2nd ed, Erlbaum: Hillsdale, NJ; 1988.|
|11||Erdfelder E. On significance and control of the β error in statistical tests of log-linear models. Zeitschrift fur Sozialpsychologie 1984;15:18-32. |
|12||Burton PR, Gurrin LC, Campbell MJ. Clinical significance not statistical significance: a simple Bayesian alternative to p values. J Epidemiol Commun Health 1998;52:318-23. |
|13||Altman DG, Goodman SN. Transfer of technology from statistical journals to the biomedical literature. Past trends and future predictions. JAMA 1994;272:129-32.|
|14||Goodman SN. Toward evidence-based medical statistics: the p value fallacy. Ann Int Med 1999;130:995-1004.|