## The Significance of Significance

### Criticisms of significance testing:

1.      It tells you the opposite of what you what to know.  Significance tests tell you the probability of getting a result as ‘extreme’ as you have, given the null hypothesis.  What you want to know – and this is quite different – is how likely it is that the null hypothesis could be true, given the result you have just got.  Cohen (1994) describes this as the ‘inverse probability error’ and both he and Carver (1978) give examples to illustrate the fallacy.

2.      It is logically nonsensical:  the null hypothesis is always false.  It is impossible that in the population from which your sample is drawn the two means are exactly equal, or that the correlation is exactly zero.  It is nonsense (and certainly not useful) to talk about the ‘truth’ of a null hypothesis which specifies a precise value (usually zero) for some population parameter when the only evidence about it comes from a sample (Cohen, 1994;  Thompson, 1996).

3.      Its true/false dichotomy inappropriately stresses decision above inference.  In most research contexts (as opposed, say, to its use in quality control) it is not appropriate to have to make an all or nothing decision about whether to accept or reject a particular null hypothesis.  It is absurd to have to have to conclude one thing if the result of an experiment gives p = 0.051 and the exact opposite if it were 0.049 (Eysenk, Oakes, p26).

4.      It leaves out the most important information:  the size of the effect.  It is not enough to know, as Tukey (1969) has said, ‘if you pull on it, it gets longer’.  Scientific advance requires an understanding of how much.  Significance tests do not tell us how big the difference was, or how strongly related were two variables.  Instead, they say more about how large our sample was (Thompson, 1992).  A great deal more information can be extracted from an experiment if the focus is on parameter estimation, rather than hypothesis testing (Simon, 1974).

5.      It generates confusion between statistical and substantive significance.  The significance (in the true sense) of a result depends on the size of the effect found and whether it can be replicated.  ‘Significance’ tests do not measure this, even imprecisely (Oakes, 1986), but are widely presented and interpreted as doing so.

6.      It is widely misunderstood.  Studies of practitioners’ understanding of significance tests (e.g. Oakes, 1986) suggest that misconceptions are not sporadic but near universal.  While this may not be the fault of significance tests, it is an argument against their use.

7.      It takes no account of any prior knowledge.  Even for the non-Bayesian, there are situations where the automatic output from significance testing must be tempered by prior knowledge (Oakes, p128; Carver, p392).  Scientific advance proceeds by the accumulation of knowledge, not by results considered in isolation.

8.      It is open to easy abuse by selection.  The ‘file drawer problem’ (Rosenthal, 1979) describes the over-representation in published work of statistically significant results, leading to overall bias.  Research syntheses based on available studies are liable to over-estimate the size of an effect, because those that failed to achieve statistically significant results are less likely to be published.  Even within a study it is impossible to know how many ‘non-significant’ relationships have been tested (consciously or not) in order to find the ‘significant’ ones that are presented.  The statistical significance of a result depends not just on the data, but on the way such findings were sought.

9.      It demands an unscientific asymmetry.  Carver (1978) describes the use of significance testing as a ‘corrupt scientific method’

10.    It puts unnecessary restrictions on sample size.  A large number of studies with small samples and similar results may provide more evidence about a phenomenon than a single large study, but taken individually none of them may have the power to achieve statistical significance.  Even Fisher regarded the 5% level as arbitrary and took the repeated finding of results at this level as a basis for knowledge, rather than any single highly ‘significant’ result (Tukey, 1969).  However, because of the orthodoxy of significance testing, these small studies may never be done, having been rejected at the planning stage as having insufficient power.

11.    It emphasises random errors at the expense of explanations.  Because significance tests (along with other forms of statistical analysis) enable us to sidestep problems of inaccurately measured data (measurement error), and poor methodology (under-specified models) by aggregation with large samples, they may prevent us from adopting the ultimately more profitable strategy of addressing these inadequacies.  (Savitz, 1993)

12.       It requires a number of often unjustified assumptions.  The use of statistical tests of significance nearly always depends on making distributional assumptions about the statistic in question, and on the use of strictly random sampling.  While distributional assumptions are sometimes acknowledged, and results may be robust to their violation, the assumption of random sampling is often neither (Shaver, 1993).  Also, precise p values are highly sensitive to scale transformations or the (generally arbitrary) choice of a particular scale (Cliff, 1993, p497).  Significance levels (p values) are often treated as far more accurate than is justified.

13.       It perpetuates an adversarial tradition in social science.  On almost any issue studies can be found arguing for diametrically opposed conclusions, but a good many of the apparent differences are simply due to sampling error (Hunter and Schmidt, 1996).  Significance testing greatly exaggerates these differences, stressing individual results at the expense of an understanding of the relationships among significant variables.

### Alternatives to significance testing

A number of the critics of significance testing (e.g. Cohen, 1994; Thompson, 1996) make some suggestions of alternative ways of interpreting empirical results and allowing for their sampling variability.  They include the following:

1.      Use better language.  The word ‘significant’ should not be used on its own when what is meant is ‘statistically significant’.  Better still, report that a particular null hypothesis was rejected.

2.      Look at the data.  Simple, flexible, informal and largely graphical techniques of exploratory data analysis, such as those described by Tukey (1977), aim to enable data to be interpreted without statistical tests of any kind.

3.      Report parameter estimates with confidence intervals.  A confidence interval contains all the information in a null hypothesis test, and more.  Parameter estimates can often usefully be reported as standardised effect sizes.

4.          Replicate results.  Only by demonstrating it repeatedly can we guarantee that a particular phenomenon is a reliable finding and not just an accident of sampling.  Internal replicability analyses such as cross-validation, the jackknife or bootstrap (Thompson, 1994) provide a means of assessing sample variability.

5.          Synthesise the results of multiple studies using meta-analysis.  This can provide an overview of findings in which the statistical significance of individual results has no part.  Instead, results are pooled to give overall estimates of effect sizes and an understanding of the relationships among different variables.

### References

Carver, R. (1978)  ‘The case against statistical significance testing’.  Harvard Educational Review, 48, 378-399.

Cliff, N. (1993) ‘Dominance statistics: ordinal analyses to answer ordinal questions’.  Psychological Bulletin, 114, 3, 494-509.

Cohen, J. (1994) ‘The Earth is Round (p<.05)’.  American Psychologist, 49, 12, 997-1003.

Oakes, M. (1986) Statistical Inference: A commentary for the social and behavioral sciences.  New York:  Wiley.

Rosenthal, R. (1979) ‘The “file drawer problem” and tolerance for null results’  Psychological Bulletin, 86, 638-641.

Savitz, D.A. (1993) ‘Is statistical significance testing useful in interpreting data?’  Reproductive Toxicology, 7, 2, 95-100.

Simon, H.A. (1974) ‘How big is a chunk?’  Science, 183, 482-8.

Thompson, B. (1992) ‘Two and one-half decades of leadership in measurement and evaluation.  Journal of Counseling and Development, 70, 434-438.

Thompson, B. (1994) ‘The pivotal role of replication in psychological research:  empirically evaluating the replicability of sample results’.  Journal of Personality, 62, 2, 157-176.

Thompson, B. (1996) ‘AERA editorial policies regarding statistical significance testing:  three suggested reforms’.  Educational Researcher, 25, 2, 26-30.

Tukey, J.W. (1969)  ‘Analyzing data: Sanctification or detective work?’  American Psychologist, 24, 83-91.