1.
*It tells you the opposite of what
you what to know*. Significance tests tell you the probability of getting a
result as ‘extreme’ as you have, given the null hypothesis. What you want to know – and this is quite different – is
how likely it is that the null hypothesis could be true, given the result you
have just got. Cohen (1994)
describes this as the ‘inverse probability error’ and both he and Carver
(1978) give examples to illustrate the fallacy.

2.
*It is logically nonsensical:
the null hypothesis is always false*.
It is impossible that in the population from which your sample is drawn
the two means are exactly equal, or that the correlation is exactly zero.
It is nonsense (and certainly not useful) to talk about the ‘truth’
of a null hypothesis which specifies a precise value (usually zero) for some
population parameter when the only evidence about it comes from a sample (Cohen,
1994; Thompson, 1996).

3.
*Its true/false dichotomy
inappropriately stresses decision above inference*.
In most research contexts (as opposed, say, to its use in quality
control) it is not appropriate to have to make an all or nothing decision about
whether to accept or reject a particular null hypothesis. It
is absurd to have to have to conclude one thing if the result of an experiment
gives p = 0.051 and the exact opposite if it were 0.049 (Eysenk, Oakes, p26).

4.
*It leaves out the most important
information: the size of the effect*.
It is not enough to know, as Tukey (1969) has said, ‘if you pull on it,
it gets longer’. Scientific
advance requires an understanding of how much.
Significance tests do not tell us how big the difference was, or how
strongly related were two variables. Instead,
they say more about how large our sample was (Thompson, 1992). A great deal more information can be extracted from an
experiment if the focus is on parameter estimation, rather than hypothesis
testing (Simon, 1974).

5.
*It generates confusion between
statistical and substantive significance*.
The significance (in the true sense) of a result depends on the size of
the effect found and whether it can be replicated.
‘Significance’ tests do not measure this, even imprecisely (Oakes,
1986), but are widely presented and interpreted as doing so.

6.
*It is widely misunderstood*.
Studies of practitioners’ understanding of significance tests (e.g.
Oakes, 1986) suggest that misconceptions are not sporadic but near universal.
While this may not be the fault of significance tests, it is an argument
against their use.

7.
*It takes no account of any prior
knowledge*. Even for the
non-Bayesian, there are situations where the automatic output from significance
testing must be tempered by prior knowledge (Oakes, p128; Carver, p392).
Scientific advance proceeds by the accumulation of knowledge, not by
results considered in isolation.

8.
*It is open to easy abuse by
selection*. The ‘file drawer
problem’ (Rosenthal, 1979) describes the over-representation in published work
of statistically significant results, leading to overall bias.
Research syntheses based on available studies are liable to over-estimate
the size of an effect, because those that failed to achieve statistically
significant results are less likely to be published.
Even within a study it is impossible to know how many
‘non-significant’ relationships have been tested (consciously or not) in
order to find the ‘significant’ ones that are presented.
The statistical significance of a result depends not just on the data,
but on the way such findings were sought.

9.
*It demands an unscientific
asymmetry*. Carver (1978)
describes the use of significance testing as a ‘corrupt scientific method’

10.
*It puts unnecessary restrictions on
sample size*. A large number of studies with small samples and similar
results may provide more evidence about a phenomenon than a single large study,
but taken individually none of them may have the power to achieve statistical
significance. Even Fisher regarded
the 5% level as arbitrary and took the repeated finding of results at this level
as a basis for knowledge, rather than any single highly ‘significant’ result
(Tukey, 1969). However, because of
the orthodoxy of significance testing, these small studies may never be done,
having been rejected at the planning stage as having insufficient power.

11.
*It emphasises random errors at the
expense of explanations*. Because
significance tests (along with other forms of statistical analysis) enable us to
sidestep problems of inaccurately measured data (measurement error), and poor
methodology (under-specified models) by aggregation with large samples, they may
prevent us from adopting the ultimately more profitable strategy of addressing
these inadequacies. (Savitz, 1993)

12.
*It requires a number of often
unjustified assumptions*. The
use of statistical tests of significance nearly always depends on making
distributional assumptions about the statistic in question, and on the use of
strictly random sampling. While
distributional assumptions are sometimes acknowledged, and results may be robust
to their violation, the assumption of random sampling is often neither (Shaver,
1993). Also, precise p values are
highly sensitive to scale transformations or the (generally arbitrary) choice of
a particular scale (Cliff, 1993, p497). Significance levels (p values) are often treated as far more
accurate than is justified.

13.
*It perpetuates an adversarial
tradition in social science*. On
almost any issue studies can be found arguing for diametrically opposed
conclusions, but a good many of the apparent differences are simply due to
sampling error (Hunter and Schmidt, 1996).
Significance testing greatly exaggerates these differences, stressing
individual results at the expense of an understanding of the relationships among
significant variables.

A number of the critics of significance testing (e.g. Cohen, 1994; Thompson, 1996) make some suggestions of alternative ways of interpreting empirical results and allowing for their sampling variability. They include the following:

1.
*Use better language*.
The word ‘significant’ should not be used on its own when what is
meant is ‘statistically significant’. Better
still, report that a particular null hypothesis was rejected.

2.
*Look at the data*. Simple,
flexible, informal and largely graphical techniques of exploratory data
analysis, such as those described by Tukey (1977), aim to enable data to be
interpreted without statistical tests of any kind.

3.
*Report parameter estimates with
confidence intervals*. A confidence interval contains all the information in a null
hypothesis test, and more. Parameter
estimates can often usefully be reported as standardised effect sizes.

4.
*Replicate results*.
Only by demonstrating it repeatedly can we guarantee that a particular
phenomenon is a reliable finding and not just an accident of sampling.
Internal replicability analyses such as cross-validation, the jackknife
or bootstrap (Thompson, 1994) provide a means of assessing sample variability.

5.
*Synthesise the results of multiple
studies using meta*-*analysis*.
This can provide an overview of findings in which the statistical
significance of individual results has no part.
Instead, results are pooled to give overall estimates of effect sizes and
an understanding of the relationships among different variables.

Carver, R.
(1978) ‘The case against
statistical significance testing’. *Harvard
Educational Review*, 48, 378-399.

Cliff, N.
(1993) ‘Dominance statistics: ordinal analyses to answer ordinal questions’.
*Psychological Bulletin*, 114, 3,
494-509.

Cohen, J.
(1994) ‘The Earth is Round (p<.05)’.
*American Psychologist*, 49, 12,
997-1003.

Oakes, M.
(1986) *Statistical Inference: A commentary
for the social and behavioral sciences*.
New York: Wiley.

Rosenthal, R.
(1979) ‘The “file drawer problem” and tolerance for null results’
*Psychological Bulletin*, 86,
638-641.

Savitz, D.A.
(1993) ‘Is statistical significance testing useful in interpreting data?’
*Reproductive Toxicology*, 7, 2,
95-100.

Simon, H.A.
(1974) ‘How big is a chunk?’ *Science*,
183, 482-8.

Thompson, B.
(1992) ‘Two and one-half decades of leadership in measurement and evaluation.
*Journal of Counseling and Development*, 70, 434-438.

Thompson, B.
(1994) ‘The pivotal role of replication in psychological research:
empirically evaluating the replicability of sample results’.
*Journal of Personality*, 62, 2, 157-176.

Thompson, B.
(1996) ‘AERA editorial policies regarding statistical significance testing:
three suggested reforms’. *Educational
Researcher*, 25, 2, 26-30.

Tukey, J.W.
(1969) ‘Analyzing data:
Sanctification or detective work?’ *American
Psychologist*, 24, 83-91.

Tukey, J.W.
(1977) *Exploratory data analysis*. Reading
MA: Addison-Wesley.

Robert Coe, June 1998