Philip Morris
Statistical Significance and Confidence Intervals
Fields
- Author
- Berry, G.
- Type
- PSCI, PUBLICATION SCIENTIFIC
- BIBL, BIBLIOGRAPHY
- Document File
- 2023512309/2023512515/Ets Issue Binder: Epidemiology
- Site
- R529
- Author (Organization)
- Medical Journal of Australia
- Univ of Sydney
- Master ID
- 2023512310/2514
- 2023512310-2514 Epidemiology and Environmental Tobacco Smoke
- 2023512329-2340 Environmental Tobacco Smoke and Lung Cancer: A Critical Assessment
- 2023512341-2348 What Is the Epidemiologic Evidence for A Passive Smoking - Lung Cancer Association?
- 2023512361-2362
- 2023512364-2440 A Dictionary of Epidemiology
- 2023512442-2514 News & Numbers A Guide to Reporting Statistical Claims and Controversies in Health and Other Fields
Related Documents:
Document Images
nt7rit:ii
This matertal rnay oe
618 protectz o bv copyright June 9, 1986 Vol. 144 THE MEDICAL IOL/RNAL OF AUSTRALIA
Statistical significance and confidence
M ~.nywpen in. rhe Journal use
/ V/ surf~~lcatme[hods arsdbne of th.I j 1 aJms ot the revlew procw is to try
to ensure that appropriate methods have
been& used. Often papers rer+ort results of
comtsarative studies that art designed to
S atuwer questions such as whether one
treatment is superior to another for a
particular disease, or whether there is an
association between sottx form of behaviour
(for exampk, taking regular, exercise or
smoking) and the occvrrence of some
disease. Comparative studies are almost
invatiably carried out on a sample of
individuals who are chosen from the
populatiort, of individuals to whom it is
intended to generalize the results. Data are
collected on the sample in order to make
inferences on the population. Valid
inferences can only be drawn if the sample
is chosen.in such.a way that it is represen-
tative of the population. Otherwise a bias
could occvr; epidemiological methods are
designed' to eliminate such biases.
Since the aim of a statistical analysis is to
make inferences. it is paramount to express
whatever inferences that can be drawn in the
most informative way: There are several
methods of statistical inference, but the two
that are most commonly used are
significance testing and confidence interval
estimation. The former is well known and
is featured by quoting P values. Many
authors appear to be under the impression
that a profusion of P values is necessary:
regrettably this impression has been bolstered
in the past by editors of biological Ijournals.
Significance testing has its place buts as
mentioned by Healy in,1978,' "it, is widely
agreed among statisticians (if less so among
the more naive users of statistics) that;
significance testing is not the be-all and end-
all of the subject". In this leading article I
would'like to discuss tfie characteristics of:
both methods of' inference, show that a
confidence interval contains the result of a
significance test, but nou vice versa, and
suggest that confidence intervals are the
answers to the more interesting questions
that data can be used to answer.
Any particular study is based on a
particular sample: however, it is useful to
imagine that the study is repeated with a
different sample being selected each time.
These hypothetical studies will give different
results because they contain differenn
individuals, and individuals vary in any
characteristic because of biological varia-
bility. The differences are termed sampling
variability. It follows then that the results
than are obtained from a particular sample
can only be taken as an approximation to the
actual situation~ in the whole popultitaon.
Statistical methods are concerned »rh
assessing the degree of approximatton and
intervals
what may be reasonably inferred, given that
different sample would have produced a
different result.
The methods are based on the assumption
that it is a matter of chance which particular
subjects are in the sample that is befng
studied, and the sampling variability is thus
random variation which is determined by the
taws of probability. Therefore, the inferences
are expressed in terms of probability. The
situation is illustrated below.
Population
I f- - - - - - - sampling variation
Sample data
- - - - - - uncertainty
Inlerences on population
Taking a samplt from the population
involves sampling variation. As a conse-
quence of thit, inferences from the sample
data back to the population~ involve
uncertainty.
A statistical analysis may be thought of as
asking questions of the data. In an investi-
gation that compares two groups for the
mean value of. for example, blood pressure
or the prevalence of some disease, three
questions may be posed: Is there a difference
between the groups?: How large is the
difference?; and How accurately is the size
of the difference known?.
As erpressed, the first question expects the
answer, "yes"'or "no": although the answer
cannot be given in, precisely these terms, itt
is often rcduced~ to two possibilities. The
appropriate methodology is the significance
rest. The second question expects a numerical!
value to be the answer. This is an estimate
and, as it is a single value, is referred to as
a point estimate. in effea, the third'question~
asks how reliable this point estimate is: the
answer is a range of values which iis referred~
to as an interval estimate or a confidence
interval:
These questions represent two approaches
to inference: hypothesis testing and~
estimation. Although at first sight they
appeartobe quite different. in concept they
have much in common. Both make
inferential statements about the value of a
parameter. (ik parameter is an unknowmy quantity which partly or wholly characterizes
a population, for, example, a mean or a
measure of association.)
The significance test is an appropriate
technique when there is an a priori hypothesis
to test. For the purpose of the statistical test
this hypothests is expressed in nuffform -
such as whemo no difference exists between;
groups - and the test evaluates whether the
data are consistent with the null hyptxhesisf tf the data differ markedly from thosrwhich
would be expected under the null hypothesis,
to the extent that the probability of such an
extreme result is low, then it is said that the
result is statistically significant. Probability
is measured on a continuum between 0 and
I, but in significance testing a probability is
considered low if it is less than conventionali
values such as 0.05 (J4.) or 0.01 (1%). A
significant result is equated with the reyacsion
of the null hypothesis or the claim of a real
effect. By definition, when the null
hypothesis is true, significant results will
occur by chance with the same relative
frequency as the signifieance probability.
That is, real effects will be claimed when the
null hypothesis is true; however, the proba-
bility of this error (type I) is determined in
the data analysis.
One disadvantage of a significance test is
that: it may fail to detect a real effect:'that
is, although the null hypothesis is false, the
evidence is not strong enough to reject it. The
probability' of this error (type 11) can be
controlled' at the design stage only, by
appropriate selection of the satnple size, and
may be quite large. Thus, the trap of
equating non-sitnifrcance with no effect
must be avoided; failure to reject the null
hypothesis is not the same as accepting it.
In the approach of confidence interval
estimation no particular hypothesis is consi-
dered: ratherthe emphasis is on estimatingg
those values of the parameter withwhich,the
data are consistent. These valhes form a
range - the confidence interval. The range
is calculated so that there is a high proba-
bility - conventionally 95*t9 or 99'f. - that
it contains the true value of the parameter.
A significance test is essentially a test of
whether the data are consistent with a
specified parameter value, and the confi-
dence intervali contains those parameter
valucs with which the data are consistent.
Therefore, a Srtsignificance test,and a 95%
confidence interval': contain some infor-
mation ir. common: significance implies that,
the null hypothesis value is outside the confr-
dence interval; non-siSnificance implies that
the null hypothesis value is within.the confi-
dence interval. However, the confidence
inteeval contains more information because
it is equivalent to performing a significance
test for all values of the parameter, not just
a single value. A confidence interval enables
a reader to see how large the effect may be.
not simply whether it is different from zero.
The limitations of the interpretations that
are provided'by a significance test may'now
be considered.
The difference is sisnifrcanr:. This means
that there is a difference orin otherwordsr the size of the difference is not zero. We
know no more than this. The difference may
J
t

THE MEDICAL JOURNAL OF AUSTPALIA Vol. 144 June 9, 1966
be large and of great importance or it may
be small and of no practial importance. It
is tr,umdactory that the tea provides no way
of distinguishing between these quite
different possibilitia.
The d(fJerrnor Is nor sijeljuvnf, This
means that there is insufficient evidence to
enable us to conclude that there is a
difference. So the difference may well be
zero. But this is not: the satae as vying that
it is zero. The true difference may be quite
large. Again, it is unsatisfactory that this
possibifity is ijot addressed.
The coeciusicns that may be drawn from
a significance test are considered to be
incomplete because it is rarely that one is
interested solely in whether a null hypothesis
is or is not true; indeed' in many cases it may,
be recognized at the outset that the null
hypothesis is unlikely to be ttue.,Rather, the
question is how large is the difference and:
is it possibly large enough to be important?
The emphasis is on measuring rather than on
testing. The addition of the concept of an
important difference to that of a null
hypothesis means that there are four possible
interpretations to an analysis: (a) the
difference is significant and large enough to
be of praRical iinportanoe; (b) the difference
is significant but too small to be of practical
importance; fc1' the difference is nott
significant but may be large enough to be
importantt and fd1 the difference is not
significant and also not large enough to be
of practical importance.
pHtert.nc
Ynportant
NuM' 0
hypot6.a.
The size of differeace that is considered
to be large enough to be important is a
matter for debate, and genuine differences
of opinion may arise. It is a tnedieal, not a
statiuial, question, ahboujh a sssedsal
statistitzatt who is esperienoed in thesubject
area could contribute to setting a value. The
fact that agreratent on a unique value may
be impossible in no way detracts from the
argument. In fact. expressing the results as
a confidence tnterval enables interpretations
to be made for any particular value that is
considered appropriate.
These possibilities are illustrated in the
Figure where the confidence intervals are
shown. The significant and non-significant
cases are distinguished by the confidence
intervals that exclude or include zero respec-
tively. The main point is that in each case
the confidence intervali gives the range of
possible values for the true difference. Of
particular concern is Ic1. Here ther: rttay be
no true difference or there may be a luge,
important difference. In other words the
study is completely inconclusive. Such a
possibility is missed by the simple expression
"not signifianr" with its lure of equating
this falsely with "no effect". This situation
will arise with a studythat is carried out on
too small a sample and this is why good study
design demands attention to sample size to
try to prevent the occurrence of an incon-
clusive result. Altman found that it was
common for undue emphasis to be placed on
"negative" findings from small studies,'
ta (b)
tb) td)
L ~ l l
SIGNIFICANT NOT SIGNIFICANT
Nnportant
Not Important
Inconclu.iw Tru n.p.tJv
raault
FIGt/RE Conhdence intervals show.nS Jour ppss+ble conclusions in terms of stattttrcalsrgndrcance
and practtcal'xttportrnce.
619
while Freimen et al. noted that 'nesative'
trisls were often too sasall to aonai:ute a fair
teu of tbtmrpies.' Similarly, a ssgniGcance
test will contrast (b).s significant and (d) as
not sijnifiaar but fash to rec+t>Ssia tmt they
give essentiaQy the tsme eoodmion - d.f
any difference is too small ~to be iasportant.
As an example., consider some results
which were obtaiaetf by Garraway et aL from
a dinial trial' for the -agraseat of arwr
stroke in the elderty.' Of 155 puieau who
were tssaaaged in a txroke tmtt. 73 were
asxsssd as independeat when tbry wen
discharged front the trnft compared with 49
of 132 who wert: maaaged in a med"l tsust.
The simplest analysis shows that the
difference betweefl the sneass raw of the
two units is stipsific"t at the l% levd.
Therefore, a genuine effect has beea estab-
lished. To appreciate the importanca of this
effect the advantage of the svoke unit may
be measured by the difference bet..eea tbe
two units in the percentage of tubjea.s
who were discharge& as independent:
30.3% - 32.2% - 18. 1 %. This is the poiiu
estimate. The aaurae7 of this iesditnue is
given by its staadard erro>r (5.5) and the 95%
confidence limits (/.3% and 2g.9%).'iaus,
the gain could be as large as 29'h or as small
as 7%.
Recently, Gardner and Ahtnan have
arstted against the eaarsive use off hypothesis
testing and urged a Qeater use of confidence
intetvds,' In an appendix to their paper they
give methods to calculate confidence
intervals for the commonly occurring two-
sample comparisons.
in presenting the main results of a study
it is good practice to provide confidence
intervals rather than to restrict the analysis
to significance tesa. Only by so doing can
authors give readers sufficient information
for a proper conclusion to be drawn;
otherwise readen have to rely upon the
authors' own interpretation.' Therefore,
intending authors are urged to express their
main conctusions in confdertee interval form
(possibly with the addition of.a siPifiance
test, although strictly that would provide no
extra information). One of the aims of the
)ournal's statistical review process will be to
ensure that where possible this is done.
GEOFFREY BERRY
Associatc Profesnar or Bioaaustio
School of Public Heatth nad Tropieat Medtcine
The Utiiver0ty of Sydney
I. Healy M1R. It uatma . tnenre:'J R SurraSor A.
1971;, 1at: 3aS31J.
2. Aheua DG. Stauwtra Is awd+cat )oarnaL: Sta MsI
1912..1 : 5901.
1. Frerean /A. Cbalr.rs TC., Smith H it. Xa01er RR.
Tlr.unponvtct of Ea.. tAc rypr 11 aror aeG.rapie
ua n ,the ora+P and sourprem+m uf uye rasdamootmC
control trut. N Ewrt fM.d 1911; 299.' NOY9s
4 . Grvrs.,ay.wM. Akhw AJ. Prercou Rl. HocYer L.
Mwaernem of sc+wr r.rde to tBr efoaty: trebutuisry
rewhf of . toarolled trul. MMed'!' 19a0; 200:
IW4t0a3.
3. C+rdne. MJ. Altmao DG. Confdma war.ahntAn
ttue P.aluncaueutonP ruAer tBaa Eypotbau
«are{. A. Ned 1 19R6; 292: 74&750.
