Title: Presentazione di PowerPoint
1Principles of statistics and scientific
interpretation of statistical tests Fabio
Fusi Dipartimento di Scienze Biomediche
2Why do we need statistical calculations Goal
to make the strongest possible conclusions from
limited amount of data  Two problems 1.
Important differences are often obscured by
biological variability and/or experimental
imprecision real differences or random
variation? 2. Overgeneralization
3Many kind of data can be analysed without
statistical analysis Differences that are large
compared with the scatter. However, in many
areas of biology it is difficult to distinguish
the signal from the noise created by biological
variability (in spite of inbred animals or cloned
cells) and imprecise measurements
4Statistical calculations extrapolate from sample
to population  Make general conclusions from
limited amounts of data. Some examples
Quality control Political polls Clinical
studies (the sample of patients studied is rarely
a random samples of the larger population) Labora
tory experiments (extending the terms sample and
population to laboratory experiments is a bit
awkward - from the sample data you want to make
inferences about the ideal situation)
5What statistical calculations CAN do  1.
Statistical estimation (mean--gtscatter and
size--gtconfidence interval) 2. Statistical
hypothesis testing (decide whether an observed
difference is likely to be caused by
chance) Â Popolazione A no difference Popolazion
e B  What is the probability of  Randomly
selected Randomly selected sample A sample
B Â with a difference as large or larger than
actually observed? --gt P value  3. Statistical
modeling (how well experimental data fit a
mathematical model)
6What statistical calculations CANNOT do  In
theory 1. Define a population 2. Randomly select
a sample of subjects to study 3. Randomly divide
the subjects to receive treatment A or B 4.
Measure a single variable in each subject 5. Use
statistics to make inferences about the
distribution of the variable in the population
and about the effect of the treatment  Common
problems (for example design a study to test a
new drug against HIV) 1. The population you
really care about is more diverse than the
population from which your data were sampled 2.
You collect data from a "covenience sample"
rather than a random sample 3. The measured
variable is a proxy for another variable you
really care about 4. Your measurements may be
made or recorded incorrectly 5. You need to
combine different kinds of measurements to reach
an overall conclusion.
7Confidence intervals and P values  Statistical
analysis of data leads to two kinds of
results  1. Confidence intervals (state a
result with margin of error) 2. P values (matches
two or more samples) Â The two give complementary
information.
8Confidence intervals  When we know the
proportion in one sample, the only way to be sure
about the proportion in the population is to
calculate a range of values that bracket the true
population proportion. Scientists usually accept
a 5 chance that the range will not include the
true population value. Â 95 confidence interval
(CI) Â The margin of error depend on the sample
size  Proportion in a sample of 100 subject
0.33 CI 0.24-0.42 (confidence
limits) Proportion in a sample of 14 subject
0.21 CI 0.05-0.51 (confidence
limits) Â You can be 95 sure that the 95 CI
includes the true population value.
9The interpretation of the CI depends on the
following assumptions 1. Random (or
representative) sample (for example
Roosvelt-Landon 1936) 2. Independent observation
(patients double counted or coming from the same
family) 3. Accurate assessment (data recorded
incorrectly) 4. Assessing an event you really
care about (only severe but not all possible drug
reactions)
10Measurements vs fraction or proportion  Working
with measurements is more difficult than
proportions, because you must deal with
variability within samples as well as differences
between groups.   Source of variability  1.
Imprecision or experimental error 2. Biological
variability 3. Blunders
11Presenting data  1. To display the scatter of
measurements histograms 2. To describe the
centre of the distribution calculate the
mean/average Urinary concentrations of lead in
15 (n15) children (?mol/24h) 0.6, 2.6, 0.1,
1.1, 0.4, 2.0, 0.8, 1.3, 1.2, 1.5, 3.2, 1.7, 1.9,
1.9, 2.2. ? y mean --------------
N In this case (0.62.6 1.92.2) / 15 22.5
/ 15 1.5 or the median N 1 N N
2 median -------- if odd or ------- lt
median lt ------- 2 2 2
123. To describe the spread/scatter of the data
report the lowest and highest values (i.e. the
range), the 25th and 75th percentiles
(interquartile range), calculate the variance
?2 ? (Yi - µ)2 ?2 ------------
N - 1 or the standard deviation (the best
possible estimate of the SD of the entire
population, as determined from one particular
sample) SD ??2 (more than half of the
observations in a population usually lie within 1
SD on either side of the mean) or the standard
error SD s.e. --------
?N Thus, when N is small, the standard error will
be relatively large. The effect of N rapidly
diminishes in large samples, which are that much
more likely to be representative of the
population as a whole.
13The Gaussian distribution  A symmetrical
bell-shaped distribution. Variables
approximate 1. a Gaussian distribution when the
variation is caused by many independent
factors 2. a bimodal or skewed distribution,
when the variation is largely due to one
factor. Â About two thirds of a Gaussian
population are within 1 SD of the mean. About 95
of the values in a Gaussian distribution lie
within 2 SDs of the mean.
14- Uso della statistica per confrontare due gruppi
di dati - Â
- Calcolo del valore di P
- Â
- Quanto siamo sicuri che esista una differenza tra
le due popolazioni? - La differenza è dovuta ad un puro caso
verificatosi durante il campionamento o alla
reale differenza tra le due popolazioni? - Il calcolo statistico ci dice quanto raramente
può accadere questo puro caso, ma non se è
realmente avvenuto.
15Un esempio la P sanguigna negli studenti di
CTF I anno 120, 80, 90, 110, 95 mmHg III anno
105, 130, 145, 125, 115 mmHg
P 0.034
16(No Transcript)
17- Implicazioni biologiche e cliniche dei dati (1)
- Esiste una differenza di 25 mmHg nella pressione
sanguigna sistolica se tale differenza fosse
consistente potrebbe avere importanti
implicazioni cliniche. - Al contrario, una differenza di 25 unità in
alcune altre variabili potrebbe essere
insignificante. - Il calcolo statistico non può indicare se una
differenza è clinicamente o scientificamente
importante. - Gli studenti del I anno e del III anno hanno una
identica distribuzione della pressione sanguigna
e la differenza osservata è soltanto frutto di
una combinazione. - La pressione sanguigna degli studenti del III
anno è realmente più alta di quella degli
studenti del I anno.
18Implicazioni biologiche e cliniche dei dati
(2) Quale è la probabilità che la differenza
osservata sia frutto di una semplice
combinazione? Se si assume che gli studenti del
I anno e del III anno hanno una identica
distribuzione della pressione sanguigna, quale è
la probabilità che la differenza tra le medie di
soggetti selezionati sia frutto di una pura
casualità e risulti ugualmente ampia o più ampia
di quella realmente osservata? Il valore P
19- Come si calcola il valore P
- Supporre che gli individui (campioni) a cui è
stata misurata la pressione sono stati
selezionati a caso/sono rappresentativi da/di un
gruppo più ampio (popolazione). - Supporre che il disegno sperimentale è esente da
errori e imperfezioni. - Ipotizzare che la distribuzione dei dati nelle
due popolazioni è la stessa - (ipotesi nulla). lt---gt ipotesi sperimentale o
alternativa - Presumendo che lipotesi nulla è vera, calcolare
la probabilità di osservare vari possibili
risultati. Esistono vari metodi a seconda della
natura dei dati e delle ipotesi fatte. - Determinare la parte di quei possibili risultati
nei quali la differenza tra le medie è ugualmente
ampia o più ampia di quella realmente osservata. - Â
- La risposta è fornita dal valore di P
20Quattro aspetti difficili pensando al valore di
P Lipotesi nulla lt---gt lipotesi dello
sperimentatore I ricercatori trovano strano
calcolare la distribuzione teorica della
probabilità di risultati pertinenti esperimenti
che non verranno mai eseguiti. Lorigine della
distribuzione teorica della probabilità esula
dalle conoscenze matematiche della maggioranza
dei ricercatori. La logica va in una direzione
che sembra intuitivamente tornare indietro si
osserva un campione per trarre delle conclusioni
sulla popolazione, mentre il calcolo del valore
di P si basa su di un assunto relativo alla
popolazione (ipotesi nulla) per determinare la
probabilità di selezionare a caso campioni
caratterizzati da differenze ampie come quelle da
noi osservate.
21Interpretazione del valore di P (1) P
0.034 Se lipotesi nulla fosse vera, allora il
3,4 di tutti i possibili esperimenti con un
analogo numero di osservazioni risulterebbe in
una differenza tra le medie della pressione
sanguigna ugualmente ampia o più ampia di quella
realmente osservata. In altre parole, se
lipotesi nulla fosse vera, esistono solo 3,4
possibilità di selezionare a caso campioni la cui
differenza tra le medie della pressione sanguigna
sia ugualmente ampia o più ampia di quella
realmente osservata. Â Il calcolo statistico
fornisce il valore di P. A noi spetta
linterpretazione.
22Interpretazione del valore di P (2)
Le due popolazioni hanno pressioni medie
identiche (la differenza osservata è la
conseguenza di un puro caso o di una combinazione)
Le due popolazioni hanno pressioni medie
differenti
Lanalisi statistica determina la probabilità che
questa casualità si verifichi (nel nostro esempio
la probabilità è del 3,4 se realmente non cè
differenza tra le due popolazioni)
23Interpretazione ERRATA del valore di P
(1) Â P 0.034 significa che, anche se le
due popolazioni hanno medie identiche, il 3,4
degli esperimenti condotti analogamente al nostro
risulterà in una differenza ampia almeno quanto
quella da noi misurata. La tentazione è quella
di dire se esiste solo una probabilità del 3,4
che la differenza da me osservata sia la
conseguenza di una pura casualità , allora ci deve
essere il 96,6 di probabilità che sia causato da
una reale differenza.
24Interpretazione ERRATA del valore di P (2) Â Si
può solamente dire che, se lipotesi nulla fosse
vera, allora il 96,6 degli esperimenti
fornirebbe una differenza lt di quella osservata
mentre il 3,4 degli esperimenti fornirebbe una
differenza gt di quella osservata. Â Il calcolo
del valore di P si basa sullassunto che
lipotesi nulla sia corretta. Il valore di P non
ci dice se questo assunto è corretto. La domanda
a cui il ricercatore deve rispondere è se il
risultato è talmente improbabile che lipotesi
nulla debba essere scartata.
25Valori di P ad una coda o due code Il valore di
P a due code rappresenta la probabilitÃ
(basandosi sullassunto dellipotesi nulla) che
il campionamento casuale fornisca una differenza
? di quella osservata e che entrambi i gruppi
abbiano la media più alta. Il valore di P ad una
coda, al contrario, rappresenta la probabilitÃ
(basandosi sullassunto dellipotesi nulla) che
il campionamento casuale fornisca una differenza
? di quella osservata e che il gruppo indicato
dallipotesi sperimentale abbia la media più
alta. Esempio pressione sanguigna Il test ad
una coda è appropriato quando dati
precedentemente ottenuti, limitazioni fisiche o
buon senso ci dicono che la differenza, se mai ce
ne sarà una, può andare soltanto in una direzione.
26Valori di P ad una coda o due code (2) Esempio
saggiare se un nuovo antibiotico danneggia la
funzione renale (il danno viene misurato come
aumento dei livelli di creatinina nel siero). Ci
possono essere due sole possibilità ?
Il valore di P a due code saggia lipotesi nulla
che lantibiotico non alteri i livelli di
creatinina
Il valore di P ad una coda saggia lipotesi nulla
che lantibiotico non aumenti i livelli di
creatinina
Non si può escludere una terza possibilitÃ
27- Scelta del numero di code
- I valori di P a due code sono usati più
frequentemente di quelli ad una sola coda - la relazione tra i valori di P e i CI è più
chiara con le due code - i valori di P a due code sono più larghi e quindi
più conservativi (molti esperimenti non
ottemperano completamente a tutte le ipotesi su
cui si basano i calcoli statistici) - alcuni test confrontano tre o più gruppi tra di
loro (il valore di P ha più di due code) - si evita la situazione spiacevole di osservare
una differenza ampia tra due gruppi che però ha
direzione opposta rispetto allipotesi
sperimentale
28(No Transcript)
29Conclusioni La maggior parte dei tests
statistici fornisce un valore di P. E
essenziale, pertanto, comprendere che cosa, il
valore di P, rappresenta e, soprattutto, che cosa
NON rappresenta.
30- Il concetto di significativitÃ
statistica-Testare lipotesi - Â
- Il concetto statisticamente significativo
- Quando si interpretano dei dati (esperimento
pilota su nuovi farmaci, sperimentazione clinica
fase III, nuove tecniche chirurgiche) è
necessario giungere ad una conclusione. - Â
- Testare lipotesi
- ipotizzare che i campioni sono stati selezionati
casualmente dalla popolazione - accettare lipotesi nulla che la distribuzione
dei valori nelle due popolazioni è la stessa - definire un valore soglia (? 0.05 livello di
significatività ) oltre il quale il valore diventa
significativo - selezionare unappropriata analisi statistica e
calcolare il valore di P - se P lt ? la differenza è statisticamente
significativa e lipotesi nulla è respinta - se P gt ? la differenza non è statisticamente
significativa e lipotesi nulla non è respinta
31Testare lipotesi
Scientifica (esperimenti, metodologie, controlli,
ecc.)
Statistica (calcolo del valore di P)
Il valore di P rappresenta un modo conciso per
riassumere lopinione di un ricercatore su una
serie di dati sperimentali
Esempio del magazzino della fabbrica di birra
32- Vantaggi e svantaggi nelluso del concetto
statisticamente significativo - Vantaggi
- in alcuni casi è necessario giungere ad una
conclusione a partire da un solo esperimento
decidersi in un modo se i risultati sono
significativi, nellaltro se non lo sono - con alcune analisi statistiche è difficile se non
impossibile ottenere un esatto valore di P ma è
sempre possibile determinare se P è gt o lt di ? - concludere che i risultati sono statisticamente
significativi è decisamente meno ambiguo che
affermare il campionamento casuale
determinerebbe una differenza ugualmente ampia o
più ampia nel 3.4 degli esperimenti se lipotesi
nulla fosse vera - Â
- Svantaggi
- molti ricercatori smettono di pensare ai dati
33Un analogia Innocente fino a prova
contraria Esiste una analogia tra la prassi che
una giuria di un tribunale deve seguire per
dichiarare colpevole un imputato e quella che un
ricercatore segue per determinare una
significatività statistica.
La giuria non può mai emettere un verdetto di
innocenza, il ricercatore non può mai affermare
che lipotesi nulla è vera.
34Errori di tipo I e II Tipo I affermare che una
differenza è statisticamente significativa e
respingere lipotesi nulla quando invece è
valida. Â Tipo II affermare che una differenza
non è statisticamente significativa e non
respingere lipotesi nulla quando invece è falsa.
35Scelta del valore appropriato di ? Per
tradizione ? è posto uguale a 0.05, anche se
questo dovrebbe essere indicato dal contesto
dellesperimento piuttosto che dalla tradizione.
Se ? fosse ridotto a valori lt0.05
Se ? fosse aumentato a valori gt0.05
Errore di Tipo I
Errore di Tipo II
Errore di Tipo I
Errore di Tipo II
36Valutare il costo degli errori di tipo I e II
(1) Modificare il valore di ? a seconda dei
casi. Â Esempio I screening di un nuovo
farmaco  Se gli esperimenti forniscono risultati
significativi --gt lindagine proseguirà  Se gli
esperimenti forniscono risultati non
significativi --gt il nuovo farmaco verrÃ
accantonato  Costo di un errore di Tipo I
modesto approfondimento di indagine  Costo di un
errore di Tipo II abbandono di una sostanza
efficace  ? 0.10 o 0.20
37Valutare il costo degli errori di tipo I e II
(2) Esempio II nuovo farmaco antipertensivo in
fase III (esiste già una buona terapia per il
trattamento dellipertensione) Â Se gli
esperimenti forniscono risultati significativi
--gt il nuovo farmaco verrà immesso in
commercio  Se gli esperimenti forniscono
risultati non significativi --gt il nuovo farmaco
verrà accantonato  Costo di un errore di Tipo I
i pazienti verranno trattati con un farmaco
privo di efficacia e privati al tempo stesso di
una valida terapia già consolidata  Costo di un
errore di Tipo II abbandono di una sostanza
utile per il trattamento di una patologia per cui
esiste già una terapia efficace  ? 0.01
38Valutare il costo degli errori di tipo I e II
(3) Esempio III nuovo farmaco in fase III (non
esiste ancora una buona terapia per questa
patologia) Â Se gli esperimenti forniscono
risultati significativi --gt il nuovo farmaco
verrà immesso in commercio  Se gli esperimenti
forniscono risultati non significativi --gt il
nuovo farmaco verrà accantonato  Costo di un
errore di Tipo I i pazienti verranno trattati
con un farmaco privo di efficacia anziché con
nulla  Costo di un errore di Tipo II abbandono
di una sostanza efficace per il trattamento di
una patologia per cui non esiste ancora una
terapia  ? 0.10
39- Relazione tra ? e valore di P
- Â
- Il valore di P e ? sono strettamente correlati.
- Il valore di P viene calcolato dai dati raccolti.
- Viene prefissato a sulla base delle conseguenze
di errori di Tipo I o II. - ? rappresenta il valore soglia per P al di sotto
del quale la differenza osservata è definita
statisticamente significativa.
40- Significatività statistica vs. importanza
scientifica - Â
- Una differenza è statisticamente significativa
quando il valore di - P lt ?.
- Esistono 3 possibilitÃ
- lipotesi nulla è vera e la differenza da noi
osservata è puramente casuale. Il valore di P ci
dice quanto rara sarà questa casualità . - lipotesi nulla è falsa (le due popolazioni sono
effettivamente differenti) e la differenza da noi
osservata è scientificamente o clinicamente
importante. - lipotesi nulla è falsa (le due popolazioni sono
effettivamente differenti) ma la differenza da
noi osservata è così piccola da non essere
scientificamente o clinicamente importante. - Piccole differenze ottenute con campioni molto
numerosi devono sempre essere interpretate.
41Come si interpretano i valori di P Â i)
significativi e ii) non significativi  THE TERM
SIGNIFICANT The term statistically significant
has a simple meaning the P value is less than a
preset threshold value ?. Â The statistical use
of the word significant has a meaning entirely
distinct from its usual meaning. Just because a
difference is statistically significant does not
mean that it is important or interesting. Â A
statistically significant result may not be
scientifically significant or clinically
significant. And a difference that is not
significant (in the first experiment) may turn
out to be very important.
42EXTREMELY SIGNIFICANT RESULTS Intuitively, you'd
think that P 0.004 is more significant than P
0.04. This is not correct. Once you have set a
value for ?, a result either is statistically
significant or is not statistically
significant. Â Very significant or extremely
significant results when the P value is
tiny. Â When showing P values on graphs,
investigators commonly use a "Michelin Guide"
scale. P lt 0.05 (significant) P lt 0.01
(highly significant) P lt 0.001 (extremely
significant).
43BORDERLINE P VALUES If you follow the strict
paradigm of statistical hypothesis testing and
set ? to its conventional value of 0.05, then a P
value of 0.049 denotes a statistically
significant difference and a P value of 0.051
denotes a not significant difference (the whole
point of using the term statistically significant
is to reach a crisp conclusion from every
experiment without exception). Â When a P value
is just slightly greater than ?, some scientists
refer to the result as marginally significant or
almost significant. Â One way to deal with
borderline P values would be to choose between
three decisions rather than two. Rather than
decide whether a difference is significant or not
significant, add a middle category of
inconclusive. This approach is not commonly used.
44THE TERM NOT SIGNIFICANT If the P value is
greater than a preset value of ?, the difference
is said to be not significant. This means that
the data are not strong enough to persuade you to
reject the null hypothesis. Â A proof that the
null hypothesis is true? Â A high P value does
not prove the null hypothesis, since concluding
that a difference is not statistically
significant when the null hypothesis is, in fact,
false is called a Type II error. Â When you read
that a result is not significant, don't stop
thinking. There are two approaches you can use to
evaluate the study. First, look at the confidence
interval (CI). Second, ask about the power of the
study to find a significant difference if it were
there.
45- Come si sceglie un test statistico
- Â REVIEW OF AVAILABLE STATISTICAL TESTS
- To select the right test, ask yourself two
questions - What kind of data have you collected?
- What is your goal?
- Then refer to Table 37.1.
46REVIEW OF NONPARAMETRIC TESTS Choosing the
right test to compare measurements is a bit
tricky, as you must choose between two families
of tests parametric and nonparametric. Many
-statistical test are based upon the assumption
that the data are sampled from a Gaussian
distribution. These tests are referred to as
parametric tests (i.e. the t test and analysis of
variance). Tests that do not make assumptions
about the population distribution are referred to
as nonparametric tests. All commonly used
nonparametric tests rank the outcome variable
from low to high and then analyse the ranks.
47CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC
TESTS THE EASY CASES (1) Choosing between
parametric and nonparametric tests is sometimes
easy. You should definitely choose a parametric
test if you are sure that your data are sampled
from a population that follows a Gaussian
distribution (at least approximately). You
should definitely select a nonparametric test in
three situations The outcome is a rank or a
score and the population is clearly not Gaussian.
Examples include class ranking of students, the
visual analogue score for pain (measured on a
continuous scale where 0 is no pain and 10 is
unbearable pain), and the star scale commonly
used by movie and restaurant critics ( is OK,
is fantastic).
48CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC
TESTS THE EASY CASES (2) Some values are "off
the scale," that is, too high or too low to
measure. Even if the population is Gaussian, it
is impossible to analyse such data with a
parametric test since you don't know all of the
values. Assign values too low to measure an
arbitrary very low value and assign values too
high to measure an arbitrary very high value.
Then perform a nonparametric test. Since the
nonparametric test only knows about the relative
ranks of the values, it won't matter that you
didn't know all the values exactly. The data
are measurements, and you are sure that the
population is not distributed in a Gaussian
manner. If the data are not sampled from a
Gaussian distribution, consider whether you can
transformed the values to make the distribution
become Gaussian (take the logarithm or reciprocal
of all values for biological or chemical reasons
as well as statistical ones.
49- CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC
TESTS THE HARD CASES - Decide whether a sample comes from a Gaussian
population. - With many data points (over a hundred or so),
you can look at the distribution of data and it
will be fairly obvious whether the distribution
is approximately bell shaped. A formal
statistical test can be used. - With few data points, it is difficult to tell
whether the data are Gaussian by inspection, and
the formal test has little power. You should
look at previous data as well. Remember, what
matters is the distribution of the overall
population, not the distribution of your sample. - Consider the source of scatter. When the scatter
comes from the sum of numerous sources, you
expect to find a roughly Gaussian
distribution.When in doubt, some people choose a
parametric test (because they aren't sure the
Gaussian assumption is violated), and others
choose a nonparametric test (because they aren't
sure the Gaussian assumption is met).
50CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC
TESTS DOES IT MATTER? (1) The answer depends
on sample size. Large sample. With data from a
non-Gaussian population, parametric tests work
well. It is impossible to say how large is large
enough. You are probably safe when there are at
least two dozen data points in each group.
Large sample. With data from a Gaussian
population, nonparametric tests work well. The P
values tend to be a bit too large, but the
discrepancy is small. Nonparametric tests are
only slightly less powerful than parametric tests
with large samples. Small samples. With data
from non-Gaussian populations you can't rely on
parametric test since P value may be
inaccurate. Small samples. With data from a
Gaussian population nonparametric tests lack
statistical power and P values tend to be too
high.
51CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC
TESTS DOES IT MATTER? (2) Large data sets
present no problems. It is usually easy to tell
if the data come from a Gaussian population, but
it doesn't really matter because the
nonparametric tests are so powerful and the
parametric tests are so robust. Small data sets
present a dilemma. It is difficult to tell if the
data come from a Gaussian population, but it
matters a lot. The nonparametric tests are not
powerful and the parametric tests are not robust.
52ONE- OR TWO-SIDED P VALUE? A one-sided P value
is appropriate when you can state with certainty
(and before collecting any data) that there
either will be no difference between the means or
that the difference will go in a direction you
can specify in advance (i.e., you have specified
which group will have the larger mean). If you
cannot specify the direction of any difference
before collecting data, then a two-sided P value
is more appropriate. If in doubt, select a
two-sided P value. If you select a one-sided
test, you should do so before collecting any data
and you need to state the direction of your
experimental hypothesis. If the data go the other
way, you must be willing to attribute that
difference (or association or correlation) to
chance, no matter how striking the data. If you
would be intrigued, even a little, by data that
goes in the "wrong" direction, then you should
use a two-sided P value. The two-sided P value
is twice the one-sided P value.
53PAIRED OR UNPAIRED TEST? When comparing three or
more groups, the term paired is not appropriate
and the term repeated measures is used
instead. Use an unpaired test to compare groups
when the individual values are not paired or
matched with one another. Select a paired or
repeated-measures test when values represent
repeated measurements on one subject (before and
after an intervention) or measurements on matched
subjects. You should select a paired test when
values in one group are more closely correlated
with a specific value in the other group than
with random values in the other group. It is only
appropriate to select a paired test when the
subjects were matched or paired before the data
were collected. You cannot base the pairing on
the data you are analyzing.
54REGRESSION OR CORRELATION? Linear regression
and correlation are similar and easily confused.
In some situations it makes sense to perform both
calculations. Â Calculate linear correlation if
you measured both X and Y in each subject and
wish to quantity how well they are associated.
Don't calculate the correlation coefficient (or
its confidence interval) if you manipulated the X
variable. Â Calculate linear regressions only if
one of the variables (X) is likely to precede or
cause the other variable (Y). Definitely choose
linear regression if you manipulated the X
variable. Â It makes a big difference which
variable is called X and which is called Y, as
linear regression calculations are not
symmetrical with respect to X and Y. If you swap
the two variables, you will obtain a different
regression line. In contrast, linear correlation
calculations are symmetrical with respect to X
and Y. If you swap the labels X and Y, you will
still get the same correlation coefficient.
55When reading research papers that include
statistical analyses, keep these points in
mind LOOK AT THE DATA the data are important,
statistics merely summarized. BEWARE OF VERY
LARGE AND VERY SMALL SAMPLES Very large sample
sizes tiny differences will be statistically
significant, even if scientifically or clinically
trivial. Small sample sizes large and important
differences can be insignificant. DONT FOCUS ON
AVERAGES OUTLIERS MAY BE IMPORTANT the
variability in the data reflects real biological
diversity. STATISTICALLY SIGNIFICANT DOES NOT
MEAN SCIENTIFICALLY IMPORTANT P values (and
statistical significance) are purely the result
of arithmetic decisions about importance require
judgement.
56P lt 0.05 IS NOT SACRED There really is not much
difference between P 0.045 and P 0.055! By
convention, the first is statistically
significant and the second is not, but this is
completely arbitrary. A rigid cutoff for
significance is useful in some situations, but is
not always needed. DONT OVERINTERPRET NOT
SIGNIFICANT RESULTS If a difference is not
statistically significant, it does not mean that
the null hypothesis is true. If there is no
evidence that A causes B that does not
necessarely mean that A doesnt cause B. It may
mean that there are no data, or that all the
studies have used few subjects and dont have
adequate power to find a difference. DONT
IGNORE PAIRING Paired (or repeated-measures)
experiments are very powerful. The use of matched
paires of subjects (or before and after
measurements) control for many sources of
variability.
57- A PRACTICAL EXAMPLE
- Choosing a test to compare two columns
- Choosing a one-way ANOVA analysis to compare
three or more columns
58- CHOOSING A TEXT TO COMPARE TWO COLUMNS
- One can perform
- Unpaired t test
- Paired t test
- Welch's t test
- Mann-Whitney test
- Wilcoxon test
- To choose among these tests, you must answer four
questions - 1. Are the data paired?
- 2. Assume sampling from a Gaussian distribution?
- 3. Assume equal variances?
- 4. One- or two-tail P value?
591. Are the data paired? (1) Choose a paired test
when the experiment follows one of these
designs You measure a variable before and after
an intervention in each subject. You recruit
subjects as pairs, matched for variables such as
age, ethnic group or disease severity. One of the
pair gets one treatment the other gets an
alternative treatment. You run a laboratory
experiment several times, each time with a
control and treated preparation handled in
parallel. You measure a variable in twins, or
child/parent pairs.
60- 1. Are the data paired? (2)
- More generally, you should select a paired test
whenever you expect a value in one group to be
closer to a particular value in the other group
than to a randomly selected value in the other
group. - Ideally, you should decide about pairing before
collecting data. - Certainly the matching should not be based on the
variable you are comparing. - If you are comparing blood pressures in two
groups, it is okay to match based on age or zip
code, but it is not okay to match based on blood
pressure.
612. Assume sampling from a Gaussian
distribution? Many statistical tests, assumes
that your data are sampled from a population that
follows a Gaussian bell-shaped distribution.
Alternative tests, known as nonparametric tests,
make fewer assumptions about the distribution of
the data, but are less powerful (especially with
small samples). The results of a normality test
can be helpful, but not always as helpful as
you'd hope.
623. Assume equal variances? The unpaired t test
assumes that the data are sampled from two
populations with the same variance (and thus the
same standard deviation). Use a modification of
the t test (developed by Welch) when you are
unwilling to make that assumption (rarely, when
you have a good reason).
63- 4. One- or two-tail P value?
- Choose a one-tailed P value only if
- You predicted which group would have the larger
mean before you collected any data. - If the other group turns out to have the larger
mean, you are willing to attribute the difference
to coincidence, even if the means are very far
apart. - Since those conditions are rarely met, two-tail P
values are usually more appropriate.
64Summary of tests to compare two columns Based on
your answers Not paired Gaussian distribution,
Unpaired t test equal SDs Not
paired Gaussian distribution, Welch's t test
different SDs Paired Gaussian distribution
Paired t test of differences Not paired Not
Gaussian Mann-Whitney test Paired Not
Gaussian Wilcoxon test
65Critical values of t The table shows critical
values from the t distribution. Choose the row
according to the number of degrees of freedom and
the column depending on the two-tailed value for
a. If your test finds a value for t greater than
the critical value tabulated here, then your
two-tailed P value is less than a.
66- CHOOSING A ONE-WAY ANOVA ANALYSIS TO COMPARE
THREE OR MORE COLUMNS - One can perform
- ordinary one-way ANOVA
- repeated measures ANOVA
- the nonparametric tests of Kruskal-Wallis and
Freidman. - To choose among these tests, you must answer
three questions - Are the data matched?
- Assume sampling from a Gaussian distribution?
- Which post test?
67- 1. Are the data matched?
- You should choose a repeated measures test when
the experiment used matched subjects. For
example - you measure a variable in each subject before,
during and after an intervention. - Randomized block experiments (each set of
subjects is called a block and you randomly
assign treatments within each block). For
example - you recruit subjects as matched sets. Each
subject in the set has the same age, diagnosis
and other relevant variables. One of the sets
gets treatment A, another gets treatment B,
another gets treatment C, etc. - you run a laboratory experiment several times,
each time with a control and several treated
preparations handled in parallel. - Ideally, you should decide about matching before
collecting data. Certainly the matching should
not be based on the variable you are comparing.
If you are comparing blood pressures in two
groups, it is okay to match based on age or
postal code, but it is not okay to match based on
blood pressure.
682. Assume sampling from a Gaussian
distribution? Many statistical tests, assumes
that your data are sampled from a population that
follows a Gaussian bell-shaped distribution.
Alternative tests, known as nonparametric tests,
make fewer assumptions about the distribution of
the data, but are less powerful (especially with
small samples). The results of a normality test
can be helpful, but not always as helpful as
you'd hope.
69- 3. Which post test?
- If you are comparing three or more groups, you
may pick a post test to compare pairs of group
means. It is not appropriate to repeatedly use a
t test to compare various pairs of columns - Dunnett. Compare all vs. control.
- Test for linear trend between column mean and
column number. - Bonferroni. Compare selected pairs of columns.
- Bonferroni. Compare all pairs of columns.
- Tukey. Compare all pairs of columns.
- Student-Newman-Keuls. Compare all pairs of
columns.
70- Dunnett's test if one column represents control
data, and you wish to compare all other columns
to that control column but not to each other. - Test for linear trend, if the columns are
arranged in a natural order (i.e. dose or time)
and you want to test whether there is a trend so
that values increase (or decrease) as you move
from left to right across columns. - Select the Bonferroni test for selected pairs of
columns when you only wish to compare certain
column pairs. You must select those pairs based
on experimental design, and ideally should
specify the pairs of interest before collecting
any data. If you base your decision on the
results (i.e. compare the smallest with the
largest mean), then you have effectively compared
all columns, and it is not appropriate to use the
test for selected pairs.
71- To compare all pairs of columns
- Bonferroni method is easy to understand although
too conservative, leading to P values that are
too high and confidence intervals that are too
wide. This is a minor concern when you compare
only a few columns, but is a major problem when
you have many columns. Don't use the Bonferroni
test with more than five groups. - Tukey and Newman-Keuls tests are related, and the
rationale for the differences is subtle. The
methods are identical when comparing the largest
group mean with the smallest. For other
comparisons, the Newman-Keuls test yields lower P
values. The problem is that it is difficult to
articulate exactly what null hypotheses the P
values test. For that reason, and because the
Newman-Keuls test does not generate confidence
intervals, we suggest selecting Tukey's test.
72Summary of tests to compare three or more
columns Based on your answers. Not
matched Gaussian Ordinary one-way
ANOVA distribution Matched Gaussian
Repeated measures one-way distribution
ANOVA Not matched Not Gaussian Kruskal-Wallis
test Matched Not Gaussian Friedman test
73Critical values of the F (after Fisher, who
invented ANOVA) distribution The table shows
critical values from the F distribution. Find the
column corresponding to the number of degrees of
freedom in the numerator and the row
corresponding to the number of degrees of freedom
in the denominator. Find the critical value of F
in the table. If you have obtained a value for F
greater than that, then your P value is less than
a.
74Testing for normality Normality test to test
for deviations from Gaussian distribution (also
called the Normal distribution) using the
Kolmogorov-Smirnov test that quantifies the
discrepancy between the distribution of data and
an ideal Gaussian distribution - a larger value
denotes a larger discrepancy. It is not
informative by itself, but is used to compute a P
value. The P value from the normality test
answers this question If you randomly sample
from a Gaussian population, what is the
probability of obtaining a sample that deviates
as much from a Gaussian distribution (or more so)
as this sample does. More precisely, the P value
answers this question If the population was
really Gaussian, what is the chance that a
randomly selected sample of this size would have
a KS distance as large, or larger, as observed?
75Your interpretation of a normality test depends
on the P value and the sample size. P value
Sample size Conclusion Small Any The data
failed the normality test. You can conclude
that the population is unlikely to be
Gaussian. Large Large The data passed the
normality test. You can conclude that the
population is likely to be Gaussian, or nearly
so. How large does the sample have to be?
There is no firm answer, but one rule-of-thumb
is that the normality tests are only useful
when your sample size is a few dozen or
more. Large Small You will be tempted to
conclude that the population is Gaussian.
Don't do that. A large P value just means that
the data are not inconsistent with a Gaussian
population. That doesn't exclude the
possibility of a nongaussian population. Small
sample sizes simply don't provide enough data
to discriminate between Gaussian and
nongaussian distributions. You can't conclude
much about the distribution of a population if
your sample contains fewer than a dozen values.