Title: What is p What is NHST
1Lecture 7
- What is p? What is NHST?
- t tests
- Wilcoxon tests
2Null Hypothesis Significance Testing (NHST)
- Dominates psychology it must be useful!
- Taught as if it is the only method for
scientific inference. -
-
- So what do the experts think ...
3- statistical significance testing retards the
growth of scientific knowledge it never makes a
positive contribution (Schmidt Hunter, 1997,
p. 37). - The almost universal reliance on merely
refuting the null hypothesis is a terrible
mistake, is basically unsound, poor scientific
strategy, and one of the worst things that ever
happened in the history of psychology (Meehl,
1978, p. 817). - Cohen (1994) suggested that Statistical
Hypothesis Inference Testing produces a more
appropriate acronym. - What is NHST, what isnt it, and why is it over-
and mis-used?
4What is p the probability of?
- p is the probability that the results are due to
chance, the probability that the null hypothesis
(H0) is true. - p is the probability that the results are not due
to chance, the probability that the null
hypothesis (H0) is false. - p is the probability of observing results as
extreme (or more) as observed, if the null
hypothesis (H0) is true. - p is the probability that the results would be
replicated if the experiment was conducted a
second time. - e. None of these.
- Write your answer here ________
5What is p the probability of?
- 80 p is the probability that the results are
due to chance, the - probability that the null hypothesis
(H0) is true. - 10 p is the probability that the results are
not due to chance, - the probability that the null
hypothesis (H0) is false. - 5 p is the probability of observing results
as extreme (or - more) as observed, if the null
hypothesis (H0) is true. - 0 p is the probability that the results
would be replicated if - the experiment was conducted a second
time. - 5 None of these.
6What we want it to mean
- Gigerenzer (1993) divides the statistical self
into the Super Ego, the Ego and Id. The Id, our
inner urges (presumably eros not the thanatos),
desires the p value to be a probability about a
hypothesis. - P(H0D)
- (some) probability of H0 conditional on the
data - What do the Super Ego and Ego believe a p value
is?
7- When calculating p values, you assume H0 is true.
The p value is the probability of having data as
extreme (or more) as observed assuming that the
null hypothesis is true. - Let H0 be some hypothesis about the world
ps assume H0. They are P(DH0).
8(No Transcript)
9that means t2
10Transforming among statistics(for meta-analysis
some change everything into r)
11- The dichotomous reject/not reject decision
making. - And what is so special about .05???
- NHST gives P(DH0) instead of P(H0D).
- NHST tests a point hypothesis.
12- Zone out if you don't like philosophy
A Mandelbrot-type drawing
13Is there any way to get from P(DH0) to
P(H0D)?It never rains in Southern California
- If X then Y If SoCal then dry
- Not Y implies not X Raining implies not SoCal
- If H0 true then z lt 1.96 (or p gt .05)
- z gt 1.96 implies H0 false, and we reject it
- Sounds okay???
14Statistics, by nature, not precise
- If X then probably Y
- Not Y implies probably not X
- If SoCal then probably dry
- Raining implies probably not SoCal
- If H0 then probably zlt1.96
- zgt1.96 implies probably H0 false
- (and we reject it)
- Is this okay???
- Does the probably make things different???
15- If SoCal then probably dry
- Raining implies probably not SoCal???
- If you are American, then you are probably not
the President. - Being the President, then you are probably not
American??? - If H0 is true, then probably z lt 1.96.
- z gt 1.96 implies probably H0 false???
- So all this hassle on hypothesis testing
16Lyubov Popova's Portrait of a Philosopher
17What is the probability of H0 being true?
- In traditional NHST, H0 refers to a single point.
- absolutely no difference between men and women on
some personality measure. - H0 always false in the real world (Cohen, 1990,
p. 1308) - As you increase the sample size even minute
difference will reach statistical significance. - statistical significance is not practical
significance - statistical significance a difference detected
- Finding a significant p just means you had a big
enough sample to detect the effect. A
non-significant p just means your sample was too
small to detect the effect.
18- But given that H0 is always false, why not just
reject it and not bother running the study? - Answer No reason if all you were doing was
checking if significant. - Are there situations where NHST makes more sense?
19Bending of light during a total eclipse
- H1. Light does not bend (classical view)
- H2. Do not bodies act upon light at a
distance, and by their action bend its rays, and
is not this action (caeteris paribus) strongest
at the least distance (Newton, Opticks) - Newtonian special relativity predicts
0.87" bend - H3. General relativity predicts 1.74" bend
- 29 May 1919 - Two British expeditions
-
- The bugbear of possible systematic error
affects all investigations of this kind
(Eddington, 1920, p. 116).
20- Africa (headed by Arthur Eddington)
- poor weather
- 1.61".30" just about sufficient
- Brasil
- good weather
- 1.98".18" practical certainty
- probable accidental errors
- Using these data to judge the success of
Einsteins GR was probably premature.
21(No Transcript)
22Put forward a bold conjecture that data can
falsify.NHST can help to falsify believed
hypotheses (like special relativity) and failure
to reject other hypotheses makes them appear
better.
23Physics v Psychology (Meehl, 1967)(generalizing
a lot here)
- Physicists are usually hoping a model fits the
data. - In psychology, usually the alternative hypothesis
is the substantive one, but it is too broadly
defined. - Psychologists are usually hoping to reject the
null hypothesis.
24Consequences of improved methods
- Larger samples, better instruments, etc, all lead
to being able to test hypotheses more precisely
(ie., smaller confidence intervals). - For physics tougher and tougher tests of
substantive hypotheses. - For psychology easier and easier for accepting
substantive hypotheses. - psychology theories rise and decline, come and
go, more as a function of baffled boredom than
anything else (Meehl, 1978, p. 807)
25- I had the expectation when I became a faculty
member that anybody with the brains to get a
Ph.D., who had taken courses in statistics and
logic and the like, could be depended upon to be
95 percent rational, an expectation which was
rudely upset by subsequent experience in faculty
meetings and committees. While I have mellowed
with age ..., I must confess that I have never
fully recovered from the shock of realizing that
one can become a college professor and not be
able to think straight. - Paul Meehl in autobiography
26NHST
- Null hypothesis significance testing should only
be used if there is a serious theory that
predicts H0. - Why test if reliability between raters/tests/etc.
is zero? - Sometimes I hear people saying things like the
number of correct answers was not significantly
from zero when the mean was not zero. This shows
utter lack of understanding of NHST. It drives me
27The Great NHST Battles
- Many thought the APA task force (1999) was going
to ban p values from psychology journals. - The arguments against NHST have existed from its
inception. - Some people argue that there are reasons for
NHST, although they admit that it is over- and
mis-used. - My own opinion If NHST is to be used, it should
be done in conjunction with other methods. And
people should know what it is! - What alternatives exist?
28a magic alternative to NHST, some other
objective mechanical ritual to replace it. It
doesnt exist (Cohen, 1997, p. 31)
29Tell a StoryAbelsons (1995) MAGIC criteria
- M agnitude report effect sizes
- A rticulation focus the reader on the effects
of importance, - G enerality state how broadly your results
should generalize, - I nterest make the reader excited (you should
be excited), - C redibility provide good evidence for your
claims. - (Abelson, 1995)
- The results section should tell a good
story. If an article were a murder mystery, the
results would reveal whodunit!
30(No Transcript)
31Avoid dichotomous thinking
- Fisher (1925)
- .1 to .9 certainly no reason to suspect the
hypothesis - .02 to .05 judge significant, though barely so
" - below .02 strongly indicated that the
hypothesis fails" - below .01 no practical importance whether P is
.01 or .000001 - Efron Tibshrani (1993, p. 204)
- lt .10 borderline evidence against H0
- lt .05 reasonably strong evidence against H0
- lt .025 strong evidence against H0
- lt .01 very strong evidence against H0
32Statistics is an Art
- There are lots of ways to approach most
statistics problems. Some are less wrong than
others. - A bad answer to the right question is often
better than a good answer to the wrong question.
33Other Alternatives
- Bayesian (which does give P(HD), but has its own
problems) - Effect sizes (r values)
- Confidence intervals (including of effect sizes)
- Graphs and more descriptive statistics
- Exploratory data analysis (EDA, Tukey, 1977)
- Instead of p values
- Goodness of fit intervals
- The traditional null hypothesis is virtually
always wrong because it is infinitely precise,
whereas none of the real-world phenomena it is
designed to test can possibly reach that level of
precision (Murphy Myors, 2004, pp. 34-35).
34- Psychologists should make friends with their
data - (Rosenthal, co-chair of APA task force)
35Within-subject t test The confidence interval
- Subtract the value of one variable from another.
- H0 Population Difference mean 0 H0 µx1-x2
0 - Newton (1998) gave 94 people the a
hostility questionnaire on arrival and at
discharge of Grendon prison in the UK. - Arrival the mean was 28.3 with an sd 8.0.
- Discharge the mean was 21.6 with an sd 9.2.
- She subtracted the scores for each
person and found the mean of this difference was
-6.6 with an sd 9.0.
36The 95 CI for the Difference
- The t value is 1.99. This is the critical value
for t to reach - CI from -8.4 to -4.8, does not overlap with zero
so can reject H0. Write as (-8.4, -4.8) or 6.6
1.8.
37Traditionally Taught t test Differently
- se is the standard error of the difference.
- With a t -7.11, and 93 degrees of freedom.
- 7.11 surpasses the critical value for both 1 and
5. t(93) 7.11, p lt .001 - What if p value is closer to 1 than 5?
38p lt .05 or p .02
- Why hide information from the reader?
- Tuftes data-ink ratio more information.
Don't use all my ink!
39How big is the difference?
- -6.6 in units of the scale
- Original standard deviations 8-9, so about 3/4 of
a standard deviation. - Standard deviation of difference also about 9, so
3/4 - This is called d, difference/sd.
- Report effect sizes with appropriate units.
40Assumptions
- Normal distribution of the differences (not the
original variables) - Interval measurement for the difference
- Independent data (so like not having a cluster
sample) - Alternative if you do not make the Normal
distribution assumption, but do make the others
Wilcoxon signed rank test.
41What if differences are not Normal
- Ignore and do a t test anyway
- Rank and do a t test
- Do procedures that lessen affect of outliers
- Do tests of median
- Do Wilcoxon signed-rank test
42Crime in Sussex
43(No Transcript)
44What is greater, weapons or sex offenses
45(No Transcript)
46Take the differences, and rank them without
regard to their sign
There are lots of ties in these data and there
are complex equations that slightly improve
estimates with ties, so the computer uses these.
47p .04969, those outliers were distorting the
data/conclusions. Chamber's Prime Directive!
48Group t test
- Much debate historically.
- How to and if to pool variance.
- Humans usually just take the larger or a weighted
average variance. Computers do other things. - Weighted average
49Example Wages and Gender
50As a regression wagei B0 B1 femalei ei
- minimizing sum of ei2 yields B1 difference in
means - H0 is B1 0
t(531) 5.45, p lt .001
51As an ANOVA
F(1,531) 29.73, p lt .001 (F here is the same as
t2 from previous page.
?2 679.4/(679.412136.4) .05301 (same as the
R2 from previous page.
and (.05301)(531)/(1-.05301) 29.73
52Or as a "t test"
t(527.38) 5.57, p lt .001
Why are the dfs weird and the t different?
53(No Transcript)
54Among the assumptions is that the residuals are
normally distributed
55Mann-Whitney/Wilcoxon
56Can we just rank the data and run our normal
tests?
57Summary
- Hypothesis testing is a weird scientific ritual.
It is arguably appropriate in some situations. - t tests and Wilcoxons are some of the most used
hypothesis tests. They are for comparing either
the center of a distribution for two groups, or
two distributions.
58Journal
- Exercises 6.3 and 6.8 from First Steps.
- From some of your own data, your superviser's
data, something you find in a paper or on a
website, or from whatever, do each of the 4 tests
we did today. Use a mix of R and SPSS.