What is p What is NHST - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

What is p What is NHST

Description:

'statistical significance testing retards the growth of scientific knowledge; it ... Why are the dfs weird and the t different? t(527.38) = 5.57, p .001 ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 59
Provided by: DanWr7
Category:
Tags: nhst | dfs

less

Transcript and Presenter's Notes

Title: What is p What is NHST


1
Lecture 7
  • What is p? What is NHST?
  • t tests
  • Wilcoxon tests

2
Null Hypothesis Significance Testing (NHST)
  • Dominates psychology it must be useful!
  • Taught as if it is the only method for
    scientific inference.
  • So what do the experts think ...

3
  • statistical significance testing retards the
    growth of scientific knowledge it never makes a
    positive contribution (Schmidt Hunter, 1997,
    p. 37).
  • The almost universal reliance on merely
    refuting the null hypothesis is a terrible
    mistake, is basically unsound, poor scientific
    strategy, and one of the worst things that ever
    happened in the history of psychology (Meehl,
    1978, p. 817).
  • Cohen (1994) suggested that Statistical
    Hypothesis Inference Testing produces a more
    appropriate acronym.
  • What is NHST, what isnt it, and why is it over-
    and mis-used?

4
What is p the probability of?
  • p is the probability that the results are due to
    chance, the probability that the null hypothesis
    (H0) is true.
  • p is the probability that the results are not due
    to chance, the probability that the null
    hypothesis (H0) is false.
  • p is the probability of observing results as
    extreme (or more) as observed, if the null
    hypothesis (H0) is true.
  • p is the probability that the results would be
    replicated if the experiment was conducted a
    second time.
  • e. None of these.
  • Write your answer here ________

5
What is p the probability of?
  • 80 p is the probability that the results are
    due to chance, the
  • probability that the null hypothesis
    (H0) is true.
  • 10 p is the probability that the results are
    not due to chance,
  • the probability that the null
    hypothesis (H0) is false.
  • 5 p is the probability of observing results
    as extreme (or
  • more) as observed, if the null
    hypothesis (H0) is true.
  • 0 p is the probability that the results
    would be replicated if
  • the experiment was conducted a second
    time.
  • 5 None of these.

6
What we want it to mean
  • Gigerenzer (1993) divides the statistical self
    into the Super Ego, the Ego and Id. The Id, our
    inner urges (presumably eros not the thanatos),
    desires the p value to be a probability about a
    hypothesis.
  • P(H0D)
  • (some) probability of H0 conditional on the
    data
  • What do the Super Ego and Ego believe a p value
    is?

7
  • When calculating p values, you assume H0 is true.
    The p value is the probability of having data as
    extreme (or more) as observed assuming that the
    null hypothesis is true.
  • Let H0 be some hypothesis about the world

ps assume H0. They are P(DH0).
8
(No Transcript)
9
that means t2
10
Transforming among statistics(for meta-analysis
some change everything into r)
11
  • The dichotomous reject/not reject decision
    making.
  • And what is so special about .05???
  • NHST gives P(DH0) instead of P(H0D).
  • NHST tests a point hypothesis.

12
  • Zone out if you don't like philosophy

A Mandelbrot-type drawing
13
Is there any way to get from P(DH0) to
P(H0D)?It never rains in Southern California
  • If X then Y If SoCal then dry
  • Not Y implies not X Raining implies not SoCal
  • If H0 true then z lt 1.96 (or p gt .05)
  • z gt 1.96 implies H0 false, and we reject it
  • Sounds okay???

14
Statistics, by nature, not precise
  • If X then probably Y
  • Not Y implies probably not X
  • If SoCal then probably dry
  • Raining implies probably not SoCal
  • If H0 then probably zlt1.96
  • zgt1.96 implies probably H0 false
  • (and we reject it)
  • Is this okay???
  • Does the probably make things different???

15
  • If SoCal then probably dry
  • Raining implies probably not SoCal???
  • If you are American, then you are probably not
    the President.
  • Being the President, then you are probably not
    American???
  • If H0 is true, then probably z lt 1.96.
  • z gt 1.96 implies probably H0 false???
  • So all this hassle on hypothesis testing

16
  • Zone back in

Lyubov Popova's Portrait of a Philosopher
17
What is the probability of H0 being true?
  • In traditional NHST, H0 refers to a single point.
  • absolutely no difference between men and women on
    some personality measure.
  • H0 always false in the real world (Cohen, 1990,
    p. 1308)
  • As you increase the sample size even minute
    difference will reach statistical significance.
  • statistical significance is not practical
    significance
  • statistical significance a difference detected
  • Finding a significant p just means you had a big
    enough sample to detect the effect. A
    non-significant p just means your sample was too
    small to detect the effect.

18
  • But given that H0 is always false, why not just
    reject it and not bother running the study?
  • Answer No reason if all you were doing was
    checking if significant.
  • Are there situations where NHST makes more sense?

19
Bending of light during a total eclipse
  • H1. Light does not bend (classical view)
  • H2. Do not bodies act upon light at a
    distance, and by their action bend its rays, and
    is not this action (caeteris paribus) strongest
    at the least distance (Newton, Opticks)
  • Newtonian special relativity predicts
    0.87" bend
  • H3. General relativity predicts 1.74" bend
  • 29 May 1919 - Two British expeditions
  • The bugbear of possible systematic error
    affects all investigations of this kind
    (Eddington, 1920, p. 116).

20
  • Africa (headed by Arthur Eddington)
  • poor weather
  • 1.61".30" just about sufficient
  • Brasil
  • good weather
  • 1.98".18" practical certainty
  • probable accidental errors
  • Using these data to judge the success of
    Einsteins GR was probably premature.

21
(No Transcript)
22
Put forward a bold conjecture that data can
falsify.NHST can help to falsify believed
hypotheses (like special relativity) and failure
to reject other hypotheses makes them appear
better.
23
Physics v Psychology (Meehl, 1967)(generalizing
a lot here)
  • Physicists are usually hoping a model fits the
    data.
  • In psychology, usually the alternative hypothesis
    is the substantive one, but it is too broadly
    defined.
  • Psychologists are usually hoping to reject the
    null hypothesis.

24
Consequences of improved methods
  • Larger samples, better instruments, etc, all lead
    to being able to test hypotheses more precisely
    (ie., smaller confidence intervals).
  • For physics tougher and tougher tests of
    substantive hypotheses.
  • For psychology easier and easier for accepting
    substantive hypotheses.
  • psychology theories rise and decline, come and
    go, more as a function of baffled boredom than
    anything else (Meehl, 1978, p. 807)

25
  • I had the expectation when I became a faculty
    member that anybody with the brains to get a
    Ph.D., who had taken courses in statistics and
    logic and the like, could be depended upon to be
    95 percent rational, an expectation which was
    rudely upset by subsequent experience in faculty
    meetings and committees. While I have mellowed
    with age ..., I must confess that I have never
    fully recovered from the shock of realizing that
    one can become a college professor and not be
    able to think straight.
  • Paul Meehl in autobiography

26
NHST
  • Null hypothesis significance testing should only
    be used if there is a serious theory that
    predicts H0.
  • Why test if reliability between raters/tests/etc.
    is zero?
  • Sometimes I hear people saying things like the
    number of correct answers was not significantly
    from zero when the mean was not zero. This shows
    utter lack of understanding of NHST. It drives me

27
The Great NHST Battles
  • Many thought the APA task force (1999) was going
    to ban p values from psychology journals.
  • The arguments against NHST have existed from its
    inception.
  • Some people argue that there are reasons for
    NHST, although they admit that it is over- and
    mis-used.
  • My own opinion If NHST is to be used, it should
    be done in conjunction with other methods. And
    people should know what it is!
  • What alternatives exist?

28
a magic alternative to NHST, some other
objective mechanical ritual to replace it. It
doesnt exist (Cohen, 1997, p. 31)
29
Tell a StoryAbelsons (1995) MAGIC criteria
  • M agnitude report effect sizes
  • A rticulation focus the reader on the effects
    of importance,
  • G enerality state how broadly your results
    should generalize,
  • I nterest make the reader excited (you should
    be excited),
  • C redibility provide good evidence for your
    claims.
  • (Abelson, 1995)
  • The results section should tell a good
    story. If an article were a murder mystery, the
    results would reveal whodunit!

30
(No Transcript)
31
Avoid dichotomous thinking
  • Fisher (1925)
  • .1 to .9 certainly no reason to suspect the
    hypothesis
  • .02 to .05 judge significant, though barely so
    "
  • below .02 strongly indicated that the
    hypothesis fails"
  • below .01 no practical importance whether P is
    .01 or .000001
  • Efron Tibshrani (1993, p. 204)
  • lt .10 borderline evidence against H0
  • lt .05 reasonably strong evidence against H0
  • lt .025 strong evidence against H0
  • lt .01 very strong evidence against H0

32
Statistics is an Art
  • There are lots of ways to approach most
    statistics problems. Some are less wrong than
    others.
  • A bad answer to the right question is often
    better than a good answer to the wrong question.

33
Other Alternatives
  • Bayesian (which does give P(HD), but has its own
    problems)
  • Effect sizes (r values)
  • Confidence intervals (including of effect sizes)
  • Graphs and more descriptive statistics
  • Exploratory data analysis (EDA, Tukey, 1977)
  • Instead of p values
  • Goodness of fit intervals
  • The traditional null hypothesis is virtually
    always wrong because it is infinitely precise,
    whereas none of the real-world phenomena it is
    designed to test can possibly reach that level of
    precision (Murphy Myors, 2004, pp. 34-35).

34
  • Psychologists should make friends with their
    data
  • (Rosenthal, co-chair of APA task force)

35
Within-subject t test The confidence interval
  • Subtract the value of one variable from another.
  • H0 Population Difference mean 0 H0 µx1-x2
    0
  • Newton (1998) gave 94 people the a
    hostility questionnaire on arrival and at
    discharge of Grendon prison in the UK.
  • Arrival the mean was 28.3 with an sd 8.0.
  • Discharge the mean was 21.6 with an sd 9.2.
  • She subtracted the scores for each
    person and found the mean of this difference was
    -6.6 with an sd 9.0.

36
The 95 CI for the Difference
  • The t value is 1.99. This is the critical value
    for t to reach
  • CI from -8.4 to -4.8, does not overlap with zero
    so can reject H0. Write as (-8.4, -4.8) or 6.6
    1.8.

37
Traditionally Taught t test Differently
  • se is the standard error of the difference.
  • With a t -7.11, and 93 degrees of freedom.
  • 7.11 surpasses the critical value for both 1 and
    5. t(93) 7.11, p lt .001
  • What if p value is closer to 1 than 5?

38
p lt .05 or p .02
  • Why hide information from the reader?
  • Tuftes data-ink ratio more information.

Don't use all my ink!
39
How big is the difference?
  • -6.6 in units of the scale
  • Original standard deviations 8-9, so about 3/4 of
    a standard deviation.
  • Standard deviation of difference also about 9, so
    3/4
  • This is called d, difference/sd.
  • Report effect sizes with appropriate units.

40
Assumptions
  • Normal distribution of the differences (not the
    original variables)
  • Interval measurement for the difference
  • Independent data (so like not having a cluster
    sample)
  • Alternative if you do not make the Normal
    distribution assumption, but do make the others
    Wilcoxon signed rank test.

41
What if differences are not Normal
  • Ignore and do a t test anyway
  • Rank and do a t test
  • Do procedures that lessen affect of outliers
  • Do tests of median
  • Do Wilcoxon signed-rank test

42
Crime in Sussex
43
(No Transcript)
44
What is greater, weapons or sex offenses
45
(No Transcript)
46
Take the differences, and rank them without
regard to their sign
There are lots of ties in these data and there
are complex equations that slightly improve
estimates with ties, so the computer uses these.
47
p .04969, those outliers were distorting the
data/conclusions. Chamber's Prime Directive!
48
Group t test
  • Much debate historically.
  • How to and if to pool variance.
  • Humans usually just take the larger or a weighted
    average variance. Computers do other things.
  • Weighted average

49
Example Wages and Gender
50
As a regression wagei B0 B1 femalei ei
  • minimizing sum of ei2 yields B1 difference in
    means
  • H0 is B1 0

t(531) 5.45, p lt .001
51
As an ANOVA
F(1,531) 29.73, p lt .001 (F here is the same as
t2 from previous page.
?2 679.4/(679.412136.4) .05301 (same as the
R2 from previous page.
and (.05301)(531)/(1-.05301) 29.73
52
Or as a "t test"
t(527.38) 5.57, p lt .001
Why are the dfs weird and the t different?
53
(No Transcript)
54
Among the assumptions is that the residuals are
normally distributed
55
Mann-Whitney/Wilcoxon
56
Can we just rank the data and run our normal
tests?
57
Summary
  • Hypothesis testing is a weird scientific ritual.
    It is arguably appropriate in some situations.
  • t tests and Wilcoxons are some of the most used
    hypothesis tests. They are for comparing either
    the center of a distribution for two groups, or
    two distributions.

58
Journal
  • Exercises 6.3 and 6.8 from First Steps.
  • From some of your own data, your superviser's
    data, something you find in a paper or on a
    website, or from whatever, do each of the 4 tests
    we did today. Use a mix of R and SPSS.
Write a Comment
User Comments (0)
About PowerShow.com