Title: Statistics for Linguistics
1Statistics for Linguistics
2Variables and levels
- Independent variable
- Dependent variable
The IV must have at least two levels
(conditions).
The DV must allow for at least two different
types of responses.
3Example 1
Subjects are given two types of constructions and
are asked to decide whether the given sentence is
grammatical (1) a. I gave the key
him. Construction 1 b. I gave the book
her. c. (2) a. I gave the key to
him. Construction 2 b. I gave the book to
her. c.
4Example 1
IV (two conditions) DV (forced choice task)
Construction 1 Construction 2 a. grammatical b. ungrammatical
5Example 2
Subjects are asked to complete copular sentences
with a relative clause. The predicate nominals of
the copular clauses belong to three different
semantic types (1) animate/human (2)
inanimate/object (3) place.
(1) This is the man __ (2) This is the ball
__ (3) This is the place __
6Example 2
Subjects responses can be divided into five
different types
(1) This is the man who talked to Jane. who
I met. whom I gave the book. to whom she
went. whose cat died.
7Example 2
IV DV
1. This is the man __ 2. This is the ball __ 3. This is the place __ a. SUBJ relative clause b. DO relative clause c. IO relative clause d. OBL relative clause e. GEN relative clause
8Example 2
IV DV
1. This is the man __ 2. This is the thing __ 3. This is the place __ a. SUBJ relative clause b. DO relative clause c. IO relative clause d. OBL relative clause e. GEN relative clause
1. I saw the man __ 2. I saw the thing __ 3. I saw the place __ a. SUBJ relative clause b. DO relative clause c. IO relative clause d. OBL relative clause e. GEN relative clause
9Example 2
Copular Transitive
SUBJ 3.5 DO 3.2 IO 2.7 OBL 2.2 GEN 0.6 SUBJ 2.5 DO 3.8 IO 3.2 OBL 0.5 GEN 0.5
10Example 2
Interaction
No interaction
11Types of variables
- Nominal/categorical data
- Ordinal data
- Interval data
12Types of analysis
- Correlational analysis
- Difference test
13Types of analysis
Correlational test Difference test
Pearsons r Kendalls tau T-test ANOVA
14Related vs. independent designs
- Within subjects design related design
repeated measures design - Between subjects design unrelated design
independent design
15Related vs. independent designs
Advantages of a within subject design
- Reduction of inter-individual differences
- Fewer subjects
Disadvantages of a within subject design
- Subjects recognize the purpose of the study.
- Subjects get tired, frustrated, excited.
- Subjects get habituated to the task.
16One sample tests
17Binomial
A linguist has collected a sample of sentences
including ditransitive verbs from a corpus.
Overall, there are 46 sentences in his sample. In
27 sentences the verb occurs with two NP objects,
in 19 sentences the verb occurs with an NP and a
PP. (1) He gives Peter the ball. V NP NP
(n27) (2) He gives the ball to Peter. V NP PP
(n19) Is the difference in frequency
between two categories significant?
18Binomial
Null-hypothesis The two constructions are
equally frequent (suggesting that they are free
variants i.e. there is nothing to
explain).Alternative hypothesis The two
constructions differ in frequency (which must
have a reason that needs to be explained).
19Binomial
Kategorie N Beobachteter Anteil Testanteil Asymptotische Signifikanz (2-seitig)
Freq Gruppe 1 27,00 27 ,59 ,50 ,302(a)
Freq Gruppe 2 19,00 19 ,41
Freq Gesamt 46 1,00
20??-square goodness-of-fit
A linguist has collected relative clauses from a
corpus, which he divided into four types (1)
subjects relatives, (2) object relatives, and (3)
oblique relatives, (4) genitive relatives. Is the
sample difference sufficient to posit a frequency
difference between the four groups in the
population?
21??-square goodness-of-fit
Subject Object Oblique Genitive Total
Freq 55 53 39 4 151
22??-square goodness-of-fit
Subject Object Oblique Genitive Total
Freq 55 53 39 4 151
Exp.
23??-square goodness-of-fit
Subject Object Oblique Genitive Total
Freq 55 53 39 4 151
Exp. 37.75 37.75 37.75 37.75
24??-square goodness-of-fit
Null-hypothesis The four types of relative
clauses are equally frequent in the true
population. Alternative hypothesis The four
types of relative clauses are not
equally frequent in the true population.
25??-square goodness-of-fit
??
(observed expected)2
?
expected
26??-square goodness-of-fit
Observed
55 53 39 4
27??-square goodness-of-fit
Observed Expected
55 53 39 4 37.75 37.75 37.75 37.75
28??-square goodness-of-fit
Observed Expected Difference (Residuals)
55 53 39 4 37.75 37.75 37.75 37.75 17.25 15.25 1.25 -33.75
29??-square goodness-of-fit
Observed Expected Difference (Residuals) Square
55 53 39 4 37.75 37.75 37.75 37.75 17.25 15.25 1.25 -33.75 297.56 232.56 1.56 1139.06
30??-square goodness-of-fit
Observed Expected Difference (Residuals) Square Sum
55 53 39 4 37.75 37.75 37.75 37.75 17.25 15.25 1.25 -33.75 297.56 232.56 1.56 1139.06 1670
31??-square goodness-of-fit
Observed Expected Difference (Residuals) Square Sum Divided by expected frequency
55 53 39 4 37.75 37.75 37.75 37.75 17.25 15.25 1.25 -33.75 297.56 232.56 1.56 1139.06 1670 ?? 44.25
32Normal distributions
33Binomial distribution
34Binomial distribution
Bernoulli trail
- two possible outcomes on each trail
- the outcomes are independent of each other
- the probability ratio is constant across trails
35Binomial distribution
T
H
HH
HT
TH
TT
36Binomial distribution
0 heads HH 1 head HT TH 2 heads TT
37Binomial distribution
HH HT TH TT
0 1 2
Sample space
Random variable
38Binomial distribution
Cumulative outcome Probability
0 1? 1 2? 2 1? 0.25 0.50 0.25
? P(x) 1
39H
T
HH
HT
TH
TT
HHH
HHT
HTH
HTT
THH
THT
TTH
TTT
40Sample space HHH TTT HHT TTH HTH THT
THH HTT
Random variables 0 Head 1 Head 2
Heads 3 Heads
0 head 1 1 head 3 2 heads 3 3 heads 1
/ 8 0.125
/ 8 0.375
/ 8 0.375
/ 8 0.125
41Binomial distribution
42??-distribution
43(No Transcript)
44??-square
.995 .99 .975 .95 .90 .10 .05 .025 .01 .005
1 df
2 df
3 df 0.072 0.115 0.216 0.352 0.584 6.25 7.81 9.35 11.34 12.84
4 df
45df3
44.25
7.81
46Two sample tests
47??-square of indepence
VP NP P VP P NP Total
Spatial Non-spatial 345 76 17 12 362 (80.4) 88 (19.6)
421 29 450
- He pushed the chair away. Spatial
- He turned on the TV. Non-spatial
48??-square of indepence
X Y Total
A B 50 50
Total 50 50 100
49??-square of indepence
X Y Total
A B 25 25 25 25 50 50
Total 50 50 100
50??-square of indepence
VP_NP_P VP_P_NP Total
Spatial Non-spatial 362 88
Total 421 29 450
51??-square of indepence
Expected frequency
X ? Y total
52??-square of indepence
V_NP_P V_P_NP Total
Spatial 345 362?421/450 17 362?29/450 362
Non-spatial 76 88?421/450 12 88?29/450 88
Total 421 29 450
53??-square of indepence
V_NP_P V_P_NP Total
Spatial 345 339 17 23.5 362
Non-spatial 76 82.5 12 5.7 88
Total 421 29 450
54??-square of indepence
??
(observed expected)2
?
expected
55??-square of indepence
Observed
345 17 76 12
56??-square of indepence
Observed Expected
345 17 76 12 338.7 23.3 82.3 5.7
57??-square of indepence
Observed Expected Difference (Residuals)
345 17 76 12 338.7 23.3 82.3 5.7 -6.3 -6.6 -6.3 6.3
58??-square
Observed Expected Difference (Residuals) Square
345 17 76 12 338.7 23.3 82.3 5.7 -6.3 -6.6 -6.3 6.3 39.69 43.56 43.56 39.69
59??-square
Observed Expected Difference (Residuals) Square Divided by expected frequency
345 17 76 12 338.7 23.3 82.3 5.7 -6.3 -6.6 -6.3 6.3 39.69 43.56 43.56 39.69 0.11 1.87 0.53 6.96
60??-square
Observed Expected Difference (Residuals) Square Divided by expected frequency ??
345 17 76 12 338.7 23.3 82.3 5.7 -6.3 -6.6 -6.3 6.3 39.69 43.56 43.56 39.69 0.11 1.87 0.53 6.96 9.47
61Probablity distribution
df (rows 1) ? (columns 1)
62??-square
.995 .99 .975 .95 .90 .10 .05 .025 .01 .005
1 df 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 df
3 df
4 df
633.841
9.47
64McNemar
Profitieren Kinder beim Erwerb einer
grammatischen Konstruktion davon, wenn sie diese
Konstruktion häufig hören? Um diese Frage zu
beantworten, bitten wir 100 Kinder, einen
ditransitiven Satz mit 10 Wörtern nachzusprechen
(Der Mann gibt dem kleinen Jungen einen sehr
großen Ballon). Alle Kinder müssen den Satz zwei
Mal nachsprechen (1) zu Beginn der Studie und
(2) nach einer Trainingsphase, in der sie
ähnliche ditransitive Sätze 5 Mal scheinbar
beiläufig in einer einstündigen Konversation
hören. Beeinflusst die Trainingsphase die
Fähigkeit der Kinder ditransitive Sätze
nachzusprechen?
65McNemar
- Kinder, die den Satz vor und nach der
Trainingsphase richtig nachsprechen. (N31) - Kinder, die den Satz vorher falsch und nachher
richtig aussprechen. (N39) - Kinder, die den Satz vorher richtig und nachher
falsch aussprechen. (N13) - Kinder, die den Satz vorher und nachher falsch
aussprechen. (N17)
66McNemar
vorher vorher
richtig falsch Total
nachher richtig 31 39 70
nachher falsch 13 17 30
Total 44 56 100
67Extensions of McNemar
Bowker Wenn eine der beiden Variablen mehr als
zwei Ausprägungen umfasst (richtig teilweise
richtig falsch).
Cochran Q Wenn die Probanden nicht nur zwei
Mal sondern mehrmals zu verschiedenen Zeiten
getestet werden.
68Extension ??-square
- Konfigurationsfrequenzanalyse (KFA)
- Loglineare Analyse
69Interval data
70t-test
Parametric
between / independent / unrelated Independent t-test
within / dependent / related / repeated measures Paired t-test
71t-test
Parametric Non-parametric
between / independent / unrelated Independent t-test Mann-Whitney U
within / dependent / related / repeated measures Paired t-test
72t-test
Parametric Non-parametric
between / independent / unrelated Independent t-test Mann-Whitney U
within / dependent / related / repeated measures Paired t-test Wilcoxon
73t-test
24 people were involved in an experiment to
determine whether background noise (e.g. music)
affects short-term memory (recall of words). Half
of the sample was randomly allocated to the NOISE
condition, and half to the NO NOISE condition.
The participants in the NOISE condition tried to
memorize a list of 20 words in two minutes, while
listening to pre-recorded noise through
earphones. The other participants wore earphones
but heard no noise as they attempted to memorize
the words. Immediately after this, they were
tested to see how many words they recalled.
74NOISE (group 1) NO NOISE (group 2)
5.00 10.00 6.00 6.00 7.00 3.00 6.00 9.00 5.00 10.00 11.00 9.00 15.00 9.00 16.00 15.00 16.00 18.00 17.00 13.00 11.00 12.00 13.00 11.00
75Standard deviation
?(xn x)2 N- 1
76X1
5.00 10.00 6.00 6.00 7.00 3.00 6.00 9.00 5.00 10.00 11.00 9.00
? 87 / 12 7.3 (mean)
77X1 (X1 Xmean)
5.00 10.00 6.00 6.00 7.00 3.00 6.00 9.00 5.00 10.00 11.00 9.00 5 7.3 10 7.3 6 7.3 6 7.3 7 7.3 3 7.3 6 7.3 9 7.3 5 7.3 10 7.3 11 7.3 9 7.3
? 87 / 12 7.3 (mean)
78X1 (X1 Xmean) d1
5.00 10.00 6.00 6.00 7.00 3.00 6.00 9.00 5.00 10.00 11.00 9.00 5 7.3 10 7.3 6 7.3 6 7.3 7 7.3 3 7.3 6 7.3 9 7.3 5 7.3 10 7.3 11 7.3 9 7.3 2.3 2.7 1.3 1.3 0.3 4.3 1.3 1.7 2.3 2.7 3.7 1.7
? 87 / 12 7.3 (mean)
79X1 (X1 Xmean) d1 d12 (residuals)
5.00 10.00 6.00 6.00 7.00 3.00 6.00 9.00 5.00 10.00 11.00 9.00 5 7.3 10 7.3 6 7.3 6 7.3 7 7.3 3 7.3 6 7.3 9 7.3 5 7.3 10 7.3 11 7.3 9 7.3 2.3 2.7 1.3 1.3 0.3 4.3 1.3 1.7 2.3 2.7 3.7 1.7 5.29 7.29 1.69 1.69 0.09 18.49 1.69 2.89 5.29 7.29 13.69 2.89
? 87 / 12 7.3 (mean) ? 72.85
80Variance
72.85 12 - 1
6.25
81Standard deviation
72.85 11 - 1
2.25
82NOISE (group 1) NO NOISE (group 2)
5.00 10.00 6.00 6.00 7.00 3.00 6.00 9.00 5.00 10.00 11.00 9.00 15.00 9.00 16.00 15.00 16.00 18.00 17.00 13.00 11.00 12.00 13.00 11.00
Within group variance
Within group variance
Between group variance (difference between M1
and M2)
83(No Transcript)
84t-test
- Interval data
- For small samples (N lt 15) the data must be
normally distributed. - Homogeneity-of-variance (Levenes test )
85One sample t-test
Previous research has shown that English-speaking
children have an MLU of 3.1 at age 32. A
researcher wants to know if SLI children (i.e.
children with a specific language impairment)
have a lower (or higher MLU) at this age. We know
that SLI children have difficulties in processing
morphological units, but it is unclear, if their
MLUs are lower than in normally developing
children. In order to test this hypothesis, the
researcher collected data from 24 SLI children
aged 31 to 33 and determined the MLU for each
child.
86Child MLU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2,7 3,0 2,8 2,9 3,1 3,0 3,1 2,5 3,2 3,1 2,9 2,9 2,8 3,1 3,2 2,4 2,3 2,8 3,1 2,5 2,7 2,9 2,9 3,0
87One sample t-test
88Confidence intervals
MLU of 3.1 normally developing children