Title: Statistics for variationists
1Statistics for variationists
- - or -what a linguist needs to know about
statistics
Sean Wallis Survey of English Usage University
College London s.wallis_at_ucl.ac.uk
2Outline
- What is the point of statistics?
- Variationist corpus linguistics
- How inferential statistics works
- Introducing z tests
- Two types (single-sample and two-sample)
- How these tests are related to ?²
- Effect size and comparing results of
experiments - Methodological implications for corpus linguistics
3What is the point of statistics?
- Analyse data you already have
- corpus linguistics
- Design new experiments
- collect new data, add annotation
- experimental linguistics in the lab
- Try new methods
- pose the right question
- We are going to focus onz and ?² tests
4What is the point of statistics?
- Analyse data you already have
- corpus linguistics
- Design new experiments
- collect new data, add annotation
- experimental linguistics in the lab
- Try new methods
- pose the right question
- We are going to focus onz and ?² tests
observational science
experimental science
philosophy of science
a little maths
5What is inferential statistics?
- Suppose we carry out an experiment
- We toss a coin 10 times and get 5 heads
- How confident are we in the results?
- Suppose we repeat the experiment
- Will we get the same result again?
- Inferential statistics is a method of inferring
the behaviour of future ghost experiments from
one experiment - We infer from the sample to the population
- Let us consider one type of experiment
- Linguistic alternation experiments
6Alternation experiments
- A variationist corpus paradigm
- Imagine a speaker forming a sentence as a series
of decisions/choices. They can - add choose to extend a phrase or clause, or stop
- select choose between constructions
- Choices will be constrained
- grammatically
- semantically
7Alternation experiments
- A variationist corpus paradigm
- Imagine a speaker forming a sentence as a series
of decisions/choices. They can - add choose to extend a phrase or clause, or stop
- select choose between constructions
- Choices will be constrained
- grammatically
- semantically
- Research question
- within these constraints,what factors influence
the particular choice?
8Alternation experiments
- Laboratory experiment (cued)
- pose the choice to subjects
- observe the one they make
- manipulate different potential influences
- Observational experiment (uncued)
- observe the choices speakers make when they make
them (e.g. in a corpus) - extract data for different potential influences
- sociolinguistic subdivide data by genre, etc
- lexical/grammatical subdivide data by elements
in surrounding context - BUT the alternate choice is counterfactual
9Statistical assumptions
- A random sample taken from the population
- Not always easy to achieve
- multiple cases from the same text and speakers,
etc - may be limited historical data available
- Be careful with data concentrated in a few texts
- The sample is tiny compared to the population
- This is easy to satisfy in linguistics!
- Observations are free to vary (alternate)
- Repeated sampling tends to form a Binomial
distribution around the expected mean - This requires slightly more explanation...
10The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
- We toss a coin 10 times, and get 5 heads
N 1
P
x
11The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
- Due to chance, some samples will have a higher or
lower score
N 4
P
x
12The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
- Due to chance, some samples will have a higher or
lower score
N 8
P
x
13The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
- Due to chance, some samples will have a higher or
lower score
N 12
P
x
14The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
- Due to chance, some samples will have a higher or
lower score
N 16
P
x
15The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
- Due to chance, some samples will have a higher or
lower score
N 20
P
x
16The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean P
F
- Due to chance, some samples will have a higher or
lower score
N 24
P
x
17Binomial ? Normal
- The Binomial (discrete) distribution is close to
the Normal (continuous) distribution
F
x
18The central limit theorem
- Any Normal distribution can be defined by only
two variables and the Normal function z
? population mean P
? standard deviations ? P(1 P) / n
F
- With more data in the experiment, s will be
smaller
z . s
z . s
- Divide x by 10 for probability scale
0.5
0.3
0.1
0.7
p
19The central limit theorem
- Any Normal distribution can be defined by only
two variables and the Normal function z
? population mean P
? standard deviations ? P(1 P) / n
F
z . s
z . s
- 95 of the curve is within 2 standard deviations
of the expected mean
- the correct figure is 1.95996!
- the critical value of z for an error level of
0.05.
2.5
2.5
95
0.5
0.3
0.1
0.7
p
20The central limit theorem
- Any Normal distribution can be defined by only
two variables and the Normal function z
? population mean P
? standard deviations ? P(1 P) / n
F
z . s
z . s
za/2
- the critical value of z for an error level a of
0.05.
2.5
2.5
95
0.5
0.3
0.1
0.7
p
21The single-sample z test...
- Is an observation p gt z standard deviations from
the expected (population) mean P?
- If yes, p is significantly different from P
F
observation p
z . s
z . s
0.25
0.25
P
0.5
0.3
0.1
0.7
p
22...gives us a confidence interval
- P z . s is the confidence interval for P
- We want to plot the interval about p
F
z . s
z . s
0.25
0.25
P
0.5
0.3
0.1
0.7
p
23...gives us a confidence interval
- P z . s is the confidence interval for P
- We want to plot the interval about p
observation p
F
w
w
P
0.25
0.25
0.5
0.3
0.1
0.7
p
24...gives us a confidence interval
- The interval about p is called the Wilson score
interval
observation p
- This interval is asymmetric
- It reflects the Normal interval about P
- If P is at the upper limit of p,p is at the
lower limit of P
F
w
w
P
0.25
0.25
(Wallis, to appear, a)
0.5
0.3
0.1
0.7
p
25...gives us a confidence interval
- The interval about p is called theWilson score
interval
observation p
- To calculate w and w we use this formula
F
w
w
P
0.25
0.25
(Wilson, 1927)
0.5
0.3
0.1
0.7
p
26Plotting confidence intervals
- Plotting modal shall/will over time (DCPSE)
- Small amounts of data / year
- Highly skewed p in some cases
- p 0 or 1 (circled)
- Confidence intervals identify the degree of
certainty in our results
(Wallis, to appear, a)
27Plotting confidence intervals
- Probability of adding successive attributive
adjective phrases (AJPs) to a NP in ICE-GB - x number of AJPs
- NPs get longer ? adding AJPs is more difficult
- The first two falls are significant, the last is
not
282 x 1 goodness of fit ?² test
- Same as single-sample z test for P (z² ?²)
- Does the value of a affect p(b)?
F
p(b a)
z . s
z . s
p(b)
P p(b)
p(b a)
IV A a, a DV B b, b
p
292 x 1 goodness of fit ?² test
- Same as single-sample z test for P (z² ?²)
- Or Wilson test for p (by inversion)
F
p(b)
P p(b)
w
w
p(b a)
IV A a, a DV B b, b
p(b a)
p
30The single-sample z test
- Compares an observation with a given value
- Compare p(b a) with p(b)
- A goodness of fit test
- Identical to a standard 2?1 ?² test
- Note that p(b) is given
- All of the variation is assumedto be in the
estimate of p(b a) - Could also comparep(b a) with p(b)
p(b)
p(b a)
p(b a)
31z test for 2 independent proportions
- Method combine observed values
- take the difference (subtract) p1 p2
- calculate an averaged confidence interval
p2 p(b a)
F
O1
O2
p1 p(b a)
(Wallis, to appear, b)
p
32z test for 2 independent proportions
- New confidence interval D O1 O2
- standard deviation s' ?p(1 p) (1/n1 1/n2)
- p p(b)
- comparez.s' with x p1 p2
?
x
D
z.s'
(Wallis, to appear, b)
0
mean x 0
? p
33z test for 2 independent proportions
- Identical to a standard 2?2 ?² test
- So you can use the usual method!
34z test for 2 independent proportions
- Identical to a standard 2?2 ?² test
- So you can use the usual method!
- BUT 2?1 and 2?2 tests have different purposes
- 2?1 goodness of fit compares single value a with
superset A - assumes only a varies
- 2?2 test compares two valuesa, a within a set A
- both values may vary
A
g.o.f. c2
a
a
2 ? 2 c2
IV A a, a
35z test for 2 independent proportions
- Identical to a standard 2?2 ?² test
- So you can use the usual method!
- BUT 2?1 and 2?2 tests have different purposes
- 2?1 goodness of fit compares single value a with
superset A - assumes only a varies
- 2?2 test compares two valuesa, a within a set A
- both values may vary
- Q Do we need ?²?
A
g.o.f. c2
a
a
2 ? 2 c2
IV A a, a
36Larger ?² tests
- ?² is popular because it can be applied to
contingency tables with many values - r ? 1 goodness of fit ?² tests (r ? 2)
- r ? c ?² tests for homogeneity (r,c ? 2)
- z tests have 1 degree of freedom
- strength significance is due to only one source
- strength easy to plot values and confidence
intervals - weakness multiple values may be unavoidable
- With larger ?² tests, evaluate and simplify
- Examine ?² contributions for each row or column
- Focus on alternation - try to test for a speaker
choice
37How big is the effect?
- These tests do not measure the strength of the
interaction between two variables - They test whether the strength of an interaction
is greater than would be expected by chance - With lots of data, a tiny change would be
significant - Dont use ?², p or z values to compare two
different experiments - A result significant at plt0.01 is not better
than one significant at plt0.05 - There are a number of ways of measuring
association strength or effect size
38How big is the effect?
- Percentage swing
- swing d p(a b) p(a b)
- swing d d/p(a b)
- frequently used (X increased by 50)
- may have confidence intervals on change
- can be misleading (50 then -50 is not
zero) - one change, not sequence
- over one value, not multiple values
39How big is the effect?
- Percentage swing
- swing d p(a b) p(a b)
- swing d d/p(a b)
- frequently used (X increased by 50)
- may have confidence intervals on change
- can be misleading (50 then -50 is not
zero) - one change, not sequence
- over one value, not multiple values
- Cramérs f
- ? ?²/N (2?2) N grand total
- ?c ?²/(k 1)N (r ?c ) k min(r, c)
- measures degree of association of one variable
with another (across all values)
?
?
40Comparing experimental results
- Suppose we have two similar experiments
- How do we test if one result is significantly
stronger than another?
41Comparing experimental results
- Suppose we have two similar experiments
- How do we test if one result is significantly
stronger than another? - Test swings
- z test for two samples from different
populations - Use s' s12 s22
- Test d1(a) d2(a) gt z.s'
0
?
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
d1(a)
d2(a)
-0.7
(Wallis 2011)
42Comparing experimental results
- Suppose we have two similar experiments
- How do we test if one result is significantly
stronger than another? - Test swings
- z test for two samples from different
populations - Use s' s12 s22
- Test d1(a) d2(a) gt z.s'
- Same method can be used to compare other z or
?² tests
0
?
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
d1(a)
d2(a)
-0.7
(Wallis 2011)
43Modern improvements on z and ?²
- Continuity correction for small n
- Yates ?2 test
- errs on side of caution
- can also be applied to Wilson interval
- Newcombe (1998) improves on 2?2 ?² test
- combines two Wilson score intervals
- performs better than ?² and log-likelihood (etc.)
for low-frequency events or small samples - However, for corpus linguists, there remains one
outstanding problem...
44Experimental design
- Each observation should be free to vary
- i.e. p can be any value from 0 to 1
p(b words)
p(b VPs)
p(b tensed VPs)
b1
b2
45Experimental design
- Each observation should be free to vary
- i.e. p can be any value from 0 to 1
- However many people use these methods
incorrectly - e.g. citation per million words
- what does this actually mean?
p(b words)
p(b VPs)
p(b tensed VPs)
b1
b2
46Experimental design
- Each observation should be free to vary
- i.e. p can be any value from 0 to 1
- However many people use these methods
incorrectly - e.g. citation per million words
- what does this actually mean?
- Baseline should be choice
- Experimentalists can design choice into
experiment - Corpus linguists have to infer when speakers had
opportunity to choose, counterfactually
p(b words)
p(b VPs)
p(b tensed VPs)
b1
b2
47A methodological progression
- Aim
- investigate change when speakers have a choice
- Four levels of experimental refinement
?
pmw
words
48A methodological progression
- Aim
- investigate change when speakers have a choice
- Four levels of experimental refinement
?
?
select a plausible baseline
pmw
words
tensed VPs
49A methodological progression
- Aim
- investigate change when speakers have a choice
- Four levels of experimental refinement
?
?
?
select a plausible baseline
grammatically restrict data or enumerate cases
pmw
will, shall
words
tensed VPs
50A methodological progression
- Aim
- investigate change when speakers have a choice
- Four levels of experimental refinement
?
?
?
?
select a plausible baseline
grammatically restrict data or enumerate cases
check each case individually for plausibility of
alternation
pmw
will, shall
will, shall
words
tensed VPs
Ye shall be saved
51Conclusions
- The basic idea of these methods is
- Predict future results if experiment were
repeated - Significant effect gt 0 (e.g. 19 times out of
20) - Based on the Binomial distribution
- Approximated by Normal distribution many uses
- Plotting confidence intervals
- Use goodness of fit or single-sample z tests to
compare an observation with an expected baseline - Use 2?2 tests or two independent sample z tests
to compare two observed samples - When using larger r ?c tests, simplify as far as
possible to identify the source of variation! - Take care with small samples / low frequencies
- Use Wilson and Newcombes methods instead!
52Conclusions
- Two methods for measuring the size of an
experimental effect - absolute or percentage swing
- Cramérs f
- You can compare two experiments
- These methods all presume that
- observed p is free to vary (speaker is free to
choose) - If this is not the case then
- statistical model is undermined
- confidence intervals are too conservative
- but multiple changes are combined into one
- e.g. VPs increase while modals decrease
- so significant change may not mean what you
think!
53References
- Newcombe, R.G. 1998. Interval estimation for the
difference between independent proportions
comparison of eleven methods. Statistics in
Medicine 17 873-890 - Wallis, S.A. 2011. Comparing ?² tests for
separability. London Survey of English Usage,
UCL - Wallis, S.A. to appear, a. Binomial confidence
intervals and contingency tests. Journal of
Quantitative Linguistics - Wallis, S.A. to appear, b. z-squared The origin
and use of ?². Journal of Quantitative
Linguistics - Wilson, E.B. 1927. Probable inference, the law of
succession, and statistical inference. Journal of
the American Statistical Association 22 209-212 - NOTE My statistics papers, more explanation,
spreadsheets etc. are published on
corp.ling.stats blog http//corplingstats.wordpre
ss.com