Statistics for variationists - PowerPoint PPT Presentation

About This Presentation

Title:

Statistics for variationists

Description:

Statistics for variationists - or - what a linguist needs to know about statistics Sean Wallis Survey of English Usage University College London – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 54

Provided by: SeanW1

Category:

more less

Transcript and Presenter's Notes

Title: Statistics for variationists

1
Statistics for variationists

- or -what a linguist needs to know about
statistics

Sean Wallis Survey of English Usage University
College London s.wallis_at_ucl.ac.uk
2
Outline

What is the point of statistics?
Variationist corpus linguistics
How inferential statistics works
Introducing z tests
Two types (single-sample and two-sample)
How these tests are related to ?²
Effect size and comparing results of
experiments
Methodological implications for corpus linguistics

3
What is the point of statistics?

Analyse data you already have
corpus linguistics
Design new experiments
collect new data, add annotation
experimental linguistics in the lab
Try new methods
pose the right question
We are going to focus onz and ?² tests

4
What is the point of statistics?

Analyse data you already have
corpus linguistics
Design new experiments
collect new data, add annotation
experimental linguistics in the lab
Try new methods
pose the right question
We are going to focus onz and ?² tests

observational science

experimental science

philosophy of science

a little maths
5
What is inferential statistics?

Suppose we carry out an experiment
We toss a coin 10 times and get 5 heads
How confident are we in the results?
Suppose we repeat the experiment
Will we get the same result again?
Inferential statistics is a method of inferring
the behaviour of future ghost experiments from
one experiment
We infer from the sample to the population
Let us consider one type of experiment
Linguistic alternation experiments

6
Alternation experiments

A variationist corpus paradigm
Imagine a speaker forming a sentence as a series
of decisions/choices. They can
add choose to extend a phrase or clause, or stop
select choose between constructions
Choices will be constrained
grammatically
semantically

7
Alternation experiments

A variationist corpus paradigm
Imagine a speaker forming a sentence as a series
of decisions/choices. They can
add choose to extend a phrase or clause, or stop
select choose between constructions
Choices will be constrained
grammatically
semantically
Research question
within these constraints,what factors influence
the particular choice?

8
Alternation experiments

Laboratory experiment (cued)
pose the choice to subjects
observe the one they make
manipulate different potential influences
Observational experiment (uncued)
observe the choices speakers make when they make
them (e.g. in a corpus)
extract data for different potential influences
sociolinguistic subdivide data by genre, etc
lexical/grammatical subdivide data by elements
in surrounding context
BUT the alternate choice is counterfactual

9
Statistical assumptions

A random sample taken from the population
Not always easy to achieve
multiple cases from the same text and speakers,
etc
may be limited historical data available
Be careful with data concentrated in a few texts
The sample is tiny compared to the population
This is easy to satisfy in linguistics!
Observations are free to vary (alternate)
Repeated sampling tends to form a Binomial
distribution around the expected mean
This requires slightly more explanation...

10
The Binomial distribution

Repeated sampling tends to form a Binomial
distribution around the expected mean P

We toss a coin 10 times, and get 5 heads

N 1
P
x
11
The Binomial distribution

Repeated sampling tends to form a Binomial
distribution around the expected mean P

Due to chance, some samples will have a higher or
lower score

N 4
P
x
12
The Binomial distribution

Repeated sampling tends to form a Binomial
distribution around the expected mean P

Due to chance, some samples will have a higher or
lower score

N 8
P
x
13
The Binomial distribution

Repeated sampling tends to form a Binomial
distribution around the expected mean P

Due to chance, some samples will have a higher or
lower score

N 12
P
x
14
The Binomial distribution

Repeated sampling tends to form a Binomial
distribution around the expected mean P

Due to chance, some samples will have a higher or
lower score

N 16
P
x
15
The Binomial distribution

Repeated sampling tends to form a Binomial
distribution around the expected mean P

Due to chance, some samples will have a higher or
lower score

N 20
P
x
16
The Binomial distribution

Repeated sampling tends to form a Binomial
distribution around the expected mean P

Due to chance, some samples will have a higher or
lower score

N 24
P
x
17
Binomial ? Normal

The Binomial (discrete) distribution is close to
the Normal (continuous) distribution

F
x
18
The central limit theorem

Any Normal distribution can be defined by only
two variables and the Normal function z

? population mean P
? standard deviations ? P(1 P) / n
F

With more data in the experiment, s will be
smaller

z . s
z . s

Divide x by 10 for probability scale

0.5
0.3
0.1
0.7
p
19
The central limit theorem

Any Normal distribution can be defined by only
two variables and the Normal function z

? population mean P
? standard deviations ? P(1 P) / n
F
z . s
z . s

95 of the curve is within 2 standard deviations
of the expected mean

the correct figure is 1.95996!
the critical value of z for an error level of
0.05.

2.5
2.5
95
0.5
0.3
0.1
0.7
p
20
The central limit theorem

Any Normal distribution can be defined by only
two variables and the Normal function z

? population mean P
? standard deviations ? P(1 P) / n
F
z . s
z . s
za/2

the critical value of z for an error level a of
0.05.

2.5
2.5
95
0.5
0.3
0.1
0.7
p
21
The single-sample z test...

Is an observation p gt z standard deviations from
the expected (population) mean P?

If yes, p is significantly different from P

F
observation p
z . s
z . s
0.25
0.25
P
0.5
0.3
0.1
0.7
p
22
...gives us a confidence interval

P z . s is the confidence interval for P
We want to plot the interval about p

F
z . s
z . s
0.25
0.25
P
0.5
0.3
0.1
0.7
p
23
...gives us a confidence interval

P z . s is the confidence interval for P
We want to plot the interval about p

observation p
F
w
w
P
0.25
0.25
0.5
0.3
0.1
0.7
p
24
...gives us a confidence interval

The interval about p is called the Wilson score
interval

observation p

This interval is asymmetric
It reflects the Normal interval about P
If P is at the upper limit of p,p is at the
lower limit of P

F
w
w
P
0.25
0.25
(Wallis, to appear, a)
0.5
0.3
0.1
0.7
p
25
...gives us a confidence interval

The interval about p is called theWilson score
interval

observation p

To calculate w and w we use this formula

F
w
w
P
0.25
0.25
(Wilson, 1927)
0.5
0.3
0.1
0.7
p
26
Plotting confidence intervals

Plotting modal shall/will over time (DCPSE)

Small amounts of data / year
Highly skewed p in some cases
p 0 or 1 (circled)
Confidence intervals identify the degree of
certainty in our results

(Wallis, to appear, a)
27
Plotting confidence intervals

Probability of adding successive attributive
adjective phrases (AJPs) to a NP in ICE-GB
x number of AJPs

NPs get longer ? adding AJPs is more difficult
The first two falls are significant, the last is
not

28
2 x 1 goodness of fit ?² test

Same as single-sample z test for P (z² ?²)
Does the value of a affect p(b)?

F
p(b a)
z . s
z . s
p(b)
P p(b)
p(b a)
IV A a, a DV B b, b
p
29
2 x 1 goodness of fit ?² test

Same as single-sample z test for P (z² ?²)
Or Wilson test for p (by inversion)

F
p(b)
P p(b)
w
w
p(b a)
IV A a, a DV B b, b
p(b a)
p
30
The single-sample z test

Compares an observation with a given value
Compare p(b a) with p(b)
A goodness of fit test
Identical to a standard 2?1 ?² test
Note that p(b) is given
All of the variation is assumedto be in the
estimate of p(b a)
Could also comparep(b a) with p(b)

p(b)
p(b a)
p(b a)
31
z test for 2 independent proportions

Method combine observed values
take the difference (subtract) p1 p2
calculate an averaged confidence interval

p2 p(b a)
F
O1
O2
p1 p(b a)
(Wallis, to appear, b)
p
32
z test for 2 independent proportions

New confidence interval D O1 O2
standard deviation s' ?p(1 p) (1/n1 1/n2)
p p(b)
comparez.s' with x p1 p2

?

x
D
z.s'
(Wallis, to appear, b)
0
mean x 0
? p
33
z test for 2 independent proportions

Identical to a standard 2?2 ?² test
So you can use the usual method!

34
z test for 2 independent proportions

Identical to a standard 2?2 ?² test
So you can use the usual method!
BUT 2?1 and 2?2 tests have different purposes
2?1 goodness of fit compares single value a with
superset A
assumes only a varies
2?2 test compares two valuesa, a within a set A
both values may vary

A
g.o.f. c2
a
a
2 ? 2 c2
IV A a, a
35
z test for 2 independent proportions

Identical to a standard 2?2 ?² test
So you can use the usual method!
BUT 2?1 and 2?2 tests have different purposes
2?1 goodness of fit compares single value a with
superset A
assumes only a varies
2?2 test compares two valuesa, a within a set A
both values may vary
Q Do we need ?²?

A
g.o.f. c2
a
a
2 ? 2 c2
IV A a, a
36
Larger ?² tests

?² is popular because it can be applied to
contingency tables with many values
r ? 1 goodness of fit ?² tests (r ? 2)
r ? c ?² tests for homogeneity (r,c ? 2)
z tests have 1 degree of freedom
strength significance is due to only one source
strength easy to plot values and confidence
intervals
weakness multiple values may be unavoidable
With larger ?² tests, evaluate and simplify
Examine ?² contributions for each row or column
Focus on alternation - try to test for a speaker
choice

37
How big is the effect?

These tests do not measure the strength of the
interaction between two variables
They test whether the strength of an interaction
is greater than would be expected by chance
With lots of data, a tiny change would be
significant
Dont use ?², p or z values to compare two
different experiments
A result significant at plt0.01 is not better
than one significant at plt0.05
There are a number of ways of measuring
association strength or effect size

38
How big is the effect?

Percentage swing
swing d p(a b) p(a b)
swing d d/p(a b)
frequently used (X increased by 50)
may have confidence intervals on change
can be misleading (50 then -50 is not
zero)
one change, not sequence
over one value, not multiple values

39
How big is the effect?

Percentage swing
swing d p(a b) p(a b)
swing d d/p(a b)
frequently used (X increased by 50)
may have confidence intervals on change
can be misleading (50 then -50 is not
zero)
one change, not sequence
over one value, not multiple values
Cramérs f
? ?²/N (2?2) N grand total
?c ?²/(k 1)N (r ?c ) k min(r, c)
measures degree of association of one variable
with another (across all values)

?
?
40
Comparing experimental results

Suppose we have two similar experiments
How do we test if one result is significantly
stronger than another?

41
Comparing experimental results

Suppose we have two similar experiments
How do we test if one result is significantly
stronger than another?
Test swings
z test for two samples from different
populations
Use s' s12 s22
Test d1(a) d2(a) gt z.s'

0
?
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
d1(a)
d2(a)
-0.7
(Wallis 2011)
42
Comparing experimental results

Suppose we have two similar experiments
How do we test if one result is significantly
stronger than another?
Test swings
z test for two samples from different
populations
Use s' s12 s22
Test d1(a) d2(a) gt z.s'
Same method can be used to compare other z or
?² tests

0
?
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
d1(a)
d2(a)
-0.7
(Wallis 2011)
43
Modern improvements on z and ?²

Continuity correction for small n
Yates ?2 test
errs on side of caution
can also be applied to Wilson interval
Newcombe (1998) improves on 2?2 ?² test
combines two Wilson score intervals
performs better than ?² and log-likelihood (etc.)
for low-frequency events or small samples
However, for corpus linguists, there remains one
outstanding problem...

44
Experimental design

Each observation should be free to vary
i.e. p can be any value from 0 to 1

p(b words)
p(b VPs)
p(b tensed VPs)
b1
b2
45
Experimental design

Each observation should be free to vary
i.e. p can be any value from 0 to 1
However many people use these methods
incorrectly
e.g. citation per million words
what does this actually mean?

p(b words)
p(b VPs)
p(b tensed VPs)
b1
b2
46
Experimental design

Each observation should be free to vary
i.e. p can be any value from 0 to 1
However many people use these methods
incorrectly
e.g. citation per million words
what does this actually mean?
Baseline should be choice
Experimentalists can design choice into
experiment
Corpus linguists have to infer when speakers had
opportunity to choose, counterfactually

p(b words)
p(b VPs)
p(b tensed VPs)
b1
b2
47
A methodological progression

Aim
investigate change when speakers have a choice
Four levels of experimental refinement

?
pmw
words
48
A methodological progression

Aim
investigate change when speakers have a choice
Four levels of experimental refinement

?
?
select a plausible baseline
pmw
words
tensed VPs
49
A methodological progression

Aim
investigate change when speakers have a choice
Four levels of experimental refinement

?
?
?
select a plausible baseline
grammatically restrict data or enumerate cases
pmw
will, shall
words
tensed VPs
50
A methodological progression

Aim
investigate change when speakers have a choice
Four levels of experimental refinement

?
?
?
?
select a plausible baseline
grammatically restrict data or enumerate cases
check each case individually for plausibility of
alternation
pmw
will, shall
will, shall
words
tensed VPs
Ye shall be saved
51
Conclusions

The basic idea of these methods is
Predict future results if experiment were
repeated
Significant effect gt 0 (e.g. 19 times out of
20)
Based on the Binomial distribution
Approximated by Normal distribution many uses
Plotting confidence intervals
Use goodness of fit or single-sample z tests to
compare an observation with an expected baseline
Use 2?2 tests or two independent sample z tests
to compare two observed samples
When using larger r ?c tests, simplify as far as
possible to identify the source of variation!
Take care with small samples / low frequencies
Use Wilson and Newcombes methods instead!

52
Conclusions

Two methods for measuring the size of an
experimental effect
absolute or percentage swing
Cramérs f
You can compare two experiments
These methods all presume that
observed p is free to vary (speaker is free to
choose)
If this is not the case then
statistical model is undermined
confidence intervals are too conservative
but multiple changes are combined into one
e.g. VPs increase while modals decrease
so significant change may not mean what you
think!

53
References

Newcombe, R.G. 1998. Interval estimation for the
difference between independent proportions
comparison of eleven methods. Statistics in
Medicine 17 873-890
Wallis, S.A. 2011. Comparing ?² tests for
separability. London Survey of English Usage,
UCL
Wallis, S.A. to appear, a. Binomial confidence
intervals and contingency tests. Journal of
Quantitative Linguistics
Wallis, S.A. to appear, b. z-squared The origin
and use of ?². Journal of Quantitative
Linguistics
Wilson, E.B. 1927. Probable inference, the law of
succession, and statistical inference. Journal of
the American Statistical Association 22 209-212
NOTE My statistics papers, more explanation,
spreadsheets etc. are published on
corp.ling.stats blog http//corplingstats.wordpre
ss.com