Title: Simple Statistics for Corpus Linguistics
1Simple Statistics for Corpus Linguistics
Sean Wallis Survey of English Usage University
College London s.wallis_at_ucl.ac.uk
2Outline
- Numbers
- A simple research question
- do women speak or write more than menin ICE-GB?
- p proportion probability
- Another research question
- what happens to speakers use of modal shall vs.
will over time? - the idea of inferential statistics
- plotting confidence intervals
- Concluding remarks
3Numbers...
- We are used to concepts like these being
expressed as numbers - length (distance, height)
- area
- volume
- temperature
- wealth (income, assets)
4Numbers...
- We are used to concepts like these being
expressed as numbers - length (distance, height)
- area
- volume
- temperature
- wealth (income, assets)
- We are going to discuss another concept
- probability
- proportion, percentage
- a simple idea, at the heart of statistics
5Probability
- Based on another, even simpler, idea
- probability p x / n
6Probability
- Based on another, even simpler, idea
- probability p x / n
- e.g. the probability that the speaker says will
instead of shall
7Probability
- Based on another, even simpler, idea
- probability p x / n
- where
- frequency x (often, f )
- the number of times something actually happens
- the number of hits in a search
- e.g. the probability that the speaker says will
instead of shall
8Probability
- Based on another, even simpler, idea
- probability p x / n
- where
- frequency x (often, f )
- the number of times something actually happens
- the number of hits in a search
- e.g. the probability that the speaker says will
instead of shall
9Probability
- Based on another, even simpler, idea
- probability p x / n
- where
- frequency x (often, f )
- the number of times something actually happens
- the number of hits in a search
- baseline n is
- the number of times something could happen
- the number of hits
- in a more general search
- in several alternative patterns (alternate
forms)
- e.g. the probability that the speaker says will
instead of shall
10Probability
- Based on another, even simpler, idea
- probability p x / n
- where
- frequency x (often, f )
- the number of times something actually happens
- the number of hits in a search
- baseline n is
- the number of times something could happen
- the number of hits
- in a more general search
- in several alternative patterns (alternate
forms)
- e.g. the probability that the speaker says will
instead of shall
11Probability
- Based on another, even simpler, idea
- probability p x / n
- where
- frequency x (often, f )
- the number of times something actually happens
- the number of hits in a search
- baseline n is
- the number of times something could happen
- the number of hits
- in a more general search
- in several alternative patterns (alternate
forms) - Probability can range from 0 to 1
- e.g. the probability that the speaker says will
instead of shall
12What can a corpus tell us?
- A corpus is a source of knowledge about language
- corpus
- introspection/observation/elicitation
- controlled laboratory experiment
- computer simulation
13What can a corpus tell us?
- A corpus is a source of knowledge about language
- corpus
- introspection/observation/elicitation
- controlled laboratory experiment
- computer simulation
How do these differ in what they might tell us?
14What can a corpus tell us?
- A corpus is a source of knowledge about language
- corpus
- introspection/observation/elicitation
- controlled laboratory experiment
- computer simulation
- A corpus is a sample of language
How do these differ in what they might tell us?
15What can a corpus tell us?
- A corpus is a source of knowledge about language
- corpus
- introspection/observation/elicitation
- controlled laboratory experiment
- computer simulation
- A corpus is a sample of language, varying by
- source (e.g. speech vs. writing, age...)
- levels of annotation (e.g. parsing)
- size (number of words)
- sampling method (random sample?)
How do these differ in what they might tell us?
16What can a corpus tell us?
- A corpus is a source of knowledge about language
- corpus
- introspection/observation/elicitation
- controlled laboratory experiment
- computer simulation
- A corpus is a sample of language, varying by
- source (e.g. speech vs. writing, age...)
- levels of annotation (e.g. parsing)
- size (number of words)
- sampling method (random sample?)
How do these differ in what they might tell us?
How does this affect the types of knowledge we
might obtain?
17What can a parsed corpus tell us?
- Three kinds of evidence may be found in a parsed
corpus
18What can a parsed corpus tell us?
- Three kinds of evidence may be found in a parsed
corpus - Frequency evidence of a particularknown rule,
structure or linguistic event
- How often?
19What can a parsed corpus tell us?
- Three kinds of evidence may be found in a parsed
corpus - Frequency evidence of a particularknown rule,
structure or linguistic event - Coverage evidence of new rules, etc.
- How often?
- How novel?
20What can a parsed corpus tell us?
- Three kinds of evidence may be found in a parsed
corpus - Frequency evidence of a particularknown rule,
structure or linguistic event - Coverage evidence of new rules, etc.
- Interaction evidence of relationshipsbetween
rules, structures and events
- How often?
- How novel?
- Does X affect Y?
21What can a parsed corpus tell us?
- Three kinds of evidence may be found in a parsed
corpus - Frequency evidence of a particularknown rule,
structure or linguistic event - Coverage evidence of new rules, etc.
- Interaction evidence of relationshipsbetween
rules, structures and events - Lexical searches may also be made more precise
using the grammatical analysis
- How often?
- How novel?
- Does X affect Y?
22A simple research question
- Let us consider the following question
- Do women speak or write more words than men in
the ICE-GB corpus? - What do you think?
- How might we find out?
23Lets get some data
- Open ICE-GB with ICECUP
- Text Fragment query for words
- ltPUNC,PAUSEgt
- counts every word, excluding pauses and
punctuation
24Lets get some data
- Open ICE-GB with ICECUP
- Text Fragment query for words
- ltPUNC,PAUSEgt
- counts every word, excluding pauses and
punctuation - Variable query
- TEXT CATEGORY spoken, written
25Lets get some data
- Open ICE-GB with ICECUP
- Text Fragment query for words
- ltPUNC,PAUSEgt
- counts every word, excluding pauses and
punctuation - Variable query
- TEXT CATEGORY spoken, written
- Variable query
- SPEAKER GENDER f, m, ltunknowngt
combine these3 queries
26Lets get some data
- Open ICE-GB with ICECUP
- Text Fragment query for words
- ltPUNC,PAUSEgt
- counts every word, excluding pauses and
punctuation - Variable query
- TEXT CATEGORY spoken, written
- Variable query
- SPEAKER GENDER f, m, ltunknowngt
combine these3 queries
27ICE-GB gender / written-spoken
- Proportion of words in each category
spoken/written by women and men - The authors of some texts are unspecified
- Some written material may be jointly authored
- female/male ratio varies slightly
female
written
male
spoken
TOTAL
p
0
0.2
0.4
0.6
0.8
1
28ICE-GB gender / written-spoken
- Proportion of words in each category
spoken/written by women and men - The authors of some texts are unspecified
- Some written material may be jointly authored
- female/male ratio varies slightly
female
written
p (female) words spoken by women /total words
(excluding ltunknowngt)
male
spoken
TOTAL
p
0
0.2
0.4
0.6
0.8
1
29p Probability Proportion
- We asked ourselves the following question
- Do women speak or write more words than men in
the ICE-GB corpus? - To answer this we looked at the proportion of
words in ICE-GB that are produced by women (out
of all words where the gender is known)
30p Probability Proportion
- We asked ourselves the following question
- Do women speak or write more words than men in
the ICE-GB corpus? - To answer this we looked at the proportion of
words in ICE-GB that are produced by women (out
of all words where the gender is known) - The proportion of words produced by women can
also be thought of as a probability - What is the probability that, if we were to pick
any random word in ICE-GB (and the gender was
known) it would be uttered by a woman?
31Another research question
- Let us consider the following question
- What happens to modal shall vs. will over time
in British English? - Does shall increase or decrease?
- What do you think?
- How might we find out?
32Lets get some data
- Open DCPSE with ICECUP
- FTF query for first person declarative shall
- repeat for will
33Lets get some data
- Open DCPSE with ICECUP
- FTF query for first person declarative shall
- repeat for will
- Corpus Map
- DATE
Do the first set of queries and then drop into
Corpus Map
34Modal shall vs. will over time
- Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall shall, will)
shall 100
0.8
0.6
0.4
0.2
shall 0
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al., 2013)
35Modal shall vs. will over time
- Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall shall, will)
shall 100
0.8
0.6
0.4
0.2
shall 0
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al., 2013)
36Modal shall vs. will over time
- Plotting probability of speaker selecting modal
shall out of shall/will over time (DCPSE)
1.0
p(shall shall, will)
shall 100
0.8
0.6
0.4
Is shall going up or down?
0.2
shall 0
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
(Aarts et al., 2013)
37Is shall going up or down?
- Whenever we look at change, we must ask ourselves
two things
38Is shall going up or down?
- Whenever we look at change, we must ask ourselves
two things - What is the change relative to?
- Is our observation higher or lower than we might
expect? - In this case we ask
- Does shall decrease relative to shall will ?
39Is shall going up or down?
- Whenever we look at change, we must ask ourselves
two things - What is the change relative to?
- Is our observation higher or lower than we might
expect? - In this case we ask
- Does shall decrease relative to shall will ?
- How confident are we in our results?
- Is the change big enough to be reproducible?
40The sample and the population
- We said that the corpus was a sample
41The sample and the population
- We said that the corpus was a sample
- Previously, we asked about the proportions of
male/female words in the corpus (ICE-GB) - We asked questions about the sample
- The answers were statements of fact
42The sample and the population
- We said that the corpus was a sample
- Previously, we asked about the proportions of
male/female words in the corpus (ICE-GB) - We asked questions about the sample
- The answers were statements of fact
- Now we are asking about British English
?
43The sample and the population
- We said that the corpus was a sample
- Previously, we asked about the proportions of
male/female words in the corpus (ICE-GB) - We asked questions about the sample
- The answers were statements of fact
- Now we are asking about British English
- We want to draw an inference
- from the sample (in this case, DCPSE)
- to the population (similarly-sampled BrE
utterances) - This inference is a best guess
- This process is called inferential statistics
44Basic inferential statistics
- Suppose we carry out an experiment
- We toss a coin 10 times and get 5 heads
- How confident are we in the results?
- Suppose we repeat the experiment
- Will we get the same result again?
45Basic inferential statistics
- Suppose we carry out an experiment
- We toss a coin 10 times and get 5 heads
- How confident are we in the results?
- Suppose we repeat the experiment
- Will we get the same result again?
- Lets try
- You should have one coin
- Toss it 10 times
- Write down how many heads you get
- Do you all get the same results?
46The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
- We toss a coin 10 times, and get 5 heads
N 1
X
x
47The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
- Due to chance, some samples will have a higher or
lower score
N 4
X
x
48The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
- Due to chance, some samples will have a higher or
lower score
N 8
X
x
49The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
- Due to chance, some samples will have a higher or
lower score
N 12
X
x
50The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
- Due to chance, some samples will have a higher or
lower score
N 16
X
x
51The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
- Due to chance, some samples will have a higher or
lower score
N 20
X
x
52The Binomial distribution
- Repeated sampling tends to form a Binomial
distribution around the expected mean X
F
- Due to chance, some samples will have a higher or
lower score
N 26
X
x
53The Binomial distribution
- It is helpful to express x as the probability of
choosing a head, p, with expected mean P - p x / n
- n max. number of possible heads (10)
- Probabilities are inthe range 0 to 1
- percentages (0 to 100)
F
P
p
54The Binomial distribution
- Take-home point
- A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is
not guaranteed to be correct in the world! - Estimating the confidence you have in your
results is essential
F
p
P
p
55The Binomial distribution
- Take-home point
- A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is
not guaranteed to be correct in the world! - Estimating the confidence you have in your
results is essential - We want to makepredictions about future runs of
the same experiment
F
p
P
p
56Binomial ? Normal
- The Binomial (discrete) distribution is close to
the Normal (continuous) distribution
F
x
57The central limit theorem
- Any Normal distribution can be defined by only
two variables and the Normal function z
? population mean P
? standard deviationS ? P(1 P) / n
F
- With more data in the experiment, S will be
smaller
z . S
z . S
0.5
0.3
0.1
0.7
p
58The central limit theorem
- Any Normal distribution can be defined by only
two variables and the Normal function z
? population mean P
? standard deviationS ? P(1 P) / n
F
z . S
z . S
- 95 of the curve is within 2 standard deviations
of the expected mean
- the correct figure is 1.95996!
- the critical value of z for an error level of
0.05.
2.5
2.5
95
0.5
0.3
0.1
0.7
p
59The single-sample z test...
- Is an observation p gt z standard deviations from
the expected (population) mean P?
- If yes, p is significantly different from P
F
observation p
z . S
z . S
0.25
0.25
P
0.5
0.3
0.1
0.7
p
60...gives us a confidence interval
- P z . S is the confidence interval for P
- We want to plot the interval about p
F
z . S
z . S
0.25
0.25
P
0.5
0.3
0.1
0.7
p
61...gives us a confidence interval
- P z . S is the confidence interval for P
- We want to plot the interval about p
62...gives us a confidence interval
- The interval about p is called the Wilson score
interval
observation p
- This interval reflects the Normal interval about
P - If P is at the upper limit of p,p is at the
lower limit of P
F
w
w
(Wallis, 2013)
P
0.25
0.25
0.5
0.3
0.1
0.7
p
63Modal shall vs. will over time
- Simple test
- Compare p for
- all LLC texts in DCPSE (1956-77) with
- all ICE-GB texts (early 1990s)
- We get the following data
- We may plot the probabilityof shall being
selected,with Wilson intervals
p(shall shall, will)
64Modal shall vs. will over time
- Simple test
- Compare p for
- all LLC texts in DCPSE (1956-77) with
- all ICE-GB texts (early 1990s)
- We get the following data
- We may plot the probabilityof shall being
selected,with Wilson intervals
May be input in a 2 x 2 chi-square test
- or you can check Wilson intervals
65Modal shall vs. will over time
- Plotting modal shall/will over time (DCPSE)
- Small amounts of data / year
1.0
p(shall shall, will)
0.8
0.6
0.4
0.2
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
66Modal shall vs. will over time
- Plotting modal shall/will over time (DCPSE)
- Small amounts of data / year
- Confidence intervals identify the degree of
certainty in our results
1.0
p(shall shall, will)
0.8
0.6
0.4
0.2
0.0
1955
1960
1965
1970
1975
1980
1985
1990
1995
67Modal shall vs. will over time
- Plotting modal shall/will over time (DCPSE)
- Small amounts of data / year
- Confidence intervals identify the degree of
certainty in our results - Highly skewed p in some cases
- p 0 or 1 (circled)
68Modal shall vs. will over time
- Plotting modal shall/will over time (DCPSE)
- Small amounts of data / year
- Confidence intervals identify the degree of
certainty in our results - We can now estimate an approximate downwards curve
(Aarts et al., 2013)
69Recap
- Whenever we look at change, we must ask ourselves
two things - What is the change relative to?
- Is our observation higher or lower than we might
expect? - In this case we ask
- Does shall decrease relative to shall will ?
- How confident are we in our results?
- Is the change big enough to be reproducible?
70Conclusions
- An observation is not the actual value
- Repeating the experiment might get different
results - The basic idea of these methods is
- Predict range of future results if experiment was
repeated - Significant effect gt 0 (e.g. 19 times out of
20) - Based on the Binomial distribution
- Approximated by Normal distribution many uses
- Plotting confidence intervals
- Use goodness of fit or single-sample z tests to
compare an observation with an expected baseline - Use 2?2 tests or two independent sample z tests
to compare two observed samples
71References
- Aarts, B., Close, J., and Wallis, S.A. 2013.
Choices over time methodological issues in
investigating current change. Chapter 2 in Aarts,
B. Close, J., Leech G., and Wallis, S.A. (eds.)
The Verb Phrase in English. Cambridge University
Press. - Wallis, S.A. 2013. Binomial confidence intervals
and contingency tests. Journal of Quantitative
Linguistics 203, 178-208. - Wilson, E.B. 1927. Probable inference, the law of
succession, and statistical inference. Journal of
the American Statistical Association 22 209-212 - NOTE Statistics papers, more explanation,
spreadsheets etc. are published on
corp.ling.stats blog http//corplingstats.wordpre
ss.com