Title: Corpus Linguistics and Language Variation
1Corpus Linguistics and Language Variation
- Michael P. Oakes
- University of Sunderland
2Contents
- Introduction to Corpora and Language Variation
- The Chi-Squared Test
- British versus U.S. English
- Social Differentiation in the Use of Vocabulary
- Genre Analysis
- Yules Distinctiveness Coefficient
- Hierarchical Clustering
- Factor Analysis
- Linguistic Facets and Support Vector Machines
- Computational Stylometry
- Conclusions
3Introduction
- Computer analyses of linguistic variation are
restricted to comparisons of the use of
frequently occurring, objectively countable
linguistic features. - Not hapax legomena
- Not Semitisms in the Greek New Testament
- A corpus is a large body of (usually electronic)
text, sampled for a purpose. - What are the differences in the ways that feature
x is used in corpus y and corpus z ? - In order to carry out comparisons, we need
corpora that are matched as far as possible in
every way but one, e.g. same genre, same country,
different gender.
4The LOB Corpus Family
- A well-known family of corpora, based on the
Brown corpus of one million words of American
English (1962). - The Lancaster-Oslo-Bergen corpus is the British
Equivalent. - Carefully constructed using the same sampling
model (balanced) to carry out studies of
diachronic and synchronic variation.
5Broad Text Category Genre Texts in Brown Texts in LOB
Press A Reportage 44 44
B Editorial 27 27
C Reviews 17 17
General Prose D Religion 17 17
E Skills, Trades, Hobbies 36 38
F Popular Lore 48 44
G Belles Lettres, Biographies, Essays 75 77
H Miscellaneous 30 30
J Academic Prose 80 80
Fiction K General Fiction 29 29
L Mystery and Detective 24 24
M Science Fiction 6 6
N Adventure and Western 29 29
P Romance and Love Story 29 29
R Humour 9 9
6Diachronic Corpora
- LOB and Brown represent English of the 1960s.
- FLOB and Frown are balanced with respect to LOB
and Brown, designed to reflect English of the
1990s. - Lancaster1931 Corpus (Leech and Smith, 2005)
- All annotated with a common grammatical tagset
known as C8, but no demographic information. - Equal 30-year gaps enable researchers to
determine whether linguistic change is speeding
up, constant or slowing down. - BE06 (Baker, 2006).
7Sources of Linguistic Variation
- Biber (1998) uses the term genre for classes of
texts that are determined on the basis of
external criteria relating to authors or
speakers purpose. - Text types grouped by similarities in intrinsic
linguistic form, irrespective of their genre
classifications e.g. press reportage,
biographies and academic prose all have a
narrative linguistic form. - Register refers to variation in language arising
from the situation it is used in, depending on
such things as interactiveness, who is the
addressee. - Dialect defined by association with different
speaker groups, based on region, social group or
other demographic factors. - Topic, genre, stylistic preference, may swamp
language change proper.
8Feature Selection
- An early stage in many text classification tasks
is to decide which features (called attributes
in machine learning applications) should be used
to characterise the texts. - Words (non-trivial)
- Lemmas and Stems
- Interpretive codes e.g. AB (abbreviation), SF
(neologism from science fiction), FO (foreign
origin) - Part-of-Speech and Semantic tags
- E.g. does American English use any of these more
than British English?
9The Chi-Squared (?²) Test (1)
- Is mother more typical of female speech than
male speech? - Rayson, Leech and Hodges (1987) used the BNC
Conversational Corpus - Start with a contingency table of observed values
Female Male Column totals
Mother 627 272 899
Any other word 2,592, 825 1,714, 161 4,306, 956
Row totals 2,593, 452 1,714, 433 Grand total 4,307, 885
10The Chi-Squared (?²) Test (2)
- Expected values are found using the formula
- row total col total / grand total
Female Male
Mother 541.2 357.8
Any other word 2,592, 910.8 1,714,075.2
11The Chi-Squared (?²) Test (3)
- X² S (O E)² / E
- X² 34.2
- For one degree of freedom, if X² gt 10.8 we can be
99.9 confident that women really do say mother
more than men.
Female Male
Mother 13.6 20.6
Any other word 0.0 0.0
12Gender Variation in the Use of Vocabulary
Words most characteristic of Male speech Words most characteristic of Female speech
fing, er, the, yeah, aye, right, hundred, f, is, of, two, three, a, four, ah, no, number, quid, mate She, her, said, nt, I, and, to, cos, oh, Christmas, thought, lovely, nice, mm, had, did, going, because, him, really
13Common Pitfalls
- The corpora need not be the same size
- Expected values should be at least 5
- Log-likelihood / G² also requires E 5
- Express data as counts, not ratios (e.g. words
per million) - Need for a dispersion measure (thalidomide
example) - Bonferroni correction for multiple comparisons
reduces Type I errors
14Yules Distinctiveness Coefficient (Q)
- Q (AD BC) / (AD BC)
- Q measures strength and direction of a
relationship - Q -1 or 1 for both complete and absolute
relationships - F ?² / sample size
complete
Brown LOB
Theatre 0 63
Theater 95 30
absolute
Brown LOB
South-west 0 10
southwest 16 0
15Comparison of Corpora representing English spoken
in 5 Countries (Oakes Farrow, 2007)
ACE (Australia) Labour rights (unions, unemployed, superannuation)
FLOB (UK) Aristocratic titles (royal, Lord)
Frown (United States) Spelling differences (color, theater), terms for transportation (railroad, highway), diversity (black, gender, white, gay)
Kolhapur (India) Crores (ten millions), lakhs (ten thousands), caste system (dalit, caste), religion (Hindu, Krishna, temple), function words (the, upto)
Wellington (N Zealand) Sports (rugby), the natural world (bay, beach, cliff).
16Genre Analysis
- Univariate techniques such as chi-squared and
Yules Q can be used to compare the vocabulary
used in different genres - Here we will look at multivariate techniques
hierarchical clustering, factor analysis, support
vector machine. - multivariate statistical analysis when several
variables are observed for each sample unit, e.g.
the frequencies of many different words in a
genre.
17Hierarchical Clustering
- Cluster analysis is a type of automatic
categorisation similar things (such as related
genres) are brought together, and dissimilar
things are kept apart. - The starting point for many clustering algorithms
is the similarity matrix, a square table of
numbers showing how much each of the items (such
as texts) to be clustered have in common with
each of the others.
18The ranks of ten common words in four corpora
LOB Brown Carroll Jones Sinclair
The 1 1 1 1
Of 2 2 2 6
And 3 3 3 2
To 4 4 5 4
A 5 5 4 3
In 6 6 6 7
That 7 7 8 8
Is 8 8 7 9
Was 9 9 10 10
It 10 10 9 5
19Spearmans Rank Correlation Coefficient
- C 1
- 6 S (R - S)² / n (n² - 1)
- R is rank in LOB, S is rank in Jones Sinclair,
n number of words - 1
- (6 50) / (10 99)
- 0.697
R S R-S (R-S)²
The 1 1 0 0
Of 2 6 -4 16
And 3 2 1 1
To 4 4 0 0
A 5 3 2 4
In 6 7 -1 1
That 7 8 -1 1
Is 8 9 -1 1
Was 9 10 -1 1
It 10 5 5 25
S 50
20Similarity matrix for the four corpora
LOB Brown Carroll Jones Sinclair
LOB - 1.000 0.964 0.697
Brown 1.000 - 0.964 0.697
Carroll 0.964 0.964 - 0.757
Jones Sinclair 0.697 0.697 0.757 -
21Production of a Dendrogram by Nearest Neighbour
linkage
0.757
0.964
1.000
LOB
Brown
C
SJ
22Dendrogram for 15 LOB Genres using 89 most
frequent words (1)
23Dendrogram for 15 LOB genres (2)
- If our cut-off point is 0.2 difference, then we
get 4 clusters - K (general fiction), P (romance and love story),
L (mystery and detective), N (adventure and
western) - A (press reportage), C (press reviews)
- M (science fiction), R (humour), F (popular
lore), G (belles lettres) - B (press editorial), D (religion), E (skills,
trades and hobbies), H (government documents), J
(learned and scientific writings). - Hofland and Johansson (1982) the two major
groups of texts were imaginative and informative
prose, bridged by essayistic prose
24Factor Analysis
- Decathlon analogy running, jumping and throwing.
- Biber (1988) groups of countable features which
consistently co-occur in texts are said to define
a linguistic dimension. - Such features are said to have positive loadings
with respect to that dimension, but dimensions
can also be defined by features which are in
complementary distributions, i.e. negatively
loaded. - Example at one pole is many pronouns and
contractions, near which lie conversational
texts and panel discussions. At the other pole,
few dimensions and contractions are scientific
texts and fiction.
25Factor Analysis Methodology
- The use of computer corpora, such as LOB, which
classify texts by a wide range of genres. - The use of computer programs to count the
frequencies of linguistic features throughout the
range of genres, e.g. Perl scripts to count the
frequencies of the 50 most common words in LOB. - Use of factor analysis to determine co-occurrence
relations among the features, e.g. using Matlab. - Use of the linguists intuition to interpret the
linguistic dimensions discovered.
26(No Transcript)
27Main Findings
- The singular personal pronouns (I, you, he, she,
his, him, her) were clustered very closely
together, and between K (general fiction) and L
(mystery and detective) showing that these
pronouns are characteristic of fictional texts - As for the dendrogram, various genres were
closely related to each other - Four types of fiction K (general), L (mystery
and detective), N (adventure and western) and P
(romance). - M (science fiction) and R (humour).
- F (popular lore), G (belles lettres), C (press
reviews). - D (religion), E (skills, trades and hobbies), H
(government docs) and J (science) - B (press editorial) is the only genre to be
positively loaded on both factors, but its
nearest neighbour is D (religion). - Only A (press reportage) is differently located
with respect to its neighbours compared with the
dendrogram.
28Support Vector Machine (Santini, 2007)
- Extant web genres e.g. editorials,
do-it-yourself, mini-guides, biographies - Variant web genres, on the other hand, have
arisen since the advent of the web blogs,
e-shops, FAQs, listings (e.g. site maps),
personal home pages, search pages. - First stage is to count automatically extractable
linguistic features, e.g. function words,
punctuation, POS trigrams, lingusitic facets,
e.g. genre-specific referential vocabulary (,
basket, buy, cart, catalogue, checkouts, cost
for e-shops), functionalality facet (tags such as
ltbuttongt and ltformgt which indicate an interactive
web page. - Application to search engines.
29Equation of Hyperplane w0Tx b0 0
?0
X
X
X
X
X
X
30Computational Stylometry
- Writing styles of individual authors
- Forsyth (1999) used ?² to find substring
discriminators for the younger (s, an, whi,
with) and the older Yeats (what, can,
?) - Holmes (1992) used hierarchical clustering to
compare Mormon Scripture, Joseph Smiths personal
writings, Old Testament sections. - Holmes, Gordon and Wilson (2001) used Principal
Components Analysis, similar to Factor Analysis,
to determine the authorship of The Heart of a
Soldier - Popescu and Dinu (2007) used an SVM to decide who
wrote the Federalist Papers (Madison rather than
Hamilton).
31Hierarchical Clustering of Mormon Scripture,
Joseph Smiths personal writings and samples
from Isaiah (Holmes, 1992)
J1 J2 J3 I1 I3 I2 JB N2
A2 A1 M2 M4 R2 D2 M3 N3 M5 R1 AB D3 L1 D1 N1 M1
32Principal Components Plot LaSalle
Autobiography, o LaSalle Letters,
George Personal, x George War, ?
Harrison, ? Heart, ? Inman papers.
33Conclusions
- Statistical methods can distinguish between
different text types, arising through such
factors as demographic variation, genre
differences, topic differences, or individual
writing styles. - One difficulty is that each of these differences
can obscure the others. - Cultural corpora versus balancing text-for-text.
- Search for linguistic features that are good at
identifying one form of linguistic variation,
without being indicative of others.
34Source
- Oakes, M. P. Corpus Linguistics and Language
Variation, in Baker, P. (ed.) Contemporary
Approaches to Corpus Linguistics, Continuum (to
appear).