Corpus Linguistics and Language Variation - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Corpus Linguistics and Language Variation

Description:

Sports (rugby), the natural world (bay, beach, cliff). Wellington (N Zealand) ... Spelling differences (color, theater), terms for transportation (railroad, ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 35
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Corpus Linguistics and Language Variation


1
Corpus Linguistics and Language Variation
  • Michael P. Oakes
  • University of Sunderland

2
Contents
  • Introduction to Corpora and Language Variation
  • The Chi-Squared Test
  • British versus U.S. English
  • Social Differentiation in the Use of Vocabulary
  • Genre Analysis
  • Yules Distinctiveness Coefficient
  • Hierarchical Clustering
  • Factor Analysis
  • Linguistic Facets and Support Vector Machines
  • Computational Stylometry
  • Conclusions

3
Introduction
  • Computer analyses of linguistic variation are
    restricted to comparisons of the use of
    frequently occurring, objectively countable
    linguistic features.
  • Not hapax legomena
  • Not Semitisms in the Greek New Testament
  • A corpus is a large body of (usually electronic)
    text, sampled for a purpose.
  • What are the differences in the ways that feature
    x is used in corpus y and corpus z ?
  • In order to carry out comparisons, we need
    corpora that are matched as far as possible in
    every way but one, e.g. same genre, same country,
    different gender.

4
The LOB Corpus Family
  • A well-known family of corpora, based on the
    Brown corpus of one million words of American
    English (1962).
  • The Lancaster-Oslo-Bergen corpus is the British
    Equivalent.
  • Carefully constructed using the same sampling
    model (balanced) to carry out studies of
    diachronic and synchronic variation.

5
Broad Text Category Genre Texts in Brown Texts in LOB
Press A Reportage 44 44
B Editorial 27 27
C Reviews 17 17
General Prose D Religion 17 17
E Skills, Trades, Hobbies 36 38
F Popular Lore 48 44
G Belles Lettres, Biographies, Essays 75 77
H Miscellaneous 30 30
J Academic Prose 80 80
Fiction K General Fiction 29 29
L Mystery and Detective 24 24
M Science Fiction 6 6
N Adventure and Western 29 29
P Romance and Love Story 29 29
R Humour 9 9
6
Diachronic Corpora
  • LOB and Brown represent English of the 1960s.
  • FLOB and Frown are balanced with respect to LOB
    and Brown, designed to reflect English of the
    1990s.
  • Lancaster1931 Corpus (Leech and Smith, 2005)
  • All annotated with a common grammatical tagset
    known as C8, but no demographic information.
  • Equal 30-year gaps enable researchers to
    determine whether linguistic change is speeding
    up, constant or slowing down.
  • BE06 (Baker, 2006).

7
Sources of Linguistic Variation
  • Biber (1998) uses the term genre for classes of
    texts that are determined on the basis of
    external criteria relating to authors or
    speakers purpose.
  • Text types grouped by similarities in intrinsic
    linguistic form, irrespective of their genre
    classifications e.g. press reportage,
    biographies and academic prose all have a
    narrative linguistic form.
  • Register refers to variation in language arising
    from the situation it is used in, depending on
    such things as interactiveness, who is the
    addressee.
  • Dialect defined by association with different
    speaker groups, based on region, social group or
    other demographic factors.
  • Topic, genre, stylistic preference, may swamp
    language change proper.

8
Feature Selection
  • An early stage in many text classification tasks
    is to decide which features (called attributes
    in machine learning applications) should be used
    to characterise the texts.
  • Words (non-trivial)
  • Lemmas and Stems
  • Interpretive codes e.g. AB (abbreviation), SF
    (neologism from science fiction), FO (foreign
    origin)
  • Part-of-Speech and Semantic tags
  • E.g. does American English use any of these more
    than British English?

9
The Chi-Squared (?²) Test (1)
  • Is mother more typical of female speech than
    male speech?
  • Rayson, Leech and Hodges (1987) used the BNC
    Conversational Corpus
  • Start with a contingency table of observed values

Female Male Column totals
Mother 627 272 899
Any other word 2,592, 825 1,714, 161 4,306, 956
Row totals 2,593, 452 1,714, 433 Grand total 4,307, 885
10
The Chi-Squared (?²) Test (2)
  • Expected values are found using the formula
  • row total col total / grand total

Female Male
Mother 541.2 357.8
Any other word 2,592, 910.8 1,714,075.2
11
The Chi-Squared (?²) Test (3)
  • X² S (O E)² / E
  • X² 34.2
  • For one degree of freedom, if X² gt 10.8 we can be
    99.9 confident that women really do say mother
    more than men.

Female Male
Mother 13.6 20.6
Any other word 0.0 0.0
12
Gender Variation in the Use of Vocabulary
Words most characteristic of Male speech Words most characteristic of Female speech
fing, er, the, yeah, aye, right, hundred, f, is, of, two, three, a, four, ah, no, number, quid, mate She, her, said, nt, I, and, to, cos, oh, Christmas, thought, lovely, nice, mm, had, did, going, because, him, really
13
Common Pitfalls
  • The corpora need not be the same size
  • Expected values should be at least 5
  • Log-likelihood / G² also requires E 5
  • Express data as counts, not ratios (e.g. words
    per million)
  • Need for a dispersion measure (thalidomide
    example)
  • Bonferroni correction for multiple comparisons
    reduces Type I errors

14
Yules Distinctiveness Coefficient (Q)
  • Q (AD BC) / (AD BC)
  • Q measures strength and direction of a
    relationship
  • Q -1 or 1 for both complete and absolute
    relationships
  • F ?² / sample size

complete
Brown LOB
Theatre 0 63
Theater 95 30
absolute
Brown LOB
South-west 0 10
southwest 16 0
15
Comparison of Corpora representing English spoken
in 5 Countries (Oakes Farrow, 2007)
ACE (Australia) Labour rights (unions, unemployed, superannuation)
FLOB (UK) Aristocratic titles (royal, Lord)
Frown (United States) Spelling differences (color, theater), terms for transportation (railroad, highway), diversity (black, gender, white, gay)
Kolhapur (India) Crores (ten millions), lakhs (ten thousands), caste system (dalit, caste), religion (Hindu, Krishna, temple), function words (the, upto)
Wellington (N Zealand) Sports (rugby), the natural world (bay, beach, cliff).
16
Genre Analysis
  • Univariate techniques such as chi-squared and
    Yules Q can be used to compare the vocabulary
    used in different genres
  • Here we will look at multivariate techniques
    hierarchical clustering, factor analysis, support
    vector machine.
  • multivariate statistical analysis when several
    variables are observed for each sample unit, e.g.
    the frequencies of many different words in a
    genre.

17
Hierarchical Clustering
  • Cluster analysis is a type of automatic
    categorisation similar things (such as related
    genres) are brought together, and dissimilar
    things are kept apart.
  • The starting point for many clustering algorithms
    is the similarity matrix, a square table of
    numbers showing how much each of the items (such
    as texts) to be clustered have in common with
    each of the others.

18
The ranks of ten common words in four corpora
LOB Brown Carroll Jones Sinclair
The 1 1 1 1
Of 2 2 2 6
And 3 3 3 2
To 4 4 5 4
A 5 5 4 3
In 6 6 6 7
That 7 7 8 8
Is 8 8 7 9
Was 9 9 10 10
It 10 10 9 5
19
Spearmans Rank Correlation Coefficient
  • C 1
  • 6 S (R - S)² / n (n² - 1)
  • R is rank in LOB, S is rank in Jones Sinclair,
    n number of words
  • 1
  • (6 50) / (10 99)
  • 0.697

R S R-S (R-S)²
The 1 1 0 0
Of 2 6 -4 16
And 3 2 1 1
To 4 4 0 0
A 5 3 2 4
In 6 7 -1 1
That 7 8 -1 1
Is 8 9 -1 1
Was 9 10 -1 1
It 10 5 5 25
S 50
20
Similarity matrix for the four corpora
LOB Brown Carroll Jones Sinclair
LOB - 1.000 0.964 0.697
Brown 1.000 - 0.964 0.697
Carroll 0.964 0.964 - 0.757
Jones Sinclair 0.697 0.697 0.757 -
21
Production of a Dendrogram by Nearest Neighbour
linkage
0.757
0.964
1.000
LOB
Brown
C
SJ
22
Dendrogram for 15 LOB Genres using 89 most
frequent words (1)
23
Dendrogram for 15 LOB genres (2)
  • If our cut-off point is 0.2 difference, then we
    get 4 clusters
  • K (general fiction), P (romance and love story),
    L (mystery and detective), N (adventure and
    western)
  • A (press reportage), C (press reviews)
  • M (science fiction), R (humour), F (popular
    lore), G (belles lettres)
  • B (press editorial), D (religion), E (skills,
    trades and hobbies), H (government documents), J
    (learned and scientific writings).
  • Hofland and Johansson (1982) the two major
    groups of texts were imaginative and informative
    prose, bridged by essayistic prose

24
Factor Analysis
  • Decathlon analogy running, jumping and throwing.
  • Biber (1988) groups of countable features which
    consistently co-occur in texts are said to define
    a linguistic dimension.
  • Such features are said to have positive loadings
    with respect to that dimension, but dimensions
    can also be defined by features which are in
    complementary distributions, i.e. negatively
    loaded.
  • Example at one pole is many pronouns and
    contractions, near which lie conversational
    texts and panel discussions. At the other pole,
    few dimensions and contractions are scientific
    texts and fiction.

25
Factor Analysis Methodology
  • The use of computer corpora, such as LOB, which
    classify texts by a wide range of genres.
  • The use of computer programs to count the
    frequencies of linguistic features throughout the
    range of genres, e.g. Perl scripts to count the
    frequencies of the 50 most common words in LOB.
  • Use of factor analysis to determine co-occurrence
    relations among the features, e.g. using Matlab.
  • Use of the linguists intuition to interpret the
    linguistic dimensions discovered.

26
(No Transcript)
27
Main Findings
  • The singular personal pronouns (I, you, he, she,
    his, him, her) were clustered very closely
    together, and between K (general fiction) and L
    (mystery and detective) showing that these
    pronouns are characteristic of fictional texts
  • As for the dendrogram, various genres were
    closely related to each other
  • Four types of fiction K (general), L (mystery
    and detective), N (adventure and western) and P
    (romance).
  • M (science fiction) and R (humour).
  • F (popular lore), G (belles lettres), C (press
    reviews).
  • D (religion), E (skills, trades and hobbies), H
    (government docs) and J (science)
  • B (press editorial) is the only genre to be
    positively loaded on both factors, but its
    nearest neighbour is D (religion).
  • Only A (press reportage) is differently located
    with respect to its neighbours compared with the
    dendrogram.

28
Support Vector Machine (Santini, 2007)
  • Extant web genres e.g. editorials,
    do-it-yourself, mini-guides, biographies
  • Variant web genres, on the other hand, have
    arisen since the advent of the web blogs,
    e-shops, FAQs, listings (e.g. site maps),
    personal home pages, search pages.
  • First stage is to count automatically extractable
    linguistic features, e.g. function words,
    punctuation, POS trigrams, lingusitic facets,
    e.g. genre-specific referential vocabulary (,
    basket, buy, cart, catalogue, checkouts, cost
    for e-shops), functionalality facet (tags such as
    ltbuttongt and ltformgt which indicate an interactive
    web page.
  • Application to search engines.

29
Equation of Hyperplane w0Tx b0 0
?0
X
X
X
X
X
X
30
Computational Stylometry
  • Writing styles of individual authors
  • Forsyth (1999) used ?² to find substring
    discriminators for the younger (s, an, whi,
    with) and the older Yeats (what, can,
    ?)
  • Holmes (1992) used hierarchical clustering to
    compare Mormon Scripture, Joseph Smiths personal
    writings, Old Testament sections.
  • Holmes, Gordon and Wilson (2001) used Principal
    Components Analysis, similar to Factor Analysis,
    to determine the authorship of The Heart of a
    Soldier
  • Popescu and Dinu (2007) used an SVM to decide who
    wrote the Federalist Papers (Madison rather than
    Hamilton).

31
Hierarchical Clustering of Mormon Scripture,
Joseph Smiths personal writings and samples
from Isaiah (Holmes, 1992)
J1 J2 J3 I1 I3 I2 JB N2
A2 A1 M2 M4 R2 D2 M3 N3 M5 R1 AB D3 L1 D1 N1 M1
32
Principal Components Plot LaSalle
Autobiography, o LaSalle Letters,
George Personal, x George War, ?
Harrison, ? Heart, ? Inman papers.
33
Conclusions
  • Statistical methods can distinguish between
    different text types, arising through such
    factors as demographic variation, genre
    differences, topic differences, or individual
    writing styles.
  • One difficulty is that each of these differences
    can obscure the others.
  • Cultural corpora versus balancing text-for-text.
  • Search for linguistic features that are good at
    identifying one form of linguistic variation,
    without being indicative of others.

34
Source
  • Oakes, M. P. Corpus Linguistics and Language
    Variation, in Baker, P. (ed.) Contemporary
    Approaches to Corpus Linguistics, Continuum (to
    appear).
Write a Comment
User Comments (0)
About PowerShow.com