Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus' - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus'

Description:

Vocabulary differences reveal cultural differences (Leech and Fallon,1992). Leech and Fallon (1992) compared the ... See Rayson, Leech & Hodges (1997) ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 40

Provided by: osirisSun

Category:

more less

Transcript and Presenter's Notes

Title: Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus'

1
Measures from Information Retrieval to Find the
Words which are Characteristic of a Corpus.

Michael Oakes
University of Sunderland, England.

2
Contents

Background and the ICAME disk
Two traditional measures chi-squared and
G-squared (Log-likelihood)
Information Retrieval

3
Looking for discriminating vocabulary

Two classic papers Kilgarriff (1996), Which
words are particularly characteristic of a text?
A survey of statistical approaches.
Yang and Pedersen (1997), A comparative study on
feature selection in text categorization.
Identify discriminants, linguistic features more
typical of one form of English than another.
Automatic categorisation of text types akin to
automatic topic, genre and author identification
(Souter, 1994).
Vocabulary differences reveal cultural
differences (Leech and Fallon,1992).

4
Leech and Fallon (1992) compared the vocabulary
in Brown and LOB

Linguistic contrasts
Spelling differences color / colour
Lexical choice gasoline / petrol
Proper nouns (Chicago more common in Brown)
Non-linguistic contrasts indicators of
socio-cultural differences between the two
countries.

5
Samples of written English on the ICAME CD
6
Number of sections of approx. 2000 words in 5
comparable corpora (1)
7
Number of sections of approx. 2000 words in 5
comparable corpora (1)
8
The chi-squared (X²) test

See Rayson, Leech Hodges (1997).
Case study Is the word lovely used more often
in speech by men or women?
Experiment In the BNC conversational corpus, men
say lovely 414 times while women say lovely
1214 times.
Statistics Is this due to chance, or does the
use of this word genuinely vary with the gender
of the speaker? Use the chi-square test.
Contingency table of observed values O see next
slide

9
Contingency table of observed values O
10
Contingency table of expected values E
11
The chi-squared test (2)

Expected frequencies E
E row total x column total / grand total
e.g. E (lovely, men) 1628 x 1714443 / 4307895
See previous table
X² S (O E)² / E
Find (O E)² / E for every box in the table,
e.g. (O E)² / E for (lovely, men)
(414 647.9)² / 647.9 84.4.
X² sum (S) for all four boxes
84.0 55.8 0.0 0.0 140.2

12
G² or Log-Likelihood
13
G² vs. Chi-squared

The chi-squared test is an approximation to the
G² test, easier to calculate in the days before
PCs and pocket calculators (Wikipedia)
Both can be used to compare corpora of different
sizes
The only restriction is that the expected values
must be gt 5 (Moore 2004, Rayson et al., 2004)

14
The 20 Words Most Typical of New Zealand English
15
Bonferroni Correction

Controls the False Discovery Rate
For a single test, X² or G² gt 10.83 is
significant at the .1 level.
In comparing the vocabulary across the five
corpora, we effectively perform 101,984 tests
because there are 101,984 unique word types
across the 5 corpora.
To find the appropriate critical value we divided
0.001 by 101,984 to give an adjusted significance
level of 9.805 x 10 E-9.
We then identify words with chi-squared
contributions gt 32.9
Not more than 0.1 of the words selected in this
way will have been incorrectly identified, since
the Bonferroni correction is conservative.
We are more interested in ranking than absolute
values.

16
Dispersion

Dispersion measures show how evenly or otherwise
a word is distributed throughout a corpus (Lyne
1985, 1986).
In this study, we should only consider words
which are relatively evenly spread throughout the
corpus.
E.g. thalidomide, ranked 15th most typical of UK,
occurs all 55 times in a single medical article.

17
Juillands D (1)

Divide the corpus into n contiguous subsections
(we used 5).
Commonwealth was found 31, 8, 32, 88, 5 times
respectively in the Australian corpus.
The standard deviation of the number of times the
word is found in each subsection 29.79, and the
mean frequency is 32.8.

18
Juillands D (2)

To account for the fact that the standard
deviation tends to be higher for more frequent
words, it is divided by the mean frequency to
give the coefficient of variation V 29.79 /
32.8 0.908
The coefficient of dispersion falls in the range
0 to 1.
D 1 - V / sqrt (n-1) 0.546 for commonwealth
Empirical finding keep if D gt 0.3, range gt 3.

19
The Australian list

18 of top 19 people and places
Exception is Commonwealth (of Australia)
Politics Premier, Senator, Hawke, Whitlam, ALP,
Labor, BHP
Employment rights unions, unemployed,
superannuation

20
The British list

People and places
Institutions NHS, BBC
Politics Tory, Labour
EC (European Community)
Historical epochs century, eighteenth
Aristocratic titles Duke, Lord(s), Prince, Royal

21
The Indian List

People and places
Currency Rs (rupees)
Numbers mn (million), crores (ten million),
lakhs (ten thousand).
Function words the, of, in, upto (single word)
Religion Buddha (86.0), Buddhism (45.4), divine
(150.6), Gita (119.3), God (37.8), Gods (78.6),
Goddess (44.4), Hindu (299.5), Hindus (148.1),
Karma (61.4), Muslim (151.8), Muslims (42.2),
mystic (53.1), Mystics (100.7), pandit (104.4),
Saints (35.6), Sikh (80.0), Swami (131.2), temple
(248.8), temples (104.2), Vedas (101.4), Vedic
(102.9), yoga (97.7).

22
The New Zealand list

Place names
Pakeha (person of European descent)
The natural world bay, forest, harbour,
island(s), landscape.
Rugby

23
The U.S. list

Few people and places
Spelling variants toward, percent, programs,
defense, program, color, behavior, labor, fiber,
gray, theater, favorite, favor, colors,
organization
Inclusiveness black, gender, white

24
Measures from Information Retrieval

Main difference with corpus linguistics is that
we are interested in the information itself
rather than its linguistic style.
Raw frequency with stoplisting
TF.IDF
Deviation from Randomness
Kullback-Liebler Divergence

25
(No Transcript)
26
Raw Frequency

Most frequent words in the New Zealand corpus
the (67355), of (32182), and (28678), to (26552),
a (23558), in (20519), is (10284), was (10081),
it (9814), that (9743), for (9341), I (7844), on
(7629), s (7585), with (7185), as (7027), he
(6716), be (6297), at (5530), by (5207)

27
The Glasgow Stoplist

a, about, above, across, adj, after, again,
against, all, almost, alone, along, also,
although, always, am, among, an, and, another,
any, anybody, anyone, anything, anywhere, apart,
are, around, as, aside, at, away, be yourself.

28
Raw Frequency with Stoplisting

s (7875), he (6716), you (3838), New (3319), we
(3292), one (3267), my (2078), Zealand (1985),
time (1920), like (1607), me (1602), two (1589),
people (1583), first (1393), now (1285), back
(1208), years (1145), way (1079), work (1041),
and made (1019)
only New and Zealand appeared typical of the
corpus of New Zealand English.
This shows the need for more sophisticated
measures.

29
TF.IDF

Takes into account both the frequency of a word
in a corpus (TF, term frequency) and the inverse
of the number of corpora the word appears in
(IDF, inverse document frequency).
The highest scores are given to words which are
common in the corpus we are looking at, but do
not occur in many other corpora.

30
20 Words in the NZ Corpus with Highest TF.IDF

Maori (1504.8), pakeha (339.5) , Aukland (304.4),
Otago (180.2), Dunedin (136.8), Waikato (135.1),
Christchurch (127.7), Wellington (112.0),
Waitangi (107.8), Aotearoa (91.7), Hutt (91.7),
Ngati (83.6), Rotorua (75.6), Maoris (74.2), moa
(72.4), Te (68.7), NZPA (67.5), marae (65.9),
ANZUS (62.7), TVNZ (62.7), Waitaki (59.5) and
Invercargill (57.9)
suggests that TF.IDF is a good measure for
finding words typical of a corpus.

31
Deviation from Randomness

One component is Bose-Einstein probability
If ? is the mean frequency of term t across all
the corpora, the Bose-Einstein probability is the
probability that a term occurs exactly f times in
one of the corpora
Words which occur much more often in one corpus
than they do on average across the corpora are
typical of that corpus, and have low
Bose-Einstein probability.

32
Inf1 is the negative of log base 2 of the
Bose-Einstein probability, so words typical of a
corpus will have high Inf1
33
The 20 words with highest Inf1 for the corpus of
NZ English were

Maori (28.66), Auckland (28.52), Pakeha (28.47),
Otago (28.46), Wellington (28.16), Dunedin
(28.12), Waikato (28.11), Christchurch (28.10),
Waitangi (28.11), Maoris (27.85), Aoteoroa
(27.84), Hutt (27.74), Ngati (27.76), Zealand
(27.76), Rotorua (27.67), moa (27.62), NZPA
(27.55), Zealanders (27.53). marae (27.52), Te
(27.52).
On its own, Inf1 appears to be a good indicator
of which words are typical of a corpus.

34
Kullback-Liebler Divergence and Relevance
Feedback (more like this)
35
KLD(t)

pR(t) is the number of times that word is found
in relevant documents, divided by the total
number of words in relevant documents
pC(t) is the number of words is found in the
entire document collection, divided by the total
number of words in the entire document collection
µ is a tuning parameter, which worked best when
set to 0.5
Instead of relevant documents we discuss the
corpus of interest, and instead of non-relevant
documents we have the other comparison corpora.

36
The 20 highest scoring words for NZ English were

Zealand (1141), Maori (567), Auckland (359),
Wellington (297), Te (175), Christchurch (148),
Pakeha (128), Canterbury (89), Zealanders (82),
Otago (67), Pacific (57), Rugby (52), Dunedin
(51), Waikato (50), Maoris (48), NZ (44), Bay
(44), Waitangi (40), Aoteoroa (34), Hutt (34).
Values in millionths.
All these words appear typical of NZ English
KLD(t) is a value for a single word. We can add
together the KLD(t) values for every word, to
derive a single value KLD(Dr, Dc) showing the
divergence between relevant documents and
non-relevant documents. It thus gives a measure
of corpus similarity.

37
Information Gain

Whereas the other measures tells us something
about the strength of the association between a
word and a corpus, IG is a single value for the
power of a word to discriminate between corpora.
As an exercise in judging the usefulness of this
measure, look at the 20 words in all five corpora
with highest IG, and try to guess the corpora
they are most typical of
Zealand (332), Maori (213), India (153),
Auckland (130), Australian (104), Wellington
(98), Rs (Rupees) (75), Gandhi (73), Pounds (68),
Clinton (67), Janata (65), Australia (64), Delhi
(54), Singh (54), Queensland (50), Bombay (50),
Aboriginal (50), Chistchurch (49), pakeha (48),
NSW (40). These IG values are in millionths.

38
Conclusions (1)

In corpus linguistics, interest is mainly in the
language used in corpora, while in information
retrieval we are mainly interested in the
information conveyed by a document
In IR, function words on a stoplist are
routinely discarded, since these are not related
to the topic of a document, but in CL, such words
tell us a great deal about the grammatical
structures used in a corpus.
The question of which words are characteristic
of a text is common to both IR and CL. A number
of statistical measures are thus relevant to both
fields of study.

39
Conclusions (2)

Our initial results suggest that the IR measures
of TF.IDF, Bose-Einstein probability and
Kullbeck-Liebler Divergence when µ 0.5 are all
good measures for finding the words most typical
of New Zealand English.
A variant of KLD measures the divergence between
two corpora
Information Gain provides a single score for a
word, reflecting its ability to discriminate
between corpora.