Title: Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus'
1Measures from Information Retrieval to Find the
Words which are Characteristic of a Corpus.
- Michael Oakes
- University of Sunderland, England.
2Contents
- Background and the ICAME disk
- Two traditional measures chi-squared and
G-squared (Log-likelihood) - Information Retrieval
3Looking for discriminating vocabulary
- Two classic papers Kilgarriff (1996), Which
words are particularly characteristic of a text?
A survey of statistical approaches. - Yang and Pedersen (1997), A comparative study on
feature selection in text categorization. - Identify discriminants, linguistic features more
typical of one form of English than another. - Automatic categorisation of text types akin to
automatic topic, genre and author identification
(Souter, 1994). - Vocabulary differences reveal cultural
differences (Leech and Fallon,1992).
4Leech and Fallon (1992) compared the vocabulary
in Brown and LOB
- Linguistic contrasts
- Spelling differences color / colour
- Lexical choice gasoline / petrol
- Proper nouns (Chicago more common in Brown)
- Non-linguistic contrasts indicators of
socio-cultural differences between the two
countries.
5Samples of written English on the ICAME CD
6Number of sections of approx. 2000 words in 5
comparable corpora (1)
7Number of sections of approx. 2000 words in 5
comparable corpora (1)
8The chi-squared (X²) test
- See Rayson, Leech Hodges (1997).
- Case study Is the word lovely used more often
in speech by men or women? - Experiment In the BNC conversational corpus, men
say lovely 414 times while women say lovely
1214 times. - Statistics Is this due to chance, or does the
use of this word genuinely vary with the gender
of the speaker? Use the chi-square test. - Contingency table of observed values O see next
slide
9Contingency table of observed values O
10Contingency table of expected values E
11The chi-squared test (2)
- Expected frequencies E
- E row total x column total / grand total
- e.g. E (lovely, men) 1628 x 1714443 / 4307895
- See previous table
- X² S (O E)² / E
- Find (O E)² / E for every box in the table,
- e.g. (O E)² / E for (lovely, men)
- (414 647.9)² / 647.9 84.4.
- X² sum (S) for all four boxes
- 84.0 55.8 0.0 0.0 140.2
12G² or Log-Likelihood
13G² vs. Chi-squared
- The chi-squared test is an approximation to the
G² test, easier to calculate in the days before
PCs and pocket calculators (Wikipedia) - Both can be used to compare corpora of different
sizes - The only restriction is that the expected values
must be gt 5 (Moore 2004, Rayson et al., 2004)
14The 20 Words Most Typical of New Zealand English
15Bonferroni Correction
- Controls the False Discovery Rate
- For a single test, X² or G² gt 10.83 is
significant at the .1 level. - In comparing the vocabulary across the five
corpora, we effectively perform 101,984 tests
because there are 101,984 unique word types
across the 5 corpora. - To find the appropriate critical value we divided
0.001 by 101,984 to give an adjusted significance
level of 9.805 x 10 E-9. - We then identify words with chi-squared
contributions gt 32.9 - Not more than 0.1 of the words selected in this
way will have been incorrectly identified, since
the Bonferroni correction is conservative. - We are more interested in ranking than absolute
values.
16Dispersion
- Dispersion measures show how evenly or otherwise
a word is distributed throughout a corpus (Lyne
1985, 1986). - In this study, we should only consider words
which are relatively evenly spread throughout the
corpus. - E.g. thalidomide, ranked 15th most typical of UK,
occurs all 55 times in a single medical article.
17Juillands D (1)
- Divide the corpus into n contiguous subsections
(we used 5). - Commonwealth was found 31, 8, 32, 88, 5 times
respectively in the Australian corpus. - The standard deviation of the number of times the
word is found in each subsection 29.79, and the
mean frequency is 32.8.
18Juillands D (2)
- To account for the fact that the standard
deviation tends to be higher for more frequent
words, it is divided by the mean frequency to
give the coefficient of variation V 29.79 /
32.8 0.908 - The coefficient of dispersion falls in the range
0 to 1. - D 1 - V / sqrt (n-1) 0.546 for commonwealth
- Empirical finding keep if D gt 0.3, range gt 3.
19The Australian list
- 18 of top 19 people and places
- Exception is Commonwealth (of Australia)
- Politics Premier, Senator, Hawke, Whitlam, ALP,
Labor, BHP - Employment rights unions, unemployed,
superannuation
20The British list
- People and places
- Institutions NHS, BBC
- Politics Tory, Labour
- EC (European Community)
- Historical epochs century, eighteenth
- Aristocratic titles Duke, Lord(s), Prince, Royal
21The Indian List
- People and places
- Currency Rs (rupees)
- Numbers mn (million), crores (ten million),
lakhs (ten thousand). - Function words the, of, in, upto (single word)
- Religion Buddha (86.0), Buddhism (45.4), divine
(150.6), Gita (119.3), God (37.8), Gods (78.6),
Goddess (44.4), Hindu (299.5), Hindus (148.1),
Karma (61.4), Muslim (151.8), Muslims (42.2),
mystic (53.1), Mystics (100.7), pandit (104.4),
Saints (35.6), Sikh (80.0), Swami (131.2), temple
(248.8), temples (104.2), Vedas (101.4), Vedic
(102.9), yoga (97.7).
22The New Zealand list
- Place names
- Pakeha (person of European descent)
- The natural world bay, forest, harbour,
island(s), landscape. - Rugby
23The U.S. list
- Few people and places
- Spelling variants toward, percent, programs,
defense, program, color, behavior, labor, fiber,
gray, theater, favorite, favor, colors,
organization - Inclusiveness black, gender, white
24Measures from Information Retrieval
- Main difference with corpus linguistics is that
we are interested in the information itself
rather than its linguistic style. - Raw frequency with stoplisting
- TF.IDF
- Deviation from Randomness
- Kullback-Liebler Divergence
25(No Transcript)
26Raw Frequency
- Most frequent words in the New Zealand corpus
- the (67355), of (32182), and (28678), to (26552),
a (23558), in (20519), is (10284), was (10081),
it (9814), that (9743), for (9341), I (7844), on
(7629), s (7585), with (7185), as (7027), he
(6716), be (6297), at (5530), by (5207)
27The Glasgow Stoplist
- a, about, above, across, adj, after, again,
against, all, almost, alone, along, also,
although, always, am, among, an, and, another,
any, anybody, anyone, anything, anywhere, apart,
are, around, as, aside, at, away, be yourself.
28Raw Frequency with Stoplisting
- s (7875), he (6716), you (3838), New (3319), we
(3292), one (3267), my (2078), Zealand (1985),
time (1920), like (1607), me (1602), two (1589),
people (1583), first (1393), now (1285), back
(1208), years (1145), way (1079), work (1041),
and made (1019) - only New and Zealand appeared typical of the
corpus of New Zealand English. - This shows the need for more sophisticated
measures.
29TF.IDF
- Takes into account both the frequency of a word
in a corpus (TF, term frequency) and the inverse
of the number of corpora the word appears in
(IDF, inverse document frequency). - The highest scores are given to words which are
common in the corpus we are looking at, but do
not occur in many other corpora.
3020 Words in the NZ Corpus with Highest TF.IDF
- Maori (1504.8), pakeha (339.5) , Aukland (304.4),
Otago (180.2), Dunedin (136.8), Waikato (135.1),
Christchurch (127.7), Wellington (112.0),
Waitangi (107.8), Aotearoa (91.7), Hutt (91.7),
Ngati (83.6), Rotorua (75.6), Maoris (74.2), moa
(72.4), Te (68.7), NZPA (67.5), marae (65.9),
ANZUS (62.7), TVNZ (62.7), Waitaki (59.5) and
Invercargill (57.9) - suggests that TF.IDF is a good measure for
finding words typical of a corpus.
31Deviation from Randomness
- One component is Bose-Einstein probability
- If ? is the mean frequency of term t across all
the corpora, the Bose-Einstein probability is the
probability that a term occurs exactly f times in
one of the corpora - Words which occur much more often in one corpus
than they do on average across the corpora are
typical of that corpus, and have low
Bose-Einstein probability.
32Inf1 is the negative of log base 2 of the
Bose-Einstein probability, so words typical of a
corpus will have high Inf1
33The 20 words with highest Inf1 for the corpus of
NZ English were
- Maori (28.66), Auckland (28.52), Pakeha (28.47),
Otago (28.46), Wellington (28.16), Dunedin
(28.12), Waikato (28.11), Christchurch (28.10),
Waitangi (28.11), Maoris (27.85), Aoteoroa
(27.84), Hutt (27.74), Ngati (27.76), Zealand
(27.76), Rotorua (27.67), moa (27.62), NZPA
(27.55), Zealanders (27.53). marae (27.52), Te
(27.52). - On its own, Inf1 appears to be a good indicator
of which words are typical of a corpus.
34Kullback-Liebler Divergence and Relevance
Feedback (more like this)
35KLD(t)
- pR(t) is the number of times that word is found
in relevant documents, divided by the total
number of words in relevant documents - pC(t) is the number of words is found in the
entire document collection, divided by the total
number of words in the entire document collection - µ is a tuning parameter, which worked best when
set to 0.5 - Instead of relevant documents we discuss the
corpus of interest, and instead of non-relevant
documents we have the other comparison corpora.
36The 20 highest scoring words for NZ English were
- Zealand (1141), Maori (567), Auckland (359),
Wellington (297), Te (175), Christchurch (148),
Pakeha (128), Canterbury (89), Zealanders (82),
Otago (67), Pacific (57), Rugby (52), Dunedin
(51), Waikato (50), Maoris (48), NZ (44), Bay
(44), Waitangi (40), Aoteoroa (34), Hutt (34).
Values in millionths. - All these words appear typical of NZ English
- KLD(t) is a value for a single word. We can add
together the KLD(t) values for every word, to
derive a single value KLD(Dr, Dc) showing the
divergence between relevant documents and
non-relevant documents. It thus gives a measure
of corpus similarity.
37Information Gain
- Whereas the other measures tells us something
about the strength of the association between a
word and a corpus, IG is a single value for the
power of a word to discriminate between corpora. - As an exercise in judging the usefulness of this
measure, look at the 20 words in all five corpora
with highest IG, and try to guess the corpora
they are most typical of - Zealand (332), Maori (213), India (153),
Auckland (130), Australian (104), Wellington
(98), Rs (Rupees) (75), Gandhi (73), Pounds (68),
Clinton (67), Janata (65), Australia (64), Delhi
(54), Singh (54), Queensland (50), Bombay (50),
Aboriginal (50), Chistchurch (49), pakeha (48),
NSW (40). These IG values are in millionths.
38Conclusions (1)
- In corpus linguistics, interest is mainly in the
language used in corpora, while in information
retrieval we are mainly interested in the
information conveyed by a document - In IR, function words on a stoplist are
routinely discarded, since these are not related
to the topic of a document, but in CL, such words
tell us a great deal about the grammatical
structures used in a corpus. - The question of which words are characteristic
of a text is common to both IR and CL. A number
of statistical measures are thus relevant to both
fields of study.
39Conclusions (2)
- Our initial results suggest that the IR measures
of TF.IDF, Bose-Einstein probability and
Kullbeck-Liebler Divergence when µ 0.5 are all
good measures for finding the words most typical
of New Zealand English. - A variant of KLD measures the divergence between
two corpora - Information Gain provides a single score for a
word, reflecting its ability to discriminate
between corpora.