Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus' - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus'

Description:

Vocabulary differences reveal cultural differences (Leech and Fallon,1992). Leech and Fallon (1992) compared the ... See Rayson, Leech & Hodges (1997) ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 40
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus'


1
Measures from Information Retrieval to Find the
Words which are Characteristic of a Corpus.
  • Michael Oakes
  • University of Sunderland, England.

2
Contents
  • Background and the ICAME disk
  • Two traditional measures chi-squared and
    G-squared (Log-likelihood)
  • Information Retrieval

3
Looking for discriminating vocabulary
  • Two classic papers Kilgarriff (1996), Which
    words are particularly characteristic of a text?
    A survey of statistical approaches.
  • Yang and Pedersen (1997), A comparative study on
    feature selection in text categorization.
  • Identify discriminants, linguistic features more
    typical of one form of English than another.
  • Automatic categorisation of text types akin to
    automatic topic, genre and author identification
    (Souter, 1994).
  • Vocabulary differences reveal cultural
    differences (Leech and Fallon,1992).

4
Leech and Fallon (1992) compared the vocabulary
in Brown and LOB
  • Linguistic contrasts
  • Spelling differences color / colour
  • Lexical choice gasoline / petrol
  • Proper nouns (Chicago more common in Brown)
  • Non-linguistic contrasts indicators of
    socio-cultural differences between the two
    countries.

5
Samples of written English on the ICAME CD
6
Number of sections of approx. 2000 words in 5
comparable corpora (1)
7
Number of sections of approx. 2000 words in 5
comparable corpora (1)
8
The chi-squared (X²) test
  • See Rayson, Leech Hodges (1997).
  • Case study Is the word lovely used more often
    in speech by men or women?
  • Experiment In the BNC conversational corpus, men
    say lovely 414 times while women say lovely
    1214 times.
  • Statistics Is this due to chance, or does the
    use of this word genuinely vary with the gender
    of the speaker? Use the chi-square test.
  • Contingency table of observed values O see next
    slide

9
Contingency table of observed values O
10
Contingency table of expected values E
11
The chi-squared test (2)
  • Expected frequencies E
  • E row total x column total / grand total
  • e.g. E (lovely, men) 1628 x 1714443 / 4307895
  • See previous table
  • X² S (O E)² / E
  • Find (O E)² / E for every box in the table,
  • e.g. (O E)² / E for (lovely, men)
  • (414 647.9)² / 647.9 84.4.
  • X² sum (S) for all four boxes
  • 84.0 55.8 0.0 0.0 140.2

12
G² or Log-Likelihood
13
G² vs. Chi-squared
  • The chi-squared test is an approximation to the
    G² test, easier to calculate in the days before
    PCs and pocket calculators (Wikipedia)
  • Both can be used to compare corpora of different
    sizes
  • The only restriction is that the expected values
    must be gt 5 (Moore 2004, Rayson et al., 2004)

14
The 20 Words Most Typical of New Zealand English
15
Bonferroni Correction
  • Controls the False Discovery Rate
  • For a single test, X² or G² gt 10.83 is
    significant at the .1 level.
  • In comparing the vocabulary across the five
    corpora, we effectively perform 101,984 tests
    because there are 101,984 unique word types
    across the 5 corpora.
  • To find the appropriate critical value we divided
    0.001 by 101,984 to give an adjusted significance
    level of 9.805 x 10 E-9.
  • We then identify words with chi-squared
    contributions gt 32.9
  • Not more than 0.1 of the words selected in this
    way will have been incorrectly identified, since
    the Bonferroni correction is conservative.
  • We are more interested in ranking than absolute
    values.

16
Dispersion
  • Dispersion measures show how evenly or otherwise
    a word is distributed throughout a corpus (Lyne
    1985, 1986).
  • In this study, we should only consider words
    which are relatively evenly spread throughout the
    corpus.
  • E.g. thalidomide, ranked 15th most typical of UK,
    occurs all 55 times in a single medical article.

17
Juillands D (1)
  • Divide the corpus into n contiguous subsections
    (we used 5).
  • Commonwealth was found 31, 8, 32, 88, 5 times
    respectively in the Australian corpus.
  • The standard deviation of the number of times the
    word is found in each subsection 29.79, and the
    mean frequency is 32.8.

18
Juillands D (2)
  • To account for the fact that the standard
    deviation tends to be higher for more frequent
    words, it is divided by the mean frequency to
    give the coefficient of variation V 29.79 /
    32.8 0.908
  • The coefficient of dispersion falls in the range
    0 to 1.
  • D 1 - V / sqrt (n-1) 0.546 for commonwealth
  • Empirical finding keep if D gt 0.3, range gt 3.

19
The Australian list
  • 18 of top 19 people and places
  • Exception is Commonwealth (of Australia)
  • Politics Premier, Senator, Hawke, Whitlam, ALP,
    Labor, BHP
  • Employment rights unions, unemployed,
    superannuation

20
The British list
  • People and places
  • Institutions NHS, BBC
  • Politics Tory, Labour
  • EC (European Community)
  • Historical epochs century, eighteenth
  • Aristocratic titles Duke, Lord(s), Prince, Royal

21
The Indian List
  • People and places
  • Currency Rs (rupees)
  • Numbers mn (million), crores (ten million),
    lakhs (ten thousand).
  • Function words the, of, in, upto (single word)
  • Religion Buddha (86.0), Buddhism (45.4), divine
    (150.6), Gita (119.3), God (37.8), Gods (78.6),
    Goddess (44.4), Hindu (299.5), Hindus (148.1),
    Karma (61.4), Muslim (151.8), Muslims (42.2),
    mystic (53.1), Mystics (100.7), pandit (104.4),
    Saints (35.6), Sikh (80.0), Swami (131.2), temple
    (248.8), temples (104.2), Vedas (101.4), Vedic
    (102.9), yoga (97.7).

22
The New Zealand list
  • Place names
  • Pakeha (person of European descent)
  • The natural world bay, forest, harbour,
    island(s), landscape.
  • Rugby

23
The U.S. list
  • Few people and places
  • Spelling variants toward, percent, programs,
    defense, program, color, behavior, labor, fiber,
    gray, theater, favorite, favor, colors,
    organization
  • Inclusiveness black, gender, white

24
Measures from Information Retrieval
  • Main difference with corpus linguistics is that
    we are interested in the information itself
    rather than its linguistic style.
  • Raw frequency with stoplisting
  • TF.IDF
  • Deviation from Randomness
  • Kullback-Liebler Divergence

25
(No Transcript)
26
Raw Frequency
  • Most frequent words in the New Zealand corpus
  • the (67355), of (32182), and (28678), to (26552),
    a (23558), in (20519), is (10284), was (10081),
    it (9814), that (9743), for (9341), I (7844), on
    (7629), s (7585), with (7185), as (7027), he
    (6716), be (6297), at (5530), by (5207)

27
The Glasgow Stoplist
  • a, about, above, across, adj, after, again,
    against, all, almost, alone, along, also,
    although, always, am, among, an, and, another,
    any, anybody, anyone, anything, anywhere, apart,
    are, around, as, aside, at, away, be yourself.

28
Raw Frequency with Stoplisting
  • s (7875), he (6716), you (3838), New (3319), we
    (3292), one (3267), my (2078), Zealand (1985),
    time (1920), like (1607), me (1602), two (1589),
    people (1583), first (1393), now (1285), back
    (1208), years (1145), way (1079), work (1041),
    and made (1019)
  • only New and Zealand appeared typical of the
    corpus of New Zealand English.
  • This shows the need for more sophisticated
    measures.

29
TF.IDF
  • Takes into account both the frequency of a word
    in a corpus (TF, term frequency) and the inverse
    of the number of corpora the word appears in
    (IDF, inverse document frequency).
  • The highest scores are given to words which are
    common in the corpus we are looking at, but do
    not occur in many other corpora.

30
20 Words in the NZ Corpus with Highest TF.IDF
  • Maori (1504.8), pakeha (339.5) , Aukland (304.4),
    Otago (180.2), Dunedin (136.8), Waikato (135.1),
    Christchurch (127.7), Wellington (112.0),
    Waitangi (107.8), Aotearoa (91.7), Hutt (91.7),
    Ngati (83.6), Rotorua (75.6), Maoris (74.2), moa
    (72.4), Te (68.7), NZPA (67.5), marae (65.9),
    ANZUS (62.7), TVNZ (62.7), Waitaki (59.5) and
    Invercargill (57.9)
  • suggests that TF.IDF is a good measure for
    finding words typical of a corpus.

31
Deviation from Randomness
  • One component is Bose-Einstein probability
  • If ? is the mean frequency of term t across all
    the corpora, the Bose-Einstein probability is the
    probability that a term occurs exactly f times in
    one of the corpora
  • Words which occur much more often in one corpus
    than they do on average across the corpora are
    typical of that corpus, and have low
    Bose-Einstein probability.

32
Inf1 is the negative of log base 2 of the
Bose-Einstein probability, so words typical of a
corpus will have high Inf1
33
The 20 words with highest Inf1 for the corpus of
NZ English were
  • Maori (28.66), Auckland (28.52), Pakeha (28.47),
    Otago (28.46), Wellington (28.16), Dunedin
    (28.12), Waikato (28.11), Christchurch (28.10),
    Waitangi (28.11), Maoris (27.85), Aoteoroa
    (27.84), Hutt (27.74), Ngati (27.76), Zealand
    (27.76), Rotorua (27.67), moa (27.62), NZPA
    (27.55), Zealanders (27.53). marae (27.52), Te
    (27.52).
  • On its own, Inf1 appears to be a good indicator
    of which words are typical of a corpus.

34
Kullback-Liebler Divergence and Relevance
Feedback (more like this)
35
KLD(t)
  • pR(t) is the number of times that word is found
    in relevant documents, divided by the total
    number of words in relevant documents
  • pC(t) is the number of words is found in the
    entire document collection, divided by the total
    number of words in the entire document collection
  • µ is a tuning parameter, which worked best when
    set to 0.5
  • Instead of relevant documents we discuss the
    corpus of interest, and instead of non-relevant
    documents we have the other comparison corpora.

36
The 20 highest scoring words for NZ English were
  • Zealand (1141), Maori (567), Auckland (359),
    Wellington (297), Te (175), Christchurch (148),
    Pakeha (128), Canterbury (89), Zealanders (82),
    Otago (67), Pacific (57), Rugby (52), Dunedin
    (51), Waikato (50), Maoris (48), NZ (44), Bay
    (44), Waitangi (40), Aoteoroa (34), Hutt (34).
    Values in millionths.
  • All these words appear typical of NZ English
  • KLD(t) is a value for a single word. We can add
    together the KLD(t) values for every word, to
    derive a single value KLD(Dr, Dc) showing the
    divergence between relevant documents and
    non-relevant documents. It thus gives a measure
    of corpus similarity.

37
Information Gain
  • Whereas the other measures tells us something
    about the strength of the association between a
    word and a corpus, IG is a single value for the
    power of a word to discriminate between corpora.
  • As an exercise in judging the usefulness of this
    measure, look at the 20 words in all five corpora
    with highest IG, and try to guess the corpora
    they are most typical of
  • Zealand (332), Maori (213), India (153),
    Auckland (130), Australian (104), Wellington
    (98), Rs (Rupees) (75), Gandhi (73), Pounds (68),
    Clinton (67), Janata (65), Australia (64), Delhi
    (54), Singh (54), Queensland (50), Bombay (50),
    Aboriginal (50), Chistchurch (49), pakeha (48),
    NSW (40). These IG values are in millionths.

38
Conclusions (1)
  • In corpus linguistics, interest is mainly in the
    language used in corpora, while in information
    retrieval we are mainly interested in the
    information conveyed by a document
  • In IR, function words on a stoplist are
    routinely discarded, since these are not related
    to the topic of a document, but in CL, such words
    tell us a great deal about the grammatical
    structures used in a corpus.
  • The question of which words are characteristic
    of a text is common to both IR and CL. A number
    of statistical measures are thus relevant to both
    fields of study.

39
Conclusions (2)
  • Our initial results suggest that the IR measures
    of TF.IDF, Bose-Einstein probability and
    Kullbeck-Liebler Divergence when µ 0.5 are all
    good measures for finding the words most typical
    of New Zealand English.
  • A variant of KLD measures the divergence between
    two corpora
  • Information Gain provides a single score for a
    word, reflecting its ability to discriminate
    between corpora.
Write a Comment
User Comments (0)
About PowerShow.com