Title: Stemming
1Lecture 6 Linguistic Methods for Searching
- Stemming
- Thesaurus
- Online resources
- Automatic construction of thesaurus
2Outline of Stemming Methods
- Goal of Stemming Process
- Algorithm
- Affix Removal (Porters Algorithm)
- Dictionary Look-up Stemmers
- Successor Variety
- n-Gram Stemming
- Applications
3The advantage
- Originally designed to improve performance by
reducing the requirement on system resources. - With the continued significant increase in
storage and computing power, use of stemming for
performance reason is no longer as important.
4Other Potentials
- It may make improvement in recall.
- There may be associated decline in precision.
- System designer make their own choice of
including stemming or not. - Google does not use the stemming
- Hotbot includes the word stemming for user choice
5Porter Stemming Algorithm
- The Porter algorithm is the most commonly
accepted algorithm. - Based upon a set of conditions of the stem,
suffix and prefix and associated actions given
the condition. - See, e.g,
- http//www.tartarus.org/martin/PorterStemmer/
6Porter Stemming (Condition)
- m, the measure of a stem is a function of
sequences of vowels (a,e,i,o,u,y) followed by a
consonant. - C(VC)mV where the initial C and final V are
optional and m is the number VC repeats
7Porter Stemming (Condition)
- ltXgt -stem ends with letter X
- v -stem contains a vowel
- d -stem ends in double consonant
- o -stem ends with consonant-vowel-consonan
t sequence where the final consonant is not w, x,
or y
8Rules
9Rules (continued)
10Example
- duplicatable
- duplicat rule 4
- duplicate rule 1b1
- duplic rule 3
11Dictionary Look-Up Stemmer
- A dictionary contains the pairing of a word and
its stem for all the words. - The structure of the dictionary should be well
designed for speeding up the search
TERM STEM computer comput compute compu
t computation comput
12Successor Variety Stemming
- Hafer and Weiss (1974) word segmentation by
letter successor varieties, Information Storage
and Retrieval 10, 371-385. - Main Idea Determine word and morpheme boundaries
based on - the distribution of phonemes in a large body of
utterances.
13Note
- Morpheme smallest meaningful part into which a
word can be divided - Run-s contains two morphemes
- un-like-ly contains three morphemes
- Phoneme unit of the system of sounds in a
language - English has 24 consonant phonemes
14Overall approach
- Hafer and Weiss use
- letters in place of phonemes
- texts in place of phonemically transcribed
utterances
15Formal Definition
- Let w be a word of length n
- wi is a length I prefix of w
- Let D be a collection of words
- D(wi) is the subset of D containing terms whose
first I letters match wi exactly - S(wi) the successor variety of wi is the number
of distinct letters that occupy the (i1)st
position of words in D(wi). - A test word of length n has n successor varieties
S(w1) S(w2) S(wn).
16Informal Definition
- The successor variety of a string in a collection
D of words is the number of different characters
that follows it in D. - That it, it depends on
- the string
- the collection D of words under consideration
17An example
- Dable, axle, accident, ape, about, be
- The successor variety for
- a 4 (b,x,c,p)
- ap 1 (e)
- app 0
- ab 2 (l, o)
- b 1 (e)
- Using Trie, successor variety of a string is the
number of children for the node the string
reaches in the trie (terminal node is treated as
having one child
18Trie for the corpus of data D
1
b
a
2
b
x
be
3
c
p
axle
l
o
ape
accident
about
able
19Segment in Words
- From a large body of text, usually the successor
variety of a substring decreases as a character
is added, until a segment boundary is reached - Consider the following example
- Dable,ape,beatable, fixable, read, readable,
reading, reads, red rope, ripe - r 3 (e,I,o)
- re 2 (a,d)
- rea 1 (d)
- read 3 (a,I,s)
- read is a segment (or stem)
20Selecting segments of words
- Cut off method
- a boundary is identified if some cutoff value is
reached. - Peak and plateau method
- a segment break is made after a character whose
successor variety is larger than that of both the
character immediate before and the character
immediately after it. - Complete word method
- a break is made after a segment if the segment is
a complete word in the corpus - Entropy method
- cutoff method applied to entropy defined for
words.
21Peak and Plateau Method
- Dable,ape,beatable, fixable, read, readable,
reading, reads, red rope, ripe - r 3 (e,I,o)
- re 2 (a,d)
- rea 1 (d)
- read 3 (a,I,s)
- reada 1 (b)
- readab 1 (l)
- readabl 1 (e)
- readable 1 (blank)
- the successor variety of read is 3 larger than
that of both rea and reada
22Peak and Plateau Method
- Input A document of many terms.
- Output each term is segmented.
- E.G., the output of readable is read-able
23Stem method of Hafer and Weiss
- Determine successor variety of a word
- Use this information to segment the word using
one of the previous methods (say peakplateau) - Choose one of the segment as stem
- if (first segment is in lt12 words in the corpus)
- //comment maybe a prefix
- first segment is stem
- else
- second segment is stem
24Stem method of Hafer and Weiss
- Input segmented word
- Output the stem of the word
- For example
- read-able is input
- read is the output
- //may be able is the output dependent on what
happens in the algorithms
25Accessor Variety Method in Chinese
- The notation is introduced by Feng, Chen, Zheng,
Deng for chinese word extraction. - The idea is similar to successor variety
- It is use to determine chinese text segmentation
since it is difficult to separate words in
Chinese text. In comparison, English words are
separated by a space symbol in text.
26Definition Accessor Variety
- We treat each Chinese character as a letter
- For each string (a potential word) consisting of
several characters, we define successor variety
as in English - Symmetrically, we also define a predecessor
variety for each string. - A word is considered a word if it has a large
successor variety and a large predecessor
variety.
27Testing Results
- The accessor variety method turns out a very
simple yet efficient way to recognize Chinese
words when combined with some simple grammar
rules. - For details, look at our paper
- http//www.cs.cityu.edu.hk/deng/5286/feng.pdf
28Word similarity
- N-gram method
- break a word of length n into (n-1) digrams,
consisting of substring of two characters of the
word. - Count the number of distinguished digrams
- Let A (B) be the number of distinguished digrams
in word 1 (2). Let C be the number of
distinguished digrams shared by word 1 and word
2. - The similarity of the two words is
- S2C/(AB)
29Example of Word similarity
- Statistics st, ta, at, ti, is, st, ti, ic, cs
- its distinguished digrams
- at, cs, ic, is, st, ta, ti
- statistical st, ta, at, ti, is, st, ti, ic, ca,
al - its distinguished digrams
- al, at, ca, ic, is, st, ta, ti
- A7, B8, C6
- Similarity 2x6/(78)12/154/580
- One may build a similarity matrix of all words in
a corpus, calculated as above, and complemented
by cutoff value method (set to zero if less than
a certain value, and to 1 else)
30Thesaurus
- Vocabulary control in an information retrieval
system - Thesaurus construction
- Manual construction
- Automatic construction
31Vocabulary control
- Standard vocabulary for both indexing and
searching (for the constructors of the system and
the users of the system)
32Objectives of vocabulary control
- To promote the consistent representation of
subject matter by indexers and searchers ,thereby
avoiding the dispersion of related materials. - To facilitate the conduct of a comprehensive
search on some topic by linking together terms
whose meanings are related paradigmatically.
33Thesaurus
- Not like common dictionary
- Words with their explanations
- May contain words in a language
- Or only contains words in a specific domain.
- With a lot of other information especially the
relationship between words - Classification of words in the language
- Words relationship like synonyms, antonyms
34On-Line Thesaurus
- http//www.thesaurus.com
- http//www.dictionary.com/
- http//www.cogsci.princeton.edu/wn/
35Dictionary vs. Thesaurus
Check Information use http//www.thesaurus.com
Dictionary Thesaurus
- information ( n f r-m sh n)n.
- Knowledge derived from study, experience, or
instruction. - Knowledge of specific events or situations that
has been gathered or received by communication
intelligence or news. See Synonyms at knowledge. - ......
Nouns information, enlightenment, acquaintance
Verbs tell inform, inform of acquaint,
acquaint with impart, Adjectives informed
communique reported published
36Use of Thesaurus
- To control the term used in indexing ,for a
specific domain only use the terms in the
thesaurus as indexing terms - Assist the users to form proper queries by the
help information contained in the thesaurus
37Construction of Thesaurus
- Stemming can be used for reduce the size of
thesaurus - Can be constructed either manually or
automatically
38WordNet manually constructed
- WordNet is an online lexical reference system
whose design is inspired by current
psycholinguistic theories of human lexical
memory. English nouns, verbs, adjectives and
adverbs are organized into synonym sets, each
representing one underlying lexical concept.
Different relations link the synonym sets.
39Relations in WordNet
40Automatic Thesaurus Construction
- A variety of methods can be used in construction
the thesaurus - Term similarity can be used for constructing the
thesaurus
41Complete Term Relation Method
- Term Document Relationship can be calculated
using a variety of methods - Like tf-idf
- Term similarity can be calculated base on the
term document relationship - for example
42Complete Term Relation Method
Set threshold to 10
43Complete Term Relation Method
- Group
- T1,T3,T4,T6
- T1,T5
- T2,T4,T6
- T2,T6,T8
- T7