Stemming - PowerPoint PPT Presentation

About This Presentation

Title:

Stemming

Description:

ape. accident. about. able. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng ... D={able,ape,beatable, fixable, read, readable, reading, reads, red ... – PowerPoint PPT presentation

Number of Views:496

Avg rating:3.0/5.0

Slides: 44

Provided by: scie241

Category:

more less

Transcript and Presenter's Notes

Title: Stemming

1
Lecture 6 Linguistic Methods for Searching

Stemming
Thesaurus
Online resources
Automatic construction of thesaurus

2
Outline of Stemming Methods

Goal of Stemming Process
Algorithm
Affix Removal (Porters Algorithm)
Dictionary Look-up Stemmers
Successor Variety
n-Gram Stemming
Applications

3
The advantage

Originally designed to improve performance by
reducing the requirement on system resources.
With the continued significant increase in
storage and computing power, use of stemming for
performance reason is no longer as important.

4
Other Potentials

It may make improvement in recall.
There may be associated decline in precision.
System designer make their own choice of
including stemming or not.
Google does not use the stemming
Hotbot includes the word stemming for user choice

5
Porter Stemming Algorithm

The Porter algorithm is the most commonly
accepted algorithm.
Based upon a set of conditions of the stem,
suffix and prefix and associated actions given
the condition.
See, e.g,
http//www.tartarus.org/martin/PorterStemmer/

6
Porter Stemming (Condition)

m, the measure of a stem is a function of
sequences of vowels (a,e,i,o,u,y) followed by a
consonant.
C(VC)mV where the initial C and final V are
optional and m is the number VC repeats

7
Porter Stemming (Condition)

ltXgt -stem ends with letter X
v -stem contains a vowel
d -stem ends in double consonant
o -stem ends with consonant-vowel-consonan
t sequence where the final consonant is not w, x,
or y

8
Rules
9
Rules (continued)
10
Example

duplicatable
duplicat rule 4
duplicate rule 1b1
duplic rule 3

11
Dictionary Look-Up Stemmer

A dictionary contains the pairing of a word and
its stem for all the words.
The structure of the dictionary should be well
designed for speeding up the search

TERM STEM computer comput compute compu
t computation comput
12
Successor Variety Stemming

Hafer and Weiss (1974) word segmentation by
letter successor varieties, Information Storage
and Retrieval 10, 371-385.
Main Idea Determine word and morpheme boundaries
based on
the distribution of phonemes in a large body of
utterances.

13
Note

Morpheme smallest meaningful part into which a
word can be divided
Run-s contains two morphemes
un-like-ly contains three morphemes
Phoneme unit of the system of sounds in a
language
English has 24 consonant phonemes

14
Overall approach

Hafer and Weiss use
letters in place of phonemes
texts in place of phonemically transcribed
utterances

15
Formal Definition

Let w be a word of length n
wi is a length I prefix of w
Let D be a collection of words
D(wi) is the subset of D containing terms whose
first I letters match wi exactly
S(wi) the successor variety of wi is the number
of distinct letters that occupy the (i1)st
position of words in D(wi).
A test word of length n has n successor varieties
S(w1) S(w2) S(wn).

16
Informal Definition

The successor variety of a string in a collection
D of words is the number of different characters
that follows it in D.
That it, it depends on
the string
the collection D of words under consideration

17
An example

Dable, axle, accident, ape, about, be
The successor variety for
a 4 (b,x,c,p)
ap 1 (e)
app 0
ab 2 (l, o)
b 1 (e)
Using Trie, successor variety of a string is the
number of children for the node the string
reaches in the trie (terminal node is treated as
having one child

18
Trie for the corpus of data D
1
b
a
2
b
x
be
3
c
p
axle
l
o
ape
accident
about
able
19
Segment in Words

From a large body of text, usually the successor
variety of a substring decreases as a character
is added, until a segment boundary is reached
Consider the following example
Dable,ape,beatable, fixable, read, readable,
reading, reads, red rope, ripe
r 3 (e,I,o)
re 2 (a,d)
rea 1 (d)
read 3 (a,I,s)
read is a segment (or stem)

20
Selecting segments of words

Cut off method
a boundary is identified if some cutoff value is
reached.
Peak and plateau method
a segment break is made after a character whose
successor variety is larger than that of both the
character immediate before and the character
immediately after it.
Complete word method
a break is made after a segment if the segment is
a complete word in the corpus
Entropy method
cutoff method applied to entropy defined for
words.

21
Peak and Plateau Method

Dable,ape,beatable, fixable, read, readable,
reading, reads, red rope, ripe
r 3 (e,I,o)
re 2 (a,d)
rea 1 (d)
read 3 (a,I,s)
reada 1 (b)
readab 1 (l)
readabl 1 (e)
readable 1 (blank)
the successor variety of read is 3 larger than
that of both rea and reada

22
Peak and Plateau Method

Input A document of many terms.
Output each term is segmented.
E.G., the output of readable is read-able

23
Stem method of Hafer and Weiss

Determine successor variety of a word
Use this information to segment the word using
one of the previous methods (say peakplateau)
Choose one of the segment as stem
if (first segment is in lt12 words in the corpus)
//comment maybe a prefix
first segment is stem
else
second segment is stem

24
Stem method of Hafer and Weiss

Input segmented word
Output the stem of the word
For example
read-able is input
read is the output
//may be able is the output dependent on what
happens in the algorithms

25
Accessor Variety Method in Chinese

The notation is introduced by Feng, Chen, Zheng,
Deng for chinese word extraction.
The idea is similar to successor variety
It is use to determine chinese text segmentation
since it is difficult to separate words in
Chinese text. In comparison, English words are
separated by a space symbol in text.

26
Definition Accessor Variety

We treat each Chinese character as a letter
For each string (a potential word) consisting of
several characters, we define successor variety
as in English
Symmetrically, we also define a predecessor
variety for each string.
A word is considered a word if it has a large
successor variety and a large predecessor
variety.

27
Testing Results

The accessor variety method turns out a very
simple yet efficient way to recognize Chinese
words when combined with some simple grammar
rules.
For details, look at our paper
http//www.cs.cityu.edu.hk/deng/5286/feng.pdf

28
Word similarity

N-gram method
break a word of length n into (n-1) digrams,
consisting of substring of two characters of the
word.
Count the number of distinguished digrams
Let A (B) be the number of distinguished digrams
in word 1 (2). Let C be the number of
distinguished digrams shared by word 1 and word
2.
The similarity of the two words is
S2C/(AB)

29
Example of Word similarity

Statistics st, ta, at, ti, is, st, ti, ic, cs
its distinguished digrams
at, cs, ic, is, st, ta, ti
statistical st, ta, at, ti, is, st, ti, ic, ca,
al
its distinguished digrams
al, at, ca, ic, is, st, ta, ti
A7, B8, C6
Similarity 2x6/(78)12/154/580
One may build a similarity matrix of all words in
a corpus, calculated as above, and complemented
by cutoff value method (set to zero if less than
a certain value, and to 1 else)

30
Thesaurus

Vocabulary control in an information retrieval
system
Thesaurus construction
Manual construction
Automatic construction

31
Vocabulary control

Standard vocabulary for both indexing and
searching (for the constructors of the system and
the users of the system)

32
Objectives of vocabulary control

To promote the consistent representation of
subject matter by indexers and searchers ,thereby
avoiding the dispersion of related materials.
To facilitate the conduct of a comprehensive
search on some topic by linking together terms
whose meanings are related paradigmatically.

33
Thesaurus

Not like common dictionary
Words with their explanations
May contain words in a language
Or only contains words in a specific domain.
With a lot of other information especially the
relationship between words
Classification of words in the language
Words relationship like synonyms, antonyms

34
On-Line Thesaurus

http//www.thesaurus.com
http//www.dictionary.com/
http//www.cogsci.princeton.edu/wn/

35
Dictionary vs. Thesaurus
Check Information use http//www.thesaurus.com
Dictionary Thesaurus

information ( n f r-m sh n)n.
Knowledge derived from study, experience, or
instruction.
Knowledge of specific events or situations that
has been gathered or received by communication
intelligence or news. See Synonyms at knowledge.
......

Nouns information, enlightenment, acquaintance
Verbs tell inform, inform of acquaint,
acquaint with impart, Adjectives informed
communique reported published
36
Use of Thesaurus

To control the term used in indexing ,for a
specific domain only use the terms in the
thesaurus as indexing terms
Assist the users to form proper queries by the
help information contained in the thesaurus

37
Construction of Thesaurus

Stemming can be used for reduce the size of
thesaurus
Can be constructed either manually or
automatically

38
WordNet manually constructed

WordNet is an online lexical reference system
whose design is inspired by current
psycholinguistic theories of human lexical
memory. English nouns, verbs, adjectives and
adverbs are organized into synonym sets, each
representing one underlying lexical concept.
Different relations link the synonym sets.

39
Relations in WordNet
40
Automatic Thesaurus Construction