Title: Information Retrieval
1Information Retrieval
January 28, 2005
2Course Information
- Instructor Dragomir R. Radev (radev_at_si.umich.edu)
- Office 3080, West Hall Connector
- Phone (734) 615-5225
- Office hours M 11-12 Th 12-1 or via email
- Course page http//tangra.si.umich.edu/radev/650
/ - Class meets on Fridays, 210-455 PM in 409 West
Hall
3Arithmetic coding
4Arithmetic coding
- Uses probabilities
- Achieves about 2.5 bits per character close to
optimal - (Rissanen and Langdon 1979, Witten, Neal, and
Cleary 1987)
5(No Transcript)
6Exercise
- Assuming the alphabet consists of a, b, and c,
develop arithmetic encodings for the following
strings aaa aab aba baa abc cab cb
a bac
7Stemming
8Goals
- Motivation
- Computer, computers, computerize, computational,
computerization - User, users, using, used
- Representing related words as one token
- Simplify matching
- Reduce storage and computation
- Also known as term conflation
9Methods
- Manual (tables)
- Achievement ? achiev
- Achiever ? achiev
- Etc.
- Affix removal (Harman 1991, Frakes 1992)
- if a word ends in ies but not eies or aies
then ies ? y - If a word ends in es but not aes, ees, or
oes, then es ? e - If a word ends in s but not us or ss then
s ? NULL - (apply only the first applicable rule)
10Porters algorithm (Porter 1980)
- Home page
- http//www.tartarus.org/martin/PorterStemmer
- Reading assignment
- http//www.tartarus.org/martin/PorterStemmer/def.
txt - Consonant-vowel sequences
- CVCV ... C
- CVCV ... V
- VCVC ... C
- VCVC ... V
- Shorthand CVCVC ... V
11Porters algorithm (contd)
- C(VC)mV
- m indicates repetition
- Examples
- m0 TR, EE, TREE, Y, BY
- m1 TROUBLE, OATS, TREES, IVY
- m2 TROUBLES, PRIVATE, OATEN
- Conditions
- S - the stem ends with S (and similarly for the
other letters). - v - the stem contains a vowel.
- d - the stem ends with a double consonant (e.g.
-TT, -SS). - o - the stem ends cvc, where the second c is not
W, X or Y (e.g. -WIL, -HOP).
12Step 1a SSES -gt SS caresses -gt caress IES -gt
I ponies -gt poni ties -gt ti SS -gt SS caress -gt
caress S -gt cats -gt cat Step 1b (mgt0) EED -gt
EE feed -gt feed agreed -gt agree (v) ED -gt
plastered -gt plaster bled -gt bled (v) ING -gt
motoring -gt motor sing -gt sing Step 1b1 If the
second or third of the rules in Step 1b is
successful, the following is done AT -gt ATE
conflat(ed) -gt conflate BL -gt BLE troubl(ed) -gt
trouble IZ -gt IZE siz(ed) -gt size (d and not
(L or S or Z)) -gt single letter hopp(ing)
-gt hop tann(ed) -gt tan fall(ing) -gt fall
hiss(ing) -gt hiss fizz(ed) -gt fizz (m1
and o) -gt E fail(ing) -gt fail fil(ing) -gt file
13Step 1c (v) Y -gt I happy -gt happi sky -gt sky
Step 2 (mgt0) ATIONAL -gt ATE relational -gt
relate (mgt0) TIONAL -gt TION conditional -gt
condition rational -gt rational (mgt0) ENCI -gt
ENCE valenci -gt valence (mgt0) ANCI -gt ANCE
hesitanci -gt hesitance (mgt0) IZER -gt IZE
digitizer -gt digitize (mgt0) ABLI -gt ABLE
conformabli -gt conformable (mgt0) ALLI -gt AL
radicalli -gt radical (mgt0) ENTLI -gt ENT
differentli -gt different (mgt0) ELI -gt E vileli
- gt vile (mgt0) OUSLI -gt OUS analogousli -gt
analogous (mgt0) IZATION -gt IZE vietnamization -gt
vietnamize (mgt0) ATION -gt ATE predication -gt
predicate (mgt0) ATOR -gt ATE operator -gt operate
(mgt0) ALISM -gt AL feudalism -gt feudal (mgt0)
IVENESS -gt IVE decisiveness -gt decisive (mgt0)
FULNESS -gt FUL hopefulness -gt hopeful (mgt0)
OUSNESS -gt OUS callousness -gt callous (mgt0)
ALITI -gt AL formaliti -gt formal (mgt0) IVITI -gt
IVE sensitiviti -gt sensitive (mgt0) BILITI -gt
BLE sensibiliti -gt sensible
14Step 3 (mgt0) ICATE -gt IC triplicate -gt triplic
(mgt0) ATIVE -gt formative -gt form (mgt0) ALIZE
-gt AL formalize -gt formal (mgt0) ICITI -gt IC
electriciti -gt electric (mgt0) ICAL -gt IC
electrical -gt electric (mgt0) FUL -gt hopeful -gt
hope (mgt0) NESS -gt goodness -gt good Step 4
(mgt1) AL -gt revival -gt reviv (mgt1) ANCE -gt
allowance -gt allow (mgt1) ENCE -gt inference -gt
infer (mgt1) ER -gt airliner -gt airlin (mgt1) IC
-gt gyroscopic -gt gyroscop (mgt1) ABLE -gt
adjustable -gt adjust (mgt1) IBLE -gt defensible
-gt defens (mgt1) ANT -gt irritant -gt irrit
(mgt1) EMENT -gt replacement -gt replac (mgt1)
MENT -gt adjustment -gt adjust (mgt1) ENT -gt
dependent -gt depend (mgt1 and (S or T)) ION -gt
adoption -gt adopt (mgt1) OU -gt homologou -gt
homolog (mgt1) ISM -gt communism -gt commun
(mgt1) ATE -gt activate -gt activ (mgt1) ITI -gt
angulariti -gt angular (mgt1) OUS -gt homologous
-gt homolog (mgt1) IVE -gt effective -gt effect
(mgt1) IZE -gt bowdlerize -gt bowdler
15Step 5a (mgt1) E -gt probate -gt probat rate -gt
rate (m1 and not o) E -gt cease -gt ceas Step
5b (m gt 1 and d and L) -gt single letter
controll -gt control roll -gt roll
16Porters algorithm (contd)
Example the word duplicatable
duplicat rule 4duplicate rule
1b1duplic rule 3
The application of another rule in step 4,
removing ic, cannotbe applied since one rule
from each step is allowed to be applied.
cd /clair4/class/ir-w03/tf-idf ./stem.pl
computers computers comput
17Porters algorithm
18Stemming
- Not always appropriate (e.g., proper names,
titles) - The same applies to casing (e.g., CAT vs. cat)
19String matching
20String matching methods
- Index-based
- Full or approximate
- E.g., theater theatre
21Index-based matching
- Inverted files
- Position-based inverted files
- Block-based inverted files
1 6 9 11 1719 24 28 33 40 46
50 55 60 This is a text. A text has many
words. Words are made from letters.
Text 11, 19 Words 33, 40 From 55
22Inverted index (trie)
Letters 60
l
Made 50
d
a
m
n
Many 28
t
Text 11, 19
w
Words 33, 40
23Sequential searching
- No indexing structure given
- Given database d and search pattern p.
- Example find words in the earlier example
- Brute force method
- try all possible starting positions
- O(n) positions in the database and O(m)
characters in the pattern so the total worst-case
runtime is O(mn) - Typical runtime is actually O(n) given that
mismatches are easy to notice
24Knuth-Morris-Pratt
- Average runtime similar to BF
- Worst case runtime is linear O(n)
- Idea reuse knowledge
- Need preprocessing of the pattern
25Knuth-Morris-Pratt (contd)
- Example (http//en.wikipedia.org/wiki/Knuth-Morris
-Pratt_algorithm)
database ABC ABC ABC ABDAB ABCDABCDABDE
pattern ABCDABD
index 0 1 2 3 4 5 6 7 char A B C D A B D pos
-1 0 0 0 0 1 2 0
1234567 ABCDABD ABCDABD
26Knuth-Morris-Pratt (contd)
ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ABC
ABC ABC ABDAB ABCDABCDABDE ABCDABD ABC
ABC ABC ABDAB ABCDABCDABDE ABCDABD
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD ABC ABC ABC
ABDAB ABCDABCDABDE
ABCDABD
27Boyer-Moore
- Used in text editors
- Demos
- http//www-sr.informatik.uni-tuebingen.de/buehler
/BM/BM.html - http//www.blarg.com/doyle/pages/bmi.html
28Other methods
- The Soundex algorithm (Odell and Russell)
- Uses
- spelling correction
- hash function
- non-recoverable
29Word similarity
- Hamming distance - when words are of the same
length - Levenshtein distance - number of edits
(insertions, deletions, replacements) - color --gt colour (1)
- survey --gt surgery (2)
- com puter --gt computer ?
- Longest common subsequence (LCS)
- lcs (survey, surgery) surey
30The Soundex algorithm
- 1. Retain the first letter of the name, and drop
all occurrences of a,e,h,I,o,u,w,y in other
positions - 2. Assign the following numbers to the remaining
letters after the first - b,f,p,v 1
- c,g,j,k,q,s,x,z 2
- d,t 3
- l 4
- m n 5
- r 6
31The Soundex algorithm
- 3. if two or more letters with the same code were
adjacent in the original name, omit all but the
first - 4. Convert to the form LDDD by adding terminal
zeros or by dropping rightmost digits - Examples
- Euler E460, Gauss G200, H416 Hilbert, K530
Knuth, Lloyd L300 - same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
- Some problems Rogers and Rodgers, Sinclair and
StClair