Information Retrieval - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Information Retrieval

Description:

(C) 2003, The University of Michigan. 2. Course Information ... hiss(ing) - hiss. fizz(ed) - fizz (m=1 and *o) - E fail(ing) - fail fil(ing) - file ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 32
Provided by: dragomi3
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
January 28, 2005
  • Handout 4

2
Course Information
  • Instructor Dragomir R. Radev (radev_at_si.umich.edu)
  • Office 3080, West Hall Connector
  • Phone (734) 615-5225
  • Office hours M 11-12 Th 12-1 or via email
  • Course page http//tangra.si.umich.edu/radev/650
    /
  • Class meets on Fridays, 210-455 PM in 409 West
    Hall

3
Arithmetic coding
4
Arithmetic coding
  • Uses probabilities
  • Achieves about 2.5 bits per character close to
    optimal
  • (Rissanen and Langdon 1979, Witten, Neal, and
    Cleary 1987)

5
(No Transcript)
6
Exercise
  • Assuming the alphabet consists of a, b, and c,
    develop arithmetic encodings for the following
    strings aaa aab aba baa abc cab cb
    a bac

7
Stemming
8
Goals
  • Motivation
  • Computer, computers, computerize, computational,
    computerization
  • User, users, using, used
  • Representing related words as one token
  • Simplify matching
  • Reduce storage and computation
  • Also known as term conflation

9
Methods
  • Manual (tables)
  • Achievement ? achiev
  • Achiever ? achiev
  • Etc.
  • Affix removal (Harman 1991, Frakes 1992)
  • if a word ends in ies but not eies or aies
    then ies ? y
  • If a word ends in es but not aes, ees, or
    oes, then es ? e
  • If a word ends in s but not us or ss then
    s ? NULL
  • (apply only the first applicable rule)

10
Porters algorithm (Porter 1980)
  • Home page
  • http//www.tartarus.org/martin/PorterStemmer
  • Reading assignment
  • http//www.tartarus.org/martin/PorterStemmer/def.
    txt
  • Consonant-vowel sequences
  • CVCV ... C
  • CVCV ... V
  • VCVC ... C
  • VCVC ... V
  • Shorthand CVCVC ... V

11
Porters algorithm (contd)
  • C(VC)mV
  • m indicates repetition
  • Examples
  • m0 TR, EE, TREE, Y, BY
  • m1 TROUBLE, OATS, TREES, IVY
  • m2 TROUBLES, PRIVATE, OATEN
  • Conditions
  • S - the stem ends with S (and similarly for the
    other letters).
  • v - the stem contains a vowel.
  • d - the stem ends with a double consonant (e.g.
    -TT, -SS).
  • o - the stem ends cvc, where the second c is not
    W, X or Y (e.g. -WIL, -HOP).

12
Step 1a SSES -gt SS caresses -gt caress IES -gt
I ponies -gt poni ties -gt ti SS -gt SS caress -gt
caress S -gt cats -gt cat Step 1b (mgt0) EED -gt
EE feed -gt feed agreed -gt agree (v) ED -gt
plastered -gt plaster bled -gt bled (v) ING -gt
motoring -gt motor sing -gt sing Step 1b1 If the
second or third of the rules in Step 1b is
successful, the following is done AT -gt ATE
conflat(ed) -gt conflate BL -gt BLE troubl(ed) -gt
trouble IZ -gt IZE siz(ed) -gt size (d and not
(L or S or Z)) -gt single letter hopp(ing)
-gt hop tann(ed) -gt tan fall(ing) -gt fall
hiss(ing) -gt hiss fizz(ed) -gt fizz (m1
and o) -gt E fail(ing) -gt fail fil(ing) -gt file
13
Step 1c (v) Y -gt I happy -gt happi sky -gt sky
Step 2 (mgt0) ATIONAL -gt ATE relational -gt
relate (mgt0) TIONAL -gt TION conditional -gt
condition rational -gt rational (mgt0) ENCI -gt
ENCE valenci -gt valence (mgt0) ANCI -gt ANCE
hesitanci -gt hesitance (mgt0) IZER -gt IZE
digitizer -gt digitize (mgt0) ABLI -gt ABLE
conformabli -gt conformable (mgt0) ALLI -gt AL
radicalli -gt radical (mgt0) ENTLI -gt ENT
differentli -gt different (mgt0) ELI -gt E vileli
- gt vile (mgt0) OUSLI -gt OUS analogousli -gt
analogous (mgt0) IZATION -gt IZE vietnamization -gt
vietnamize (mgt0) ATION -gt ATE predication -gt
predicate (mgt0) ATOR -gt ATE operator -gt operate
(mgt0) ALISM -gt AL feudalism -gt feudal (mgt0)
IVENESS -gt IVE decisiveness -gt decisive (mgt0)
FULNESS -gt FUL hopefulness -gt hopeful (mgt0)
OUSNESS -gt OUS callousness -gt callous (mgt0)
ALITI -gt AL formaliti -gt formal (mgt0) IVITI -gt
IVE sensitiviti -gt sensitive (mgt0) BILITI -gt
BLE sensibiliti -gt sensible
14
Step 3 (mgt0) ICATE -gt IC triplicate -gt triplic
(mgt0) ATIVE -gt formative -gt form (mgt0) ALIZE
-gt AL formalize -gt formal (mgt0) ICITI -gt IC
electriciti -gt electric (mgt0) ICAL -gt IC
electrical -gt electric (mgt0) FUL -gt hopeful -gt
hope (mgt0) NESS -gt goodness -gt good Step 4
(mgt1) AL -gt revival -gt reviv (mgt1) ANCE -gt
allowance -gt allow (mgt1) ENCE -gt inference -gt
infer (mgt1) ER -gt airliner -gt airlin (mgt1) IC
-gt gyroscopic -gt gyroscop (mgt1) ABLE -gt
adjustable -gt adjust (mgt1) IBLE -gt defensible
-gt defens (mgt1) ANT -gt irritant -gt irrit
(mgt1) EMENT -gt replacement -gt replac (mgt1)
MENT -gt adjustment -gt adjust (mgt1) ENT -gt
dependent -gt depend (mgt1 and (S or T)) ION -gt
adoption -gt adopt (mgt1) OU -gt homologou -gt
homolog (mgt1) ISM -gt communism -gt commun
(mgt1) ATE -gt activate -gt activ (mgt1) ITI -gt
angulariti -gt angular (mgt1) OUS -gt homologous
-gt homolog (mgt1) IVE -gt effective -gt effect
(mgt1) IZE -gt bowdlerize -gt bowdler
15
Step 5a (mgt1) E -gt probate -gt probat rate -gt
rate (m1 and not o) E -gt cease -gt ceas Step
5b (m gt 1 and d and L) -gt single letter
controll -gt control roll -gt roll
16
Porters algorithm (contd)
Example the word duplicatable
duplicat rule 4duplicate rule
1b1duplic rule 3
The application of another rule in step 4,
removing ic, cannotbe applied since one rule
from each step is allowed to be applied.
cd /clair4/class/ir-w03/tf-idf ./stem.pl
computers computers comput
17
Porters algorithm
18
Stemming
  • Not always appropriate (e.g., proper names,
    titles)
  • The same applies to casing (e.g., CAT vs. cat)

19
String matching
20
String matching methods
  • Index-based
  • Full or approximate
  • E.g., theater theatre

21
Index-based matching
  • Inverted files
  • Position-based inverted files
  • Block-based inverted files

1 6 9 11 1719 24 28 33 40 46
50 55 60 This is a text. A text has many
words. Words are made from letters.
Text 11, 19 Words 33, 40 From 55
22
Inverted index (trie)
Letters 60
l
Made 50
d
a
m
n
Many 28
t
Text 11, 19
w
Words 33, 40
23
Sequential searching
  • No indexing structure given
  • Given database d and search pattern p.
  • Example find words in the earlier example
  • Brute force method
  • try all possible starting positions
  • O(n) positions in the database and O(m)
    characters in the pattern so the total worst-case
    runtime is O(mn)
  • Typical runtime is actually O(n) given that
    mismatches are easy to notice

24
Knuth-Morris-Pratt
  • Average runtime similar to BF
  • Worst case runtime is linear O(n)
  • Idea reuse knowledge
  • Need preprocessing of the pattern

25
Knuth-Morris-Pratt (contd)
  • Example (http//en.wikipedia.org/wiki/Knuth-Morris
    -Pratt_algorithm)

database ABC ABC ABC ABDAB ABCDABCDABDE
pattern ABCDABD
index 0 1 2 3 4 5 6 7 char A B C D A B D pos
-1 0 0 0 0 1 2 0
1234567 ABCDABD ABCDABD
26
Knuth-Morris-Pratt (contd)
ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ABC
ABC ABC ABDAB ABCDABCDABDE ABCDABD ABC
ABC ABC ABDAB ABCDABCDABDE ABCDABD
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD ABC ABC ABC
ABDAB ABCDABCDABDE
ABCDABD
27
Boyer-Moore
  • Used in text editors
  • Demos
  • http//www-sr.informatik.uni-tuebingen.de/buehler
    /BM/BM.html
  • http//www.blarg.com/doyle/pages/bmi.html

28
Other methods
  • The Soundex algorithm (Odell and Russell)
  • Uses
  • spelling correction
  • hash function
  • non-recoverable

29
Word similarity
  • Hamming distance - when words are of the same
    length
  • Levenshtein distance - number of edits
    (insertions, deletions, replacements)
  • color --gt colour (1)
  • survey --gt surgery (2)
  • com puter --gt computer ?
  • Longest common subsequence (LCS)
  • lcs (survey, surgery) surey

30
The Soundex algorithm
  • 1. Retain the first letter of the name, and drop
    all occurrences of a,e,h,I,o,u,w,y in other
    positions
  • 2. Assign the following numbers to the remaining
    letters after the first
  • b,f,p,v 1
  • c,g,j,k,q,s,x,z 2
  • d,t 3
  • l 4
  • m n 5
  • r 6

31
The Soundex algorithm
  • 3. if two or more letters with the same code were
    adjacent in the original name, omit all but the
    first
  • 4. Convert to the form LDDD by adding terminal
    zeros or by dropping rightmost digits
  • Examples
  • Euler E460, Gauss G200, H416 Hilbert, K530
    Knuth, Lloyd L300
  • same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
  • Some problems Rogers and Rodgers, Sinclair and
    StClair
Write a Comment
User Comments (0)
About PowerShow.com