CSCI 7000 Modern Information Retrieval Jim Martin - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

CSCI 7000 Modern Information Retrieval Jim Martin

Description:

Given query, first enumerate all dictionary terms within a preset (weighted) edit distance ... query do we compute its edit distance to every dictionary term? ... – PowerPoint PPT presentation

Number of Views:237

Avg rating:3.0/5.0

Slides: 37

Provided by: csCol6

Category:

more less

Transcript and Presenter's Notes

Title: CSCI 7000 Modern Information Retrieval Jim Martin

1
CSCI 7000Modern Information RetrievalJim Martin

Lecture 3
9/3/2008

2
Today 9/5

Review
Dictionary contents
Advance query handling
Phrases
Wildcards
Spelling correction
First programming assignment

Index The Dictionary file and a Postings file

4
Review Dictionary

What goes into creating the dictionary?
Tokenization
Case folding
Stemming
Stop-listing
Dealing with numbers (and number-like entities)
Complex morphology

5
Phrasal queries

Want to handle queries such as
Colorado Buffaloes as a phrase
This concept is popular with users about 10 of
web queries are phrasal queries
Postings that consist of document lists alone are
not sufficient to handle phrasal queries
Two general approaches
Biword indexing
Positional indexing

6
Solution 1 Biword Indexing

Index every consecutive pair of terms in the text
as a phrase
For example the text Friends, Romans,
Countrymen would generate the biwords
friends romans
romans countrymen
Each of these biwords is now a dictionary term
Two-word phrase query-processing is now free
Not really.

7
Longer Phrasal Queries

Longer phrases can be broken into the Boolean AND
queries on the component biwords
Colorado Buffaloes at Arizona
(Colorado Buffaloes) AND (Buffaloes at) AND (at
Arizona)

Susceptible to Type 1 errors (false positives)
8
Solution 2 Positional Indexing

Change our posting content
Store, for each term, entries of the form
doc1 position1, position2
doc2 position1, position2
etc.

9
Positional index example
149 4 17, 191, 291, 430, 434 5 363, 367,
Which of docs 1,2,4,5 could contain to be or not
to be?
10
Processing a phrase query

Extract inverted index entries for each distinct
term to, be, or, not.
Merge their docposition lists to enumerate all
positions with to be or not to be.
to
21,17,74,222,551 48,16,190,429,433
713,23,191 ...
be
117,19 417,191,291,430,434 514,19,101 ...
Same general method for proximity searches

11
Positional index size

As well see you can compress position
values/offsets
But a positional index still expands the postings
storage substantially
Nevertheless, it is now the standard approach
because of the power and usefulness of phrase and
proximity queries whether used explicitly or
implicitly in a ranking retrieval system.

12
Rules of thumb

Positional index size 3550 of volume of
original text
Caveat all of this holds for English-like
languages

13
Combination Techniques

Biwords are faster.
And they cover a large percentage of very
frequent (implied) phrasal queries
Britney Spears
So it can be effective to combine positional
indexes with biword indexes for frequent bigrams

14
Web

Cuil
Yahoo! BOSS

15
Programming Assignment Part 1

Download and install Lucene
How does Lucene handle (by default)
Case, stemming, and phrasal queries
Download and index a collection that I will point
you at
How big is the resulting index?
Terms and size of index
Return the Top N document IDs (hits) from a set
of queries Ill provide.

16
Programming Assignment Part 2

Make it better

17
Wild Card Queries

Two flavors
Word-based
Caribb
Phrasal
Pirates Caribbean
General approach
Generate a set of new queries from the original
Operation on the dictionary
Run those queries in a not stupid way

18
Simple Single Wild-card queries

Single instance of a
means an string of length 0 or more
This is not Kleene .
mon find all docs containing any word beginning
mon.
Index your lexicon on prefixes
mon find words ending in mon
Maintain a backwards index

Exercise from this, how can w enumerate all
terms meeting the wild-card query procent ?
19
Arbitrary Wildcards

How can we handle multiple s in the middle of
query term?
The solution transform every wild-card query so
that the s occur at the end
This gives rise to the Permuterm Index.

20
Permuterm Index

For term hello index under
hello, elloh, llohe, lohel, ohell
where is a special symbol.
Example

Query helo Rotate Lookup ohel
21
Permuterm query processing

Rotate query wild-card to the right
Now use indexed lookup as before.
Permuterm problem quadruples lexicon size

Empirical observation for English.
22
Spelling Correction

Two primary uses
Correcting document(s) being indexed
Retrieve matching documents when query contains a
spelling error
Two main flavors
Isolated word
Check each word on its own for misspelling
Will not catch typos resulting in correctly
spelled words e.g., from ? form
Context-sensitive
Look at surrounding words, e.g., I flew form
Heathrow to Narita.

23
Document correction

Primarily for OCRed documents
Correction algorithms tuned for this
Goal the index (dictionary) contains fewer
OCR-induced misspellings
Can use domain-specific knowledge
E.g., OCR can confuse O and D more often than it
would confuse O and I (adjacent on the QWERTY
keyboard, so more likely interchanged in typing).

24
Query correction

Our principal focus here
E.g., the query Alanis Morisett
We can either
Retrieve using that spelling
Retrieve documents indexed by the correct
spelling, OR
Return several suggested alternative queries with
the correct spelling
Did you mean ?

25
Isolated word correction

Fundamental premise there is a lexicon from
which the correct spellings come
Two basic choices for this
A standard lexicon such as
Websters English Dictionary
An industry-specific lexicon hand-maintained
The lexicon of the indexed corpus
E.g., all words on the web
All names, acronyms etc.
(Including the mis-spellings)

26
Isolated word correction

Given a lexicon and a character sequence Q,
return the words in the lexicon closest to Q
Whats closest?
Well study several alternatives
Edit distance
Weighted edit distance
Character n-gram overlap

27
Edit distance

Given two strings S1 and S2, the minimum number
of basic operations to covert one to the other
Basic operations are typically character-level
Insert
Delete
Replace
E.g., the edit distance from cat to dog is 3.
Generally found by dynamic programming.

28
Weighted edit distance

As above, but the weight of an operation depends
on the character(s) involved
Meant to capture keyboard errors, e.g. m more
likely to be mis-typed as n than as q
Therefore, replacing m by n is a smaller edit
distance than by q
(Same ideas usable for OCR, but with different
weights)
Require weight matrix as input
Modify dynamic programming to handle weights
(Viterbi)

29
Using edit distances

Given query, first enumerate all dictionary terms
within a preset (weighted) edit distance
Then look up enumerated dictionary terms in the
term-document inverted index

30
Edit distance to all dictionary terms?

Given a (misspelled) query do we compute its
edit distance to every dictionary term?
Expensive and slow
How do we cut the set of candidate dictionary
terms?
Here we can use n-gram overlap for this

31
Context-sensitive spell correction

Text I flew from Heathrow to Narita.
Consider the phrase query flew form Heathrow
Wed like to respond
Did you mean flew from Heathrow?
because no docs matched the query phrase.

32
Context-sensitive correction

Need surrounding context to catch this.
NLP too heavyweight for this.
First idea retrieve dictionary terms close (in
weighted edit distance) to each query term
Now try all possible resulting phrases with one
word fixed at a time
flew from heathrow
fled form heathrow
flea form heathrow
etc.
Suggest the alternative that has lots of hits?

33
Exercise

Suppose that for flew form Heathrow we have 7
alternatives for flew, 19 for form and 3 for
heathrow.
How many corrected phrases will we enumerate in
this scheme?

34
General issue in spell correction