Title: Todays Topics
1Todays Topics
- Boolean IR
- Signature files
- Inverted files
- PAT trees
- Suffix arrays
2Boolean IR
- Pre 1970s - Dominant industrial model
through 1994 (Lexis-Nexis, DIALOG)
- Documents composed of TERMS(words, stems)
- Express result in set-theoretic terms
A AND B
(A AND B) OR C
3Boolean Operators
Docs containing term A
A AND B A OR B (A AND B) OR C A AND (
NOT B )
Adjacent AND g A B e.g.
Johns Hopkins The Who Proximity window g A
w/10 B A and B within /- 10
words g A w/sent B A B in same
sentence
Proximity Operators (Extended ANDs)
(in /- K words)
4Boolean IR(implementation)
- Bit vectors
- Inverted files(a.k.a. Index)
- PAT tree(more powerful index)
Termi
V1
V2
Impractical g very sparse(wastefully big) g
costly to compare
5Problems with Boolean IR
- Does not effectively support relevance ranking of
returned documents - Base model expression satisfaction is Boolean
- A document matches expression or it doesnt
- Extension to permit ordering (A AND B) OR C
- Supermatches(5 terms/doc gt 3 terms/doc)
- Partial matches
- (expression incompletely satisfied give
partial credit) - Importance weighting(10A OR 5B)
Weight/importance
6Boolean IR
- Advantages Can directly control search
- Good for precise queries in structured data
- (e.g. database search or legal index)
- Disadvantages Must directly control search
- Users should be familiar with domain and term
space(know what to ask for and exclude) - Poor at relevance ranking
- Poor at weighted query expansion, user modelling
etc.
7Signature Files
Document Bit vector
Superimposed Coding Using some mapping/ Hash
function
Mapping function f( )
Signature
fewer bits
Problem several different document bit
vectors(i.e. different words)
get mapped to same signature. (use stoplist to
help avoid common words from overwhelming
signatures)
8False Drop Problem
- On retrieval, all documents/bit vectors mapped to
the same signature are retrieved(returned) - Only a portion are relevant
- Need to do secondary validation step to make sure
target words actually match - Prob(False Drop) Prob(Signature qualifies
Text does not)
9Efficiency Problem
- Testing for signature match may require linear
scan through all document signatures
10Vertical Partitioning
- Improves sig1, sig2 comparison speed,
- but still requires O(N) linear search of all
signatures - Options
sig
- Bit sliced onto different devices for parallel
comparision - And together matches on each
segment
sig1 sig2 comp
AND
AND
g result
11Horizontal Partitioning
- Goal avoid sequential scanning of the signature
file
Signature Database
Input signature
Hash function or index yielding specific
candidates to try
12Inverted Files
Documents
Terms Baum Bayes Viterbi
index
14 39 156 39 45 156 290 41 86 156 217
14 15 16 17 37 38 39 40
13Inverted Files
- Very efficient for single word queries
- Just enumerate documents pointed to by index
- O( A ) O(SA)
- Efficient for ORs
- Just enumerate both lists and remove duplicates
O(SA SB)
14ANDs using Inverted Files
(meet search)
Method 1
Index for Bayes
Index for Viterbi
Ai
Bj
14 39 156 227 319
39 45 58 96 156 208
j
i
O(SA SB ) same as OR, but smaller output
- Begin with two pointers(i, j) on list is in
index(A,B) - if A i B i , write A i to output
- if A i lt B i , i else j
15ANDs using Inverted Files
Method 2 Useful if one index is smaller than
the other(SA ltlt SB )
(Hopkins)
Bj
For all members of A bsearch (A i , B) (do
binary search into larger index) for all members
of smaller index
1 5 25 28 39 45 58 96 156
(Johns)
Ai
39 227
A AND B AND C Order by smaller list pairwise
Cost SA log2 (SB ) can achieve SA log log
(SB )
16Proximity Search
Document level indexes not adequate
Option 1
Doc 1
Index to corpus Position offset Before Match
if ptrA ptrB Now A B match if ptrA
ptrB -1 A w/10 B match if ptrA - ptrB
10
Anthony Johns Hopkins
Doc 2
Doc 3
Doc i
Size of corpus size of index
17Variations 1
Dont index function words
index
wordlist
X
Johns The
The Johns Hopkins
- Do linear match search in corpus
- savings on 50 index size
- potential speed improvement
- given data access costs
-
18Variations 2 Multilevel Indexes
Position level
Doc level
Johns Hopkins
- Supports parallel
- search
- May have paging
- cost advantage
- Cost large index
- N dV
Anthony Johns Hopkins
Johns Hopkins Anthony
Hopkins Anthony
Avg. Doc/vocab size
19Interpolation Search
Useful when data are numeric and uniformly
distributed
value
Bi cell
of cells in index 100 Values range from 0
1000 Goal looking for the value 211 Binary
search begin looking at cell
50 Interpolation search better guess for
1st cell to examine?
174 195 211 226 230 231 246
17 18 19 20 21 22 23 48 49 50 51 100
483 496 521 526 995
20- Binary Search
- Bsearch(low, high, key)
- mid (high low) / 2
- If (key Amid)
- return mid
- Else if (key lt Amid)
- Bsearch (low, mid-1, key)
- Else
- Bsearch(mid1, high, key)
Interpolation Search Isearch(low, high, key) mid
best estimate of pos mid low (high low)
(expected of way
through range)
21Comparison
Typical sequence of cells tested
Binary Search 50 25 12 18 22 21 19.
Interpolation Search 21 19. g go directly to
expected region
log log (N)
22Cost of Computing Inverted Index
- Simple
- word position pairs
- and sort
- If N gtgt memory size
- Tokenize(words g integers)
- Create histogram
- Allocate space in index
- Do multipass(K-pass) through corpus only adding
tokens in K bins
Corpus size gN log N
23K-pass Indexing
index
W1 W2 W3 W4
Block1 (pass K 1)
Time KN 1 But big win over N log N on paging
K 2
24Vector Models for IR
- Gerald Salton, Cornell
- (Salton Lesk, 68)
- (Salton, 71)
- (Salton McGill, 83)
- SMART System
- Chris Buckely, Cornell
- g Current keeper of the flame
Saltons magical automatic retrieval tool(?)
25Vector Models for IR
Boolean Model
Doc V1
Doc V2
Word Stem Special compounds
SMART Vector Model
Termi
Doc V1
1.0 3.5 4.6 0.1 0.0 0.0
Doc V2
0.0 0.0 0.0 0.1 4.0 0.0
SMART vectors are composed of real valued Term
weights NOT simply Boolean Term Present or NOT
26Example
Comput C Sparc genome Bilog
protein
Compiler
DNA
Doc V1
3 5 4 1 0 1 0 0
Doc V2
1 0 0 0 5 3 1 4
Doc V3
2 8 0 1 0 1 0 0
- Issues
- How are weights determined?
- (simple option
- jraw freq.
- kweighted by region, titles, keywords)
- Which terms to include? Stoplists
- Stem or not?
27QUERIES and Documents share same vector
representaion
D1
D2
Q
D3
Given Qeury DQ g map to vector VQ and find
document Di sim (Vi ,VQ) is greatest
28Similarity Functions
- Many other options availabe(Dice, Jaccard)
- Cosine similarity is self normalizing
V1
100 200 300 50
D2
V2
1 2 3 0.5
Q
D3
V3
10 20 30 5
Can use arbitrary integer values (dont need to
be probabilities)