Todays Topics - PowerPoint PPT Presentation

About This Presentation

Title:

Todays Topics

Description:

On retrieval, all documents/bit vectors mapped to the same signature are retrieved(returned) ... Prob(False Drop) = Prob(Signature qualifies & Text does not) ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 29

Provided by: andre9

Learn more at: https://www.cs.jhu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Todays Topics

1
Todays Topics

Boolean IR
Signature files
Inverted files
PAT trees
Suffix arrays

2
Boolean IR
- Pre 1970s - Dominant industrial model
through 1994 (Lexis-Nexis, DIALOG)

Documents composed of TERMS(words, stems)
Express result in set-theoretic terms

A AND B
(A AND B) OR C
3
Boolean Operators
Docs containing term A
A AND B A OR B (A AND B) OR C A AND (
NOT B )
Adjacent AND g A B e.g.
Johns Hopkins The Who Proximity window g A
w/10 B A and B within /- 10
words g A w/sent B A B in same
sentence
Proximity Operators (Extended ANDs)
(in /- K words)
4
Boolean IR(implementation)

Bit vectors
Inverted files(a.k.a. Index)
PAT tree(more powerful index)

Termi
V1
V2
Impractical g very sparse(wastefully big) g
costly to compare
5
Problems with Boolean IR

Does not effectively support relevance ranking of
returned documents
Base model expression satisfaction is Boolean
A document matches expression or it doesnt
Extension to permit ordering (A AND B) OR C
Supermatches(5 terms/doc gt 3 terms/doc)
Partial matches
(expression incompletely satisfied give
partial credit)
Importance weighting(10A OR 5B)

Weight/importance
6
Boolean IR

Advantages Can directly control search
Good for precise queries in structured data
(e.g. database search or legal index)
Disadvantages Must directly control search
Users should be familiar with domain and term
space(know what to ask for and exclude)
Poor at relevance ranking
Poor at weighted query expansion, user modelling
etc.

7
Signature Files
Document Bit vector
Superimposed Coding Using some mapping/ Hash
function
Mapping function f( )
Signature
fewer bits
Problem several different document bit
vectors(i.e. different words)
get mapped to same signature. (use stoplist to
help avoid common words from overwhelming
signatures)
8
False Drop Problem

On retrieval, all documents/bit vectors mapped to
the same signature are retrieved(returned)
Only a portion are relevant
Need to do secondary validation step to make sure
target words actually match
Prob(False Drop) Prob(Signature qualifies
Text does not)

9
Efficiency Problem

Testing for signature match may require linear
scan through all document signatures

10
Vertical Partitioning

Improves sig1, sig2 comparison speed,
but still requires O(N) linear search of all
signatures
Options

sig
- Bit sliced onto different devices for parallel
comparision - And together matches on each
segment
sig1 sig2 comp
AND
AND
g result
11
Horizontal Partitioning

Goal avoid sequential scanning of the signature
file

Signature Database
Input signature
Hash function or index yielding specific
candidates to try
12
Inverted Files

Like an index to a book

Documents
Terms Baum Bayes Viterbi
index
14 39 156 39 45 156 290 41 86 156 217
14 15 16 17 37 38 39 40
13
Inverted Files

Very efficient for single word queries
Just enumerate documents pointed to by index
O( A ) O(SA)
Efficient for ORs
Just enumerate both lists and remove duplicates
O(SA SB)

14
ANDs using Inverted Files
(meet search)
Method 1
Index for Bayes
Index for Viterbi
Ai
Bj
14 39 156 227 319
39 45 58 96 156 208
j
i
O(SA SB ) same as OR, but smaller output

Begin with two pointers(i, j) on list is in
index(A,B)
if A i B i , write A i to output
if A i lt B i , i else j

15
ANDs using Inverted Files
Method 2 Useful if one index is smaller than
the other(SA ltlt SB )
(Hopkins)
Bj
For all members of A bsearch (A i , B) (do
binary search into larger index) for all members
of smaller index
1 5 25 28 39 45 58 96 156
(Johns)
Ai
39 227
A AND B AND C Order by smaller list pairwise
Cost SA log2 (SB ) can achieve SA log log
(SB )
16
Proximity Search
Document level indexes not adequate
Option 1
Doc 1
Index to corpus Position offset Before Match
if ptrA ptrB Now A B match if ptrA
ptrB -1 A w/10 B match if ptrA - ptrB
10
Anthony Johns Hopkins
Doc 2
Doc 3
Doc i
Size of corpus size of index
17
Variations 1
Dont index function words
index
wordlist
X
Johns The

The Johns Hopkins

Do linear match search in corpus
savings on 50 index size
potential speed improvement
given data access costs

18
Variations 2 Multilevel Indexes
Position level
Doc level
Johns Hopkins

Supports parallel
search
May have paging
cost advantage
Cost large index
N dV

Anthony Johns Hopkins
Johns Hopkins Anthony
Hopkins Anthony
Avg. Doc/vocab size
19
Interpolation Search
Useful when data are numeric and uniformly
distributed
value
Bi cell
of cells in index 100 Values range from 0
1000 Goal looking for the value 211 Binary
search begin looking at cell
50 Interpolation search better guess for
1st cell to examine?
174 195 211 226 230 231 246
17 18 19 20 21 22 23 48 49 50 51 100
483 496 521 526 995
20

Binary Search
Bsearch(low, high, key)
mid (high low) / 2
If (key Amid)
return mid
Else if (key lt Amid)
Bsearch (low, mid-1, key)
Else
Bsearch(mid1, high, key)

Interpolation Search Isearch(low, high, key) mid
best estimate of pos mid low (high low)
(expected of way
through range)
21
Comparison
Typical sequence of cells tested
Binary Search 50 25 12 18 22 21 19.
Interpolation Search 21 19. g go directly to
expected region
log log (N)
22
Cost of Computing Inverted Index

Simple
word position pairs
and sort
If N gtgt memory size
Tokenize(words g integers)
Create histogram
Allocate space in index
Do multipass(K-pass) through corpus only adding
tokens in K bins

Corpus size gN log N
23
K-pass Indexing
index
W1 W2 W3 W4
Block1 (pass K 1)
Time KN 1 But big win over N log N on paging
K 2
24
Vector Models for IR

Gerald Salton, Cornell
(Salton Lesk, 68)
(Salton, 71)
(Salton McGill, 83)
SMART System
Chris Buckely, Cornell
g Current keeper of the flame

Saltons magical automatic retrieval tool(?)
25
Vector Models for IR
Boolean Model
Doc V1
Doc V2
Word Stem Special compounds
SMART Vector Model
Termi
Doc V1
1.0 3.5 4.6 0.1 0.0 0.0
Doc V2
0.0 0.0 0.0 0.1 4.0 0.0
SMART vectors are composed of real valued Term
weights NOT simply Boolean Term Present or NOT
26
Example
Comput C Sparc genome Bilog
protein
Compiler
DNA
Doc V1
3 5 4 1 0 1 0 0
Doc V2
1 0 0 0 5 3 1 4
Doc V3
2 8 0 1 0 1 0 0

Issues
How are weights determined?
(simple option
jraw freq.
kweighted by region, titles, keywords)
Which terms to include? Stoplists
Stem or not?

27
QUERIES and Documents share same vector
representaion
D1
D2
Q
D3
Given Qeury DQ g map to vector VQ and find
document Di sim (Vi ,VQ) is greatest
28
Similarity Functions