Todays Topics - PowerPoint PPT Presentation

About This Presentation
Title:

Todays Topics

Description:

On retrieval, all documents/bit vectors mapped to the same signature are retrieved(returned) ... Prob(False Drop) = Prob(Signature qualifies & Text does not) ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 29
Provided by: andre9
Learn more at: https://www.cs.jhu.edu
Category:

less

Transcript and Presenter's Notes

Title: Todays Topics


1
Todays Topics
  • Boolean IR
  • Signature files
  • Inverted files
  • PAT trees
  • Suffix arrays

2
Boolean IR
- Pre 1970s - Dominant industrial model
through 1994 (Lexis-Nexis, DIALOG)
  • Documents composed of TERMS(words, stems)
  • Express result in set-theoretic terms

A AND B
(A AND B) OR C
3
Boolean Operators
Docs containing term A
A AND B A OR B (A AND B) OR C A AND (
NOT B )
Adjacent AND g A B e.g.
Johns Hopkins The Who Proximity window g A
w/10 B A and B within /- 10
words g A w/sent B A B in same
sentence
Proximity Operators (Extended ANDs)
(in /- K words)
4
Boolean IR(implementation)
  • Bit vectors
  • Inverted files(a.k.a. Index)
  • PAT tree(more powerful index)

Termi
V1
V2
Impractical g very sparse(wastefully big) g
costly to compare
5
Problems with Boolean IR
  • Does not effectively support relevance ranking of
    returned documents
  • Base model expression satisfaction is Boolean
  • A document matches expression or it doesnt
  • Extension to permit ordering (A AND B) OR C
  • Supermatches(5 terms/doc gt 3 terms/doc)
  • Partial matches
  • (expression incompletely satisfied give
    partial credit)
  • Importance weighting(10A OR 5B)

Weight/importance
6
Boolean IR
  • Advantages Can directly control search
  • Good for precise queries in structured data
  • (e.g. database search or legal index)
  • Disadvantages Must directly control search
  • Users should be familiar with domain and term
    space(know what to ask for and exclude)
  • Poor at relevance ranking
  • Poor at weighted query expansion, user modelling
    etc.

7
Signature Files
Document Bit vector
Superimposed Coding Using some mapping/ Hash
function
Mapping function f( )
Signature
fewer bits
Problem several different document bit
vectors(i.e. different words)
get mapped to same signature. (use stoplist to
help avoid common words from overwhelming
signatures)
8
False Drop Problem
  • On retrieval, all documents/bit vectors mapped to
    the same signature are retrieved(returned)
  • Only a portion are relevant
  • Need to do secondary validation step to make sure
    target words actually match
  • Prob(False Drop) Prob(Signature qualifies
    Text does not)

9
Efficiency Problem
  • Testing for signature match may require linear
    scan through all document signatures

10
Vertical Partitioning
  • Improves sig1, sig2 comparison speed,
  • but still requires O(N) linear search of all
    signatures
  • Options

sig
- Bit sliced onto different devices for parallel
comparision - And together matches on each
segment
sig1 sig2 comp
AND
AND
g result
11
Horizontal Partitioning
  • Goal avoid sequential scanning of the signature
    file

Signature Database
Input signature
Hash function or index yielding specific
candidates to try
12
Inverted Files
  • Like an index to a book

Documents
Terms Baum Bayes Viterbi
index
14 39 156 39 45 156 290 41 86 156 217
14 15 16 17 37 38 39 40
13
Inverted Files
  • Very efficient for single word queries
  • Just enumerate documents pointed to by index
  • O( A ) O(SA)
  • Efficient for ORs
  • Just enumerate both lists and remove duplicates
    O(SA SB)

14
ANDs using Inverted Files
(meet search)
Method 1
Index for Bayes
Index for Viterbi
Ai
Bj
14 39 156 227 319
39 45 58 96 156 208
j
i
O(SA SB ) same as OR, but smaller output
  • Begin with two pointers(i, j) on list is in
    index(A,B)
  • if A i B i , write A i to output
  • if A i lt B i , i else j

15
ANDs using Inverted Files
Method 2 Useful if one index is smaller than
the other(SA ltlt SB )
(Hopkins)
Bj
For all members of A bsearch (A i , B) (do
binary search into larger index) for all members
of smaller index
1 5 25 28 39 45 58 96 156
(Johns)
Ai
39 227
A AND B AND C Order by smaller list pairwise
Cost SA log2 (SB ) can achieve SA log log
(SB )
16
Proximity Search
Document level indexes not adequate
Option 1
Doc 1
Index to corpus Position offset Before Match
if ptrA ptrB Now A B match if ptrA
ptrB -1 A w/10 B match if ptrA - ptrB
10
Anthony Johns Hopkins
Doc 2
Doc 3
Doc i
Size of corpus size of index
17
Variations 1
Dont index function words
index
wordlist
X
Johns The

The Johns Hopkins
  • Do linear match search in corpus
  • savings on 50 index size
  • potential speed improvement
  • given data access costs

18
Variations 2 Multilevel Indexes
Position level
Doc level
Johns Hopkins
  • Supports parallel
  • search
  • May have paging
  • cost advantage
  • Cost large index
  • N dV

Anthony Johns Hopkins
Johns Hopkins Anthony
Hopkins Anthony
Avg. Doc/vocab size
19
Interpolation Search
Useful when data are numeric and uniformly
distributed
value
Bi cell
of cells in index 100 Values range from 0
1000 Goal looking for the value 211 Binary
search begin looking at cell
50 Interpolation search better guess for
1st cell to examine?
174 195 211 226 230 231 246
17 18 19 20 21 22 23 48 49 50 51 100
483 496 521 526 995
20
  • Binary Search
  • Bsearch(low, high, key)
  • mid (high low) / 2
  • If (key Amid)
  • return mid
  • Else if (key lt Amid)
  • Bsearch (low, mid-1, key)
  • Else
  • Bsearch(mid1, high, key)

Interpolation Search Isearch(low, high, key) mid
best estimate of pos mid low (high low)
(expected of way
through range)
21
Comparison
Typical sequence of cells tested
Binary Search 50 25 12 18 22 21 19.
Interpolation Search 21 19. g go directly to
expected region
log log (N)
22
Cost of Computing Inverted Index
  • Simple
  • word position pairs
  • and sort
  • If N gtgt memory size
  • Tokenize(words g integers)
  • Create histogram
  • Allocate space in index
  • Do multipass(K-pass) through corpus only adding
    tokens in K bins

Corpus size gN log N
23
K-pass Indexing
index
W1 W2 W3 W4
Block1 (pass K 1)
Time KN 1 But big win over N log N on paging
K 2
24
Vector Models for IR
  • Gerald Salton, Cornell
  • (Salton Lesk, 68)
  • (Salton, 71)
  • (Salton McGill, 83)
  • SMART System
  • Chris Buckely, Cornell
  • g Current keeper of the flame

Saltons magical automatic retrieval tool(?)
25
Vector Models for IR
Boolean Model
Doc V1
Doc V2
Word Stem Special compounds
SMART Vector Model
Termi
Doc V1
1.0 3.5 4.6 0.1 0.0 0.0
Doc V2
0.0 0.0 0.0 0.1 4.0 0.0
SMART vectors are composed of real valued Term
weights NOT simply Boolean Term Present or NOT
26
Example
Comput C Sparc genome Bilog
protein
Compiler
DNA
Doc V1
3 5 4 1 0 1 0 0
Doc V2
1 0 0 0 5 3 1 4
Doc V3
2 8 0 1 0 1 0 0
  • Issues
  • How are weights determined?
  • (simple option
  • jraw freq.
  • kweighted by region, titles, keywords)
  • Which terms to include? Stoplists
  • Stem or not?

27
QUERIES and Documents share same vector
representaion
D1
D2
Q
D3
Given Qeury DQ g map to vector VQ and find
document Di sim (Vi ,VQ) is greatest
28
Similarity Functions
  • Many other options availabe(Dice, Jaccard)
  • Cosine similarity is self normalizing

V1
100 200 300 50
D2
V2
1 2 3 0.5
Q
D3
V3
10 20 30 5
Can use arbitrary integer values (dont need to
be probabilities)
Write a Comment
User Comments (0)
About PowerShow.com