Efficient Approximate Search on String Collections - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Efficient Approximate Search on String Collections

Description:

Selectivity estimation. Conclusion and future directions. 2. Web Search ... Selectivity estimation. Conclusion and future directions. 35 # of common grams = 1 ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 64

Provided by: chenl4

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Approximate Search on String Collections

1
Efficient Approximate Search on String Collections

Marios Hadjieleftheriou

Chen Li
2
Outline

Part 1
Motivation and preliminaries
Inverted list based algorithms
Part 2
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions

3
Web Search

Errors in queries
Errors in data
Bring query and meaningful results closer together

Actual queries gathered by Google
http//www.google.com/jobs/britney.html
4
Record Linkage
R
S

Edit distance
Jaccard
Cosine

Record linkage
5
Document Cleaning
Should be Niels Bohr
Source http//en.wikipedia.org/wiki/Heisenberg's_
microscope
6
Demos

http//directory.uci.edu/
http//psearch.ics.uci.edu/advanced/
http//psearch.ics.uci.edu/

7
State-of-the-art Oracle 10g and older

Supported by Oracle Text
CREATE TABLE engdict(word VARCHAR(20), len INT)
Create preferences for text indexing
begin ctx_ddl.create_preference('STEM_FUZZY_PREF'
, 'BASIC_WORDLIST') ctx_ddl.set_attribute('STEM_F
UZZY_PREF','FUZZY_MATCH','ENGLISH')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCO
RE','0') ctx_ddl.set_attribute('STEM_FUZZY_PREF',
'FUZZY_NUMRESULTS','5000') ctx_ddl.set_attribute(
'STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER',
'ENGLISH') end /
CREATE INDEX fuzzy_stem_subst_idx ON engdict (
word ) INDEXTYPE IS ctxsys.context PARAMETERS
('Wordlist STEM_FUZZY_PREF')
Usage
SELECT FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6,
weight)', 1) gt 0
Limitation cannot handle errors in the first
letters
Katherine versus Catherine

8
Microsoft SQL Server CGG05

Data cleaning tools available in SQL Server 2005
Part of Integration Services
Supports fuzzy lookups
Uses data flow pipeline of transformations
Similarity function tokens with TF/IDF scores

8
9
Lucene

Using Levenshtein Distance (Edit Distance).
Example roam0.8
Prefix filtering followed by a scan (Efficiency?)

10
Problem Formulation
Find strings similar to a given string

Performance is important!
10 ms 100 queries per second (QPS)
5 ms 200 QPS

11
Similarity Functions

Similar to
a domain-specific function
returns a similarity value between two strings
Examples
Edit distance
Hamming distance
Jaccard similarity
Soundex
TF/IDF, BM25, DICE
See KSS06 for an excellent survey

12
Edit Distance

A widely used metric to define string similarity
Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
Example
s1 Tom Hanks
s2 Ton Hank
ed(s1,s2) 2

12
13
Outline

Motivation and preliminaries
Inverted list based algorithms
List-merging algorithms
VGRAM
List-compression techniques
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions

14
q-grams of strings
u n i v e r s a l
2-grams
15
Edit operations effect on grams
Fixed length q
u n i v e r s a l

k operations could affect k q grams

16
q-gram inverted lists
17
Searching using inverted lists

Query shtick, ED(shtick, ?)1

ic
ck
sh ht ti ic ck
ti
2-grams
18
T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
19
Example

1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
20
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
21
Heap-based Algorithm
Push to heap
Min-heap

Count of occurrences of each element using a
heap
22
MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
23
Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
24
ScanCount
String ids
of occurrences
Increment by 1
1 2 3
0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
24
25
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
26
MergeSkip algorithm BK02, LLL08
Pop T-1

Min-heap
Jump
Greater or equals
T-1
27
Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
28
DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
29
How many lists are treated as long lists?
Short Lists
Long Lists
Merge
Lookup
?
A good balance in the tradeoff of long
lists T / ( µ logM 1)
30
Length Filtering
Length 10
s
By length only!
Ed(s,t) 2
t
Length 19
31
Positional Filtering
Ed(s,t) 2
s
(ab,1)
t
(ab,12)
32
Filter tree LLL08
Length level
Gram level

Position level
Inverted list
33
Surprising experimental results (DBLP)
Adding position filter could increase running time
34
Filters fragment inverts lists
Merge
Merge
Merge
Merge
Applying filters
Saving reduce list size. Cost - Tree
traversal, - More merging
35
Outline

Motivation and preliminaries
Inverted list based algorithms
List-merging algorithms
VGRAM LWY07,YWL08
List-compression techniques
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions

36
2-grams -gt 3-grams?

Query shtick, ED(shtick, ?)1

ick
sht hti tic ick
tic
of common grams gt 1
3-grams
37
Observation 1 dilemma of choosing q

Increasing q causing
Longer grams ? Shorter lists
Smaller of common grams of similar strings

38
Observation 2 skew distributions of gram
frequencies

DBLP 276,699 article titles
Popular 5-grams ation (gt114K times), tions,
ystem, catio

39
VGRAM Main idea

Grams with variable lengths (between qmin and
qmax)
zebra
ze(123)
corrasion
co(5213), cor(859), corr(171)
Advantages
Reduce index size ?
Reducing running time ?
Adoptable by many algorithms ?

40
Challenges

Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
Adopting VGRAM in existing algorithms?

41
Challenge 1 String ? Variable-length grams?

Fixed-length 2-grams

u n i v e r s a l

Variable-length grams

u n i v e r s a l
42
Representing gram dictionary as a trie
ni ivr sal uni vers
43
Challenge 2 Constructing gram dictionary
Step 1 Collecting frequencies of grams with
length in qmin, qmax
st ? 0, 1, 3 sti? 0, 1 stu?3 stic? 0, 1 stuc?3
Gram trie with frequencies
44
Step 2 selecting grams

Pruning trie using a frequency threshold F (e.g.,
2)

45
Step 2 selecting grams (cont)
Threshold T 2
46
Final gram dictionary
A cost-based approach to choosing a gram
dictionary YWL08
47
Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l

k operations could affect k q grams

48
Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
49
Grams affected by a deletion
Affected?
i
i-qmax1
iqmax- 1
Deletion
Deletion
u n i v e r s a l
Affected?
50
Grams affected by a deletion (cont)
Affected?
i
i-qmax1
iqmax- 1
Deletion
Trie of grams
Trie of reversed grams
51
of grams affected by each operation
Deletion/substitution
Insertion
0
1
1
1
1
2
1
2
2
2
1
1
1
2
1
1
1
1
0
_ u _ n _ i _ v _ e _ r _ s _ a _ l _
52
Max of grams affected by k operations
Vector of s lt2,4,6,8,9gt
With 2 edit operations, at most 4 grams can be
affected

Called NAG vector ( of affected grams)
Precomputed and stored
Dynamic programming to compute tight bounds
YWL08

53
Summary of VGRAM index
54
Challenge 4 adopting VGRAM

Easily adoptable by many algorithms
Basic interfaces
String s ? grams
String s1, s2 such that ed(s1,s2) lt k ? min of
their common grams

55
Lower bound on of common grams
Fixed length (q)
u n i v e r s a l

If ed(s1,s2) lt k, then their of common grams
gt
(s1- q 1) k q

Variable lengths of grams of s1 NAG(s1,k)
56
Example algorithm using inverted lists

Query shtick, ED(shtick, ?)1

sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
57
Outline

Motivation and preliminaries
Inverted list based algorithms
List-merging algorithms
VGRAM
List-compression techniques BJL09
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions

58
Motivation

Inverted index very large
IR lossless compression (delta encoding, mostly
disk-based)
Decompression overhead
Difficult to tune compression ratio

59
Solution

Two lossy-compression techniques
Queries become faster
Flexibility to choose space / time tradeoff

60
Approach 1 Discarding Lists
tf
vi
ir
ef
rv
ne
un
in

2-grams
1 2 4 5 6
5 9
1 5
7 9
5 6 9
Inverted Lists
Discarded
61
Approach 2 Combining Lists
tf
vi
ir
ef
rv
ne
un
in

2-grams
1 2 4 5 6
1 3 4 5 7 9
7 9
6 9
1 2 3 9
5 6 9
Inverted Lists
Combined
62
Technical challenges