Title: Efficient Approximate Search on String Collections
1Efficient Approximate Search on String Collections
Chen Li
2Outline
- Part 1
- Motivation and preliminaries
- Inverted list based algorithms
- Part 2
- Gram signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions
3Web Search
- Errors in queries
- Errors in data
- Bring query and meaningful results closer together
Actual queries gathered by Google
http//www.google.com/jobs/britney.html
4Record Linkage
R
S
- Edit distance
- Jaccard
- Cosine
Record linkage
5Document Cleaning
Should be Niels Bohr
Source http//en.wikipedia.org/wiki/Heisenberg's_
microscope
6Demos
- http//directory.uci.edu/
- http//psearch.ics.uci.edu/advanced/
- http//psearch.ics.uci.edu/
7State-of-the-art Oracle 10g and older
- Supported by Oracle Text
- CREATE TABLE engdict(word VARCHAR(20), len INT)
- Create preferences for text indexing
- begin ctx_ddl.create_preference('STEM_FUZZY_PREF'
, 'BASIC_WORDLIST') ctx_ddl.set_attribute('STEM_F
UZZY_PREF','FUZZY_MATCH','ENGLISH')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCO
RE','0') ctx_ddl.set_attribute('STEM_FUZZY_PREF',
'FUZZY_NUMRESULTS','5000') ctx_ddl.set_attribute(
'STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER',
'ENGLISH') end / - CREATE INDEX fuzzy_stem_subst_idx ON engdict (
word ) INDEXTYPE IS ctxsys.context PARAMETERS
('Wordlist STEM_FUZZY_PREF') - Usage
- SELECT FROM engdict
- WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6,
weight)', 1) gt 0 - Limitation cannot handle errors in the first
letters - Katherine versus Catherine
8Microsoft SQL Server CGG05
- Data cleaning tools available in SQL Server 2005
- Part of Integration Services
- Supports fuzzy lookups
- Uses data flow pipeline of transformations
- Similarity function tokens with TF/IDF scores
8
9Lucene
- Using Levenshtein Distance (Edit Distance).
- Example roam0.8
- Prefix filtering followed by a scan (Efficiency?)
10Problem Formulation
Find strings similar to a given string
- Performance is important!
- 10 ms 100 queries per second (QPS)
- 5 ms 200 QPS
11Similarity Functions
- Similar to
- a domain-specific function
- returns a similarity value between two strings
- Examples
- Edit distance
- Hamming distance
- Jaccard similarity
- Soundex
- TF/IDF, BM25, DICE
- See KSS06 for an excellent survey
12Edit Distance
- A widely used metric to define string similarity
- Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2 - Example
- s1 Tom Hanks
- s2 Ton Hank
- ed(s1,s2) 2
12
13Outline
- Motivation and preliminaries
- Inverted list based algorithms
- List-merging algorithms
- VGRAM
- List-compression techniques
- Gram signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions
14q-grams of strings
u n i v e r s a l
2-grams
15Edit operations effect on grams
Fixed length q
u n i v e r s a l
- k operations could affect k q grams
16q-gram inverted lists
17Searching using inverted lists
- Query shtick, ED(shtick, ?)1
ic
ck
sh ht ti ic ck
ti
2-grams
18T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
19Example
1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
20List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
21Heap-based Algorithm
Push to heap
Min-heap
Count of occurrences of each element using a
heap
22MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
23Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
24ScanCount
String ids
of occurrences
Increment by 1
1 2 3
0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
24
25List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
26MergeSkip algorithm BK02, LLL08
Pop T-1
Min-heap
Jump
Greater or equals
T-1
27Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
28DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
29How many lists are treated as long lists?
Short Lists
Long Lists
Merge
Lookup
?
A good balance in the tradeoff of long
lists T / ( µ logM 1)
30 Length Filtering
Length 10
s
By length only!
Ed(s,t) 2
t
Length 19
31Positional Filtering
Ed(s,t) 2
s
(ab,1)
t
(ab,12)
32Filter tree LLL08
Length level
Gram level
Position level
Inverted list
33Surprising experimental results (DBLP)
Adding position filter could increase running time
34Filters fragment inverts lists
Merge
Merge
Merge
Merge
Applying filters
Saving reduce list size. Cost - Tree
traversal, - More merging
35Outline
- Motivation and preliminaries
- Inverted list based algorithms
- List-merging algorithms
- VGRAM LWY07,YWL08
- List-compression techniques
- Gram signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions
362-grams -gt 3-grams?
- Query shtick, ED(shtick, ?)1
ick
sht hti tic ick
tic
of common grams gt 1
3-grams
37Observation 1 dilemma of choosing q
- Increasing q causing
- Longer grams ? Shorter lists
- Smaller of common grams of similar strings
38Observation 2 skew distributions of gram
frequencies
- DBLP 276,699 article titles
- Popular 5-grams ation (gt114K times), tions,
ystem, catio
39VGRAM Main idea
- Grams with variable lengths (between qmin and
qmax) - zebra
- ze(123)
- corrasion
- co(5213), cor(859), corr(171)
- Advantages
- Reduce index size ?
- Reducing running time ?
- Adoptable by many algorithms ?
40Challenges
- Generating variable-length grams?
- Constructing a high-quality gram dictionary?
- Relationship between string similarity and their
gram-set similarity? - Adopting VGRAM in existing algorithms?
41Challenge 1 String ? Variable-length grams?
u n i v e r s a l
u n i v e r s a l
42Representing gram dictionary as a trie
ni ivr sal uni vers
43Challenge 2 Constructing gram dictionary
Step 1 Collecting frequencies of grams with
length in qmin, qmax
st ? 0, 1, 3 sti? 0, 1 stu?3 stic? 0, 1 stuc?3
Gram trie with frequencies
44Step 2 selecting grams
- Pruning trie using a frequency threshold F (e.g.,
2)
45Step 2 selecting grams (cont)
Threshold T 2
46Final gram dictionary
A cost-based approach to choosing a gram
dictionary YWL08
47Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
- k operations could affect k q grams
48Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
49Grams affected by a deletion
Affected?
i
i-qmax1
iqmax- 1
Deletion
Deletion
u n i v e r s a l
Affected?
50Grams affected by a deletion (cont)
Affected?
i
i-qmax1
iqmax- 1
Deletion
Trie of grams
Trie of reversed grams
51 of grams affected by each operation
Deletion/substitution
Insertion
0
1
1
1
1
2
1
2
2
2
1
1
1
2
1
1
1
1
0
_ u _ n _ i _ v _ e _ r _ s _ a _ l _
52Max of grams affected by k operations
Vector of s lt2,4,6,8,9gt
With 2 edit operations, at most 4 grams can be
affected
- Called NAG vector ( of affected grams)
- Precomputed and stored
- Dynamic programming to compute tight bounds
YWL08
53Summary of VGRAM index
54Challenge 4 adopting VGRAM
- Easily adoptable by many algorithms
- Basic interfaces
- String s ? grams
- String s1, s2 such that ed(s1,s2) lt k ? min of
their common grams
55Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
- If ed(s1,s2) lt k, then their of common grams
gt - (s1- q 1) k q
Variable lengths of grams of s1 NAG(s1,k)
56Example algorithm using inverted lists
- Query shtick, ED(shtick, ?)1
sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
57Outline
- Motivation and preliminaries
- Inverted list based algorithms
- List-merging algorithms
- VGRAM
- List-compression techniques BJL09
- Gram signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions
58Motivation
- Inverted index very large
- IR lossless compression (delta encoding, mostly
disk-based) - Decompression overhead
- Difficult to tune compression ratio
59Solution
- Two lossy-compression techniques
- Queries become faster
- Flexibility to choose space / time tradeoff
60Approach 1 Discarding Lists
tf
vi
ir
ef
rv
ne
un
in
2-grams
1 2 4 5 6
5 9
1 5
7 9
5 6 9
Inverted Lists
Discarded
61Approach 2 Combining Lists
tf
vi
ir
ef
rv
ne
un
in
2-grams
1 2 4 5 6
1 3 4 5 7 9
7 9
6 9
1 2 3 9
5 6 9
Inverted Lists
Combined
62Technical challenges
- Effect on list-merging algorithms
- How to choose lists to discard/merge
63Outline (end of part 1)
- Part 1
- Motivation and preliminaries
- Inverted list based algorithms
- List-merging algorithms
- VGRAM LWY07,YWL08
- List-compression techniques
- Part 2
- Gram signature algorithms
- Length normalized algorithms
- Selectivity estimation
- Conclusion and future directions