Efficient Approximate Search on String Collections - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Efficient Approximate Search on String Collections

Description:

Selectivity estimation. Conclusion and future directions. 2. Web Search ... Selectivity estimation. Conclusion and future directions. 35 # of common grams = 1 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 64
Provided by: chenl4
Category:

less

Transcript and Presenter's Notes

Title: Efficient Approximate Search on String Collections


1
Efficient Approximate Search on String Collections
  • Marios Hadjieleftheriou

Chen Li
2
Outline
  • Part 1
  • Motivation and preliminaries
  • Inverted list based algorithms
  • Part 2
  • Gram signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions

3
Web Search
  • Errors in queries
  • Errors in data
  • Bring query and meaningful results closer together

Actual queries gathered by Google
http//www.google.com/jobs/britney.html
4
Record Linkage
R
S
  • Edit distance
  • Jaccard
  • Cosine

Record linkage
5
Document Cleaning
Should be Niels Bohr
Source http//en.wikipedia.org/wiki/Heisenberg's_
microscope
6
Demos
  • http//directory.uci.edu/
  • http//psearch.ics.uci.edu/advanced/
  • http//psearch.ics.uci.edu/

7
State-of-the-art Oracle 10g and older
  • Supported by Oracle Text
  • CREATE TABLE engdict(word VARCHAR(20), len INT)
  • Create preferences for text indexing
  • begin ctx_ddl.create_preference('STEM_FUZZY_PREF'
    , 'BASIC_WORDLIST') ctx_ddl.set_attribute('STEM_F
    UZZY_PREF','FUZZY_MATCH','ENGLISH')
    ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCO
    RE','0') ctx_ddl.set_attribute('STEM_FUZZY_PREF',
    'FUZZY_NUMRESULTS','5000') ctx_ddl.set_attribute(
    'STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE')
    ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER',
    'ENGLISH') end /
  • CREATE INDEX fuzzy_stem_subst_idx ON engdict (
    word ) INDEXTYPE IS ctxsys.context PARAMETERS
    ('Wordlist STEM_FUZZY_PREF')
  • Usage
  • SELECT FROM engdict
  • WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6,
    weight)', 1) gt 0
  • Limitation cannot handle errors in the first
    letters
  • Katherine versus Catherine

8
Microsoft SQL Server CGG05
  • Data cleaning tools available in SQL Server 2005
  • Part of Integration Services
  • Supports fuzzy lookups
  • Uses data flow pipeline of transformations
  • Similarity function tokens with TF/IDF scores

8
9
Lucene
  • Using Levenshtein Distance (Edit Distance).
  • Example roam0.8
  • Prefix filtering followed by a scan (Efficiency?)

10
Problem Formulation
Find strings similar to a given string
  • Performance is important!
  • 10 ms 100 queries per second (QPS)
  • 5 ms 200 QPS

11
Similarity Functions
  • Similar to
  • a domain-specific function
  • returns a similarity value between two strings
  • Examples
  • Edit distance
  • Hamming distance
  • Jaccard similarity
  • Soundex
  • TF/IDF, BM25, DICE
  • See KSS06 for an excellent survey

12
Edit Distance
  • A widely used metric to define string similarity
  • Ed(s1,s2) minimum of operations (insertion,
    deletion, substitution) to change s1 to s2
  • Example
  • s1 Tom Hanks
  • s2 Ton Hank
  • ed(s1,s2) 2

12
13
Outline
  • Motivation and preliminaries
  • Inverted list based algorithms
  • List-merging algorithms
  • VGRAM
  • List-compression techniques
  • Gram signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions

14
q-grams of strings
u n i v e r s a l
2-grams
15
Edit operations effect on grams
Fixed length q
u n i v e r s a l
  • k operations could affect k q grams

16
q-gram inverted lists
17
Searching using inverted lists
  • Query shtick, ED(shtick, ?)1

ic
ck
sh ht ti ic ck
ti
2-grams
18
T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
19
Example
  • T 4

1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
20
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
21
Heap-based Algorithm
Push to heap
Min-heap

Count of occurrences of each element using a
heap
22
MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
23
Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
24
ScanCount
String ids
of occurrences
Increment by 1
1 2 3
0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
24
25
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
26
MergeSkip algorithm BK02, LLL08
Pop T-1

Min-heap
Jump
Greater or equals
T-1
27
Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
28
DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
29
How many lists are treated as long lists?
Short Lists
Long Lists
Merge
Lookup
?
A good balance in the tradeoff of long
lists T / ( µ logM 1)
30
Length Filtering
Length 10
s
By length only!
Ed(s,t) 2
t
Length 19
31
Positional Filtering
Ed(s,t) 2
s
(ab,1)
t
(ab,12)
32
Filter tree LLL08
Length level
Gram level

Position level
Inverted list
33
Surprising experimental results (DBLP)
Adding position filter could increase running time
34
Filters fragment inverts lists
Merge
Merge
Merge
Merge
Applying filters
Saving reduce list size. Cost - Tree
traversal, - More merging
35
Outline
  • Motivation and preliminaries
  • Inverted list based algorithms
  • List-merging algorithms
  • VGRAM LWY07,YWL08
  • List-compression techniques
  • Gram signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions

36
2-grams -gt 3-grams?
  • Query shtick, ED(shtick, ?)1

ick
sht hti tic ick
tic
of common grams gt 1
3-grams
37
Observation 1 dilemma of choosing q
  • Increasing q causing
  • Longer grams ? Shorter lists
  • Smaller of common grams of similar strings

38
Observation 2 skew distributions of gram
frequencies
  • DBLP 276,699 article titles
  • Popular 5-grams ation (gt114K times), tions,
    ystem, catio

39
VGRAM Main idea
  • Grams with variable lengths (between qmin and
    qmax)
  • zebra
  • ze(123)
  • corrasion
  • co(5213), cor(859), corr(171)
  • Advantages
  • Reduce index size ?
  • Reducing running time ?
  • Adoptable by many algorithms ?

40
Challenges
  • Generating variable-length grams?
  • Constructing a high-quality gram dictionary?
  • Relationship between string similarity and their
    gram-set similarity?
  • Adopting VGRAM in existing algorithms?

41
Challenge 1 String ? Variable-length grams?
  • Fixed-length 2-grams

u n i v e r s a l
  • Variable-length grams

u n i v e r s a l
42
Representing gram dictionary as a trie
ni ivr sal uni vers
43
Challenge 2 Constructing gram dictionary
Step 1 Collecting frequencies of grams with
length in qmin, qmax
st ? 0, 1, 3 sti? 0, 1 stu?3 stic? 0, 1 stuc?3
Gram trie with frequencies
44
Step 2 selecting grams
  • Pruning trie using a frequency threshold F (e.g.,
    2)

45
Step 2 selecting grams (cont)
Threshold T 2
46
Final gram dictionary
A cost-based approach to choosing a gram
dictionary YWL08
47
Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
  • k operations could affect k q grams

48
Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
49
Grams affected by a deletion
Affected?
i
i-qmax1
iqmax- 1
Deletion
Deletion
u n i v e r s a l
Affected?
50
Grams affected by a deletion (cont)
Affected?
i
i-qmax1
iqmax- 1
Deletion
Trie of grams
Trie of reversed grams
51
of grams affected by each operation
Deletion/substitution
Insertion
0
1
1
1
1
2
1
2
2
2
1
1
1
2
1
1
1
1
0
_ u _ n _ i _ v _ e _ r _ s _ a _ l _
52
Max of grams affected by k operations
Vector of s lt2,4,6,8,9gt
With 2 edit operations, at most 4 grams can be
affected
  • Called NAG vector ( of affected grams)
  • Precomputed and stored
  • Dynamic programming to compute tight bounds
    YWL08

53
Summary of VGRAM index
54
Challenge 4 adopting VGRAM
  • Easily adoptable by many algorithms
  • Basic interfaces
  • String s ? grams
  • String s1, s2 such that ed(s1,s2) lt k ? min of
    their common grams

55
Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
  • If ed(s1,s2) lt k, then their of common grams
    gt
  • (s1- q 1) k q

Variable lengths of grams of s1 NAG(s1,k)
56
Example algorithm using inverted lists
  • Query shtick, ED(shtick, ?)1

sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
57
Outline
  • Motivation and preliminaries
  • Inverted list based algorithms
  • List-merging algorithms
  • VGRAM
  • List-compression techniques BJL09
  • Gram signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions

58
Motivation
  • Inverted index very large
  • IR lossless compression (delta encoding, mostly
    disk-based)
  • Decompression overhead
  • Difficult to tune compression ratio

59
Solution
  • Two lossy-compression techniques
  • Queries become faster
  • Flexibility to choose space / time tradeoff

60
Approach 1 Discarding Lists
tf
vi
ir
ef
rv
ne
un
in


2-grams
1 2 4 5 6
5 9
1 5
7 9
5 6 9
Inverted Lists
Discarded
61
Approach 2 Combining Lists
tf
vi
ir
ef
rv
ne
un
in


2-grams
1 2 4 5 6
1 3 4 5 7 9
7 9
6 9
1 2 3 9
5 6 9
Inverted Lists
Combined
62
Technical challenges
  • Effect on list-merging algorithms
  • How to choose lists to discard/merge

63
Outline (end of part 1)
  • Part 1
  • Motivation and preliminaries
  • Inverted list based algorithms
  • List-merging algorithms
  • VGRAM LWY07,YWL08
  • List-compression techniques
  • Part 2
  • Gram signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions
Write a Comment
User Comments (0)
About PowerShow.com