Efficient Approximate Search on String Collections Part II

About This Presentation

Title:

Efficient Approximate Search on String Collections Part II

Description:

Selectivity estimation. Conclusion and future directions. 2 /68. N-Gram Signatures ... Selectivity estimation. Conclusion and future directions. 15 /68. Length ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 69

Provided by: chenl4

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Approximate Search on String Collections Part II

1
Efficient Approximate Search on String
CollectionsPart II

Marios Hadjieleftheriou

Chen Li
2
Outline

Motivation and preliminaries
Inverted list based algorithms
Gram Signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions

3
N-Gram Signatures

Use string signatures that upper bound similarity
Use signatures as filtering step
Properties
Signature has to have small size
Signature verification must be fast
False positives/False negatives
Signatures have to be indexable

4
Known signatures

Minhash
Jaccard, Edit distance
Prefix filter (CGK06)
Jaccard, Edit distance
PartEnum (AGK06)
Hamming, Jaccard, Edit distance
LSH (GIM99)
Jaccard, Edit distance
Mismatch filter (XWL08)
Edit distance

5
Prefix Filter

Bit vectors
Mismatch vector
s matches 6, missing 2, extra 2
If s?q?6 then ?s?s s.t. s?3, s?q??
For at least k matches, s l - k 1

6
Using Prefixes

Take a random permutation of n-gram universe
Take prefixes from both sets
sq3, if s?q?6 then s?q??

7
Prefix Filter for Weighted Sets

For example
Order n-grams by weight (new coordinate space)
Query w(q?s)Si?q?s wi ? t
Keep prefix s s.t. w(s) ? w(s) - a
Best case w(q/q ? s/s) a
Hence, we need w(q?s) ? t - a

w1 ? w2 ? ? w14
8
Prefix Filter Properties

The larger we make a, the smaller the prefix
The larger we make a, the smaller the range of
thresholds we can support
Because t?a, otherwise t-a is negative.
We need to pre-specify minimum t
Can apply to Jaccard, Edit Distance, IDF

9
Other Signatures

Minhash (still to come)
PartEnum
Upper bounds Hamming
Select multiple subsets instead of one prefix
Larger signature, but stronger guarantee
LSH
Probabilistic with guarantees
Based on hashing
Mismatch filter
Use positional mismatching n-grams within the
prefix to attain lower bound of Edit Distance

10
Signature Indexing

Straightforward solution
Create an inverted index on signature n-grams
Merge inverted lists to compute signature
intersections
For a given string q
Access only lists in q
Find strings s with w(q n s) t - a

11
The Inverted Signature Hashtable (CCVX08)

Maintain a signature vector for every n-gram
Consider prefix signatures for simplicity
s1 tt , t L, s2tt, t L, s3
co-occurence lists t L tt ?? tt ?
tt t L ?
Hash all n-grams (h n-gram ? 0, m)
Convert co-occurrence lists to bit-vectors of
size m

12
Example
13
Using the Hashtable?

Let list at correspond to bit-vector 100011
There exists string s s.t. at ? s and s also
contains some n-grams that hash to 0, 1, or 5
Given query q
Construct query signature matrix
Consider only solid sub-matrices P r?q, p?q
We need to look only at r?q such that w(r)?t-a
and w(p)?t

14
Verification

How do we find which strings correspond to a
given sub-matrix?
Create an inverted index on string n-grams
Examine only lists in r and strings with w(s)?t
Remember that r?q
Can be used with other signatures as well

15
Outline

Motivation and preliminaries
Inverted list based algorithms
Gram Signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions

16
Length Normalized Measures

What is normalization?
Normalize similarity scores by the length of the
strings.
Can result in more meaningful matches.
Can use L0 (i.e., the length of the string), L1,
L2, etc.
For example L2
Let w2(s) ?? St?sw(t)2
Weight can be IDF, unary, language model, etc.
s2 w2(s)-1/2

17
The L2-Length Filter (HCKS08)

Why L2?
For almost exact matches.
Two strings match only if
They have very similar n-gram sets, and hence L2
lengths
The extra n-grams have truly insignificant
weights in aggregate (hence, resulting in similar
L2 lengths).

18
Example

ATT Labs Research ? L2100
ATT Labs Research ? L295
ATT Labs ? L270
If Research happened to be very popular and had
small weight?
The Dark Knight ? L275
Dark Night ? L272

19
Why L2 (continued)

Tight L2-based length filtering will result in
very efficient pruning.
L2 yields scores bounded within 0, 1
1 means a truly perfect match.
Easier to interpret scores.
L0 and L1 do not have the same properties
Scores are bounded only by the largest string
length in the database.
For L0 an exact match can have score smaller than
a non-exact match!

20
Example

qATT, TT , T L, LAB, ABS ? L05
s1ATT ?
L01
s2q ?
? L05
S(q, s1)Sw(q?s1)/(q0 s10)10/5 2
S(q, s2)Sw(q?s2)/(q0 s20)40/25lt2

21
Problems

L2 normalization poses challenges.
For example
S(q, s) w2(q?s)/(q2 s2)
Prefix filter cannot be applied.
Minimum prefix weight a?
Value depends both on s2 and q2.
But q2 is unknown at index construction time

22
Important L2 Properties

Length filtering
For S(q, s) t
t q2 ? s2 ? q2 / t
We are only looking for strings within these
lengths.
Proof in paper
Monotonicity

23
Monotonicity

Let st1, t2, , tm.
Let pw(s, t)w(t) / s2 (partial weight of s)
Then S(q, s) S t?q?s w(t)2 / (q2 s2)
St?q?s pw(s, t) pw(q, t)
If pw(s, t) gt pw(r, t)
w(t)/s2 gt w(t)/r2 ? s2 lt r2
Hence, for any t ? t
w(t)/s2 gt w(t)/r2 ? pw(s, t) gt pw(r,
t)

24
Indexing

Use inverted lists sorted by pw()

pw(0, ic) gt pw(4, ic) gt pw(1, ic) gt pw(2, ic) ?
02 lt 42 lt 12 lt 22

25
L2 Length Filter

Given q and t, and using length filtering

We examine only a small fraction of the lists

26
Monotonicity

If I have seen 1 already, then 4 is not in the
list

3
1
3
1
4
27
Other Improvements

Use properties of weighting scheme
Scan high weight lists first
Prune according to string length and maximum
potential score
Ignore low weight lists altogether

28
Conclusion

Concepts can be extended easily for
BM25
Weighted Jaccard
DICE
IDF
Take away message
Properties of similarity/distance function can
play big role in designing very fast indexes.
L2 super fast for almost exact matches

29
Outline

Motivation and preliminaries
Inverted list based algorithms
Gram signature algorithms
Length-normalized measures
Selectivity estimation
Conclusion and future directions

30
The Problem

Estimate the number of strings with
Edit distance smaller than k from query q
Cosine similarity higher than t to query q
Jaccard, Hamming, etc
Issues
Estimation accuracy
Size of estimator
Cost of estimation

31
Motivation

Query optimization
Selectivity of query predicates
Need to support selectivity of approximate string
predicates
Visualization/Querying
Expected result set size helps with visualization
Result set size important for remote query
processing

32
Flavors

Edit distance
Based on clustering (JL05)
Based on min-hash (MBKS07)
Based on wild-card n-grams (LNS07)
Cosine similarity
Based on sampling (HYKS08)

33
Selectivity Estimation for Edit Distance

Problem
Given query string q
Estimate number of strings s ? D
Such that ed(q, s) ? d

34
Sepia - Clustering (JL05, JLV08)

Partition strings using clustering
Enables pruning of whole clusters
Store per cluster histograms
Number of strings within edit distance 0,1,,d
from the cluster center
Compute global dataset statistics
Use a training query set to compute frequency of
strings within edit distance 0,1,,d from each
query

35
Edit Vectors

Edit distance is not discriminative
Use Edit Vectors
3D space vs 1D space

Ci
Luciano
2
lt2,0,0gt
lt1,1,1gt
3
Lucia
Lukas
pi
q
Lucas
lt1,1,0gt
2
36
Visually
...
Global Table
37
Selectivity Estimation

Use triangle inequality
Compute edit vector v(q,pi) for all clusters i
If v(q,pi) ? rid disregard cluster Ci

d
ri
pi
q
38
Selectivity Estimation

Use triangle inequality
Compute edit vector v(q,pi) for all clusters i
If v(q,pi) ? rid disregard cluster Ci
For all entries in frequency table
If v(q,pi) v(pi,s) ? d then ed(q,s) ? d for
all s
If v(q,pi) - v(pi,s) ? d ignore these
strings
Else use global table
Lookup entry ltv(q,pi), v(pi,s), dgt in global
table
Use the estimated fraction of strings

39
Example

d 3
v(q,p1) lt1,1,0gt v(p1,s) lt1,0,2gt
Global lookup
lt1,1,0gt,lt1,0,2gt, 3
Fraction is 25 x 7 1.75
Iterate through F1, and add up contributions

Global Table
40
Cons

Hard to maintain if clusters start drifting
Hard to find good number of clusters
Space/Time tradeoffs
Needs training to construct good dataset
statistics table

41
VSol minhash (MBKS07)

Solution based on minhash
minhash is used for
Estimate the size of a set s
Estimate resemblance of two sets
I.e., estimating the size of Js1?s2 / s1?s2
Estimate the size of the union s1?s2
Hence, estimating the size of the intersection
s1?s2?? J(s1, s2) ?? ?(s1, s2)

42
Minhash

Given a set s t1, , tm
Use independent hash functions h1, , hk
hi n-gram ? 0, 1
Hash elements of s, k times
Keep the k elements that hashed to the smallest
value each time
We reduced set s, from m to k elements
Denote minhash signature with s

43
How to use minhash

Given two signatures q, s
J(q, s) ? S1?i?k Iqisi / k
s ? ( k / S1?i?k si ) - 1
(q?s) q ? s min1?i?k(qi, si)
Hence
q?s ? (k / S1?i?k (q?s)i) - 1

44
VSol Estimator

Construct one inverted list per n-gram in D
The lists are our sets
Compute a minhash signature for each list

45
Selectivity Estimation

Use edit distance length filter
If ed(q, s) ? d, then q and s share at least L
s - 1 - n (d-1)
n-grams
Given query q t1, , tm
Answer is the size of the union of all non-empty
L-intersections (binomial coefficient m choose
L)
We can estimate sizes of L-intersections using
minhash signatures

46
Example

d 2, n 3 ? L 6
Look at all 6-intersections of inverted lists
? ??1, ..., ?6 ? 1,10 (ti1 ? ti2 ? ?
ti6)
There are (10 choose 6) such terms

Inverted list
47
The m-L Similarity

Can be done efficiently using minhashes
Answer
? S1?j?k I ?i1, , iL ti1j tiLj
A ? ? ? t1? ?tm
Proof very similar to the proof for minhashes

48
Cons

Will overestimate results
Many L-intersections will share strings
Edit distance length filter is loose

49
OptEQ wild-card n-grams (LNS07)

Use extended n-grams
Introduce wild-card symbol ?
E.g., ab? can be
aba, abb, abc,
Build an extended n-gram table
Extract all 1-grams, 2-grams, , n-grams
Generalize to extended 2-grams, , n-grams
Maintain an extended n-grams/frequency hashtable

50
Example
51
Query Expansion (Replacements only)

Given query qabcd
d2
And replacements only
Base strings
??cd, ?b?d, ?bc?, a??d, a?c?, ab??
Query answer
S1s?D s ? ??cd, S2
A S1 ? S2 ? S3 ? S4 ? S5 ? S6
S1?n?6 (-1)n-1 S1 ? ? Sn

52
Replacement Intersection Lattice

A S1?n?6 (-1)n-1 S1 ? ? Sn
Need to evaluate size of all 2-intersections,
3-intersections, , 6-intersections
Then, use n-gram table to compute sum A
Exponential number of intersections
But ... there is well-defined structure

53
Replacement Lattice

Build replacement lattice
Many intersections are empty
Others produce the same results
we need to count everything only once

2 ?
1 ?
0 ?
54
General Formulas

Similar reasoning for
r replacements
d deletions
Other combinations difficult
Multiple insertions
Combinations of insertions/replacements
But we can generate the corresponding lattice
algorithmically!
Expensive but possible

55
BasicEQ

Partition strings by length
Query q with length l
Possible matching strings with lengths
l-d, ld
For k l-d to ld
Find all combinations of idr d and li-dk
If (i,d,r) is a special case use formula
Else generate lattice incrementally
Start from query base strings (easy to generate)
Begin with 2-intersections and build from there

56
OptEq

Details are cumbersome
Left for homework
Various optimizations possible to reduce
complexity

57
Cons

Fairly complicated implementation
Expensive
Works for small edit distance only

58
Hashed Sampling (HYKS08)

Used to estimate selectivity of TF/IDF, BM25,
DICE (vector space model)
Main idea
Take a sample of the inverted index
But do it intelligently to improve variance

59
Example

Take a sample of the inverted index

60
Example (Cont.)

But do it intelligently to improve variance

61
Construction

Draw samples deterministically
Use a hash function h ?N ? 0, 100
Keep ids that hash to values smaller than s
Invariant
If a given id is sampled in one list, it will
always be sampled in all other lists that contain
it
S(q, s) can be computed directly from the sample
No need to store complete sets in the sample
No need for extra I/O to compute scores

62
Selectivity Estimation

The union of arbitrary list samples is an s
sample
Given query q t1, , tm
A As t1 ? ? tm / ts1 ? ? tsm
As is the query answer size from the sample
The fraction is the actual scale-up factor
But there are duplicates in these unions!
We need to know
The distinct number of ids in t1 ? ? tm
The distinct number of ids in ts1 ? ? tsm

63
Count Distinct

Distinct ts1 ? ? tsm is easy
Scan the sampled lists
Distinct t1 ? ? tm is hard
Scanning the lists is the same as computing the
exact answer to the query naively
We are lucky
Each list sample doubles up as a k-minimum value
estimator by construction!
We can use the list samples to estimate the
distinct t1 ? ? tm

64
The k-Minimum Value Synopsis

It is used to estimated the distinct size of
arbitrary set unions (the same as FM sketch)
Take hash function h N ? 0, 100
Hash each element of the set
The r-th smallest hash value is an unbiased
estimator of count distinct

65
Outline

Motivation and preliminaries
Inverted list based algorithms
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions

66
Future Directions

Result ranking
In practice need to run multiple types of
searches
Need to identify the best results
Diversity of query results
Some queries have multiple meanings
E.g., Jaguar
Updates
Incremental maintenance

67
References

AGK06 Arvind Arasu, Venkatesh Ganti, Raghav
Kaushik Efficient Exact Set-Similarity Joins.
VLDB 2006
BJL09 Space-Constrained Gram-Based Indexing
for Efficient Approximate String Search,
Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng
Lu, ICDE 2009
HCK08 Marios Hadjieleftheriou, Amit Chandel,
Nick Koudas, Divesh Srivastava Fast Indexes and
Algorithms for Set Similarity Selection Queries.
ICDE 2008
HYK08 Marios Hadjieleftheriou, Xiaohui Yu,
Nick Koudas, Divesh Srivastava Hashed samples
selectivity estimators for set similarity
selection queries. PVLDB 2008.
JL05 Selectivity Estimation for Fuzzy String
Predicates in Large Data Sets, Liang Jin, and
Chen Li. VLDB 2005.
KSS06 Record linkage Similarity measures and
algorithms. Nick Koudas, Sunita Sarawagi, and
Divesh Srivastava. SIGMOD 2006.
LLL08 Efficient Merging and Filtering
Algorithms for Approximate String Searches, Chen
Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.
LNS07 Hongrae Lee, Raymond T. Ng, Kyuseok Shim
Extending Q-Grams to Estimate Selectivity of
String Matching with Low Edit Distance. VLDB 2007
LWY07 VGRAM Improving Performance of
Approximate Queries on String Collections Using
Variable-Length Grams, Chen Li, Bin Wang, and
Xiaochun Yang. VLDB 2007
MBK07 Arturas Mazeika, Michael H. Böhlen, Nick
Koudas, Divesh Srivastava Estimating the
selectivity of approximate string queries. ACM
TODS 2007
XWL08 Chuan Xiao, Wei Wang, Xuemin Lin
Ed-Join an efficient algorithm for similarity
joins with edit distance constraints. PVLDB 2008

68
References

XWL08 Chuan Xiao, Wei Wang, Xuemin Lin,
Jeffrey Xu Yu Efficient similarity joins for
near duplicate detection. WWW 2008.
YWL08 Cost-Based Variable-Length-Gram Selection
for String Collections to Support Approximate
Queries Efficiently, Xiaochun Yang, Bin Wang, and
Chen Li, SIGMOD 2008
JLV08L. Jin, C. Li, R. Vernica SEPIA
Estimating Selectivities of Approximate String
Predicates in Large Databases, VLDBJ08
CGK06 S. Chaudhuri, V. Ganti, R. Kaushik A
Primitive Operator for Similarity Joins in Data
Cleaning, ICDE06
CCGX08K. Chakrabarti, S. Chaudhuri, V. Ganti,
D. Xin An Efficient Filter for Approximate
Membership Checking, SIGMOD08
SK04 Sunita Sarawagi, Alok Kirpal Efficient
set joins on similarity predicates. SIGMOD
Conference 2004 743-754
BK02 Jérémy Barbay, Claire Kenyon Adaptive
intersection and t-threshold problems. SODA 2002
390-399
CGG05 Surajit Chaudhuri, Kris Ganjam,
Venkatesh Ganti, Rahul Kapoor, Vivek R.
Narasayya, Theo Vassilakis Data cleaning in
microsoft SQL server 2005. SIGMOD Conference
2005 918-920