Efficient Approximate Search on String Collections Part II - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Efficient Approximate Search on String Collections Part II

Description:

Selectivity estimation. Conclusion and future directions. 2 /68. N-Gram Signatures ... Selectivity estimation. Conclusion and future directions. 15 /68. Length ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 69
Provided by: chenl4
Category:

less

Transcript and Presenter's Notes

Title: Efficient Approximate Search on String Collections Part II


1
Efficient Approximate Search on String
CollectionsPart II
  • Marios Hadjieleftheriou

Chen Li
2
Outline
  • Motivation and preliminaries
  • Inverted list based algorithms
  • Gram Signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions

3
N-Gram Signatures
  • Use string signatures that upper bound similarity
  • Use signatures as filtering step
  • Properties
  • Signature has to have small size
  • Signature verification must be fast
  • False positives/False negatives
  • Signatures have to be indexable

4
Known signatures
  • Minhash
  • Jaccard, Edit distance
  • Prefix filter (CGK06)
  • Jaccard, Edit distance
  • PartEnum (AGK06)
  • Hamming, Jaccard, Edit distance
  • LSH (GIM99)
  • Jaccard, Edit distance
  • Mismatch filter (XWL08)
  • Edit distance

5
Prefix Filter
  • Bit vectors
  • Mismatch vector
  • s matches 6, missing 2, extra 2
  • If s?q?6 then ?s?s s.t. s?3, s?q??
  • For at least k matches, s l - k 1

6
Using Prefixes
  • Take a random permutation of n-gram universe
  • Take prefixes from both sets
  • sq3, if s?q?6 then s?q??

7
Prefix Filter for Weighted Sets
  • For example
  • Order n-grams by weight (new coordinate space)
  • Query w(q?s)Si?q?s wi ? t
  • Keep prefix s s.t. w(s) ? w(s) - a
  • Best case w(q/q ? s/s) a
  • Hence, we need w(q?s) ? t - a

w1 ? w2 ? ? w14
8
Prefix Filter Properties
  • The larger we make a, the smaller the prefix
  • The larger we make a, the smaller the range of
    thresholds we can support
  • Because t?a, otherwise t-a is negative.
  • We need to pre-specify minimum t
  • Can apply to Jaccard, Edit Distance, IDF

9
Other Signatures
  • Minhash (still to come)
  • PartEnum
  • Upper bounds Hamming
  • Select multiple subsets instead of one prefix
  • Larger signature, but stronger guarantee
  • LSH
  • Probabilistic with guarantees
  • Based on hashing
  • Mismatch filter
  • Use positional mismatching n-grams within the
    prefix to attain lower bound of Edit Distance

10
Signature Indexing
  • Straightforward solution
  • Create an inverted index on signature n-grams
  • Merge inverted lists to compute signature
    intersections
  • For a given string q
  • Access only lists in q
  • Find strings s with w(q n s) t - a

11
The Inverted Signature Hashtable (CCVX08)
  • Maintain a signature vector for every n-gram
  • Consider prefix signatures for simplicity
  • s1 tt , t L, s2tt, t L, s3
  • co-occurence lists t L tt ?? tt ?
  • tt t L ?
  • Hash all n-grams (h n-gram ? 0, m)
  • Convert co-occurrence lists to bit-vectors of
    size m

12
Example
13
Using the Hashtable?
  • Let list at correspond to bit-vector 100011
  • There exists string s s.t. at ? s and s also
    contains some n-grams that hash to 0, 1, or 5
  • Given query q
  • Construct query signature matrix
  • Consider only solid sub-matrices P r?q, p?q
  • We need to look only at r?q such that w(r)?t-a
    and w(p)?t

14
Verification
  • How do we find which strings correspond to a
    given sub-matrix?
  • Create an inverted index on string n-grams
  • Examine only lists in r and strings with w(s)?t
  • Remember that r?q
  • Can be used with other signatures as well

15
Outline
  • Motivation and preliminaries
  • Inverted list based algorithms
  • Gram Signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions

16
Length Normalized Measures
  • What is normalization?
  • Normalize similarity scores by the length of the
    strings.
  • Can result in more meaningful matches.
  • Can use L0 (i.e., the length of the string), L1,
    L2, etc.
  • For example L2
  • Let w2(s) ?? St?sw(t)2
  • Weight can be IDF, unary, language model, etc.
  • s2 w2(s)-1/2

17
The L2-Length Filter (HCKS08)
  • Why L2?
  • For almost exact matches.
  • Two strings match only if
  • They have very similar n-gram sets, and hence L2
    lengths
  • The extra n-grams have truly insignificant
    weights in aggregate (hence, resulting in similar
    L2 lengths).

18
Example
  • ATT Labs Research ? L2100
  • ATT Labs Research ? L295
  • ATT Labs ? L270
  • If Research happened to be very popular and had
    small weight?
  • The Dark Knight ? L275
  • Dark Night ? L272

19
Why L2 (continued)
  • Tight L2-based length filtering will result in
    very efficient pruning.
  • L2 yields scores bounded within 0, 1
  • 1 means a truly perfect match.
  • Easier to interpret scores.
  • L0 and L1 do not have the same properties
  • Scores are bounded only by the largest string
    length in the database.
  • For L0 an exact match can have score smaller than
    a non-exact match!

20
Example
  • qATT, TT , T L, LAB, ABS ? L05
  • s1ATT ?
    L01
  • s2q ?
    ? L05
  • S(q, s1)Sw(q?s1)/(q0 s10)10/5 2
  • S(q, s2)Sw(q?s2)/(q0 s20)40/25lt2

21
Problems
  • L2 normalization poses challenges.
  • For example
  • S(q, s) w2(q?s)/(q2 s2)
  • Prefix filter cannot be applied.
  • Minimum prefix weight a?
  • Value depends both on s2 and q2.
  • But q2 is unknown at index construction time

22
Important L2 Properties
  • Length filtering
  • For S(q, s) t
  • t q2 ? s2 ? q2 / t
  • We are only looking for strings within these
    lengths.
  • Proof in paper
  • Monotonicity

23
Monotonicity
  • Let st1, t2, , tm.
  • Let pw(s, t)w(t) / s2 (partial weight of s)
  • Then S(q, s) S t?q?s w(t)2 / (q2 s2)
  • St?q?s pw(s, t) pw(q, t)
  • If pw(s, t) gt pw(r, t)
  • w(t)/s2 gt w(t)/r2 ? s2 lt r2
  • Hence, for any t ? t
  • w(t)/s2 gt w(t)/r2 ? pw(s, t) gt pw(r,
    t)

24
Indexing
  • Use inverted lists sorted by pw()
  • pw(0, ic) gt pw(4, ic) gt pw(1, ic) gt pw(2, ic) ?
  • 02 lt 42 lt 12 lt 22

25
L2 Length Filter
  • Given q and t, and using length filtering
  • We examine only a small fraction of the lists

26
Monotonicity
  • If I have seen 1 already, then 4 is not in the
    list

3
1
3
1
4
27
Other Improvements
  • Use properties of weighting scheme
  • Scan high weight lists first
  • Prune according to string length and maximum
    potential score
  • Ignore low weight lists altogether

28
Conclusion
  • Concepts can be extended easily for
  • BM25
  • Weighted Jaccard
  • DICE
  • IDF
  • Take away message
  • Properties of similarity/distance function can
    play big role in designing very fast indexes.
  • L2 super fast for almost exact matches

29
Outline
  • Motivation and preliminaries
  • Inverted list based algorithms
  • Gram signature algorithms
  • Length-normalized measures
  • Selectivity estimation
  • Conclusion and future directions

30
The Problem
  • Estimate the number of strings with
  • Edit distance smaller than k from query q
  • Cosine similarity higher than t to query q
  • Jaccard, Hamming, etc
  • Issues
  • Estimation accuracy
  • Size of estimator
  • Cost of estimation

31
Motivation
  • Query optimization
  • Selectivity of query predicates
  • Need to support selectivity of approximate string
    predicates
  • Visualization/Querying
  • Expected result set size helps with visualization
  • Result set size important for remote query
    processing

32
Flavors
  • Edit distance
  • Based on clustering (JL05)
  • Based on min-hash (MBKS07)
  • Based on wild-card n-grams (LNS07)
  • Cosine similarity
  • Based on sampling (HYKS08)

33
Selectivity Estimation for Edit Distance
  • Problem
  • Given query string q
  • Estimate number of strings s ? D
  • Such that ed(q, s) ? d

34
Sepia - Clustering (JL05, JLV08)
  • Partition strings using clustering
  • Enables pruning of whole clusters
  • Store per cluster histograms
  • Number of strings within edit distance 0,1,,d
    from the cluster center
  • Compute global dataset statistics
  • Use a training query set to compute frequency of
    strings within edit distance 0,1,,d from each
    query

35
Edit Vectors
  • Edit distance is not discriminative
  • Use Edit Vectors
  • 3D space vs 1D space

Ci
Luciano
2
lt2,0,0gt
lt1,1,1gt
3
Lucia
Lukas
pi
q
Lucas
lt1,1,0gt
2
36
Visually
...
Global Table
37
Selectivity Estimation
  • Use triangle inequality
  • Compute edit vector v(q,pi) for all clusters i
  • If v(q,pi) ? rid disregard cluster Ci

d
ri
pi
q
38
Selectivity Estimation
  • Use triangle inequality
  • Compute edit vector v(q,pi) for all clusters i
  • If v(q,pi) ? rid disregard cluster Ci
  • For all entries in frequency table
  • If v(q,pi) v(pi,s) ? d then ed(q,s) ? d for
    all s
  • If v(q,pi) - v(pi,s) ? d ignore these
    strings
  • Else use global table
  • Lookup entry ltv(q,pi), v(pi,s), dgt in global
    table
  • Use the estimated fraction of strings

39
Example
  • d 3
  • v(q,p1) lt1,1,0gt v(p1,s) lt1,0,2gt
  • Global lookup
  • lt1,1,0gt,lt1,0,2gt, 3
  • Fraction is 25 x 7 1.75
  • Iterate through F1, and add up contributions

Global Table
40
Cons
  • Hard to maintain if clusters start drifting
  • Hard to find good number of clusters
  • Space/Time tradeoffs
  • Needs training to construct good dataset
    statistics table

41
VSol minhash (MBKS07)
  • Solution based on minhash
  • minhash is used for
  • Estimate the size of a set s
  • Estimate resemblance of two sets
  • I.e., estimating the size of Js1?s2 / s1?s2
  • Estimate the size of the union s1?s2
  • Hence, estimating the size of the intersection
  • s1?s2?? J(s1, s2) ?? ?(s1, s2)

42
Minhash
  • Given a set s t1, , tm
  • Use independent hash functions h1, , hk
  • hi n-gram ? 0, 1
  • Hash elements of s, k times
  • Keep the k elements that hashed to the smallest
    value each time
  • We reduced set s, from m to k elements
  • Denote minhash signature with s

43
How to use minhash
  • Given two signatures q, s
  • J(q, s) ? S1?i?k Iqisi / k
  • s ? ( k / S1?i?k si ) - 1
  • (q?s) q ? s min1?i?k(qi, si)
  • Hence
  • q?s ? (k / S1?i?k (q?s)i) - 1

44
VSol Estimator
  • Construct one inverted list per n-gram in D
  • The lists are our sets
  • Compute a minhash signature for each list

45
Selectivity Estimation
  • Use edit distance length filter
  • If ed(q, s) ? d, then q and s share at least L
    s - 1 - n (d-1)
  • n-grams
  • Given query q t1, , tm
  • Answer is the size of the union of all non-empty
    L-intersections (binomial coefficient m choose
    L)
  • We can estimate sizes of L-intersections using
    minhash signatures

46
Example
  • d 2, n 3 ? L 6
  • Look at all 6-intersections of inverted lists
  • ? ??1, ..., ?6 ? 1,10 (ti1 ? ti2 ? ?
    ti6)
  • There are (10 choose 6) such terms

Inverted list
47
The m-L Similarity
  • Can be done efficiently using minhashes
  • Answer
  • ? S1?j?k I ?i1, , iL ti1j tiLj
  • A ? ? ? t1? ?tm
  • Proof very similar to the proof for minhashes

48
Cons
  • Will overestimate results
  • Many L-intersections will share strings
  • Edit distance length filter is loose

49
OptEQ wild-card n-grams (LNS07)
  • Use extended n-grams
  • Introduce wild-card symbol ?
  • E.g., ab? can be
  • aba, abb, abc,
  • Build an extended n-gram table
  • Extract all 1-grams, 2-grams, , n-grams
  • Generalize to extended 2-grams, , n-grams
  • Maintain an extended n-grams/frequency hashtable

50
Example
51
Query Expansion (Replacements only)
  • Given query qabcd
  • d2
  • And replacements only
  • Base strings
  • ??cd, ?b?d, ?bc?, a??d, a?c?, ab??
  • Query answer
  • S1s?D s ? ??cd, S2
  • A S1 ? S2 ? S3 ? S4 ? S5 ? S6
  • S1?n?6 (-1)n-1 S1 ? ? Sn

52
Replacement Intersection Lattice
  • A S1?n?6 (-1)n-1 S1 ? ? Sn
  • Need to evaluate size of all 2-intersections,
    3-intersections, , 6-intersections
  • Then, use n-gram table to compute sum A
  • Exponential number of intersections
  • But ... there is well-defined structure

53
Replacement Lattice
  • Build replacement lattice
  • Many intersections are empty
  • Others produce the same results
  • we need to count everything only once

2 ?
1 ?
0 ?
54
General Formulas
  • Similar reasoning for
  • r replacements
  • d deletions
  • Other combinations difficult
  • Multiple insertions
  • Combinations of insertions/replacements
  • But we can generate the corresponding lattice
    algorithmically!
  • Expensive but possible

55
BasicEQ
  • Partition strings by length
  • Query q with length l
  • Possible matching strings with lengths
  • l-d, ld
  • For k l-d to ld
  • Find all combinations of idr d and li-dk
  • If (i,d,r) is a special case use formula
  • Else generate lattice incrementally
  • Start from query base strings (easy to generate)
  • Begin with 2-intersections and build from there

56
OptEq
  • Details are cumbersome
  • Left for homework
  • Various optimizations possible to reduce
    complexity

57
Cons
  • Fairly complicated implementation
  • Expensive
  • Works for small edit distance only

58
Hashed Sampling (HYKS08)
  • Used to estimate selectivity of TF/IDF, BM25,
    DICE (vector space model)
  • Main idea
  • Take a sample of the inverted index
  • But do it intelligently to improve variance

59
Example
  • Take a sample of the inverted index

60
Example (Cont.)
  • But do it intelligently to improve variance

61
Construction
  • Draw samples deterministically
  • Use a hash function h ?N ? 0, 100
  • Keep ids that hash to values smaller than s
  • Invariant
  • If a given id is sampled in one list, it will
    always be sampled in all other lists that contain
    it
  • S(q, s) can be computed directly from the sample
  • No need to store complete sets in the sample
  • No need for extra I/O to compute scores

62
Selectivity Estimation
  • The union of arbitrary list samples is an s
    sample
  • Given query q t1, , tm
  • A As t1 ? ? tm / ts1 ? ? tsm
  • As is the query answer size from the sample
  • The fraction is the actual scale-up factor
  • But there are duplicates in these unions!
  • We need to know
  • The distinct number of ids in t1 ? ? tm
  • The distinct number of ids in ts1 ? ? tsm

63
Count Distinct
  • Distinct ts1 ? ? tsm is easy
  • Scan the sampled lists
  • Distinct t1 ? ? tm is hard
  • Scanning the lists is the same as computing the
    exact answer to the query naively
  • We are lucky
  • Each list sample doubles up as a k-minimum value
    estimator by construction!
  • We can use the list samples to estimate the
    distinct t1 ? ? tm

64
The k-Minimum Value Synopsis
  • It is used to estimated the distinct size of
    arbitrary set unions (the same as FM sketch)
  • Take hash function h N ? 0, 100
  • Hash each element of the set
  • The r-th smallest hash value is an unbiased
    estimator of count distinct

65
Outline
  • Motivation and preliminaries
  • Inverted list based algorithms
  • Gram signature algorithms
  • Length normalized algorithms
  • Selectivity estimation
  • Conclusion and future directions

66
Future Directions
  • Result ranking
  • In practice need to run multiple types of
    searches
  • Need to identify the best results
  • Diversity of query results
  • Some queries have multiple meanings
  • E.g., Jaguar
  • Updates
  • Incremental maintenance

67
References
  • AGK06 Arvind Arasu, Venkatesh Ganti, Raghav
    Kaushik Efficient Exact Set-Similarity Joins.
    VLDB 2006
  • BJL09 Space-Constrained Gram-Based Indexing
    for Efficient Approximate String Search,
    Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng
    Lu, ICDE 2009
  • HCK08 Marios Hadjieleftheriou, Amit Chandel,
    Nick Koudas, Divesh Srivastava Fast Indexes and
    Algorithms for Set Similarity Selection Queries.
    ICDE 2008
  • HYK08 Marios Hadjieleftheriou, Xiaohui Yu,
    Nick Koudas, Divesh Srivastava Hashed samples
    selectivity estimators for set similarity
    selection queries. PVLDB 2008.
  • JL05 Selectivity Estimation for Fuzzy String
    Predicates in Large Data Sets, Liang Jin, and
    Chen Li. VLDB 2005.
  • KSS06 Record linkage Similarity measures and
    algorithms. Nick Koudas, Sunita Sarawagi, and
    Divesh Srivastava. SIGMOD 2006.
  • LLL08 Efficient Merging and Filtering
    Algorithms for Approximate String Searches, Chen
    Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.
  • LNS07 Hongrae Lee, Raymond T. Ng, Kyuseok Shim
    Extending Q-Grams to Estimate Selectivity of
    String Matching with Low Edit Distance. VLDB 2007
  • LWY07 VGRAM Improving Performance of
    Approximate Queries on String Collections Using
    Variable-Length Grams, Chen Li, Bin Wang, and
    Xiaochun Yang. VLDB 2007
  • MBK07 Arturas Mazeika, Michael H. Böhlen, Nick
    Koudas, Divesh Srivastava Estimating the
    selectivity of approximate string queries. ACM
    TODS 2007
  • XWL08 Chuan Xiao, Wei Wang, Xuemin Lin
    Ed-Join an efficient algorithm for similarity
    joins with edit distance constraints. PVLDB 2008

68
References
  • XWL08 Chuan Xiao, Wei Wang, Xuemin Lin,
    Jeffrey Xu Yu Efficient similarity joins for
    near duplicate detection. WWW 2008.
  • YWL08 Cost-Based Variable-Length-Gram Selection
    for String Collections to Support Approximate
    Queries Efficiently, Xiaochun Yang, Bin Wang, and
    Chen Li, SIGMOD 2008
  • JLV08L. Jin, C. Li, R. Vernica SEPIA
    Estimating Selectivities of Approximate String
    Predicates in Large Databases, VLDBJ08
  • CGK06 S. Chaudhuri, V. Ganti, R. Kaushik A
    Primitive Operator for Similarity Joins in Data
    Cleaning, ICDE06
  • CCGX08K. Chakrabarti, S. Chaudhuri, V. Ganti,
    D. Xin An Efficient Filter for Approximate
    Membership Checking, SIGMOD08
  • SK04 Sunita Sarawagi, Alok Kirpal Efficient
    set joins on similarity predicates. SIGMOD
    Conference 2004 743-754
  • BK02 Jérémy Barbay, Claire Kenyon Adaptive
    intersection and t-threshold problems. SODA 2002
    390-399
  • CGG05 Surajit Chaudhuri, Kris Ganjam,
    Venkatesh Ganti, Rahul Kapoor, Vivek R.
    Narasayya, Theo Vassilakis Data cleaning in
    microsoft SQL server 2005. SIGMOD Conference
    2005 918-920
Write a Comment
User Comments (0)
About PowerShow.com