SpaceConstrained GramBased Indexing for Efficient Approximate String Search - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

SpaceConstrained GramBased Indexing for Efficient Approximate String Search

Description:

Brad Pitt. Forest Whittacker. George Bush. Angelina Jolie. Arnold ... Brad Pitt. Arnold Schwarzeneger. George Bush. Angelina Jolie. Forrest Whittaker ... – PowerPoint PPT presentation

Number of Views:249
Avg rating:3.0/5.0
Slides: 32
Provided by: Rinde
Category:

less

Transcript and Presenter's Notes

Title: SpaceConstrained GramBased Indexing for Efficient Approximate String Search


1
Space-Constrained Gram-Based Indexing for
Efficient Approximate String Search
  • Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng
    Lu2
  • 1University of California, Irvine
  • 2Renmin University of China

2
Overview
  • Motivation Preliminaries
  • Approach 1 Discarding Lists
  • Approach 2 Combining Lists
  • Experiments Conclusion

3
Motivation Data Cleaning
Should clearly be Niels Bohr
  • Real-world data is dirty
  • Typos
  • Inconsistent representations
  • (PO Box vs. P.O. Box)
  • Approximately check against clean dictionary

Source http//en.wikipedia.org/wiki/Heisenberg's_
microscope, Jan 2008
4
Motivation Record Linkage
We want to link records belonging to the same
entity
No exact match!
The same entity may have similar
representations Arnold Schwarzeneger
versus Arnold Schwarzenegger Forrest Whittaker
versus Forest Whittacker
5
Motivation Query Relaxation
  • Errors in queries
  • Errors in data
  • Bring query and meaningful results closer together

Actual queries gathered by Google
http//www.google.com/jobs/britney.html
6
What is Approximate String Search?
String Collection (People) Brad Pitt Forest
Whittacker George Bush Angelina Jolie Arnold
Schwarzeneger
Queries against collection Find all entries
similar to Forrest Whitaker Find all entries
similar to Arnold Schwarzenegger Find all
entries similar to Brittany Spears
  • What do we mean by similar to?
  • Edit Distance
  • Jaccard Similarity
  • Cosine Similaity
  • Dice
  • Etc.

The similar to predicate can help our described
applications! How can we support these types of
queries efficiently?
7
Approximate Query Answering
Main Idea Use q-grams as signatures for a string
irvine
Sliding Window
2-grams ir, rv, vi, in, ne
Intuition Similar strings share a certain number
of grams
Inverted index on grams supports finding all data
strings sharing enough grams with a query
8
Approximate Query Example
Query irvine, Edit Distance 1 2-grams ir,
rv, vi, in, ne
Lookup Grams
tf
vi
ir
ef
rv
ne
un
in


2-grams
5 9
1 2 4 5 6
1 3 4 5 7 9
1 5
1 2 3 9
3 9
7 9
5 6 9
Inverted Lists (stringIDs)
Candidates 1, 5, 9 May have false
positives Need to compute real similarity
Each edit operations can destroy at most q
grams Answers must share at least T 5 1 2
3 grams T-Occurrence problem Find elements
occurring at least T3 times among inverted
lists. This is called list-merging. T is called
merging-threshold.
9
Motivation Compression
Index-Size Estimation Each string produces s -
q 1 grams For each gram we add one element to
its inverted list (a 4-byte uint) With ASCII
encoding the index is 4x as large as the
original data!
  • Inverted index can be very large compared to
    source data
  • May need to fit in memory for fast query
    processing
  • Can we compress the index to fit into a space
    budget?

10
Motivation Related Work
  • IR community developed many lossless compression
    algorithms for inverted lists (mostly in a
    disk-based setting)
  • Mainly use delta representation packing
  • If inverted lists are in memory these techniques
    always impose decompression overhead
  • Difficult to tune compression ratio
  • How to overcome these limitations in our setting?

11
This Paper
  • We developed two lossy compression techniques
  • We answer queries exactly
  • Index can fit into a space budget (space
    constraint)
  • Queries can become faster on the compressed
    indexes
  • Flexibility to choose space / time tradeoff
  • Existing list-merging algorithms can be re-used
    (even with compression specific optimizations)

12
Overview
  • Motivation Preliminaries
  • Approach 1 Discarding Lists
  • Approach 2 Combining Lists
  • Experiments Conclusion

13
Approach 1 Discarding Lists
B E FORE
tf
vi
ir
ef
rv
ne
un
in


2-grams
1 2 4 5 6
5 9
1 3 4 5 7 9
1 5
1 2 3 9
3 9
7 9
5 6 9
Inverted Lists (stringIDs)
tf
vi
ir
ef
rv
ne
un
in


2-grams
A F TER
1 2 4 5 6
5 9
1 5
7 9
5 6 9
Inverted Lists (stringIDs)
Lists discarded, Holes
14
Effects on Queries
  • Need to decrease merging-threshold T
  • Lower T ? more false positives to post-process
  • If T collection and compute true similarities
  • Surprisingly! Query Processing time can decrease
    because fewerlists to consider

15
Query shanghai, Edit Distance 1 3-grams sha,
han, ang, ngh, gha, hai
Hole grams
Regular grams
han
ngh
hai

uni
ing
sha
ang
gha
ter
3-grams
Merging-threshold without holes, T grams ed
q 6 1 3 3 Basis Each Edit Operation
can destroy at most q3 grams Naïve new
Merging-Threshold T T holes 0 ?
Panic! Can we really destroy at most q3
non-hole grams with each edit operation?
han
ngh
hai
sha
ang
gha
Delete a
Delete g
Can destroy at most 2 grams with 1 Edit
Operation! New Merging-Threshold T 1 We use
Dynamic Programming to compute tighter T
16
Choosing Lists to Discard
  • One extreme query is entirely unaffected
  • Other extreme query becomes panic
  • Good choice of lists depends on query workload
  • Many combinations of lists to discard that
    satisfy memory constraint, checking all is
    infeasible
  • How can we make a reasonable choice efficiently?

17
Choosing Lists to Discard
Input Memory Constraint Inverted Lists L Query
Workload W Output Lists to Discard
D DiscardLists While(Memory Constraint Not
Satisfied) For each list in L ?t
estimateImpact(list, W) benefit
list.size() discard use ?ts and
benefits to choose list add discard to D
remove discard from L
How can we do this efficiently? Perhaps
incrementally? Times needed List-Merging
Time Post-Processing Time Panic Time
What exactly should we minimize? benefit /
cost? cost only? We could ignore benefit
18
Choosing Lists to Discard
Estimating Query Times With Holes List-Merging
Time cost function, parameters decided offline
with linear regression Post-Processing Time
candidates average compute similarity
time Panic Time strings average compute
similarity time candidates depends on T, data
distribution, number of holes Incremental-ScanCou
nt Algorithm
Before Discarding List T 3 candidates 3
2
0
3
3
2
4
0
0
1
0
Counts
List to discard
0
1
2
3
4
5
6
7
8
9
StringIDs
2
3
decrement counts
4
8
After Discarding List T T 1 2 candidates
4
2
0
2
2
1
4
0
0
0
0
Counts
0
1
2
3
4
5
6
7
8
9
StringIDs
Many more ways to improve speed of DiscardLists,
this is just one example
19
Overview
  • Motivation Preliminaries
  • Approach 1 Discarding Lists
  • Approach 2 Combining Lists
  • Experiments Conclusion

20
Approach 2 Combining Lists
B E FORE
tf
vi
ir
ef
rv
ne
un
in


2-grams
1 2 4 5 6
5 9
1 3 4 5 7 9
5 6 9
1 2 3 9
1 3 9
7 9
6 9
Inverted Lists (stringIDs)
tf
vi
ir
ef
rv
ne
un
in


2-grams
A F TER
1 2 4 5 6
1 3 4 5 7 9
7 9
6 9
1 2 3 9
5 6 9
Inverted Lists (stringIDs)
Lists combined
Intuition Combine correlated lists.
21
Effects on Queries
  • Merging-threshold T is unchanged (no new panics)
  • Lists become longer
  • More time to traverse lists
  • More false positives

List-Merging Optimization 3-grams sha, han, ang,
ngh, gha, hai
Traverse physical lists once. Count for
stringIDs on physical lists increased by refcount
instead of 1
combined refcount 2
combined refcount 3
22
Choosing Lists to Combine
  • Discovering candidate gram pairs
  • Frequent q1-grams ? correlated adjacent q-grams
  • Using Locality-Sensitive Hashing (LSH)
  • Selecting candidate pairs to combine
  • Based on estimated cost on query workload
  • Similar to DiscardList
  • Different Incremental ScanCount algorithm

23
Overview
  • Motivation Preliminaries
  • Approach 1 Discarding Lists
  • Approach 2 Combining Lists
  • Experiments Conclusion

24
Experiments
  • Datasets
  • Google WebCorpus (word grams)
  • IMDB Actors
  • Queries picked from dataset, Zipf distributed
  • q3, Edit Distance2
  • Overview
  • Performance of flavors of DiscardLists
    CombineLists
  • Scalability with increasing index size
  • Comparison with IR compression technique
  • Comparison with VGRAM
  • What if workload changes from training workload

25
Experiments
DiscardLists
CombineLists
Runtime decreases!
Runtime decreases!
26
Experiments
Comparison with IR compression technique
Compressed
Uncompressed
Compressed
Uncompressed
27
Experiments
Comparison with variable-length gram technique,
VGRAM
Compressed
Uncompressed
Uncompressed
Compressed
28
Future Work
  • DiscardLists, CombineLists and IR compression
    could be combined
  • When considering filter tree, global vs. local
    decisions
  • How to minimize impact on performance if workload
    change

29
Conclusion
  • We developed two lossy compression techniques
  • We answer queries exactly
  • Index can fit into a space budget (space
    constraint)
  • Queries can become faster on the compressed
    indexes
  • Flexibility to choose space / time tradeoff
  • Existing list-merging algorithms can be re-used
    (even with compression specific optimizations)

30
More Experiments
What if the workload changes from the training
workload?
31
More Experiments
What if the workload changes from the training
workload?
Write a Comment
User Comments (0)
About PowerShow.com