Similarity join problem with Pass-Join-K using Hadoop presentation

About This Presentation

Transcript and Presenter's Notes

Title: Similarity join problem with Pass-Join-K using Hadoop

1
Similarity join problem with Pass-Join-K using
Hadoop

---BY Yu Haiyang

2
Outline

Background
The introduction of Pass-Join-K
Combining Pass-Join-K with Hadoop

3
Background

Similarity join Find all similar pairs from two
sets.
Data Cleaning.
Query Relaxation
Spellchecking

PO BOX 23, Main St. P.O. Box 23, Main St
information
imformation
4
Background

How to define similarity?
Jaccard distance
Cosine distance
Edit distance

2016-6-19
http//datamining.xmu.edu.cn
4/32
5
Background

Edit distance
The minimum number of edit operations
(insertion, deletion, and substitution) to
transform one string to another.

Bod
Body
Insertion
Baby
Body
Substitution
2016-6-19
http//datamining.xmu.edu.cn
5/32
6
Background

How does the edit distance compare with other
two?
Accuracy abcdefg,gfedcba
Verification time O(mn) -gt O(mn)

2016-6-19
http//datamining.xmu.edu.cn
6/32
7
Background

Find similar pairs
We have two string sets ,one is vldb,sigmod,.
,the other is pvldb,icde,.
Find some candidate pairs , and then verify these
pairs.

ltvldb,pvldbgt,ltvldb,icdegt,ltvldb,..gt,ltsigmod,pvldbgt
,ltsigmod,icdegt,.
ltvldb,pvldbgt Yes
ltvldb,icdegt No
8
Background

So we have to
Finding candidate pairs. There are O(N2) if we do
not prune some pairs.
verifying these pairs.

O(mn)
2016-6-19
http//datamining.xmu.edu.cn
8/32
9
Introduction of Pass-Join-K

Some obvious pruning techniques
Length based threshold 2,ltab,abceegt
Shift-based ltabcd,cdefgt

a b c d
c d e f
10
Introduction of Pass-Join-K

Partition-based pruning technique
We suppose the threshold tau 2, K2and we have
a pair ltabcdefghijk,abdefghkgt

abc def ghi jk
ab def gh k
2016-6-19
http//datamining.xmu.edu.cn
10/32
11
Introduction of Pass-Join-K

Partition Scheme
We have seen that the longer the substrings are,
the harder they could be marched.
So we break the string into tauk parts and each
part while its length equals length/(tauk) or
length/(tauk)1.

2016-6-19
http//datamining.xmu.edu.cn
11/32
12
Introduction of Pass-Join-K

Partition Scheme
So we break the string into tauk parts and each
part while its length equals length/(tauk) or
length/(tauk)1.

abc def ghi jk
2016-6-19
http//datamining.xmu.edu.cn
12/32
13
Introduction of Pass-Join-K

Partition Scheme
r abcdefghijk s abdefghk

def
L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
13/32
14
Introduction of Pass-Join-K

Substring Selection
Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
14/32
15
Introduction of Pass-Join-K

Substring Selection
Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
15/32
16
Introduction of Pass-Join-K

Substring Selection
Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f gh k
2016-6-19
http//datamining.xmu.edu.cn
16/32
17
Introduction of Pass-Join-K

Substring Selection
Here we suppose tau 3 and k 1

abc def ghi jk
abd efg hk
2016-6-19
http//datamining.xmu.edu.cn
17/32
18
Introduction of Pass-Join-K

Substring Selection
Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
18/32
19
Introduction of Pass-Join-K

Substring Selection
So what we do is to deduce the number of
substrings. More pruning techniques, please read
our paper Pass-Join-K?????????????

2016-6-19
http//datamining.xmu.edu.cn
19/32
20
Introduction of Pass-Join-K

Verification
DP( Dynamic programming)
D(m,n)max(D(m,n-1)1,D(m-1,n)1,D(m-1,n-1)flag)
where flag 1 when smrn , s and r are both
strings.

2016-6-19
http//datamining.xmu.edu.cn
20/32
21
Introduction of Pass-Join-K

Verification
Here we suppose tau 3 and k 1

abc def ghi jk
def e f g h k
Tauleft 3
Tauright 3-30
2016-6-19
http//datamining.xmu.edu.cn
21/32
22
Combining Pass-Join-K with Hadoop

Inverted index tree in hadoop
(abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r)
(jk,4,11,r)

L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
22/32
23
Combining Pass-Join-K with Hadoop

Substrings in hadoop
Suppose tau 3, k 1, and s abdefghk,
length(s) 8. We have to generate some records
such as (a,1,5,s),(a,2,6,s)(a,3,7,s),(ab,1,8,s),,
(ab,1,11,s),

2016-6-19
http//datamining.xmu.edu.cn
23/32
24
Combining Pass-Join-K with Hadoop

Substrings in hadoop
Suppose tau 3, k 1, and s abdefghk,
length(s) 8. We have to generate more than
2tau(tauk)m records where m is the average
number that substring for each segment, such as
(a,1,5,s),(a,1,6,s)(a,1,7,s),(ab,1,8,s),,(ab,1,11
,s),

2016-6-19
http//datamining.xmu.edu.cn
24/32
25
Combining Pass-Join-K with Hadoop

Data flows in hadoop

2016-6-19
http//datamining.xmu.edu.cn
25/32
26
Combining Pass-Join-K with Hadoop

How to improve the performance ?
We have known that as k increased , the pairs we
need to verity would be decrease.
As k increased, more than (tauk1)/(tauk)
records should be translated in Mapper phase.

2016-6-19
http//datamining.xmu.edu.cn
26/32
27
Combining Pass-Join-K with Hadoop

Here we have 2 ways to improve our algorithm.
Finding a dataset that the candidate pairs number
are large enough or making tau are large enough.
Decreasing the data which were generated in
Mapper phase.

2016-6-19
http//datamining.xmu.edu.cn
27/32
28
Combining Pass-Join-K with Hadoop

Decrease the data flows

2016-6-19
http//datamining.xmu.edu.cn
28/32
29
Combining Pass-Join-K with Hadoop

Decrease the data flows
The inverted index record was formulated as
(substring,segmentNumber, LengthInf, Id, flag)
Each records length is length(substring)4sizeof
(int), and substring sometimes could be so long.
Hash(substring) -gt integer, then record length is
5sizeof(int)

2016-6-19
http//datamining.xmu.edu.cn
29/32
30
Combining Pass-Join-K with Hadoop

Decrease the data flows
The substring would generate some similar records
such as (a,1,5,s),(a,1,6,s)(a,1,7,s)
Each substring would generate tauk similar
segments, so we combine them as ,for example,
(a,1,5,7,s). So we make the (tauk)4sizeof(int)
to 5sizeof(int).

2016-6-19
http//datamining.xmu.edu.cn
30/32
31
Combining Pass-Join-K with Hadoop

Decrease the data flows
So by using two steps we have seen before, we
have reduced the (length(substring)4sizeof(int))
(tauk) to 5 times sizeof(int)

2016-6-19
http//datamining.xmu.edu.cn
31/32
32
Thanks for patience

Email yhycai_at_gmail.com

Write a Comment

User Comments (0)

About PowerShow.com

Similarity join problem with Pass-Join-K using Hadoop PowerPoint PPT Presentation