Title: Similarity join problem with Pass-Join-K using Hadoop
 1Similarity join problem with Pass-Join-K using 
Hadoop
  2Outline
- Background 
- The introduction of Pass-Join-K 
- Combining Pass-Join-K with Hadoop
3Background
- Similarity join Find all similar pairs from two 
 sets.
- Data Cleaning. 
- Query Relaxation 
- Spellchecking
PO BOX 23, Main St. P.O. Box 23, Main St
information
imformation 
 4Background
- How to define similarity? 
- Jaccard distance 
- Cosine distance 
- Edit distance
2016-6-19
http//datamining.xmu.edu.cn
4/32 
 5Background
- Edit distance 
-  The minimum number of edit operations 
 (insertion, deletion, and substitution) to
 transform one string to another.
Bod
Body
Insertion
Baby
Body
Substitution
2016-6-19
http//datamining.xmu.edu.cn
5/32 
 6Background
- How does the edit distance compare with other 
 two?
-  Accuracy abcdefg,gfedcba 
- Verification time O(mn) -gt O(mn)
2016-6-19
http//datamining.xmu.edu.cn
6/32 
 7Background
- Find similar pairs 
- We have two string sets ,one is vldb,sigmod,. 
 ,the other is pvldb,icde,.
- Find some candidate pairs , and then verify these 
 pairs.
ltvldb,pvldbgt,ltvldb,icdegt,ltvldb,..gt,ltsigmod,pvldbgt
,ltsigmod,icdegt,.
ltvldb,pvldbgt Yes
ltvldb,icdegt No 
 8Background
- So we have to 
- Finding candidate pairs. There are O(N2) if we do 
 not prune some pairs.
- verifying these pairs.
O(mn)
2016-6-19
http//datamining.xmu.edu.cn
8/32 
 9Introduction of Pass-Join-K
- Some obvious pruning techniques 
- Length based threshold  2,ltab,abceegt 
- Shift-based ltabcd,cdefgt
a b c d
c d e f 
 10Introduction of Pass-Join-K
- Partition-based pruning technique 
- We suppose the threshold tau  2, K2and we have 
 a pair ltabcdefghijk,abdefghkgt
abc def ghi jk
ab def gh k
2016-6-19
http//datamining.xmu.edu.cn
10/32 
 11Introduction of Pass-Join-K
- Partition Scheme 
- We have seen that the longer the substrings are, 
 the harder they could be marched.
- So we break the string into tauk parts and each 
 part while its length equals length/(tauk) or
 length/(tauk)1.
2016-6-19
http//datamining.xmu.edu.cn
11/32 
 12Introduction of Pass-Join-K
- Partition Scheme 
- So we break the string into tauk parts and each 
 part while its length equals length/(tauk) or
 length/(tauk)1.
abc def ghi jk
2016-6-19
http//datamining.xmu.edu.cn
12/32 
 13Introduction of Pass-Join-K
- Partition Scheme 
- r  abcdefghijk s  abdefghk
def
L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
13/32 
 14Introduction of Pass-Join-K
- Substring Selection 
- Here we suppose tau  3 and k  1 
abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
14/32 
 15Introduction of Pass-Join-K
- Substring Selection 
- Here we suppose tau  3 and k  1 
abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
15/32 
 16Introduction of Pass-Join-K
- Substring Selection 
- Here we suppose tau  3 and k  1 
abc def ghi jk
a b d e f gh k
2016-6-19
http//datamining.xmu.edu.cn
16/32 
 17Introduction of Pass-Join-K
- Substring Selection 
- Here we suppose tau  3 and k  1 
abc def ghi jk
abd efg hk
2016-6-19
http//datamining.xmu.edu.cn
17/32 
 18Introduction of Pass-Join-K
- Substring Selection 
- Here we suppose tau  3 and k  1 
abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
18/32 
 19Introduction of Pass-Join-K
- Substring Selection 
- So what we do is to deduce the number of 
 substrings. More pruning techniques, please read
 our paper Pass-Join-K?????????????
2016-6-19
http//datamining.xmu.edu.cn
19/32 
 20Introduction of Pass-Join-K
- Verification 
- DP( Dynamic programming) 
- D(m,n)max(D(m,n-1)1,D(m-1,n)1,D(m-1,n-1)flag) 
 where flag  1 when smrn , s and r are both
 strings.
2016-6-19
http//datamining.xmu.edu.cn
20/32 
 21Introduction of Pass-Join-K
- Verification 
- Here we suppose tau  3 and k  1 
abc def ghi jk
def e f g h k
Tauleft  3
Tauright  3-30
2016-6-19
http//datamining.xmu.edu.cn
21/32 
 22Combining Pass-Join-K with Hadoop
- Inverted index tree in hadoop 
- (abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r) 
 (jk,4,11,r)
L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
22/32 
 23Combining Pass-Join-K with Hadoop
- Substrings in hadoop 
- Suppose tau  3, k  1, and s  abdefghk, 
 length(s)  8. We have to generate some records
 such as (a,1,5,s),(a,2,6,s)(a,3,7,s),(ab,1,8,s),,
 (ab,1,11,s),
2016-6-19
http//datamining.xmu.edu.cn
23/32 
 24Combining Pass-Join-K with Hadoop
- Substrings in hadoop 
- Suppose tau  3, k  1, and s  abdefghk, 
 length(s)  8. We have to generate more than
 2tau(tauk)m records where m is the average
 number that substring for each segment, such as
 (a,1,5,s),(a,1,6,s)(a,1,7,s),(ab,1,8,s),,(ab,1,11
 ,s),
2016-6-19
http//datamining.xmu.edu.cn
24/32 
 25Combining Pass-Join-K with Hadoop
2016-6-19
http//datamining.xmu.edu.cn
25/32 
 26Combining Pass-Join-K with Hadoop
- How to improve the performance ? 
- We have known that as k increased , the pairs we 
 need to verity would be decrease.
- As k increased, more than (tauk1)/(tauk) 
 records should be translated in Mapper phase.
2016-6-19
http//datamining.xmu.edu.cn
26/32 
 27Combining Pass-Join-K with Hadoop
- Here we have 2 ways to improve our algorithm. 
- Finding a dataset that the candidate pairs number 
 are large enough or making tau are large enough.
- Decreasing the data which were generated in 
 Mapper phase.
2016-6-19
http//datamining.xmu.edu.cn
27/32 
 28Combining Pass-Join-K with Hadoop
2016-6-19
http//datamining.xmu.edu.cn
28/32 
 29Combining Pass-Join-K with Hadoop
- Decrease the data flows 
- The inverted index record was formulated as 
 (substring,segmentNumber, LengthInf, Id, flag)
- Each records length is length(substring)4sizeof
 (int), and substring sometimes could be so long.
- Hash(substring) -gt integer, then record length is 
 5sizeof(int)
2016-6-19
http//datamining.xmu.edu.cn
29/32 
 30Combining Pass-Join-K with Hadoop
- Decrease the data flows 
- The substring would generate some similar records 
 such as (a,1,5,s),(a,1,6,s)(a,1,7,s)
- Each substring would generate tauk similar 
 segments, so we combine them as ,for example,
 (a,1,5,7,s). So we make the (tauk)4sizeof(int)
 to 5sizeof(int).
2016-6-19
http//datamining.xmu.edu.cn
30/32 
 31Combining Pass-Join-K with Hadoop
- Decrease the data flows 
- So by using two steps we have seen before, we 
 have reduced the (length(substring)4sizeof(int))
 (tauk) to 5 times sizeof(int)
2016-6-19
http//datamining.xmu.edu.cn
31/32 
 32Thanks for patience 
- Email yhycai_at_gmail.com