Similarity join problem with Pass-Join-K using Hadoop - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Similarity join problem with Pass-Join-K using Hadoop

Description:

Similarity join problem with Pass-Join-K using Hadoop---BY Yu Haiyang Hello everyone. Today I will give a short report about string optimization. – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 33
Provided by: educ5460
Category:

less

Transcript and Presenter's Notes

Title: Similarity join problem with Pass-Join-K using Hadoop


1
Similarity join problem with Pass-Join-K using
Hadoop
  • ---BY Yu Haiyang

2
Outline
  • Background
  • The introduction of Pass-Join-K
  • Combining Pass-Join-K with Hadoop

3
Background
  • Similarity join Find all similar pairs from two
    sets.
  • Data Cleaning.
  • Query Relaxation
  • Spellchecking

PO BOX 23, Main St. P.O. Box 23, Main St
information
imformation
4
Background
  • How to define similarity?
  • Jaccard distance
  • Cosine distance
  • Edit distance

2016-6-19
http//datamining.xmu.edu.cn
4/32
5
Background
  • Edit distance
  • The minimum number of edit operations
    (insertion, deletion, and substitution) to
    transform one string to another.

Bod
Body
Insertion
Baby
Body
Substitution
2016-6-19
http//datamining.xmu.edu.cn
5/32
6
Background
  • How does the edit distance compare with other
    two?
  • Accuracy abcdefg,gfedcba
  • Verification time O(mn) -gt O(mn)

2016-6-19
http//datamining.xmu.edu.cn
6/32
7
Background
  • Find similar pairs
  • We have two string sets ,one is vldb,sigmod,.
    ,the other is pvldb,icde,.
  • Find some candidate pairs , and then verify these
    pairs.

ltvldb,pvldbgt,ltvldb,icdegt,ltvldb,..gt,ltsigmod,pvldbgt
,ltsigmod,icdegt,.
ltvldb,pvldbgt Yes
ltvldb,icdegt No
8
Background
  • So we have to
  • Finding candidate pairs. There are O(N2) if we do
    not prune some pairs.
  • verifying these pairs.

O(mn)
2016-6-19
http//datamining.xmu.edu.cn
8/32
9
Introduction of Pass-Join-K
  • Some obvious pruning techniques
  • Length based threshold 2,ltab,abceegt
  • Shift-based ltabcd,cdefgt

a b c d
c d e f
10
Introduction of Pass-Join-K
  • Partition-based pruning technique
  • We suppose the threshold tau 2, K2and we have
    a pair ltabcdefghijk,abdefghkgt

abc def ghi jk
ab def gh k
2016-6-19
http//datamining.xmu.edu.cn
10/32
11
Introduction of Pass-Join-K
  • Partition Scheme
  • We have seen that the longer the substrings are,
    the harder they could be marched.
  • So we break the string into tauk parts and each
    part while its length equals length/(tauk) or
    length/(tauk)1.

2016-6-19
http//datamining.xmu.edu.cn
11/32
12
Introduction of Pass-Join-K
  • Partition Scheme
  • So we break the string into tauk parts and each
    part while its length equals length/(tauk) or
    length/(tauk)1.

abc def ghi jk
2016-6-19
http//datamining.xmu.edu.cn
12/32
13
Introduction of Pass-Join-K
  • Partition Scheme
  • r abcdefghijk s abdefghk

def
L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
13/32
14
Introduction of Pass-Join-K
  • Substring Selection
  • Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
14/32
15
Introduction of Pass-Join-K
  • Substring Selection
  • Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
15/32
16
Introduction of Pass-Join-K
  • Substring Selection
  • Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f gh k
2016-6-19
http//datamining.xmu.edu.cn
16/32
17
Introduction of Pass-Join-K
  • Substring Selection
  • Here we suppose tau 3 and k 1

abc def ghi jk
abd efg hk
2016-6-19
http//datamining.xmu.edu.cn
17/32
18
Introduction of Pass-Join-K
  • Substring Selection
  • Here we suppose tau 3 and k 1

abc def ghi jk
a b d e f g h k
2016-6-19
http//datamining.xmu.edu.cn
18/32
19
Introduction of Pass-Join-K
  • Substring Selection
  • So what we do is to deduce the number of
    substrings. More pruning techniques, please read
    our paper Pass-Join-K?????????????

2016-6-19
http//datamining.xmu.edu.cn
19/32
20
Introduction of Pass-Join-K
  • Verification
  • DP( Dynamic programming)
  • D(m,n)max(D(m,n-1)1,D(m-1,n)1,D(m-1,n-1)flag)
    where flag 1 when smrn , s and r are both
    strings.

2016-6-19
http//datamining.xmu.edu.cn
20/32
21
Introduction of Pass-Join-K
  • Verification
  • Here we suppose tau 3 and k 1

abc def ghi jk
def e f g h k
Tauleft 3
Tauright 3-30
2016-6-19
http//datamining.xmu.edu.cn
21/32
22
Combining Pass-Join-K with Hadoop
  • Inverted index tree in hadoop
  • (abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r)
    (jk,4,11,r)

L11
1
3
4
2
abc def ghi jk
r
r
r
r
2016-6-19
http//datamining.xmu.edu.cn
22/32
23
Combining Pass-Join-K with Hadoop
  • Substrings in hadoop
  • Suppose tau 3, k 1, and s abdefghk,
    length(s) 8. We have to generate some records
    such as (a,1,5,s),(a,2,6,s)(a,3,7,s),(ab,1,8,s),,
    (ab,1,11,s),

2016-6-19
http//datamining.xmu.edu.cn
23/32
24
Combining Pass-Join-K with Hadoop
  • Substrings in hadoop
  • Suppose tau 3, k 1, and s abdefghk,
    length(s) 8. We have to generate more than
    2tau(tauk)m records where m is the average
    number that substring for each segment, such as
    (a,1,5,s),(a,1,6,s)(a,1,7,s),(ab,1,8,s),,(ab,1,11
    ,s),

2016-6-19
http//datamining.xmu.edu.cn
24/32
25
Combining Pass-Join-K with Hadoop
  • Data flows in hadoop

2016-6-19
http//datamining.xmu.edu.cn
25/32
26
Combining Pass-Join-K with Hadoop
  • How to improve the performance ?
  • We have known that as k increased , the pairs we
    need to verity would be decrease.
  • As k increased, more than (tauk1)/(tauk)
    records should be translated in Mapper phase.

2016-6-19
http//datamining.xmu.edu.cn
26/32
27
Combining Pass-Join-K with Hadoop
  • Here we have 2 ways to improve our algorithm.
  • Finding a dataset that the candidate pairs number
    are large enough or making tau are large enough.
  • Decreasing the data which were generated in
    Mapper phase.

2016-6-19
http//datamining.xmu.edu.cn
27/32
28
Combining Pass-Join-K with Hadoop
  • Decrease the data flows

2016-6-19
http//datamining.xmu.edu.cn
28/32
29
Combining Pass-Join-K with Hadoop
  • Decrease the data flows
  • The inverted index record was formulated as
    (substring,segmentNumber, LengthInf, Id, flag)
  • Each records length is length(substring)4sizeof
    (int), and substring sometimes could be so long.
  • Hash(substring) -gt integer, then record length is
    5sizeof(int)

2016-6-19
http//datamining.xmu.edu.cn
29/32
30
Combining Pass-Join-K with Hadoop
  • Decrease the data flows
  • The substring would generate some similar records
    such as (a,1,5,s),(a,1,6,s)(a,1,7,s)
  • Each substring would generate tauk similar
    segments, so we combine them as ,for example,
    (a,1,5,7,s). So we make the (tauk)4sizeof(int)
    to 5sizeof(int).

2016-6-19
http//datamining.xmu.edu.cn
30/32
31
Combining Pass-Join-K with Hadoop
  • Decrease the data flows
  • So by using two steps we have seen before, we
    have reduced the (length(substring)4sizeof(int))
    (tauk) to 5 times sizeof(int)

2016-6-19
http//datamining.xmu.edu.cn
31/32
32
Thanks for patience
  • Email yhycai_at_gmail.com
Write a Comment
User Comments (0)
About PowerShow.com