Two-pass algorithms based on hashing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Two-pass algorithms based on hashing

1
Two-pass algorithms based on hashing

Main idea
Instead of sorted sublists, create partitions,
based on hashing.
Second pass creates result from partitions using
one pass algorithms.

2
Creating partitions

Partitions (buckets) are created based on all
attributes of the relation except for grouping
and join, where the partitions are based on the
grouping and join-attributes respectively.
Why bucketize? Tuples with matching values end
up in the same bucket.
Initialize M-1 buckets using M-1 empty buffers
FOR each block b of relation R DO
read block b into the M-th buffer
FOR each tuple t in b DO
IF the buffer for bucket h(t) has no room for t
THEN
copy the buffer to disk
initialize a new empty block in that buffer
copy t to the buffer for bucket h(t)
ENDIF
ENDFOR
FOR each bucket DO
IF the buffer for this bucket is not empty THEN
write the buffer to disk

3
Partition hash-join

Pass 1 create partitions R1, ..,RM-1 of R, and
S1, ..,SM-1 of S, based on the join attributes
(the same hash function for both R and S)
Pass 2 for each pair Ri, Si
compute Ri ?? Si
using the one-pass method.

4
Example

B(R) 1000 blocks
B(S) 500 blocks
Memory available 101 blocks
R ?? S on common attribute C
Pass 1 Use 100 buckets
Read R
Hash
Write buckets

Same for S, except partition size 5 blocks

Pass 2
Read one R bucket
Build memory hash table
Read corresponding S bucket block by block.

S
R
...
R
...
Memory
In general Cost 3(B(R) B(S)) Req.
min(B(R),B(S)) M2

Cost
Bucketize
Read write R
Read write S
Join
Read R
Read S
Total cost 31000500 4500

6
Hash-based duplicate elimination

Pass 1 create partitions by hashing on all
attributes
Pass 2 for each partition, use the one-pass
method for duplicate elimination
Cost 3B(R) disk I/Os
Requirement B(R) M(M-1)
(B(R)/(M-1) is the approximate size of one
bucket)
i.e. the req. is approximately B(R) M2

7
Hash-based grouping and aggregation

Pass 1 create partitions by hashing on grouping
attributes
Pass 2 for each partition, use one-pass method.
Cost 3B(R), Requirement B(R) M2
More exactly the requirement is

8
Hash-based set union

Pass 1 create partitions R1,,RM-1 of R, and
S1,,SM-1 of S (with the same hash function)
Pass 2 for each pair Ri, Si compute Ri ? Si
using the one-pass method.
Cost 3(B(R) B(S))
Requirement min(B(R),B(S)) M2
Similar algorithms for intersection and
difference (set and bag versions)

9
Summary of hash-based methods
Operators Approx. M required Disk I/O
?, ? Sqrt(B) 3B
?, ?, - Sqrt(B(S)) 3(B(R)B(S))
?? Sqrt(B(S)) 3(B(R)B(S))

Assuming B(S) lt B(R)
10
Sort vs. Hash based algorithms

Hash-based algorithms have a size requirement
that depends only on the smaller of the two
arguments rather than on the sum of the argument
sizes, as for sort-based algorithms.
Sort-based algorithms allow us to produce the
result in sorted order and take advantage of that
sort later. The result can be used in another
sort-based algorithm later.
Hash-based algorithms depend on the buckets being
of nearly equal size. Well, what about a join
with a very few values for the join attribute

Write a Comment

User Comments (0)

About PowerShow.com

Two-pass algorithms based on hashing PowerPoint PPT Presentation