Two-pass algorithms based on hashing - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Two-pass algorithms based on hashing

Description:

Two-pass algorithms based on hashing Main idea: Instead of sorted sublists, create partitions, based on hashing. Second pass creates result from partitions using one ... – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 11
Provided by: thomo150
Category:

less

Transcript and Presenter's Notes

Title: Two-pass algorithms based on hashing


1
Two-pass algorithms based on hashing
  • Main idea
  • Instead of sorted sublists, create partitions,
    based on hashing.
  • Second pass creates result from partitions using
    one pass algorithms.

2
Creating partitions
  • Partitions (buckets) are created based on all
    attributes of the relation except for grouping
    and join, where the partitions are based on the
    grouping and join-attributes respectively.
  • Why bucketize? Tuples with matching values end
    up in the same bucket.
  • Initialize M-1 buckets using M-1 empty buffers
  • FOR each block b of relation R DO
  • read block b into the M-th buffer
  • FOR each tuple t in b DO
  • IF the buffer for bucket h(t) has no room for t
    THEN
  • copy the buffer to disk
  • initialize a new empty block in that buffer
  • copy t to the buffer for bucket h(t)
  • ENDIF
  • ENDFOR
  • FOR each bucket DO
  • IF the buffer for this bucket is not empty THEN
  • write the buffer to disk

3
Partition hash-join
  • Pass 1 create partitions R1, ..,RM-1 of R, and
    S1, ..,SM-1 of S, based on the join attributes
    (the same hash function for both R and S)
  • Pass 2 for each pair Ri, Si
  • compute Ri ?? Si
  • using the one-pass method.

4
Example
  • B(R) 1000 blocks
  • B(S) 500 blocks
  • Memory available 101 blocks
  • R ?? S on common attribute C
  • Pass 1 Use 100 buckets
  • Read R
  • Hash
  • Write buckets
  • Same for S, except partition size 5 blocks

5
  • Pass 2
  • Read one R bucket
  • Build memory hash table
  • Read corresponding S bucket block by block.

S
R
...
R
...
Memory
In general Cost 3(B(R) B(S)) Req.
min(B(R),B(S)) M2
  • Cost
  • Bucketize
  • Read write R
  • Read write S
  • Join
  • Read R
  • Read S
  • Total cost 31000500 4500

6
Hash-based duplicate elimination
  • Pass 1 create partitions by hashing on all
    attributes
  • Pass 2 for each partition, use the one-pass
    method for duplicate elimination
  • Cost 3B(R) disk I/Os
  • Requirement B(R) M(M-1)
  • (B(R)/(M-1) is the approximate size of one
    bucket)
  • i.e. the req. is approximately B(R) M2

7
Hash-based grouping and aggregation
  • Pass 1 create partitions by hashing on grouping
    attributes
  • Pass 2 for each partition, use one-pass method.
  • Cost 3B(R), Requirement B(R) M2
  • More exactly the requirement is

8
Hash-based set union
  • Pass 1 create partitions R1,,RM-1 of R, and
    S1,,SM-1 of S (with the same hash function)
  • Pass 2 for each pair Ri, Si compute Ri ? Si
    using the one-pass method.
  • Cost 3(B(R) B(S))
  • Requirement min(B(R),B(S)) M2
  • Similar algorithms for intersection and
    difference (set and bag versions)

9
Summary of hash-based methods
Operators Approx. M required Disk I/O
?, ? Sqrt(B) 3B
?, ?, - Sqrt(B(S)) 3(B(R)B(S))
?? Sqrt(B(S)) 3(B(R)B(S))


Assuming B(S) lt B(R)
10
Sort vs. Hash based algorithms
  • Hash-based algorithms have a size requirement
    that depends only on the smaller of the two
    arguments rather than on the sum of the argument
    sizes, as for sort-based algorithms.
  • Sort-based algorithms allow us to produce the
    result in sorted order and take advantage of that
    sort later. The result can be used in another
    sort-based algorithm later.
  • Hash-based algorithms depend on the buckets being
    of nearly equal size. Well, what about a join
    with a very few values for the join attribute
Write a Comment
User Comments (0)
About PowerShow.com