Greedy Algo. for Selecting a Join Order - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Greedy Algo. for Selecting a Join Order

Description:

Greedy Algo. for Selecting a Join Order ... BASIS: Start with the pair of relations whose estimated join size is smallest. The join of these relations becomes ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 13
Provided by: scie232
Category:
Tags: algo | greedy | join | order | selecting

less

Transcript and Presenter's Notes

Title: Greedy Algo. for Selecting a Join Order


1
Greedy Algo. for Selecting a Join Order
  • The "greediness" is based on the idea that we
    want to keep the intermediate relations as small
    as possible at each level of the tree.
  • BASIS Start with the pair of relations whose
    estimated join size is smallest. The join of
    these relations becomes the current tree.
  • INDUCTION Find, among all those relations not
    yet included in the current tree, the relation
    that, when joined with the current tree, yields
    the relation of smallest estimated size.
  • The new current tree has the old current tree
    as its left argument and the selected relation
    as its right argument.

2
Example
  • The basis step is to find the pair of relations
    that have the smallest join.
  • This honor goes to the join T ?? U, with a cost
    of 1000. Thus, T ?? U is the "current tree."
  • We now consider whether to join R or S into the
    tree next.
  • We compare the sizes of (T ?? U) ?? R and (T ??
    U)??S.
  • The latter, with a size of 2000 is better than
    the former, with a size of 10,000. Thus, we pick
    as the new current tree (T ?? U)??S.
  • Now there is no choice we must join R at the
    last step, leaving us with a total cost of 3000,
    the sum of the sizes of the two intermediate
    relations.

3
Pipelining Versus Materialization
  • Materialization The naive way to execute a query
    plan is to order the operations appropriately
  • An operation is not performed until the
    argument(s) below it have been performed, and
  • Store the result of each operation on disk until
    it is needed by another operation. This strategy
    is
  • Pipelining A more subtle, and generally more
    efficient, way to execute a query plan is to
    interleave the execution of several operations.
  • The tuples produced by one operation are passed
    directly to the operation that uses it, without
    ever storing the intermediate tuples on disk.
  • Typically is implemented by a network of
    iterators.

4
Pipelining Example (I)
  • Let us consider physical query plans for the
    expression
  • (R(w,x) ?? S(x,y)) ?? U(y,z)
  • Assumptions
  • R occupies 5000 blocks S and U each occupy
    10,000 blocks.
  • The intermediate result R ?? S occupies k blocks
    for some k.
  • We can estimate k, based on the number of
    x-values in R and S and the size of (w,x,y)
    tuples compared to the (w,x) tuples of R and the
    (x,y) tuples of S.
  • However, we want to see what happens as k varies,
    so we leave this constant open.
  • Both joins will be implemented as hash-joins,
    either one-pass or two-pass, depending on k.
  • There are 101 buffers available.

5
Example (II)
6
Example (III)
  • First, consider the join R ?? S. Neither relation
    fits in main memory, so we need a two-pass
    hash-join.
  • If the smaller relation R is partitioned into the
    maximum-possible 100 buckets on the first pass,
    then each bucket for R occupies 50 blocks.
  • If R's buckets have 50 blocks, then the second
    pass of the hash-join R ?? S uses 51 buffers,
    leaving 50 buffers to use for the join of the
    result of R ?? S with U.
  • Now,suppose that k ? 49 that is, the result of R
    ?? S occupies at most 49 blocks.
  • Then we can pipeline the result of R ?? S into 49
    buffers, organize them for lookup as a hash
    table, and we have one buffer left to read each
    block of T in turn.
  • We may thus execute the second join as a one-pass
    join.
  • The total number of disk I/O's is
  • 45,000 to perform the two-pass hash join of R and
    S.
  • 10,000 to read U in the one-pass hash-join of (R
    ?? S) ?? U.
  • The total is 55,000 disk I/O's.

7
Example (IV)
  • Now, suppose k gt 49, but k lt 5000. We can still
    pipeline the result R ?? S, but we need to use
    another strategy, in which this relation is
    joined with U in a 50-bucket, two-pass hash-join.
  • Before we start on R ?? S, we hash U into 50
    buckets of 200 blocks.
  • Next, we perform a two-pass hash join of R and S
    using 51 buckets as before, but as each tuple of
    R ?? S is generated, we place it in one of the 50
    remaining buffers that is used to help form the
    50 buckets for the join of R??S with U.
  • Finally, we join R ?? S with U bucket by bucket.
    Since k lt 5000, the buckets of R ?? S will be of
    size at most 100 blocks, so this join is
    feasible.
  • The fact that buckets of U are of size 200
    blocks is not a problem.
  • We are using buckets of R ?? S as the build
    relation and buckets of U as the probe relation
    in the one-pass joins of buckets.
  • The number of disk I/O's for this pipelined join
    is
  • a) 20,000 to read U and write its tuples into
    buckets.
  • b) 45,000 to perform the two-pass hash-join R ??
    S.
  • c) k to write out the buckets of R ?? S.
  • d) k 10,000 to read the buckets of R ?? S and
    U in the final join.
  • The total cost is thus 75,000 2k.

8
Example (V)
  • Last, let us consider what happens when k gt 5000.
    Now, we cannot perform a two-pass join in the 50
    buffers available if the result of R ?? S is
    pipelined. So,
  • Compute R ?? S using a two-pass hash join and
    materialize the result on disk.
  • Join R ?? S with U, also using a two-pass
    hash-join.
  • Note that since B(U) 10,000, we can perform a
    two-pass hash-join using 100 buckets regardless
    of how large k is. Technically, U should appear
    as the left argument of its join if we decide to
    make U the build relation for the hash join.
  • Number of disk I/O's for this plan is
  • 45,000 for the two-pass join of R and S.
  • k to store R ?? S on disk.
  • 30,000 3k for the two-pass hash-join of U with
    R ?? S.
  • Total cost is thus 75,000 4k

9
Example (VI)
10
Notation for Physical Plans
It can also be SortScan(R,L) if a sort based join
is preferred.
11
Notation for Physical Plans
12
Notation for Physical Plans
Write a Comment
User Comments (0)
About PowerShow.com