Evaluation of Relational Operations: Other Operations - PowerPoint PPT Presentation

About This Presentation
Title:

Evaluation of Relational Operations: Other Operations

Description:

Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections 14.1-14.3; 14.5-14.7) – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 13
Provided by: RaghuRa104
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Relational Operations: Other Operations


1
Evaluation of Relational Operations Other
Operations
  • Chapter 14 Ramakrishnan Gehrke
  • (Sections 14.1-14.3 14.5-14.7)

2
What you will learn from this lecture
  • General selections.
  • Union/intersection.
  • Group-by.

3
Simple Selections
SELECT FROM Songs S WHERE S.sname C
  • Of the form
  • Size of result approximated as size of R
    reduction factor
  • reduction factor typically estimated with the
    uniformity assumption
  • or estimated with the help of a histogram will
    discuss soon.
  • With no index, unsorted (worst case) ? table
    scan only option cost M (pages of R).
  • With an index on selection attribute Use index
    to find qualifying data entries, then retrieve
    corresponding data records. (Hash index useful
    only for equality selections.)
  • What about a sorted indexless relation? Is binary
    search always superior to table scan?

4
Using an Index for Selections
  • Cost depends on qualifying tuples, and
    clustering.
  • Cost of finding qualifying data entries
    (typically small) plus cost of retrieving records
    (could be large w/o clustering).
  • For example, assuming uniform distribution of
    snames(!), about 1/26 of tuples qualify (20
    pages, 1600 tuples). With a clustered index,
    cost is about 20 I/Os if unclustered, upto 1600
    I/Os! assumes S has 500 pages.
  • Important refinement for unclustered indexes
  • 1. Find qualifying data entries.
  • 2. Sort the rids of the data records to be
    retrieved (or, if the records are not stored in
    rid-order, sort the page addresses).
  • 3. Fetch rids in order. This ensures that each
    data page is looked at just once (though of
    such pages likely to be higher than with
    clustering).

5
General Selection Conditions
  • (timelt8/9/04 AND uid123) OR rating5 OR sid3
  • Such selection conditions are first converted to
    conjunctive normal form (CNF) (timelt8/9/04 OR
    rating5 OR sid3 ) AND (uid123 OR rating5 OR
    sid3).
  • We only discuss the case with no disjunctions
    (most common in practice!)
  • disjunctions may or may not be easy
  • e.g., (timelt8/9/04 OR uid123)
  • index on uid only may still warrant a scan
  • e.g., (timelt8/9/04 OR uid123) AND sid 3
  • index on sid helps tremendously
  • - (time 8/9/04 OR uid123). Can you think of
    a clever strategy if given index on time and on
    uid?

6
Two Approaches to General Selections
  • First approach Find the most selective access
    path, retrieve tuples using it, and apply any
    remaining conditions that dont match the index
  • Most selective access path An index probe or
    index scan or table scan that we estimate will
    require the fewest page I/Os.
  • e.g., timelt8/9/04 AND rating5 AND sid3.
    Suppose we have indexes (Btree) on time and
    (hash) on ltrating, sidgt. ? 2 options
  • option 1 use the B tree index on time (check
    rating5 and sid3 for each retrieved tuple)
  • option 2 use the hash index on ltrating, sidgt
    (check timelt8/9/94)
  • Conditions that match this index reduce the
    number of tuples retrieved other conditions are
    used to discard some retrieved tuples, but do not
    affect number of tuples/pages fetched.

7
Intersection of Rids
  • Second approach (if we have 2 or more matching
    indexes)
  • Get sets of rids of data records using each
    matching index.
  • Then intersect these sets of rids
  • Retrieve the records and apply any remaining
    conditions.
  • e.g., timelt8/9/04 AND rating5 AND sid3
  • suppose we have a B tree index on time and an
    index on sid
  • retrieve rids of records satisfying timelt8/9/04
    using the first
  • retrieve rids of records satisfying sid3 using
    the second
  • intersect the rids
  • finally, retrieve records and check rating5
  • Howd you implement intersection efficiently?

8
The Projection Operation
SELECT DISTINCT R.sid, R.uid FROM
Ratings R
  • An approach based on sorting
  • Modify Phase 1 of external sort to eliminate
    unwanted fields. Thus, sorted sublists (runs)
    are produced, but tuples in SSLs are smaller than
    input tuples. (Size ratio depends on and size
    of fields dropped.)
  • Modify merging passes to eliminate duplicates.
    Thus, number of result tuples smaller than input.
    (Difference depends on of duplicates.)
  • What should we sort on?

9
Projection Based on Hashing
  • Partitioning phase Read R using one input
    buffer. For each tuple, discard unwanted fields,
    apply hash function h1 to choose one of B-1
    output buckets/buffers.
  • Result is B-1 partitions (of tuples with no
    unwanted fields). Any two tuples from different
    partitions guaranteed to be distinct.
  • Duplicate elimination phase For each partition,
    read it and build an in-memory hash table, using
    hash fn h2 (ltgt h1) on all fields, while
    discarding duplicates.
  • If partition does not fit in memory, can apply
    hash-based projection algorithm recursively to
    this partition. (but what do you project on in
    recursive invocation(s)?)
  • Note the similarity to hash join.

10
Set Operations
  • Intersection and cross-product special cases of
    join.
  • Union sorting based and hash based
  • Sorting based approach
  • Sort both relations (on combination of all
    attributes).
  • Scan sorted relations and merge them.
  • Alternative Merge SSLs from Phase 1 for both
    relations.
  • An advantage result is sorted.
  • Hash based approach to union
  • Partition R and S using hash function h.
  • For each S-partition, build in-memory hash table
    (using h2), scan corr. R-partition and add tuples
    to table while discarding duplicates.
  • What is the buffer requirement for this?

11
Aggregate Operations (AVG, MIN, etc.)
  • Without group-by
  • In general, requires scanning the relation.
  • Given index whose search key includes all
    attributes in the SELECT WHERE clauses, may
    suffice to do index scan. (remember what index
    scan is?)
  • With group-by
  • Sort on group-by attributes, then scan relation
    and compute aggregate for each group. (Can
    improve upon this by combining sorting and
    aggregate computation. In this case, the I/O cost
    is just that of sorting.)
  • Similar approach based on hashing on group-by
    attributes.
  • Maintain running info. for each group.

12
Summary
  • A virtue of relational DBMSs queries are
    composed of a few basic operators the
    implementation of these operators can be
    carefully tuned (and it is important to do
    this!).
  • Indeed relational algebra ? gold standard for
    algebra design simple ops sophisticated ones
    derivable by composition e.g., join, division.
  • Many alternative implementation techniques for
    each operator no universally superior technique
    for most operators.
  • Must consider available alternatives for each
    operation in a query and choose best one based on
    system statistics, etc. This is part of the
    broader task of optimizing a query composed of
    several ops.
Write a Comment
User Comments (0)
About PowerShow.com