ICS 214A: Database Management Systems - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

ICS 214A: Database Management Systems

Description:

Read write records, scan relations. Get page containing tuples. Read/write file pages ... simple to implement using a scan. relation scan. index scan ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 40
Provided by: che7
Category:

less

Transcript and Presenter's Notes

Title: ICS 214A: Database Management Systems


1
ICS 214A Database Management Systems
  • Part III Query Processing and Optimization
  • Professor Sharad Mehrotra

2
Big picture
We are here!
SQL statements
Access plan
Query Processor
optimizer
Read write records, scan relations
Get page containing tuples
Indexes
Record-oriented file system
Buffer manager
Basic file system
Read/write file pages
Hardware
3
Query Processor Optimizer
  • Overview
  • Query processor
  • Query optimizer

4
Overview query processing steps
  • Parsing
  • ? Validation
  • ? Optimization
  • ? Execution

5
Parsing
  • SQL --gt Parse Tree
  • Good old lex and yacc
  • Detect and reject syntax errors

6
Example SQL query
  • Find the movies with stars born in 1960
  • SELECT title
  • FROM StarsIn
  • WHERE starName IN (
  • SELECT name
  • FROM MovieStar
  • WHERE birthdate LIKE 1960
  • )

7
Example Parse Tree
ltQuerygt
ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltTuplegt IN ltQuerygt
title StarsIn
ltAttributegt ( ltQuerygt )
starName ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltAttributegt LIKE ltPatterngt
name MovieStar
birthDate 1960
8
Validation
  • Validator parse tree --gt initial logical plan
  • Detect reject semantic errors
  • Nonexistent tables/views/columns
  • Insufficient access priveleges
  • Type mismatch
  • E.g., AVG(name), name GPA, Student UNION Enroll
  • Also
  • Expand
  • Expand view definitions
  • Validator uses information in system catalogs for
    all the above

9
Example Initial Logical Query Plan
?title
  • NOTE
  • Represented as a query tree though sometimes a
    DAG if it has common subexpressions
  • Usually common subexpressions evaluating
    separately and stored in temporary file
  • Effectively query plan corresponds to a tree
  • Not all systems use relational
  • Algebra for representing logical
  • Query plans. DB2 uses the query Graph model

?starNamename
?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
10
Optimizer
  • Given an initial Logical Plan generate a physical
    plan that is Optimal with respect to one or
    more performance metrics.
  • Performance Metrics
  • I/O, CPU, Communication, Power Consumption, Cache
    misses, of internet queries, . (depends upon
    domain)

11
Query Optimizer
  • Many logical plans are equivalent
  • Databases support typically many ways to
    implement physical operators
  • Query optimization is a difficult search problem
    of identifying an optimal plan.
  • Issues
  • Search space of plans
  • Which plans does the optimizer consider
  • Cost estimation
  • how to assign cost to a plan
  • Enumeration algorithm
  • Search space

12
Optimizer use rewrite rules to generate better
logical plans
?title
starNamename
StarsIn ?name
?birthdate LIKE 1960
MovieStar
13
Optimizers may consider only subset of plans
  • Different optimizers may consider different types
    of trees.

B
Left deep tree
Bushy tree
  • These trees are different since inputs to binary
    operator are not symmetric.

Right deep tree
14
Multiple physical plans may exist for a given
logical plan

Hash join
SEQ scan
index scan
StarsIn MovieStar
15
Logical vs Physical Plan
  • Logical Plan
  • Physical Plan

Nested Loop
Merge Join
File scan R3
Sort
Sort
File scan R1
File scan R2
16
Query Processor
  • Implements a bunch of algorithms that form the
    building blocks of the physical plan.
  • Unary operators
  • Tuple level select, project (can be evaluated on
    tuple directly)
  • Table level sort, hash, aggregation, group by.
  • binary operators
  • join, union, intersection, difference.
  • Important Issue
  • Query processor architecture
  • how is the execution of different physical
    operators synchronized and how is data passed
    form one operator to another.

17
Example
R(A,B,C), S(C,D,E)
  • Select B,D
  • From R,S
  • Where R.A c ? S.E 2 ? R.CS.C

18
R A B C S C D E a 1 10 10 x 2 b 1 20 2
0 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50
y 3
19
Plan I
  • ?B,D
  • sR.Ac? S.E2 ? R.CS.C
  • x
  • R S

OR ?B,D sR.Ac? S.E2 ? R.C S.C (R x S)
20
R x S R.A R.B R.C S.C S.D S.E a 1 10 10
x 2 a 1 10 20 y 2 . .
C 2 10 10 x 2 . .
21
Plan II
  • ?B,D
  • sR.A c sS.E 2
  • R S

natural join
22
R S A B C s (R) s(S) C D
E a 1 10 A B C C D E 10
x 2 b 1 20 c 2 10 10 x 2 20 y
2 c 2 10 20 y 2 30 z 2 d 2
35 30 z 2 40 x 1 e 3 45
50 y 3
23
Plan III
  • Use R.A and S.C Indexes
  • (1) Use R.A index to select R tuples with R.A
    c
  • (2) For each R.C value found, use S.C index to
    find matching tuples
  • (3) Eliminate S tuples S.E ? 2
  • (4) Join matching R,S tuples, project B,D
    attributes and place in result

24
R S A B C C D E a 1
10 10 x 2 b 1 20 20
y 2 c 2 10 30 z 2 d 2 35
40 x 1 e 3 45
50 y 3
A
C
I1
I2
25
Query Processor physical level
  • Implements a bunch of algorithms that form the
    building blocks of the physical plan.
  • Unary operators
  • sort, hash, select, project, aggregation, group
    by.
  • binary operators
  • join, union, intersection, difference.
  • Important Issue
  • Query processor architecture
  • how is the execution of different physical
    operators synchronized and how is data passed
    form one operator to another?

26
Query Processor Architecture
  • Consider a simple join operator and query
    consisting of 2 joins. How should data be passed
    form one join to another?
  • Create temporary files.
  • Create a process for each operator and use IPC
    (e.g, shared memory, pipes).
  • OS does scheduling of these processes.

27
Query Processor Architecture (cont)
  • Take a query plan and convert as much of it as
    possible into a block that can be executed and
    one single iterative program with loops and other
    control structures
  • Implement operators as iterators with open,
    next, close as a single process easy to
    parallelize

28
Iterators
  • Demand-driven evaluation of query tree.
  • Operators exchange data in units such as records
  • Each operator supports the following interfaces
  • open
  • next
  • close
  • open() at top of tree results in cascade of opens
    down the tree.
  • An operator getting a next() call may recursively
    make next() calls from within to produce its next
    answer.
  • close() at top of tree results in cascade of
    close down the tree

29
Benefits of Iterators
  • Query executed as one process
  • Items produced one at a time
  • No temporary files
  • amenable to parallelization

30
Sample Iterators
  • Print
  • Open()
  • open input
  • Next()
  • call next on input format the item to screen
  • Close()
  • close input

31
Sample Iterators (cont)
  • Scan
  • open
  • open file
  • next
  • read next item
  • close
  • close file

32
Sample Iterators (cont)
  • Sort (merge sort)
  • open
  • open input
  • build all initial run files calling next on
    input
  • close input
  • merge run files
  • next
  • determine next output item read new item from
    correct run file
  • close
  • destroy remaining run files

33
Operators in Relational Model
  • Tuple _at_ a time unary (?,?)
  • can be evaluated immediately
  • tuple at a time without looking at entire
    relation
  • Full-relation unary operators
  • S - duplicate removal
  • VL- aggregations, grouping
  • E - sorting
  • Binary operators

34
Operator Implementation
  • Tuple _at_ a time operator
  • simple to implement using a scan
  • relation scan
  • index scan
  • Implementation of other operators can be
    classified according to the following
  • sort based
  • hash based
  • nested loop based
  • Furthermore, each of the above may or may not
    exploit indices.

35
Sorting
  • If data lt M, no problem
  • read into memory and sort using your favorite
    method. (usually quick sort)
  • what if data does not fit memory?
  • Uses external sort
  • Abstract description of external sort
  • divide input into runs R1,,Rn which are
    themselves sorted (run generation)
  • merge runs into sorted output (merging)
  • Notice 2 different sort algorithms used.
  • 1. For in memory sorting
  • 2. For managing subsets of data on disk.

36
Run Generation
  • Simple idea
  • read enough data as fits memory
  • use MM sorting
  • output run
  • Replacement Selection
  • organize memory as priority queue
  • initially fill memory with items and organize as
    priority queue
  • pop smallest entry and put in run
  • immediately replace entry in PQ by a new item
  • if new value gt largest outputted so far, the new
    value would be outputted in same run.

37
Replacement Selection
  • Advantages
  • larger size runs, hence merge can be done in
    fewer phases
  • Disadvantages
  • makes memory management more complex
  • data read from disk in blocks. Should we copy it
    to a PQ and store separately (copy overhead)
  • else 1/2 buffer would be wasted since records
    already outputted to run

38
Merging Runs
  • Read size C cluster from each run so as to fill M
  • Merge them and store in output
  • Continue recursively until only 1 run remaining

39
Complexity of Sort
  • Let W of initial runs, (W R/M if
    quicksort, else R/2M 1 if replacement
    selection used.
  • C cluster size
  • F fan in of merge
  • of levels of Merge
  • Total I/O cost
  • cost of run creation L 2 R/C times I/O
    cost of reading C
  • What should C be?
  • Smaller value of C allows for max. of runs to
    be merged in 1 iteration.
  • Minimizes the number of phases and hence the
    amount of data read and written to disk
  • larger value of Cgt more sequential I/O per
    iteration.

40
Choosing Cluster Size (usually large cluster size
with small fan-in is good)
  • Let R 51200KB, M 160KB
  • Num. Of runs 51200/160 320
  • disk latency seek time 25ms, transfer rate
    2ms/4KB
  • If C 16KB
  • F 160/16 - 1 9
  • number of merge levels, L log_9(320) 3
  • reading 16KB takes 25 2 4 33ms
  • total time 2 L R/C 33 ms 2 3 3200
    33/60000 approx. 10.56 minutes.
  • If C 4KB
  • F 160 /4 - 1 39
  • numbe of merge levels, L log_39(320) 2
  • reading 4KB takes 25 4 29 ms
  • total time 2 2 29 3200 4 /60000
    24.74 minutes

41
Optimizing Merge
  • What if initial number of runs W is not power of
    F?
  • Given 12 runs, and fan-in of 10
  • cost of strategy 1 12 10 2 24
  • cost of strategy 2 12 3 15
  • In the first merge, do not merge runs
    aggressively as many as possible. Instead, just
    merge enough to ensure subsequent merges will use
    full fan-in F
  • At any stage, merge as many runs as possible
    using the smallest run files available

42
Implementing Sort Based Algorithms
  • Sort-based Unary operators (duplicate
    elimination, aggregation)
  • naive sort first then compile
    aggregate/eliminate duplicates
  • smart merge aggregation/duplicate elimination
    with run generation

43
Duplicate Elimination using Sort
  • Remove duplicates during run generation
  • duplicates never permitted in runs that are
    created.
  • This results in shorter runs ( a run size never
    exceeds the output size).
  • Hence cost can significantly reduce.
  • E.g., in extreme case if the file completely
    consists of duplicates, we only pay cost of
    reading it once and do not pay overhead of run
    generation, run storage, etc.

44
Aggregation using Sort
  • sort based on group-by criteria.
  • For different aggregate queries, different
    information maintained per group.
  • E.g., for min query, in the run created, we do
    not maintain duplicates, rather maintain a min
    for a given run. During merging, the min of the
    new run is the minimum of the values associated
    with the min for groups with individual runs.

45
Implementing Sort-Based Operators
  • Binary operators (join, union (set-based
    semantics), intersection, difference, semi-join)
  • naïve first sort the individual relations and
    then compute the binary operation.
  • Smart join the last merge steps of the sort on
    both relations and compute the binary operation
    in that step.

46
Sort based Union Operator (R union S)
  • divide R and S into runs using quicksort and sort
    runs individually
  • Let F be the fan-in of sorting
  • Merge runs of R and S until the remaining runs of
    R and S taken together is less than or equal to
    F.
  • Read each remaining run of R and S into memory
    and compute union
  • same as duplicate removal from these runs taken
    together.
  • Similar algorithms can easily be devised for
    sort-based intersection, and set difference

47
Merge Join
  • Similar to sort-based implementation of other
    operators
  • Divide R and S into runs and merge runs until
    there are just enough runs of R and S left over
    such that these runs can be merged together in
    memory.
  • Merging runs of R and S for join slightly
    different compared to other binary operators such
    as set difference, union since both inputs do not
    advance simultaneously in presence of duplicates.
  • E.g., say R and S both have duplicate entries for
    a join value v 5.
  • Runs of R are advanced a tuple at a time, and the
    corresponding entries in the runs of S is looked
    up.
  • It is possible we may have to read pages
    containing duplicates (I.e., v 5 entries) in S
    repeatedly.

48
Next
  • Continuation of Query Processing
  • Hashing and hash based algorithms
  • Nested loop based algorithms
  • Index based algorithms
  • Query Optimization
Write a Comment
User Comments (0)
About PowerShow.com