Title: Query Execution Ch'15
1Query ExecutionCh.15
1
2P.Q.
3True Or False?
- An index on a file speeds up selections on the
search key field(s) ?
3
4Which ?
- Search key any subset of the fields of a
relation? - Search key a subset of is Relation's Primary
key(minimal set of fields that uniquely identify
a record in a relation)?
5(Review) Bad hash functions?
- int h(chars) return 0
- int h(chars) return rnd()
- int h(chars) return s0s1...
- int
- h(chars) return secure_cipher_key(s)
6(Review) Bad hash functions?
- int h(chars) return 0 SINGLE BUCKET
- int h(chars) return rnd() CANT RE-FIND
- int h(chars) return s0s1... BETTER but
poor distribution - int
- h(chars) return encrypt_hash(s) GOOD
distribution but TOO EXPENSIVE
7Linear or Extensible Hash or both?
- In memory array size must be 2i
- Does not use overflow blocks?
- Uses a msb bit-flip trick?
- Uses a subset of generated hash bit code?
- Array entries refer to disk blocks
- Splits bucket where new record is inserted?
8Linear or Extensible Hash or both?
- In memory array size must be 2i Extensible
- Does not use overflow blocks? Extensible
- Uses a msb bit-flip trick? Linear
- Uses a subset of hash bits? Both
- Array entries refer to disk blocks Both
- Splits bucket where new record is inserted? Ext.
9 10T/F
- Most Eskimoes have fridges?
- Thomas Edison, who invented the lightbulb, was
afraid of the dark? - King Charles I was executed 1649 after the
English Civil War led by Oliver Cromwell? - http//www.quiz-zone.co.uk/quizrounds/020317trueor
false/questions.html
11All True
- Most Eskimoes have fridges?
- Thomas Edison, who invented the lightbulb, was
afraid of the dark? - King Charles I was executed 1649 after the
English Civil War led by Oliver Cromwell? - http//www.quiz-zone.co.uk/quizrounds/020317trueor
false/questions.html - http//www.historylearningsite.co.uk/CharlesI_exec
ution.htm
12Query Execution
12
13Outline (Formal)
- Logical/physical operators
- Cost parameters and sorting
- One-pass algorithms
- Nested-loop joins
- Two-pass algorithms
-
13
14Outline (InFormal)
- Logical/physical operators
- Which streets vs. postal delivery route
- Cost parameters and sorting
- "How long?!" "I only apples from small to large"
- One-pass algorithms
- "Hello Tuple... Goodbye tuple!"
- Nested-loop joins
- "Everybody plays everybody else"
- Two-pass algorithms
- "Bite of chunks at a time" When ONCE is not
enough -
14
15 Query Execution
Query or update
User/ Application
Query compiler
Query execution plan
Execution engine
Record, index requests
Index/record mgr.
Page commands
Buffer manager
Read/write pages
Storage manager
storage
15
16Logical v.s. Physical Operators
- Logical operators
- what they do
- e.g., union, selection, project, join, grouping
- Physical operators
- how they do it
- e.g., nested loop join, sort-merge join, hash
join, index join
16
17Query Execution Plans
buyer
?
Cityurbana
phonegt5430000
- Query Plan
- logical tree
- implementation choice at every node
- scheduling of operations.
Buyername
(Simple Nested Loops)
Person
Purchase
(Table scan)
(Index scan)
Some operators are from relational algebra, and
others (e.g., scan, group) are not.
17
18How do We Combine Operations?
- The iterator model. Each operation is implemented
by 3 functions - Open sets up the data structures and performs
initializations - GetNext returns the the next tuple of the
result. - Close ends the operations. Cleans up the data
structures.
18
19How do We Combine Operations?
- The iterator model. Enables pipelining!
- Contrast with data-driven materialize model.
- "pipelined evaluation" vs. "materialized
evaluation"
19
20Cost Parameters
- Cost parameters
- M number of blocks that fit in main memory
- B(R) number of blocks holding R
- T(R) number of tuples in R
- V(R,a) number of distinct values of the
attribute a - Estimating the cost
- Important in optimization (next lecture)
- Compute I/O cost only
- We compute the cost to read the tables
- We dont compute the cost to write the result
(because pipelining)
20
21Example Sorting cost?
- Two pass multi-way merge sort
- Step 1
- Read M blocks at a time, sort, write
- Result have runs of length M on disk
- Step 2
- Merge M-1 at a time, write to disk
- Result have runs of length M(M-1) M2
- Cost nnnn B(R) ?
21
22Example Sorting cost?
- Two pass multi-way merge sort
- Step 1
- Read M blocks at a time, sort, write
- Result have runs of length M on disk
- Step 2
- Merge M-1 at a time, write to disk
- Result have runs of length M(M-1)?M2
- Cost 3B(R), Assumption B(R) ? M2
22
23SCAN
- Read entire contents of a table
- (Review Section 15.1.1-5 p.716)
24SCAN Scanning Tables
- -) The table is clustered (i.e. blocks consists
only of records from this table) - Table-scan if we know where the blocks are
- Index-scan if we have a sparse index to find the
blocks - -(
- The table is unclustered (e.g. its records are
placed on blocks with other tables) - May need one read for each record
24
25Cost of the Scan Operator
- Clustered relation
- Table scan B(R)
- Index scan B(R)
- Unclustered relation
- T(R)
- what if we want to sort as well?
25
26Cost of the Scan Operator
- Clustered relation
- Table scan B(R) to sort 3B(R)
- Index scan B(R) to sort B(R) or 3B(R)
- Unclustered relation
- T(R) to sort T(R) 2B(R)
26
27One pass algorithm
27
28One pass algorithmexample operators?
28
29One-pass Algorithms
- Selection ?(R), projection ?(R)
- Both are tuple-at-a-Time algorithms
- Cost B(R)
29
30One-pass Algorithms
- Duplicate elimination ?(R)
- Need to keep a dictionary in memory
- balanced search tree
- hash table
- etc
- Cost B(R)
- Assumption B(?(R)) lt M
30
31One-pass Algorithms
- Grouping ?city, sum(price) (R)
- Need to keep a dictionary in memory
- Also store the sum(price) for each city
- Cost B(R)
- Assumption number of cities fits in memory
31
32One-pass Algorithms
- Binary operations R n S, R U S, R S
- Assumption min(B(R), B(S)) lt M
- Scan one table first, then the next, eliminate
duplicates - Cost B(R)B(S)
32
33Nested loop join (15.3)
33
34Nested Loop Joins
- Tuple-based nested loop R S
- for each tuple r in R do
- for each tuple s in S do
- if r and s join then output (r,s)
- Cost T(R) T(S), sometimes T(R) B(S)
34
35Nested Loop Joins
- Block-based Nested Loop Join
- for each (M-1) blocks bs of S do
- for each block br of R do
- for each tuple s in bs do
- for each tuple r in br do
- if r and s join then
output(r,s)
35
36Nested Loop Joins
R S
Join Result
Hash table for block of S (k lt B-1 pages)
. . .
. . .
Input buffer for R
Output buffer
36
37Nested Loop Joins
- Block-based Nested Loop Join
- Cost
- Read S once cost B(S)
- Outer loop runs B(S)/(M-1) times, and each time
need to read R costs B(S)B(R)/(M-1) - Total cost B(S) B(S)B(R)/(M-1)
- Notice it is better to iterate over the smaller
relation first
37
38Two pass algorithm
38
39Two-Pass Algorithms Based on Sorting
- Duplicate elimination ?(R)
- Simple idea sort first, then eliminate
duplicates - Step 1 sort runs of size M, write
- Cost 2B(R)
- Step 2 merge M-1 runs, but include each tuple
only once - Cost B(R)
- Total cost 3B(R), Assumption B(R) lt M2
39
40Two-Pass Algorithms Based on Sorting
- Grouping ?city, sum(price) (R)
- Same as before sort, then compute the sum(price)
for each group - As before compute sum(price) during the merge
phase. - Total cost 3B(R)
- Assumption B(R) lt M2
40
41Two-Pass Algorithms Based on Sorting
- Binary operations R n S, R U S, R S
- Idea sort R, sort S, then do the right thing
- A closer look
- Step 1 split R into runs of size M, then split S
into runs of size M. Cost 2B(R) 2B(S) - Step 2 merge M/2 runs from R merge M/2 runs
from S ouput a tuple on a case by cases basis - Total cost 3B(R)3B(S)
- Assumption B(R)B(S)lt M2
41
42Two-Pass Algorithms Based on Sorting
- Join R S
- Start by sorting both R and S on the join
attribute - Cost 4B(R)4B(S) (because need to write to
disk) - Read both relations in sorted order, match tuples
- Cost B(R)B(S)
- Difficulty many tuples in R may match many in S
- If at least one set of tuples fits in M, we are
OK - Otherwise need nested loop, higher cost
- Total cost 5B(R)5B(S)
- Assumption B(R) lt M2, B(S) lt M2
42
43Two-Pass Algorithms Based on Sorting
- Join R S
- If the number of tuples in R matching those in S
is small (or vice versa) we can compute the join
during the merge phase - Total cost 3B(R)3B(S)
- Assumption B(R) B(S) lt M2
43
44Two Pass Algorithms Based on Hashing
- Idea partition a relation R into buckets, on
disk - Each bucket has size approx. B(R)/M
- Does each bucket fit in main memory ?
- Yes if B(R)/M lt M, i.e. B(R) lt M2
1
2
B(R)
44
45Hash Based Algorithms for ?
- Recall ?(R) ??duplicate elimination
- Step 1. Partition R into buckets
- Step 2. Apply ? to each bucket (may read in main
memory) - Cost 3B(R)
- AssumptionB(R) lt M2
45
46Hash Based Algorithms for ?
- Recall ?(R) ??grouping and aggregation
- Step 1. Partition R into buckets
- Step 2. Apply ? to each bucket (may read in main
memory) - Cost 3B(R)
- AssumptionB(R) lt M2
46
47Hash-based Join
- R S
- Recall the main memory hash-based join
- Scan S, build buckets in main memory
- Then scan R and join
47
48Partitioned Hash Join
- R S
- Step 1
- Hash S into M buckets
- send all buckets to disk
- Step 2
- Hash R into M buckets
- Send all buckets to disk
- Step 3
- Join every pair of buckets
48
49PartitionedHash-Join
- Partition both relations using hash fn h R
tuples in partition i will only match S tuples in
partition i.
50Partitioned Hash Join
- Cost 3B(R) 3B(S)
- Assumption min(B(R), B(S)) lt M2
50
51Hybrid Hash Join Algorithm
- When we have more memory B(S) ltlt M2
- Partition S into k buckets
- But keep first bucket S1 in memory, k-1 buckets
to disk - Partition R into k buckets
- First bucket R1 is joined immediately with S1
- Other k-1 buckets go to disk
- Finally, join k-1 pairs of buckets
- (R2,S2), (R3,S3), , (Rk,Sk)
51
52Hybrid Join Algorithm
- How big should we choose k ?
- Average bucket size for S is B(S)/k
- Need to fit B(S)/k (k-1) blocks in memory
- B(S)/k (k-1) lt M
- k slightly smaller than B(S)/M
52
53Hybrid Join Algorithm
- How many I/Os ?
- Recall cost of partitioned hash join
- 3B(R) 3B(S)
- Now we save 2 disk operations for one bucket
- Recall there are k buckets
- Hence we save 2/k(B(R) B(S))
- Cost (3-2/k)(B(R) B(S))
(3-2M/B(S))(B(R) B(S))
53
54Indexed Based Algorithms
- In a clustered index all tuples with the same
value of the key are clustered on as few blocks
as possible
54
55Index Based Selection
- Selection on equality ?av(R)
- Clustered index on a cost B(R)/V(R,a)
- Unclustered index on a cost T(R)/V(R,a)
55
56Index Based Selection
- Example B(R) 2000, T(R) 100,000, V(R, a)
20, compute the cost of ?av(R) - Cost of table scan
- If R is clustered B(R) 2000 I/Os
- If R is unclustered T(R) 100,000 I/Os
- Cost of index based selection
- If index is clustered B(R)/V(R,a) 100
- If index is unclustered T(R)/V(R,a) 5000
- Notice when V(R,a) is small, then unclustered
index is useless
56
57Index Based Join
- R S
- Assume S has an index on the join attribute
- Iterate over R, for each tuple fetch
corresponding tuple(s) from S - Assume R is clustered. Cost
- If index is clustered B(R) T(R)B(S)/V(S,a)
- If index is unclustered B(R) T(R)T(S)/V(S,a)
57
58Index Based Join
- Assume both R and S have a sorted index (B tree)
on the join attribute - Then perform a merge join (called zig-zag join)
- Cost B(R) B(S)
58