Title: Review for Midterm 2
1Review for Midterm 2
2Midterm 2
- Scheduled for April 30th
- 4 papers
- Variant indexes.
- Access path selection.
- Overview of query optimization.
- Mining Association Rules.
- Paper on cache management is not included because
it was covered by your project. - Midterm 2 is worth 35 of your grade.
3Variant Indexes
- A read-mostly database that is updated
infrequently. - Complex indexes to speedup queries.
- Focuses on physical designs to enhance
performance.
4Example Data Warehouse
- Key Observations
- A handful of products, a PROD table with tens of
rows. - Many millions of rows for SALES tables.
SALES
TIME
Cid Pid Day Amt dollar_cost Unit_sales
PROD
Day Week Month Year Holliday Weekday
Pid Name Size Weight Package_type
5A B-Tree on Major Holidays
- A B-tree index on different holidays of the
SALES table.
B-tree Leaf page
(Pres day, (1,2), (1, 3), (1, 4), (2,1), .
Value List
Joe, Big Mac, Lab day,
Jane, Happy Meal, Pres day,
Shideh, Happy Meal, Pres day,
Mary, Fries, Pres day,
Kam, Happy Meal, Pres day,
Harry, Big Mac, Pres day,
Henry, Big Mac, Pres day,
Bob, Big Mac, Pres day,
6A B-Tree on Major Holidays
- A B-tree index on different holidays of the
SALES table.
B-tree Leaf page
(Pres day, (1,2), (1, 3), (1, 4), (2,1), .
Value List
RID List
Joe, Big Mac, Lab day,
Jane, Happy Meal, Pres day,
Shideh, Happy Meal, Pres day,
Mary, Fries, Pres day,
Kam, Happy Meal, Pres day,
Harry, Big Mac, Pres day,
Henry, Big Mac, Pres day,
Bob, Big Mac, Pres day,
7Conjunctive Queries
- Count number of Big Mac Sales on Presidents
Day assuming a B-tree on product (pid) and day
of SALES - With RID-Lists
- Get the Value-List for Big Mac using the
B-tree, obtain RID-List1. - Get the Value-List for Presidents Day using
the B-tree, obtain RID-List2. - Compute set-intersect of RID-List1 and RID-List2
- Count the number of RIDs in the intersection set.
- Is there a better way?
- Yes, use bit-maps and logical bit-wise operands.
8Bitmap Indexes
- Use a bitmap to represent the existence of a
record with a certain attribute value. - Example If a record has the indexed attribute
value Big Mac then its corresponding entry in
the bitmap is set to one. Otherwise, it is a
zero.
9A Bitmap
- A Bitmap B is defined on T as a sequence of M
bits. - For each row r with row number j that has the
property P, we set bit j in B to one all other
bits are set to zero. - Assuming fix sized disk pages that hold p
records, RID of record j is (j/p, jp). Page is
j/p, slot number is jp.
Pres Day, 0100001100111111110000011001..
Record 0
10A Bitmap
- A Bitmap B is defined on T as a sequence of M
bits. - For each row r with row number j that has the
property P, we set bit j in B to one all other
bits are set to zero. - Assuming fix sized disk pages that hold p
records, RID of record j is (j/p, jp). Page is
j/p, slot number is jp.
Pres Day, 0100001100111111110000011001..
Record 1
11A Bitmap
- A Bitmap B is defined on T as a sequence of M
bits. - For each row r with row number j that has the
property P, we set bit j in B to one all other
bits are set to zero. - Assuming fix sized disk pages that hold p
records, RID of record j is (j/p, jp). Page is
j/p, slot number is jp.
Pres Day, 0100001100111111110000011001..
Record 2
12A B-Tree on Major Holidays
- A B-tree index on different holidays of the
SALES table.
B-tree Leaf page
(Pres day, 01111111.
Joe, Big Mac, Lab day,
Jane, Happy Meal, Pres day,
Shideh, Happy Meal, Pres day,
Mary, Fries, Pres day,
Kam, Happy Meal, Pres day,
Harry, Big Mac, Pres day,
Henry, Big Mac, Pres day,
Bob, Big Mac, Pres day,
13Conjunctive Queries
- Count number of Big Mac Sales on Presidents
Day assuming a B-tree on product (pid) and day
of SALES - With RID
- Get the Value-List for Big Mac using the
B-tree, obtain RID-List1. - Get the Value-List for Presidents Day using
the B-tree, obtain RID-List2. - Compute set-intersect of RID-List1 and RID-List2
- Count the number of RIDs in the intersection set.
- With bit maps
- Get the Value-List for Big Mac using the
B-tree, obtain bit-map1. - Get the Value-List for Presidents Day using
the B-tree, obtain bit-map2. - Recall Existence Bitmap (EBM) identify rows that
exist. - Let RES logical AND of bit-map1, bit-map2, and
EBM. - Count the number of bits set to one to identify
how many Big Macs were sold on Presidents Day.
14Variant Indexes
- Midterm 2 ignores
- MEDIAN, N-TILE, Column-Product as aggregates.
- Section 5.
15Access Path Selection
- Formulates a cost prediction for each access
plan, using the following cost formula - COST Page fetches W (RSI Calls)
- W is an adjustable weighting factor between I/O
and CPU. - RSI calls is an approximation for CPU
utilization. - Assumptions
- WHERE tree is considered to be in conjunctive
normal form, - Every disjunct is called a boolean factor.
16How?
- Enumerating the different execution plans,
- Estimate the cost of performing each plan,
- Pick the cheapest plan.
- Definition of cost is as follows
- COST Page fetches W (RSI Calls)
17Clustered B-Tree
- A B-tree on the gpa attribute
3.6
(3.7, (3, 1))
(3.9, (4,1))
(2.3, (1, 1))
(3, (2,1))
(3.8, (3,2))
(3.9, (4,2))
(2.5, (1,2))
(3.1, (2,2))
(3.8, (3,3))
(4, (4,3))
(2.8, (1,3))
(3.2, (2,3)
(3.8, (3,4))
(2.8, (1,4))
(4, (4,4))
(3.5, (2,4))
Bob, 21, 3.7, CS
Mary, 24, 3, ECE
Chris, 22, 3.9, CS
Chad, 28, 2.3, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
James, 24, 3.1, ME
Chang, 18, 2.5, CS
Kane, 19, 3.8, ME
Lam, 22, 2.8, ME
Louis, 32, 4, LS
Tom, 20, 3.2, EE
Martha, 29, 3.8, CS
Pat, 19, 2.8, EE
Leila, 20, 3.5, LS
Shideh, 16, 4, CS
18Non-Clustered B-Tree
- A random I/O for every qualifying record
3.6
(3.7, (1, 1))
(3.9, (4,1))
(2.3, (4, 2))
(3, (1,2))
(3.8, (3,2))
(3.9, (2,4))
(2.5, (2,3))
(3.1, (3,3))
(3.8, (2,1))
(4, (3,1))
(2.8, (2,2))
(3.2, (1,3)
(3.8, (1,4))
(2.8, (3,4))
(4, (4,4))
(3.5, (4,3))
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chris, 22, 3.9, CS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
Shideh, 16, 4, CS
19Questions
- How are relations and segments related? If slide
4, you state segments may contain more than one
relation, and then the next bullet says at most
one relation per segment.. What is going on?
20Questions
- How are relations and segments related? If slide
4, you state segments may contain more than one
relation, and then the next bullet says at most
one relation per segment.. What is going on? - Best clarified with an example
- Segment 1 may contain the Emp, Dept, and Revenues
tables/relations. - The Emp relation can be assigned to Segment 1
only. It may NOT be assigned to both Segments 1
and 2.
21Questions
- NINDX(I), the number of pages in index I.
- TCARD(T), the number of pages in the segment that
hold tuples of relation T.
- The cost of retrieving a range of records from a
clustered B-tree Should not this be
(depth_of_B-tree F(pred) TCARD) or something
related to the depth of the tree rather than
NINDX, since you only have to navigate through to
the leaf nodes once and do a record scan once
youve reached the correct leaf node?
22Question
- NINDX(I), the number of pages in index I.
- TCARD(T), the number of pages in the segment that
hold tuples of relation T.
- Should not the cost for a non-clustered B-tree
also involve the depth of the tree rather than
NINDX?
23Question
- NINDX(I), the number of pages in index I.
- TCARD(T), the number of pages in the segment that
hold tuples of relation T.
- Should not the cost for a non-clustered B-tree
also involve the depth of the tree rather than
NINDX? - It should include the depth of the tree.
- NINDX must be included because the leaf pages of
the B-tree must be visited for the qualifying
records.
24Questions
- NINDX(I), the number of pages in index I.
- TCARD(T), the number of pages in the segment that
hold tuples of relation T.
- Why is the cost of a merge-scan NINDX(R)
NINDX(S) rather than the sum of segment scans of
R and S, since you still have to visit every page
that contains tuples of R and S?
25Questions
- NINDX(I), the number of pages in index I.
- TCARD(T), the number of pages in the segment that
hold tuples of relation T.
- Why is the cost of a merge-scan NINDX(R)
NINDX(S) rather than the sum of segment scans of
R and S, since you still have to visit every page
that contains tuples of R and S? - The merge-scan employs the sorted order of the
entries in the leaf pages of the B-tree index
structures.
26Overview of Query Optimization
- Extends discussion to
- Correlation queries, use of outer-join to
flatten nested queries. - Raises more questions than providing answers.
- A good starting point for
- A practitioner who wants to build an optimizer
for a relational DBMS. - A Ph.D. student interested in writing a
dissertation in the area of query optimization
techniques.
27Mining Association Rules
- Objective Discover association Rule over basket
data. - Motivation valuable for cross-marketing and
attached mailing applications. - Example 98 of customers who purchase tires and
auto accessories also get automotive services
done. - Key contributions
- Fast algorithms
- Apriori, AprioriTid, and AprioriHybrid
- Pay attention to terminology, definitions, and
the general framework.
28Database Systems
- This course has introduced you to
- A storage manager and its use, BDB.
- Spatial indexing, R-Trees.
- Parallel DBMS.
- Alternative technologies for applications that do
not require ACID transactions. - Google FS, MapReduce, etc.
- Optimization techniques for relational DBMSs.
- Knowledge Discovery.
- Efficient query processing techniques.
- RAID and Use of flash memory in enterprises
- Steve Kleimans lecture.
29Database Systems
- Many important topics remain
- Data mining, Data cubes, Data visualization
techniques. - Papers are from 1990s Neumerous follow-on
papers!