Title: Database Management Systems Session 9
1Database Management Systems Session 9
- Instructor Vinnie Costavcosta_at_optonline.net
2The 100 Laptop Moves Closer to Reality
- CNET News.com, 9/28 Nicholas Negroponte, the
co-founder of the Media Lab at the Massachusetts
Institute of Technology, detailed specifications
for a 100 windup-powered laptop targeted at
children in developing nations."This is the
most important thing I have ever done in my
life," Negroponte said during a presentation at
Technology Review's Emerging Technologies
Conference at MIT. "Reception has been
incredible. The idea is simple. It's an education
project, not a laptop project. If we can make
education better--particularly primary and
secondary schools--it will be a better
world.The laptops would save on software costs
by running the open-source Linux operating system
rather than Windows, and reduce the need for
costly base stations by participating in "mesh"
networks and connecting to Wi-Fi wireless
networks. - MITs Laptop Per Child initiative, which seeks
to produce between 5 million and 15 million 100
laptops within a year. The MIT team plans to
continually push the cost of the laptops down by
incorporating low-power enhancements such as
"electronic ink" displays. - CNET News.com, 9/28
3The Rest Of DBMS In A Nutshell
4Good Side Of Technology
- Nicholas Negroponte (born 1943) is a
Greek-American computer scientist best known as
founder and director of Massachusetts Institute
of Technology's Media Lab. - Born the son of a Greek ship owner on New York
City's Upper East Side, Nicholas Negroponte
studied at MIT, where as a graduate student he
specialized in the field of computer-aided
design. He joined the faculty of MIT in 1966. He
is the brother of United States Director of
National Intelligence, John Negroponte. - He was a founder of WiReD magazine and has been
an "angel investor" for over 40 start-ups,
including three in China. Professor Negroponte
helped to establish, and serves as chairman of,
the 2B1 Foundation, an organization dedicated to
bringing computer access to children in the most
remote and poorest parts of the world
5Lecture Overview
- Overview Of Storage and Indexing (Chap 8)
- Overview Of Query Evaluation (Chap 12)
- Schema Refinement and Normal Forms (Chap 19)
- Physical Database Design (Chap 20)
- Security and Authorization (Chap 21)
6Overview of Storage and Indexing
7Data on External Storage
- Disks Can retrieve random page at fixed cost
- But reading several consecutive pages is much
cheaper than reading them in random order - Tapes Can only read pages in sequence
- Cheaper than disks used for archival storage
- File organization Method of arranging a file of
records on external storage. - Record id (rid) is sufficient to physically
locate record - Indexes are data structures that allow us to find
the record ids of records with given values in
index search key fields - Architecture Buffer manager stages pages from
external storage to main memory buffer pool. File
and index layers make calls to the buffer manager.
8Alternative File Organizations
- Many alternatives exist, each ideal for some
situations, and not so good in others - Heap (random order) files - Suitable when
typical access is a file scan retrieving all
records. - Sorted Files - Best if records must be retrieved
in some order, or only a range of records is
needed. - Indexes - Data structures to organize records via
trees or hashing. - Like sorted files, they speed up searches for a
subset of records, based on values in certain
(search key) fields - Updates are much faster than in sorted files.
9Indexes
- An index on a file speeds up selections on the
search key fields for the index - Any subset of the fields of a relation can be the
search key for an index on the relation - Search key is not the same as key (minimal set of
fields that uniquely identify a record in a
relation) - An index contains a collection of data entries,
and supports efficient retrieval of all data
entries k with a given key value k - Given data entry k, we can find record with key
k in at most one disk I/O
10B Tree Indexes
Non-leaf
Pages
Leaf
Pages (Sorted by search key)
- Leaf pages contain data entries, and are chained
(prev next) - Non-leaf pages have index entries only used to
direct searches
index entry
P
K
P
K
P
P
K
m
0
1
2
1
m
2
11Example B Tree
Note how data entries in leaf level are sorted
Root
17
Entries lt 17
Entries gt 17
27
30
13
5
2
3
39
38
7
5
8
22
24
27
29
14
16
33
34
- Find 28? 29? All gt 15 and lt 30
- Insert/delete Find data entry in leaf, then
change it. Need to adjust parent sometimes. - And change sometimes bubbles up the tree
12Hash-Based Indexes
- Good for equality selections.
- Index is a collection of buckets
- Bucket primary page plus zero or more overflow
pages - Buckets contain data entries.
- Hashing function h - h(r) bucket in which (data
entry for) record r belongs. h looks at the
search key fields of r.
13Index Classification
- Primary vs secondary - If search key contains
primary key, then called primary index. - Unique index Search key contains a candidate
key. - Clustered vs unclustered - If order of data
records is the same as, or close to, order of
data entries, then called clustered index. - A file can be clustered on at most one search
key. - Cost of retrieving data records through index
varies greatly based on whether index is
clustered or not!
14Clustered vs. Unclustered Index
- Suppose that the data records are stored in a
Heap file - To build clustered index, first sort the Heap
file (with some free space on each page for
future inserts). - Overflow pages may be needed for inserts. (Thus,
order of data recs is close to, but not
identical to, the sort order.)
Index entries
CLUSTERED
direct search for
UNCLUSTERED
data entries
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
15Cost Model for Our Analysis
- We ignore CPU costs, for simplicity
- B The number of data pages
- R Number of records per page
- D (Average) time to read or write disk page
- Measuring number of page I/Os ignores gains of
pre-fetching a sequence of pages thus, even I/O
cost is only approximated. - Average-case analysis based on several
simplistic assumptions.
- Good enough to show the overall trends!
16Comparing File Organizations
- Heap files (random order insert at eof)
- Sorted files, sorted on ltage, salgt
- Clustered B tree file, search key ltage, salgt
- Heap file with unclustered B tree index on
search key ltage, salgt - Heap file with unclustered hash index on search
key ltage, salgt
17Operations to Compare
- Scan Fetch all records from disk
- Equality search
- Range selection
- Insert a record
- Delete a record
18Understanding the Workload
- For each query in the workload
- Which relations does it access?
- Which attributes are retrieved?
- Which attributes are involved in selection/join
conditions? How selective are these conditions
likely to be? - For each update in the workload
- Which attributes are involved in selection/join
conditions? How selective are these conditions
likely to be? - The type of update (INSERT/DELETE/UPDATE), and
the attributes that are affected.
19Choice of Indexes
- What indexes should we create?
- Which relations should have indexes? What
field(s) should be the search key? Should we
build several indexes? - For each index, what kind of an index should it
be? - Clustered? Hash/tree?
20Choice of Indexes (Contd.)
- One approach Consider the most important queries
in turn. Consider the best plan using the
current indexes, and see if a better plan is
possible with an additional index. If so, create
it. - Obviously, this implies that we must understand
how a DBMS evaluates queries and creates query
evaluation plans! - For now, we discuss simple 1-table queries.
- Before creating an index, must also consider the
impact on updates in the workload! - Trade-off Indexes can make queries go faster,
updates slower. Require disk space, too.
21Index Selection Guidelines
- Attributes in WHERE clause are candidates for
index keys. - Exact match condition suggests hash index.
- Range query suggests tree index.
- Clustering is especially useful for range
queries can also help on equality queries if
there are many duplicates. - Multi-attribute search keys should be considered
when a WHERE clause contains several conditions. - Order of attributes is important for range
queries. - Such indexes can sometimes enable index-only
strategies for important queries. - For index-only strategies, clustering is not
important! - Try to choose indexes that benefit as many
queries as possible. Since only one index can be
clustered per relation, choose it based on
important queries that would benefit the most
from clustering.
22Examples of Clustered Indexes
SELECT E.dno FROM Emp E WHERE E.agegt40
- B tree index on E.age can be used to get
qualifying rows - How selective is the condition?
- Is the index clustered?
- Consider the GROUP BY query
- If many tuples have E.age gt 10, using E.age index
and sorting the retrieved rows may be costly - Clustered E.dno index may be better!
- Equality queries and duplicates
- Clustering on E.hobby helps!
SELECT E.dno, COUNT () FROM Emp E WHERE
E.agegt10 GROUP BY E.dno
SELECT E.dno FROM Emp E WHERE E.hobbyStamps
23Indexes with Composite Search Keys
- Composite Search Keys - Search on a combination
of fields. - Equality query - Every field value is equal to a
constant value. E.g. wrt ltsal,agegt index - age20 and sal 75
- Range query - Some field value is not a constant.
E.g. - age20 and sal gt 10
- Data entries in index sorted by search key to
support range queries. - Lexicographic order, or
- Spatial order.
Examples of composite key indexes using
lexicographic order.
11,80
11
12
12,10
name
age
sal
12,20
12
bob
10
12
13,75
13
cal
80
11
ltage, salgt
ltagegt
joe
12
20
sue
13
75
10,12
10
20
20,12
Data records sorted by name
75,13
75
80,11
80
ltsal, agegt
ltsalgt
Data entries in index sorted by ltsal,agegt
Data entries sorted by ltsalgt
24Composite Search Keys
- To retrieve Emp records with age30 AND sal4000,
an index on ltage,salgt would be better than an
index on age or an index on sal - Choice of index key orthogonal to clustering etc.
- If condition is 20ltagelt30 AND 3000ltsallt5000
- Clustered tree index on ltage,salgt or ltsal,agegt is
best - If condition is age30 AND 3000ltsallt5000
- Clustered ltage,salgt index much better than
ltsal,agegt index!
25Summary
- Many alternative file organizations exist, each
appropriate in some situation. - If selection queries are frequent, sorting the
file or building an index is important. - Hash-based indexes only good for equality search
- Sorted files and tree-based indexes best for
range search also good for equality search.
(Files rarely kept sorted in practice B tree
index is better.) - Index is a collection of data entries plus a way
to quickly find entries with given key values.
26Summary (Contd.)
- Data entries can be actual data records, ltkey,
ridgt pairs, or ltkey, rid-listgt pairs. - Can have several indexes on a given file of data
records, each with a different search key - Indexes can be classified as clustered vs.
unclustered, primary vs. secondary. Differences
have important consequences for
utility/performance.
27Summary (Contd.)
- Understanding the nature of the workload for the
application, and the performance goals, is
essential to developing a good design. - What are the important queries and updates? What
attributes/relations are involved? - Indexes must be chosen to speed up important
queries (and perhaps some updates!). - Index maintenance overhead on updates to key
fields. - Choose indexes that can help many queries, if
possible. - Build indexes to support index-only strategies.
- Clustering is an important decision only one
index on a given relation can be clustered! - Order of fields in composite index key can be
important.
28Lecture Overview
- Overview Of Storage and Indexing (Chap 8)
- Overview Of Query Evaluation (Chap 12)
- Schema Refinement and Normal Forms (Chap 19)
- Physical Database Design (Chap 20)
- Security and Authorization (Chap 21)
29Overview of Query Evaluation
30Some Common Techniques
- Algorithms for evaluating relational operators
use some simple ideas extensively - Indexing Can use WHERE conditions to retrieve
small set of tuples (selections, joins) - Iteration Sometimes, faster to scan all tuples
even if there is an index. (And sometimes, we can
scan the data entries in an index instead of the
table itself.) - Partitioning By using sorting or hashing, we can
partition the input tuples and replace an
expensive operation by similar operations on
smaller inputs.
Watch for these techniques as we discuss query
evaluation!
31Statistics and Catalogs
- Need information about the relations and indexes
involved. Catalogs typically contain at least - tuples (NTuples) and pages (NPages) for each
relation - distinct key values (NKeys) and NPages for each
index - Index height, low/high key values (Low/High) for
each tree index - Catalogs updated periodically
- Updating whenever data changes is too expensive
lots of approximation anyway, so slight
inconsistency ok. - More detailed information (e.g., histograms of
the values in some field) are sometimes stored.
32Highlights of System R Optimizer
- Impact
- Most widely used currently works well for lt 10
joins. - Cost estimation - Approximate art at best
- Statistics, maintained in system catalogs, used
to estimate cost of operations and result sizes. - Considers combination of CPU and I/O costs.
- Plan Space - Too large, must be pruned.
- Only the space of left-deep plans is considered
- Left-deep plans allow output of each operator to
be pipelined into the next operator without
storing it in a temporary relation.
33Cost Estimation
- For each plan considered, must estimate cost
- Must estimate cost of each operation in plan
tree. - Depends on input cardinalities.
- Weve already discussed how to estimate the cost
of operations (sequential scan, index scan,
joins, etc.) - Must also estimate size of result for each
operation in tree! - Use information about the input relations.
- For selections and joins, assume independence of
predicates.
34Schema for Examples
Sailors (sid integer, sname string, rating
integer, age real) Reserves (sid integer, bid
integer, day dates, rname string)
- Similar to old schema rname added for
variations. - Reserves
- Each tuple is 40 bytes long, 100 tuples per
page, 1000 pages. - Sailors
- Each tuple is 50 bytes long, 80 tuples per page,
500 pages.
35Motivating Example
RA Tree
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
- Cost 5005001000 I/Os
- By no means the worst plan!
- Misses several opportunities selections could
have been pushed earlier, no use is made of any
available indexes, etc. - Goal of optimization To find more efficient
plans that compute the same answer.
Plan
36Alternative Plans 1 (No Indexes)
- Main difference push selects.
- With 5 buffers, cost of plan
- Scan Reserves (1000) write temp T1 (10 pages,
if we have 100 boats, uniform distribution). - Scan Sailors (500) write temp T2 (250 pages, if
we have 10 ratings). - Sort T1 (2210), sort T2 (23250), merge
(10250) - Total 3560 page I/Os.
- If we used BNL join, join cost 104250, total
cost 2770. - If we push projections, T1 has only sid, T2
only sid and sname - T1 fits in 3 pages, cost of BNL drops to under
250 pages, total lt 2000.
37Alternative Plans 2With Indexes
(On-the-fly)
sname
(On-the-fly)
rating gt 5
- With clustered index on bid of Reserves, we get
100,000/100 1000 tuples on 1000/100 10
pages. - INL with pipelining (outer is not materialized).
(Index Nested Loops,
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves
- Projecting out unnecessary fields from outer
doesnt help.
- Join column sid is a key for Sailors.
- At most one matching tuple, unclustered index on
sid OK.
- Decision not to push ratinggt5 before the join
is based on - availability of sid index on Sailors.
- Cost Selection of Reserves tuples (10 I/Os)
for each, - must get matching Sailors tuple (10001.2)
total 1210 I/Os.
38Summary
- There are several alternative evaluation
algorithms for each relational operator - A query is evaluated by converting it to a tree
of operators and evaluating the operators in the
tree - Must understand query optimization in order to
fully understand the performance impact of a
given database design (relations, indexes) on a
workload (set of queries) - Two parts to optimizing a query
- Consider a set of alternative plans.
- Must prune search space typically, left-deep
plans only. - Must estimate cost of each plan that is
considered. - Must estimate size of result and cost for each
plan node. - Key issues Statistics, indexes, operator
implementations.
39Lecture Overview
- Overview Of Storage and Indexing (Chap 8)
- Overview Of Query Evaluation (Chap 12)
- Schema Refinement and Normal Forms (Chap 19)
- Physical Database Design (Chap 20)
- Security and Authorization (Chap 21)
40Schema Refinement and Normal Forms
41The Evils of Redundancy
- Redundancy is at the root of several problems
associated with relational schemas - redundant storage, insert/delete/update anomalies
- Integrity constraints, in particular functional
dependencies, can be used to identify schemas
with such problems and to suggest refinements - Main refinement technique decomposition
(replacing ABCD with, say, AB and BCD, or ACD and
ABD) - Decomposition should be used judiciously
- Is there reason to decompose a relation?
- What problems (if any) does the decomposition
cause?
42Functional Dependencies (FDs)
- A functional dependency X Y holds over
relation R if, for every allowable instance r of
R - t1 r, t2 r, (t1) (t2)
implies (t1) (t2) - i.e., given two tuples in r, if the X values
agree, then the Y values must also agree. (X and
Y are sets of attributes.) - An FD is a statement about all allowable
relations. - Must be identified based on semantics of
application. - Given some allowable instance r1 of R, we can
check if it violates some FD f, but we cannot
tell if f holds over R! - K is a candidate key for R means that K R
- However, K R does not require K to be
minimal!
43Example Constraints on Entity Set
- Consider relation obtained from Hourly_Emps
- Hourly_Emps (ssn, name, lot, rating, hrly_wages,
hrs_worked) - Notation We will denote this relation schema by
listing the attributes SNLRWH - This is really the set of attributes
S,N,L,R,W,H. - Sometimes, we will refer to all attributes of a
relation by using the relation name. (e.g.,
Hourly_Emps for SNLRWH) - Some FDs on Hourly_Emps
- ssn is the key S SNLRWH
- rating determines hrly_wages R W
44Example (Contd.)
lot
name
rating
ssn
hourly_wages
hours_worked
45Example (Contd.)
Wages
Hourly_Emps2
- Problems due to R W
- Update anomaly Can we change W in
just the 1st tuple of SNLRWH? - Insertion anomaly What if we want to insert an
employee and dont know the hourly wage for his
rating? - Deletion anomaly If we delete all employees with
rating 5, we lose the information about the wage
for rating 5!
Will 2 smaller tables be better?
46Normal Forms
- Returning to the issue of schema refinement, the
first question to ask, Is any refinement
needed? - If a relation is in a certain normal form (BCNF,
3NF etc.), it is known that certain kinds of
problems are avoided/minimized. This can be used
to help us decide whether decomposing the
relation will help. - Role of FDs in detecting redundancy
- Consider a relation R with 3 attributes, ABC.
- No FDs hold There is no redundancy here.
- Given A B Several tuples could have the
same A value, and if so, theyll all have the
same B value!
47Boyce-Codd Normal Form (BCNF)
- In other words, a relation, R, is in BCNF if the
only non-trivial FDs that hold over R are key
constraints - No dependency in R that can be predicted using
FDs alone - If we are shown two tuples that agree upon the X
value, we cannot infer the A value in one tuple
from the A value in the other - If example relation is in BCNF, the 2 tuples must
be identical (since X is a key)
Nonkey attrk
Nonkey attr2
KEY
Nonkey attr1
BNCF ensures that no redundancy can be detected
using FD information alone. It is the most
desirable normal form!
48Third Normal Form (3NF)
- If R is in BCNF, its obviously in 3NF.
- If R is in 3NF, some redundancy is possible. It
is a compromise, used when BCNF not achievable
(e.g., no good decomp, or performance
considerations). - Lossless-join, dependency-preserving
decomposition of R into a collection of 3NF
relations is always possible - Thus, 3NF is indeed a compromise relative to
BCNF.
49Decomposition of a Relation Scheme
- Suppose that relation R contains attributes A1
... An. A decomposition of R consists of
replacing R by two or more relations such that - Each new relation scheme contains a subset of the
attributes of R (and no attributes that do not
appear in R), and - Every attribute of R appears as an attribute of
one of the new relations. - Intuitively, decomposing R means we will store
instances of the relation schemes produced by the
decomposition, instead of instances of R. - E.g., Can decompose SNLRWH into SNLRH and RW.
50Example Decomposition
- Decompositions should be used only when needed.
- SNLRWH has FDs S SNLRWH and R W
- Second FD causes violation of 3NF W values
repeatedly associated with R values. Easiest way
to fix this is to create a relation RW to store
these associations, and to remove W from the main
schema - i.e., we decompose SNLRWH into SNLRH and RW
- The information to be stored consists of SNLRWH
tuples. If we just store the projections of
these tuples onto SNLRH and RW, are there any
potential problems that we should be aware of?
51Problems with Decompositions
- There are three potential problems to consider
- Some queries become more expensive.
- e.g., How much did sailor Joe earn? (salary
WH) - Given instances of the decomposed relations, we
may not be able to reconstruct the corresponding
instance of the original relation! - Fortunately, not in the SNLRWH example.
- Checking some dependencies may require joining
the instances of the decomposed relations. - Fortunately, not in the SNLRWH example.
- Tradeoff Must consider these issues vs.
redundancy.
52Lossless Join Decompositions
- Decomposition of R into X and Y is lossless-join
w.r.t. a set of FDs F if, for every instance r
that satisfies F - (r) (r) r
- It is always true that r (r)
(r) - In general, the other direction does not hold!
If it does, the decomposition is lossless-join. - Definition extended to decomposition into 3 or
more relations in a straightforward way. - It is essential that all decompositions used to
deal with redundancy be lossless! (Avoids
Problem (2).)
53More on Lossless Join
- The decomposition of R into X and Y is
lossless-join wrt F if and only if the closure
of F contains - X Y X, or
- X Y Y
- In particular, the decomposition of R into
UV and R - V is lossless-join if U V
holds over R.
54Dependency Preserving Decomposition
- Dependency preserving decomposition (Intuitive)
- If R is decomposed into X, Y and Z, and we
enforce the FDs that hold on X, on Y and on Z,
then all FDs that were given to hold on R must
also hold. (Avoids Problem (3).)
55Summary of Normalization
- 1NF Eliminate Repeating Groups - Make a separate
table for each set of related attributes, and
give each table a primary key - 2NF Eliminate Redundant Data - If an attribute
depends on only part of a multi-valued key,
remove it to a separate table - 3NF Eliminate Columns Not Dependent On Key - If
attributes do not contribute to a description of
the key, remove them to a separate table - BCNF Boyce-Codd Normal Form - If there are
non-trivial dependencies between candidate key
attributes, separate them out into distinct tables
http//www.datamodel.org/NormalizationRules.html
56Refining an ER Diagram
Before
- 1st diagram translated
Workers(S,N,L,D,S) Departments(D,M,B) - Lots associated with workers.
- Suppose all workers in a dept are assigned the
same lot D L - Redundancy fixed by Workers2(S,N,D,S)
Dept_Lots(D,L) - Can fine-tune this Workers2(S,N,D,S)
Departments(D,M,B,L)
Note that Employees and Works_In are mapped to a
single relation, Workers
After
57Summary of Schema Refinement
- If a relation is in BCNF, it is free of
redundancies that can be detected using FDs.
Thus, trying to ensure that all relations are in
BCNF is a good heuristic. - If a relation is not in BCNF, we can try to
decompose it into a collection of BCNF relations. - Must consider whether all FDs are preserved. If
a lossless-join, dependency preserving
decomposition into BCNF is not possible (or
unsuitable, given typical queries), should
consider decomposition into 3NF. - Decompositions should be carried out and/or
re-examined while keeping performance
requirements in mind.
58Lecture Overview
- Overview Of Storage and Indexing (Chap 8)
- Overview Of Query Evaluation (Chap 12)
- Schema Refinement and Normal Forms (Chap 19)
- Physical Database Design (Chap 20)
- Security and Authorization (Chap 21)
59Physical Database Design
60Overview
- After ER design, schema refinement, and the
definition of views, we have the conceptual and
external schemas for our database. - The next step is to choose indexes, make
clustering decisions, and to refine the
conceptual and external schemas (if necessary) to
meet performance goals. - We must begin by understanding the workload
- The most important queries and how often they
arise. - The most important updates and how often they
arise. - The desired performance for these queries and
updates.
61Decisions to Make
- What indexes should we create?
- Which relations should have indexes? What
field(s) should be the search key? Should we
build several indexes? - For each index, what kind of an index should it
be? - Clustered? Hash/tree?
- Should we make changes to the conceptual schema?
- Consider alternative normalized schemas?
(Remember, there are many choices in decomposing
into BCNF, etc.) - Should we undo some decomposition steps and
settle for a lower normal form?
(Denormalization.) - Horizontal partitioning, replication, views ...
62Index Selection for Joins
- When considering a join condition
- Hash index on inner is very good for Index Nested
Loops - Should be clustered if join column is not key for
inner, and inner tuples need to be retrieved - Clustered B tree on join column(s) good for
Sort-Merge
63Example 1
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
D.dnameToy AND E.dnoD.dno
- Hash index on D.dname supports Toy selection.
- Given this, index on D.dno is not needed
- Hash index on E.dno allows us to get matching
(inner) Emp rows for each selected (outer) Dept
row - What if WHERE included ... AND E.age25
? - Could retrieve Emp rows using index on E.age,
then join with Dept rows satisfying dname
selection. Comparable to strategy that used
E.dno index. - So, if E.age index is already created, this query
provides much less motivation for adding an E.dno
index.
64Example 2
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
E.sal BETWEEN 10000 AND 20000 AND
E.hobbyStamps AND E.dnoD.dno
Range selection
- Clearly, Emp should be the outer relation.
- Suggests that we build a hash index on D.dno
- What index should we build on Emp?
- B tree on E.sal could be used, OR an index on
E.hobby could be used. Only one of these is
needed, and which is better depends upon the
selectivity of the conditions. - As a rule of thumb, equality selections more
selective than range selections. - As both examples indicate, our choice of indexes
is guided by the plan(s) that we expect an
optimizer to consider for a query. Have to
understand optimizers!
65Clustering and Joins
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
D.dnameToy AND E.dnoD.dno
- Clustering is especially important when accessing
inner rows in INL. - Should make index on E.dno clustered
- Suppose that the WHERE clause is instead
- WHERE E.hobbyStamps AND E.dnoD.dno
- If many employees collect stamps, Sort-Merge join
may be worth considering. A clustered index on
D.dno would help. - Summary Clustering is useful whenever many rows
are to be retrieved.
66Tuning the Conceptual Schema
- The choice of conceptual schema should be guided
by the workload, in addition to redundancy
issues - We may settle for a 3NF schema rather than BCNF
- Workload may influence the choice we make in
decomposing a relation into 3NF or BCNF - We may further decompose a BCNF schema!
- We might denormalize (i.e., undo a decomposition
step), or we might add fields to a relation. - If such changes are made after a database is in
use, called schema evolution might want to mask
some of these changes from applications by
defining views
67Tuning Queries and Views
- If a query runs slower than expected, check if an
index needs to be re-built, or if statistics are
too old - Sometimes, the DBMS may not be executing the plan
you had in mind. Common areas of weakness - Selections involving null values
- Selections involving arithmetic or string
expressions - Selections involving OR conditions.
- Lack of evaluation features like index-only
strategies or certain join methods or poor size
estimation - Check the plan that is being used! Then adjust
the choice of indexes or rewrite the query/view
68More Guidelines for Query Tuning
- Minimize the use of DISTINCT dont need it if
duplicates are acceptable, or if answer contains
a key. - Minimize the use of GROUP BY and HAVING
SELECT MIN (E.age) FROM Employee E GROUP BY
E.dno HAVING E.dno102
SELECT MIN (E.age) FROM Employee E WHERE
E.dno102
69Summary
- Database design consists of several tasks
requirements analysis, conceptual design, schema
refinement, physical design and tuning. - In general, have to go back and forth between
these tasks to refine a database design, and
decisions in one task can influence the choices
in another task. - Understanding the nature of the workload for the
application, and the performance goals, is
essential to developing a good design - What are the important queries and updates? What
attributes/relations are involved?
70Summary
- The conceptual schema should be refined by
considering performance criteria and workload - May choose 3NF or lower normal form over BCNF.
- May choose among alternative decompositions into
BCNF (or 3NF) based upon the workload. - May denormalize, or undo some decompositions.
- May decompose a BCNF relation further!
71Summary (Contd.)
- Over time, indexes have to be fine-tuned
(dropped, created, re-built, ...) for performance - Should determine the plan used by the system, and
adjust the choice of indexes appropriately - System may still not find a good plan
- Only left-deep plans considered!
- Null values, arithmetic conditions, string
expressions, the use of ORs, etc. can confuse an
optimizer - So, may have to rewrite the query/view
- Avoid nested queries, temporary relations,
complex conditions, and operations like DISTINCT
and GROUP BY
72Lecture Overview
- Overview Of Storage and Indexing (Chap 8)
- Overview Of Query Evaluation (Chap 12)
- Schema Refinement and Normal Forms (Chap 19)
- Physical Database Design (Chap 20)
- Security and Authorization (Chap 21)
73Security and Authorization
74Introduction to DB Security
- Secrecy - Users should not be able to see things
they are not supposed to - E.g., A student cant see other students grades
- Integrity - Users should not be able to modify
things they are not supposed to - E.g., Only instructors can assign grades
- Availability - Users should be able to see and
modify things they are allowed to
75Access Controls
- A security policy specifies who is authorized to
do what - A security mechanism allows us to enforce a
chosen security policy. - Two main mechanisms at the DBMS level
- Discretionary access control - privileges
- Mandatory access control system policies (root)
76Discretionary Access Control
- Based on the concept of access rights or
privileges for objects (tables and views), and
mechanisms for giving users privileges (and
revoking privileges). - Creator of a table or a view automatically gets
all privileges on it. - DMBS keeps track of who subsequently gains and
loses privileges, and ensures that only requests
from users who have the necessary privileges (at
the time the request is issued) are allowed.
77GRANT Command
GRANT privileges ON object TO users WITH GRANT
OPTION
- The following privileges can be specified
- SELECT Can read all columns (including those
added later via ALTER TABLE command) - INSERT(col-name) Can insert rows with non-null
or non-default values in this column - INSERT means same right with respect to all
columns - DELETE Can delete rows
- REFERENCES (col-name) Can define foreign keys
(in other tables) that refer to this column - If a user has a privilege with the GRANT OPTION,
can pass privilege on to other users (with or
without passing on the GRANT OPTION) - Only owner can execute CREATE, ALTER, and DROP
78GRANT and REVOKE of Privileges
- GRANT INSERT, SELECT ON Sailors TO Horatio
- Horatio can query Sailors or insert rows into it
- GRANT DELETE ON Sailors TO Yuppy WITH GRANT
OPTION - Yuppy can delete rows, and also authorize others
to do so - GRANT UPDATE (rating) ON Sailors TO Dustin
- Dustin can update (only) the rating field of
Sailors rows - GRANT SELECT ON ActiveSailors TO Guppy, Yuppy
- This does NOT allow the uppies to query Sailors
directly! - REVOKE
- When a privilege is revoked from X, it is also
revoked from all users who got it solely from X
79GRANT/REVOKE on Views
- If the creator of a view loses the SELECT
privilege on an underlying table, the view is
dropped! - If the creator of a view loses a privilege held
with the grant option on an underlying table,
(s)he loses the privilege on the view as well so
do users who were granted that privilege on the
view!
80Views and Security
- Views can be used to present necessary
information (or a summary), while hiding details
in underlying relation(s) - Given ActiveSailors, but not Sailors or Reserves,
we can find sailors who have a reservation, but
not the bids of boats that have been reserved - Creator of view has a privilege on the view if
(s)he has the privilege on all underlying tables. - Together with GRANT/REVOKE commands, views are a
very powerful access control tool
81Role-Based Authorization
- In SQL-92, privileges are actually assigned to
authorization ids, which can denote a single user
or a group of users - In SQL1999 (and in many current systems),
privileges are assigned to roles - Roles can then be granted to users and to other
roles - Reflects how real organizations work
- Illustrates how standards often catch up with de
facto standards embodied in popular systems
82Security to the Level of a Field!
- Can create a view that only returns one field of
one row - Then grant access to that view accordingly.
- Allows for arbitrary granularity of control, but
- Clumsy to specify, though this can be hidden
under a good UI - Performance is unacceptable if we need to define
field-granularity access frequently (Too many
view creations and look-ups)
83Internet-Oriented Security
- Key Issues User authentication and trust
- When DB must be accessed from a secure location,
password-based schemes are usually adequate. - For access over an external network, trust is
hard to achieve - If someone with Sams credit card wants to buy
from you, how can you be sure it is not someone
who stole his card? - How can Sam be sure that the screen for entering
his credit card information is indeed yours, and
not some rogue site spoofing you (to steal such
information)? How can he be sure that sensitive
information is not sniffed while it is being
sent over the network to you? - Encryption is a technique used to address these
issues.
84Encryption
- Masks data for secure transmission or storage
- Encrypt(data, encryption key) encrypted data
- Decrypt(encrypted data, decryption key)
original data - Without decryption key, the encrypted data is
meaningless gibberish - Symmetric Encryption
- Encryption key decryption key all authorized
users know decryption key (a weakness). - DES, used since 1977, has 56-bit key AES has
128-bit (optionally, 192-bit or 256-bit) key - Public-Key Encryption - Each user has two keys
- Users public encryption key - Known to all
- Private decryption key - Known only to this user
- Used in RSA scheme
85RSA Public-Key Encryption
- Let the data be an integer I
- Choose a large (gtgt I) integer L p q
- p, q are large, say 1024-bit, distinct prime
numbers - Encryption - Choose a random number 1 lt e lt L
that is relatively prime to (p-1) (q-1) - Encrypted data S I e mod L
- Decryption key d - Chosen so that
- d e 1 mod ((p-1) (q-1))
- We can then show that I S d mod L
- It turns out that the roles of e and d can be
reversed so they are simply called the public
and private keys
86Certifying Servers SSL, SET
- If Amazon distributes their public key, Sams
browser will encrypt his order using it. - So, only Amazon can decipher the order, since no
one else has Amazons private key. - But how can Sam (or his browser) know that the
public key for Amazon is genuine? The SSL
protocol covers this. - Amazon contracts with, say, Verisign, to issue a
certificate ltVerisign, Amazon, amazon.com,
public-keygt - This certificate is stored in encrypted form,
encrypted with Verisigns private key, known only
to Verisign - Verisigns public key is known to all browsers,
which can therefore decrypt the certificate and
obtain Amazons public key, and be confident that
it is genuine. - The browser then generates a temporary session
key, encodes it using Amazons public key, and
sends it to Amazon. - All subsequent msgs between the browser and
Amazon are encoded using symmetric encryption
(e.g., DES), which is more efficient than
public-key encryption. - What if Sam doesnt trust Amazon with his credit
card information? - Secure Electronic Transaction protocol 3-way
communication between Amazon, Sam, and a trusted
server, e.g., Visa.
87Authenticating Users
- Amazon can simply use password authentication,
i.e., ask Sam to log into his Amazon account - Done after SSL is used to establish a session
key, so that the transmission of the password is
secure! - Amazon is still at risk if Sams card is stolen
and his password is hacked. Business risk - Digital Signatures
- Sam encrypts the order using his private key,
then encrypts the result using Amazons public
key - Amazon decrypts the msg with their private key,
and then decrypts the result using Sams public
key, which yields the original order! - Exploits interchangeability of public/private
keys for encryption/decryption - Now, no one can forge Sams order, and Sam cannot
claim that someone else forged the order.
88Summary
- Three main security objectives secrecy,
integrity, availability - DB admin is responsible for overall security.
- Designs security policy, maintains an audit
trail, or history of users accesses to DB. - Two main approaches to DBMS security
discretionary and mandatory access control - Discretionary control based on notion of
privileges - Mandatory control based on notion of security
classes
89Final Exam
- Due Next Saturday, Oct 15
Last Class
- Class Begins at 9AM!!!
- Well Cover XML Data
90See Dark At The End Of The Tunnel
But Almost There!!!