Database Management Systems Session 9 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Database Management Systems Session 9

1
Database Management Systems Session 9

Instructor Vinnie Costavcosta_at_optonline.net

2
The 100 Laptop Moves Closer to Reality

CNET News.com, 9/28 Nicholas Negroponte, the
co-founder of the Media Lab at the Massachusetts
Institute of Technology, detailed specifications
for a 100 windup-powered laptop targeted at
children in developing nations."This is the
most important thing I have ever done in my
life," Negroponte said during a presentation at
Technology Review's Emerging Technologies
Conference at MIT. "Reception has been
incredible. The idea is simple. It's an education
project, not a laptop project. If we can make
education better--particularly primary and
secondary schools--it will be a better
world.The laptops would save on software costs
by running the open-source Linux operating system
rather than Windows, and reduce the need for
costly base stations by participating in "mesh"
networks and connecting to Wi-Fi wireless
networks.
MITs Laptop Per Child initiative, which seeks
to produce between 5 million and 15 million 100
laptops within a year. The MIT team plans to
continually push the cost of the laptops down by
incorporating low-power enhancements such as
"electronic ink" displays.
CNET News.com, 9/28

3
The Rest Of DBMS In A Nutshell

Session 9

4
Good Side Of Technology

Nicholas Negroponte (born 1943) is a
Greek-American computer scientist best known as
founder and director of Massachusetts Institute
of Technology's Media Lab.
Born the son of a Greek ship owner on New York
City's Upper East Side, Nicholas Negroponte
studied at MIT, where as a graduate student he
specialized in the field of computer-aided
design. He joined the faculty of MIT in 1966. He
is the brother of United States Director of
National Intelligence, John Negroponte.
He was a founder of WiReD magazine and has been
an "angel investor" for over 40 start-ups,
including three in China. Professor Negroponte
helped to establish, and serves as chairman of,
the 2B1 Foundation, an organization dedicated to
bringing computer access to children in the most
remote and poorest parts of the world

5
Lecture Overview

Overview Of Storage and Indexing (Chap 8)
Overview Of Query Evaluation (Chap 12)
Schema Refinement and Normal Forms (Chap 19)
Physical Database Design (Chap 20)
Security and Authorization (Chap 21)

6
Overview of Storage and Indexing

Chapter 8

7
Data on External Storage

Disks Can retrieve random page at fixed cost
But reading several consecutive pages is much
cheaper than reading them in random order
Tapes Can only read pages in sequence
Cheaper than disks used for archival storage
File organization Method of arranging a file of
records on external storage.
Record id (rid) is sufficient to physically
locate record
Indexes are data structures that allow us to find
the record ids of records with given values in
index search key fields
Architecture Buffer manager stages pages from
external storage to main memory buffer pool. File
and index layers make calls to the buffer manager.

8
Alternative File Organizations

Many alternatives exist, each ideal for some
situations, and not so good in others
Heap (random order) files - Suitable when
typical access is a file scan retrieving all
records.
Sorted Files - Best if records must be retrieved
in some order, or only a range of records is
needed.
Indexes - Data structures to organize records via
trees or hashing.
Like sorted files, they speed up searches for a
subset of records, based on values in certain
(search key) fields
Updates are much faster than in sorted files.

9
Indexes

An index on a file speeds up selections on the
search key fields for the index
Any subset of the fields of a relation can be the
search key for an index on the relation
Search key is not the same as key (minimal set of
fields that uniquely identify a record in a
relation)
An index contains a collection of data entries,
and supports efficient retrieval of all data
entries k with a given key value k
Given data entry k, we can find record with key
k in at most one disk I/O

10
B Tree Indexes
Non-leaf
Pages
Leaf
Pages (Sorted by search key)

Leaf pages contain data entries, and are chained
(prev next)
Non-leaf pages have index entries only used to
direct searches

index entry
P
K
P
K
P
P
K
m
0
1
2
1
m
2
11
Example B Tree
Note how data entries in leaf level are sorted
Root
17
Entries lt 17
Entries gt 17
27
30
13
5
2
3
39
38
7
5
8
22
24
27
29
14
16
33
34

Find 28? 29? All gt 15 and lt 30
Insert/delete Find data entry in leaf, then
change it. Need to adjust parent sometimes.
And change sometimes bubbles up the tree

12
Hash-Based Indexes

Good for equality selections.
Index is a collection of buckets
Bucket primary page plus zero or more overflow
pages
Buckets contain data entries.
Hashing function h - h(r) bucket in which (data
entry for) record r belongs. h looks at the
search key fields of r.

13
Index Classification

Primary vs secondary - If search key contains
primary key, then called primary index.
Unique index Search key contains a candidate
key.
Clustered vs unclustered - If order of data
records is the same as, or close to, order of
data entries, then called clustered index.
A file can be clustered on at most one search
key.
Cost of retrieving data records through index
varies greatly based on whether index is
clustered or not!

14
Clustered vs. Unclustered Index

Suppose that the data records are stored in a
Heap file
To build clustered index, first sort the Heap
file (with some free space on each page for
future inserts).
Overflow pages may be needed for inserts. (Thus,
order of data recs is close to, but not
identical to, the sort order.)

Index entries
CLUSTERED
direct search for
UNCLUSTERED
data entries
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
15
Cost Model for Our Analysis

We ignore CPU costs, for simplicity
B The number of data pages
R Number of records per page
D (Average) time to read or write disk page
Measuring number of page I/Os ignores gains of
pre-fetching a sequence of pages thus, even I/O
cost is only approximated.
Average-case analysis based on several
simplistic assumptions.

Good enough to show the overall trends!

16
Comparing File Organizations

Heap files (random order insert at eof)
Sorted files, sorted on ltage, salgt
Clustered B tree file, search key ltage, salgt
Heap file with unclustered B tree index on
search key ltage, salgt
Heap file with unclustered hash index on search
key ltage, salgt

17
Operations to Compare

Scan Fetch all records from disk
Equality search
Range selection
Insert a record
Delete a record

18
Understanding the Workload

For each query in the workload
Which relations does it access?
Which attributes are retrieved?
Which attributes are involved in selection/join
conditions? How selective are these conditions
likely to be?
For each update in the workload
Which attributes are involved in selection/join
conditions? How selective are these conditions
likely to be?
The type of update (INSERT/DELETE/UPDATE), and
the attributes that are affected.

19
Choice of Indexes

What indexes should we create?
Which relations should have indexes? What
field(s) should be the search key? Should we
build several indexes?
For each index, what kind of an index should it
be?
Clustered? Hash/tree?

20
Choice of Indexes (Contd.)

One approach Consider the most important queries
in turn. Consider the best plan using the
current indexes, and see if a better plan is
possible with an additional index. If so, create
it.
Obviously, this implies that we must understand
how a DBMS evaluates queries and creates query
evaluation plans!
For now, we discuss simple 1-table queries.
Before creating an index, must also consider the
impact on updates in the workload!
Trade-off Indexes can make queries go faster,
updates slower. Require disk space, too.

21
Index Selection Guidelines

Attributes in WHERE clause are candidates for
index keys.
Exact match condition suggests hash index.
Range query suggests tree index.
Clustering is especially useful for range
queries can also help on equality queries if
there are many duplicates.
Multi-attribute search keys should be considered
when a WHERE clause contains several conditions.
Order of attributes is important for range
queries.
Such indexes can sometimes enable index-only
strategies for important queries.
For index-only strategies, clustering is not
important!
Try to choose indexes that benefit as many
queries as possible. Since only one index can be
clustered per relation, choose it based on
important queries that would benefit the most
from clustering.

22
Examples of Clustered Indexes
SELECT E.dno FROM Emp E WHERE E.agegt40

B tree index on E.age can be used to get
qualifying rows
How selective is the condition?
Is the index clustered?
Consider the GROUP BY query
If many tuples have E.age gt 10, using E.age index
and sorting the retrieved rows may be costly
Clustered E.dno index may be better!
Equality queries and duplicates
Clustering on E.hobby helps!

SELECT E.dno, COUNT () FROM Emp E WHERE
E.agegt10 GROUP BY E.dno
SELECT E.dno FROM Emp E WHERE E.hobbyStamps
23
Indexes with Composite Search Keys

Composite Search Keys - Search on a combination
of fields.
Equality query - Every field value is equal to a
constant value. E.g. wrt ltsal,agegt index
age20 and sal 75
Range query - Some field value is not a constant.
E.g.
age20 and sal gt 10
Data entries in index sorted by search key to
support range queries.
Lexicographic order, or
Spatial order.

Examples of composite key indexes using
lexicographic order.
11,80
11
12
12,10
name
age
sal
12,20
12
bob
10
12
13,75
13
cal
80
11
ltage, salgt
ltagegt
joe
12
20
sue
13
75
10,12
10
20
20,12
Data records sorted by name
75,13
75
80,11
80
ltsal, agegt
ltsalgt
Data entries in index sorted by ltsal,agegt
Data entries sorted by ltsalgt
24
Composite Search Keys

To retrieve Emp records with age30 AND sal4000,
an index on ltage,salgt would be better than an
index on age or an index on sal
Choice of index key orthogonal to clustering etc.
If condition is 20ltagelt30 AND 3000ltsallt5000
Clustered tree index on ltage,salgt or ltsal,agegt is
best
If condition is age30 AND 3000ltsallt5000
Clustered ltage,salgt index much better than
ltsal,agegt index!

25
Summary

Many alternative file organizations exist, each
appropriate in some situation.
If selection queries are frequent, sorting the
file or building an index is important.
Hash-based indexes only good for equality search
Sorted files and tree-based indexes best for
range search also good for equality search.
(Files rarely kept sorted in practice B tree
index is better.)
Index is a collection of data entries plus a way
to quickly find entries with given key values.

26
Summary (Contd.)

Data entries can be actual data records, ltkey,
ridgt pairs, or ltkey, rid-listgt pairs.
Can have several indexes on a given file of data
records, each with a different search key
Indexes can be classified as clustered vs.
unclustered, primary vs. secondary. Differences
have important consequences for
utility/performance.

27
Summary (Contd.)

Understanding the nature of the workload for the
application, and the performance goals, is
essential to developing a good design.
What are the important queries and updates? What
attributes/relations are involved?
Indexes must be chosen to speed up important
queries (and perhaps some updates!).
Index maintenance overhead on updates to key
fields.
Choose indexes that can help many queries, if
possible.
Build indexes to support index-only strategies.
Clustering is an important decision only one
index on a given relation can be clustered!
Order of fields in composite index key can be
important.

28
Lecture Overview

Overview Of Storage and Indexing (Chap 8)
Overview Of Query Evaluation (Chap 12)
Schema Refinement and Normal Forms (Chap 19)
Physical Database Design (Chap 20)
Security and Authorization (Chap 21)

29
Overview of Query Evaluation

Chapter 12

30
Some Common Techniques

Algorithms for evaluating relational operators
use some simple ideas extensively
Indexing Can use WHERE conditions to retrieve
small set of tuples (selections, joins)
Iteration Sometimes, faster to scan all tuples
even if there is an index. (And sometimes, we can
scan the data entries in an index instead of the
table itself.)
Partitioning By using sorting or hashing, we can
partition the input tuples and replace an
expensive operation by similar operations on
smaller inputs.

Watch for these techniques as we discuss query
evaluation!
31
Statistics and Catalogs

Need information about the relations and indexes
involved. Catalogs typically contain at least
tuples (NTuples) and pages (NPages) for each
relation
distinct key values (NKeys) and NPages for each
index
Index height, low/high key values (Low/High) for
each tree index
Catalogs updated periodically
Updating whenever data changes is too expensive
lots of approximation anyway, so slight
inconsistency ok.
More detailed information (e.g., histograms of
the values in some field) are sometimes stored.

32
Highlights of System R Optimizer

Impact
Most widely used currently works well for lt 10
joins.
Cost estimation - Approximate art at best
Statistics, maintained in system catalogs, used
to estimate cost of operations and result sizes.
Considers combination of CPU and I/O costs.
Plan Space - Too large, must be pruned.
Only the space of left-deep plans is considered
Left-deep plans allow output of each operator to
be pipelined into the next operator without
storing it in a temporary relation.

33
Cost Estimation

For each plan considered, must estimate cost
Must estimate cost of each operation in plan
tree.
Depends on input cardinalities.
Weve already discussed how to estimate the cost
of operations (sequential scan, index scan,
joins, etc.)
Must also estimate size of result for each
operation in tree!
Use information about the input relations.
For selections and joins, assume independence of
predicates.

34
Schema for Examples
Sailors (sid integer, sname string, rating
integer, age real) Reserves (sid integer, bid
integer, day dates, rname string)

Similar to old schema rname added for
variations.
Reserves
Each tuple is 40 bytes long, 100 tuples per
page, 1000 pages.
Sailors
Each tuple is 50 bytes long, 80 tuples per page,
500 pages.

35
Motivating Example
RA Tree
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5

Cost 5005001000 I/Os
By no means the worst plan!
Misses several opportunities selections could
have been pushed earlier, no use is made of any
available indexes, etc.
Goal of optimization To find more efficient
plans that compute the same answer.

Plan
36
Alternative Plans 1 (No Indexes)

Main difference push selects.
With 5 buffers, cost of plan
Scan Reserves (1000) write temp T1 (10 pages,
if we have 100 boats, uniform distribution).
Scan Sailors (500) write temp T2 (250 pages, if
we have 10 ratings).
Sort T1 (2210), sort T2 (23250), merge
(10250)
Total 3560 page I/Os.
If we used BNL join, join cost 104250, total
cost 2770.
If we push projections, T1 has only sid, T2
only sid and sname
T1 fits in 3 pages, cost of BNL drops to under
250 pages, total lt 2000.

37
Alternative Plans 2With Indexes
(On-the-fly)
sname
(On-the-fly)
rating gt 5

With clustered index on bid of Reserves, we get
100,000/100 1000 tuples on 1000/100 10
pages.
INL with pipelining (outer is not materialized).

(Index Nested Loops,
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves

Projecting out unnecessary fields from outer
doesnt help.

Join column sid is a key for Sailors.
At most one matching tuple, unclustered index on
sid OK.

Decision not to push ratinggt5 before the join
is based on
availability of sid index on Sailors.
Cost Selection of Reserves tuples (10 I/Os)
for each,
must get matching Sailors tuple (10001.2)
total 1210 I/Os.

38
Summary

There are several alternative evaluation
algorithms for each relational operator
A query is evaluated by converting it to a tree
of operators and evaluating the operators in the
tree
Must understand query optimization in order to
fully understand the performance impact of a
given database design (relations, indexes) on a
workload (set of queries)
Two parts to optimizing a query
Consider a set of alternative plans.
Must prune search space typically, left-deep
plans only.
Must estimate cost of each plan that is
considered.
Must estimate size of result and cost for each
plan node.
Key issues Statistics, indexes, operator
implementations.

39
Lecture Overview

Overview Of Storage and Indexing (Chap 8)
Overview Of Query Evaluation (Chap 12)
Schema Refinement and Normal Forms (Chap 19)
Physical Database Design (Chap 20)
Security and Authorization (Chap 21)

40
Schema Refinement and Normal Forms

Chapter 19

41
The Evils of Redundancy

Redundancy is at the root of several problems
associated with relational schemas
redundant storage, insert/delete/update anomalies
Integrity constraints, in particular functional
dependencies, can be used to identify schemas
with such problems and to suggest refinements
Main refinement technique decomposition
(replacing ABCD with, say, AB and BCD, or ACD and
ABD)
Decomposition should be used judiciously
Is there reason to decompose a relation?
What problems (if any) does the decomposition
cause?

42
Functional Dependencies (FDs)

A functional dependency X Y holds over
relation R if, for every allowable instance r of
R
t1 r, t2 r, (t1) (t2)
implies (t1) (t2)
i.e., given two tuples in r, if the X values
agree, then the Y values must also agree. (X and
Y are sets of attributes.)
An FD is a statement about all allowable
relations.
Must be identified based on semantics of
application.
Given some allowable instance r1 of R, we can
check if it violates some FD f, but we cannot
tell if f holds over R!
K is a candidate key for R means that K R
However, K R does not require K to be
minimal!

43
Example Constraints on Entity Set

Consider relation obtained from Hourly_Emps
Hourly_Emps (ssn, name, lot, rating, hrly_wages,
hrs_worked)
Notation We will denote this relation schema by
listing the attributes SNLRWH
This is really the set of attributes
S,N,L,R,W,H.
Sometimes, we will refer to all attributes of a
relation by using the relation name. (e.g.,
Hourly_Emps for SNLRWH)
Some FDs on Hourly_Emps
ssn is the key S SNLRWH
rating determines hrly_wages R W

44
Example (Contd.)
lot
name
rating
ssn
hourly_wages
hours_worked
45
Example (Contd.)
Wages
Hourly_Emps2

Problems due to R W
Update anomaly Can we change W in
just the 1st tuple of SNLRWH?
Insertion anomaly What if we want to insert an
employee and dont know the hourly wage for his
rating?
Deletion anomaly If we delete all employees with
rating 5, we lose the information about the wage
for rating 5!

Will 2 smaller tables be better?
46
Normal Forms

Returning to the issue of schema refinement, the
first question to ask, Is any refinement
needed?
If a relation is in a certain normal form (BCNF,
3NF etc.), it is known that certain kinds of
problems are avoided/minimized. This can be used
to help us decide whether decomposing the
relation will help.
Role of FDs in detecting redundancy
Consider a relation R with 3 attributes, ABC.
No FDs hold There is no redundancy here.
Given A B Several tuples could have the
same A value, and if so, theyll all have the
same B value!

47
Boyce-Codd Normal Form (BCNF)

In other words, a relation, R, is in BCNF if the
only non-trivial FDs that hold over R are key
constraints
No dependency in R that can be predicted using
FDs alone
If we are shown two tuples that agree upon the X
value, we cannot infer the A value in one tuple
from the A value in the other
If example relation is in BCNF, the 2 tuples must
be identical (since X is a key)

Nonkey attrk
Nonkey attr2
KEY
Nonkey attr1
BNCF ensures that no redundancy can be detected
using FD information alone. It is the most
desirable normal form!
48
Third Normal Form (3NF)

If R is in BCNF, its obviously in 3NF.
If R is in 3NF, some redundancy is possible. It
is a compromise, used when BCNF not achievable
(e.g., no good decomp, or performance
considerations).
Lossless-join, dependency-preserving
decomposition of R into a collection of 3NF
relations is always possible
Thus, 3NF is indeed a compromise relative to
BCNF.

49
Decomposition of a Relation Scheme

Suppose that relation R contains attributes A1
... An. A decomposition of R consists of
replacing R by two or more relations such that
Each new relation scheme contains a subset of the
attributes of R (and no attributes that do not
appear in R), and
Every attribute of R appears as an attribute of
one of the new relations.
Intuitively, decomposing R means we will store
instances of the relation schemes produced by the
decomposition, instead of instances of R.
E.g., Can decompose SNLRWH into SNLRH and RW.

50
Example Decomposition

Decompositions should be used only when needed.
SNLRWH has FDs S SNLRWH and R W
Second FD causes violation of 3NF W values
repeatedly associated with R values. Easiest way
to fix this is to create a relation RW to store
these associations, and to remove W from the main
schema
i.e., we decompose SNLRWH into SNLRH and RW
The information to be stored consists of SNLRWH
tuples. If we just store the projections of
these tuples onto SNLRH and RW, are there any
potential problems that we should be aware of?

51
Problems with Decompositions

There are three potential problems to consider
Some queries become more expensive.
e.g., How much did sailor Joe earn? (salary
WH)
Given instances of the decomposed relations, we
may not be able to reconstruct the corresponding
instance of the original relation!
Fortunately, not in the SNLRWH example.
Checking some dependencies may require joining
the instances of the decomposed relations.
Fortunately, not in the SNLRWH example.
Tradeoff Must consider these issues vs.
redundancy.

52
Lossless Join Decompositions

Decomposition of R into X and Y is lossless-join
w.r.t. a set of FDs F if, for every instance r
that satisfies F
(r) (r) r
It is always true that r (r)
(r)
In general, the other direction does not hold!
If it does, the decomposition is lossless-join.
Definition extended to decomposition into 3 or
more relations in a straightforward way.
It is essential that all decompositions used to
deal with redundancy be lossless! (Avoids
Problem (2).)

53
More on Lossless Join

The decomposition of R into X and Y is
lossless-join wrt F if and only if the closure
of F contains
X Y X, or
X Y Y
In particular, the decomposition of R into
UV and R - V is lossless-join if U V
holds over R.

54
Dependency Preserving Decomposition

Dependency preserving decomposition (Intuitive)
If R is decomposed into X, Y and Z, and we
enforce the FDs that hold on X, on Y and on Z,
then all FDs that were given to hold on R must
also hold. (Avoids Problem (3).)

55
Summary of Normalization

1NF Eliminate Repeating Groups - Make a separate
table for each set of related attributes, and
give each table a primary key
2NF Eliminate Redundant Data - If an attribute
depends on only part of a multi-valued key,
remove it to a separate table
3NF Eliminate Columns Not Dependent On Key - If
attributes do not contribute to a description of
the key, remove them to a separate table
BCNF Boyce-Codd Normal Form - If there are
non-trivial dependencies between candidate key
attributes, separate them out into distinct tables

http//www.datamodel.org/NormalizationRules.html
56
Refining an ER Diagram
Before

1st diagram translated
Workers(S,N,L,D,S) Departments(D,M,B)
Lots associated with workers.
Suppose all workers in a dept are assigned the
same lot D L
Redundancy fixed by Workers2(S,N,D,S)
Dept_Lots(D,L)
Can fine-tune this Workers2(S,N,D,S)
Departments(D,M,B,L)

Note that Employees and Works_In are mapped to a
single relation, Workers
After
57
Summary of Schema Refinement

If a relation is in BCNF, it is free of
redundancies that can be detected using FDs.
Thus, trying to ensure that all relations are in
BCNF is a good heuristic.
If a relation is not in BCNF, we can try to
decompose it into a collection of BCNF relations.
Must consider whether all FDs are preserved. If
a lossless-join, dependency preserving
decomposition into BCNF is not possible (or
unsuitable, given typical queries), should
consider decomposition into 3NF.
Decompositions should be carried out and/or
re-examined while keeping performance
requirements in mind.

58
Lecture Overview

Overview Of Storage and Indexing (Chap 8)
Overview Of Query Evaluation (Chap 12)
Schema Refinement and Normal Forms (Chap 19)
Physical Database Design (Chap 20)
Security and Authorization (Chap 21)

59
Physical Database Design

Chapter 20

60
Overview

After ER design, schema refinement, and the
definition of views, we have the conceptual and
external schemas for our database.
The next step is to choose indexes, make
clustering decisions, and to refine the
conceptual and external schemas (if necessary) to
meet performance goals.
We must begin by understanding the workload
The most important queries and how often they
arise.
The most important updates and how often they
arise.
The desired performance for these queries and
updates.

61
Decisions to Make

What indexes should we create?
Which relations should have indexes? What
field(s) should be the search key? Should we
build several indexes?
For each index, what kind of an index should it
be?
Clustered? Hash/tree?
Should we make changes to the conceptual schema?
Consider alternative normalized schemas?
(Remember, there are many choices in decomposing
into BCNF, etc.)
Should we undo some decomposition steps and
settle for a lower normal form?
(Denormalization.)
Horizontal partitioning, replication, views ...

62
Index Selection for Joins

When considering a join condition
Hash index on inner is very good for Index Nested
Loops
Should be clustered if join column is not key for
inner, and inner tuples need to be retrieved
Clustered B tree on join column(s) good for
Sort-Merge

63
Example 1
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
D.dnameToy AND E.dnoD.dno

Hash index on D.dname supports Toy selection.
Given this, index on D.dno is not needed
Hash index on E.dno allows us to get matching
(inner) Emp rows for each selected (outer) Dept
row
What if WHERE included ... AND E.age25
?
Could retrieve Emp rows using index on E.age,
then join with Dept rows satisfying dname
selection. Comparable to strategy that used
E.dno index.
So, if E.age index is already created, this query
provides much less motivation for adding an E.dno
index.

64
Example 2
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
E.sal BETWEEN 10000 AND 20000 AND
E.hobbyStamps AND E.dnoD.dno
Range selection

Clearly, Emp should be the outer relation.
Suggests that we build a hash index on D.dno
What index should we build on Emp?
B tree on E.sal could be used, OR an index on
E.hobby could be used. Only one of these is
needed, and which is better depends upon the
selectivity of the conditions.
As a rule of thumb, equality selections more
selective than range selections.
As both examples indicate, our choice of indexes
is guided by the plan(s) that we expect an
optimizer to consider for a query. Have to
understand optimizers!

65
Clustering and Joins
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
D.dnameToy AND E.dnoD.dno

Clustering is especially important when accessing
inner rows in INL.
Should make index on E.dno clustered
Suppose that the WHERE clause is instead
WHERE E.hobbyStamps AND E.dnoD.dno
If many employees collect stamps, Sort-Merge join
may be worth considering. A clustered index on
D.dno would help.
Summary Clustering is useful whenever many rows
are to be retrieved.

66
Tuning the Conceptual Schema

The choice of conceptual schema should be guided
by the workload, in addition to redundancy
issues
We may settle for a 3NF schema rather than BCNF
Workload may influence the choice we make in
decomposing a relation into 3NF or BCNF
We may further decompose a BCNF schema!
We might denormalize (i.e., undo a decomposition
step), or we might add fields to a relation.
If such changes are made after a database is in
use, called schema evolution might want to mask
some of these changes from applications by
defining views

67
Tuning Queries and Views

If a query runs slower than expected, check if an
index needs to be re-built, or if statistics are
too old
Sometimes, the DBMS may not be executing the plan
you had in mind. Common areas of weakness
Selections involving null values
Selections involving arithmetic or string
expressions
Selections involving OR conditions.
Lack of evaluation features like index-only
strategies or certain join methods or poor size
estimation
Check the plan that is being used! Then adjust
the choice of indexes or rewrite the query/view

68
More Guidelines for Query Tuning

Minimize the use of DISTINCT dont need it if
duplicates are acceptable, or if answer contains
a key.
Minimize the use of GROUP BY and HAVING

SELECT MIN (E.age) FROM Employee E GROUP BY
E.dno HAVING E.dno102
SELECT MIN (E.age) FROM Employee E WHERE
E.dno102
69
Summary

Database design consists of several tasks
requirements analysis, conceptual design, schema
refinement, physical design and tuning.
In general, have to go back and forth between
these tasks to refine a database design, and
decisions in one task can influence the choices
in another task.
Understanding the nature of the workload for the
application, and the performance goals, is
essential to developing a good design
What are the important queries and updates? What
attributes/relations are involved?

70
Summary

The conceptual schema should be refined by
considering performance criteria and workload
May choose 3NF or lower normal form over BCNF.
May choose among alternative decompositions into
BCNF (or 3NF) based upon the workload.
May denormalize, or undo some decompositions.
May decompose a BCNF relation further!

71
Summary (Contd.)

Over time, indexes have to be fine-tuned
(dropped, created, re-built, ...) for performance
Should determine the plan used by the system, and
adjust the choice of indexes appropriately
System may still not find a good plan
Only left-deep plans considered!
Null values, arithmetic conditions, string
expressions, the use of ORs, etc. can confuse an
optimizer
So, may have to rewrite the query/view
Avoid nested queries, temporary relations,
complex conditions, and operations like DISTINCT
and GROUP BY

72
Lecture Overview

Overview Of Storage and Indexing (Chap 8)
Overview Of Query Evaluation (Chap 12)
Schema Refinement and Normal Forms (Chap 19)
Physical Database Design (Chap 20)
Security and Authorization (Chap 21)

73
Security and Authorization

Chapter 21

74
Introduction to DB Security

Secrecy - Users should not be able to see things
they are not supposed to
E.g., A student cant see other students grades
Integrity - Users should not be able to modify
things they are not supposed to
E.g., Only instructors can assign grades
Availability - Users should be able to see and
modify things they are allowed to

75
Access Controls

A security policy specifies who is authorized to
do what
A security mechanism allows us to enforce a
chosen security policy.
Two main mechanisms at the DBMS level
Discretionary access control - privileges
Mandatory access control system policies (root)

76
Discretionary Access Control

Based on the concept of access rights or
privileges for objects (tables and views), and
mechanisms for giving users privileges (and
revoking privileges).
Creator of a table or a view automatically gets
all privileges on it.
DMBS keeps track of who subsequently gains and
loses privileges, and ensures that only requests
from users who have the necessary privileges (at
the time the request is issued) are allowed.

77
GRANT Command
GRANT privileges ON object TO users WITH GRANT
OPTION

The following privileges can be specified
SELECT Can read all columns (including those
added later via ALTER TABLE command)
INSERT(col-name) Can insert rows with non-null
or non-default values in this column
INSERT means same right with respect to all
columns
DELETE Can delete rows
REFERENCES (col-name) Can define foreign keys
(in other tables) that refer to this column
If a user has a privilege with the GRANT OPTION,
can pass privilege on to other users (with or
without passing on the GRANT OPTION)
Only owner can execute CREATE, ALTER, and DROP

78
GRANT and REVOKE of Privileges

GRANT INSERT, SELECT ON Sailors TO Horatio
Horatio can query Sailors or insert rows into it
GRANT DELETE ON Sailors TO Yuppy WITH GRANT
OPTION
Yuppy can delete rows, and also authorize others
to do so
GRANT UPDATE (rating) ON Sailors TO Dustin
Dustin can update (only) the rating field of
Sailors rows
GRANT SELECT ON ActiveSailors TO Guppy, Yuppy
This does NOT allow the uppies to query Sailors
directly!
REVOKE
When a privilege is revoked from X, it is also
revoked from all users who got it solely from X

79
GRANT/REVOKE on Views

If the creator of a view loses the SELECT
privilege on an underlying table, the view is
dropped!
If the creator of a view loses a privilege held
with the grant option on an underlying table,
(s)he loses the privilege on the view as well so
do users who were granted that privilege on the
view!

80
Views and Security

Views can be used to present necessary
information (or a summary), while hiding details
in underlying relation(s)
Given ActiveSailors, but not Sailors or Reserves,
we can find sailors who have a reservation, but
not the bids of boats that have been reserved
Creator of view has a privilege on the view if
(s)he has the privilege on all underlying tables.
Together with GRANT/REVOKE commands, views are a
very powerful access control tool

81
Role-Based Authorization

In SQL-92, privileges are actually assigned to
authorization ids, which can denote a single user
or a group of users
In SQL1999 (and in many current systems),
privileges are assigned to roles
Roles can then be granted to users and to other
roles
Reflects how real organizations work
Illustrates how standards often catch up with de
facto standards embodied in popular systems

82
Security to the Level of a Field!

Can create a view that only returns one field of
one row
Then grant access to that view accordingly.
Allows for arbitrary granularity of control, but
Clumsy to specify, though this can be hidden
under a good UI
Performance is unacceptable if we need to define
field-granularity access frequently (Too many
view creations and look-ups)

83
Internet-Oriented Security

Key Issues User authentication and trust
When DB must be accessed from a secure location,
password-based schemes are usually adequate.
For access over an external network, trust is
hard to achieve
If someone with Sams credit card wants to buy
from you, how can you be sure it is not someone
who stole his card?
How can Sam be sure that the screen for entering
his credit card information is indeed yours, and
not some rogue site spoofing you (to steal such
information)? How can he be sure that sensitive
information is not sniffed while it is being
sent over the network to you?
Encryption is a technique used to address these
issues.

84
Encryption

Masks data for secure transmission or storage
Encrypt(data, encryption key) encrypted data
Decrypt(encrypted data, decryption key)
original data
Without decryption key, the encrypted data is
meaningless gibberish
Symmetric Encryption
Encryption key decryption key all authorized
users know decryption key (a weakness).
DES, used since 1977, has 56-bit key AES has
128-bit (optionally, 192-bit or 256-bit) key
Public-Key Encryption - Each user has two keys
Users public encryption key - Known to all
Private decryption key - Known only to this user
Used in RSA scheme

85
RSA Public-Key Encryption

Let the data be an integer I
Choose a large (gtgt I) integer L p q
p, q are large, say 1024-bit, distinct prime
numbers
Encryption - Choose a random number 1 lt e lt L
that is relatively prime to (p-1) (q-1)
Encrypted data S I e mod L
Decryption key d - Chosen so that
d e 1 mod ((p-1) (q-1))
We can then show that I S d mod L
It turns out that the roles of e and d can be
reversed so they are simply called the public
and private keys

86
Certifying Servers SSL, SET

If Amazon distributes their public key, Sams
browser will encrypt his order using it.
So, only Amazon can decipher the order, since no
one else has Amazons private key.
But how can Sam (or his browser) know that the
public key for Amazon is genuine? The SSL
protocol covers this.
Amazon contracts with, say, Verisign, to issue a
certificate ltVerisign, Amazon, amazon.com,
public-keygt
This certificate is stored in encrypted form,
encrypted with Verisigns private key, known only
to Verisign
Verisigns public key is known to all browsers,
which can therefore decrypt the certificate and
obtain Amazons public key, and be confident that
it is genuine.
The browser then generates a temporary session
key, encodes it using Amazons public key, and
sends it to Amazon.
All subsequent msgs between the browser and
Amazon are encoded using symmetric encryption
(e.g., DES), which is more efficient than
public-key encryption.
What if Sam doesnt trust Amazon with his credit
card information?
Secure Electronic Transaction protocol 3-way
communication between Amazon, Sam, and a trusted
server, e.g., Visa.

87
Authenticating Users

Amazon can simply use password authentication,
i.e., ask Sam to log into his Amazon account
Done after SSL is used to establish a session
key, so that the transmission of the password is
secure!
Amazon is still at risk if Sams card is stolen
and his password is hacked. Business risk
Digital Signatures
Sam encrypts the order using his private key,
then encrypts the result using Amazons public
key
Amazon decrypts the msg with their private key,
and then decrypts the result using Sams public
key, which yields the original order!
Exploits interchangeability of public/private
keys for encryption/decryption
Now, no one can forge Sams order, and Sam cannot
claim that someone else forged the order.

88
Summary

Three main security objectives secrecy,
integrity, availability
DB admin is responsible for overall security.
Designs security policy, maintains an audit
trail, or history of users accesses to DB.
Two main approaches to DBMS security
discretionary and mandatory access control
Discretionary control based on notion of
privileges
Mandatory control based on notion of security
classes

89
Final Exam

Due Next Saturday, Oct 15

Last Class

Class Begins at 9AM!!!
Well Cover XML Data

90
See Dark At The End Of The Tunnel
But Almost There!!!

Write a Comment

User Comments (0)

About PowerShow.com

Database Management Systems Session 9 PowerPoint PPT Presentation