Database Management Systems Session 9 - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Database Management Systems Session 9

Description:

... by participating in 'mesh' networks and connecting to Wi-Fi wireless networks. ... To build clustered index, first sort the Heap file (with some free space on each ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 91
Provided by: vincen85
Learn more at: https://cs.hofstra.edu
Category:

less

Transcript and Presenter's Notes

Title: Database Management Systems Session 9


1
Database Management Systems Session 9
  • Instructor Vinnie Costavcosta_at_optonline.net

2
The 100 Laptop Moves Closer to Reality
  • CNET News.com, 9/28 Nicholas Negroponte, the
    co-founder of the Media Lab at the Massachusetts
    Institute of Technology, detailed specifications
    for a 100 windup-powered laptop targeted at
    children in developing nations."This is the
    most important thing I have ever done in my
    life," Negroponte said during a presentation at
    Technology Review's Emerging Technologies
    Conference at MIT. "Reception has been
    incredible. The idea is simple. It's an education
    project, not a laptop project. If we can make
    education better--particularly primary and
    secondary schools--it will be a better
    world.The laptops would save on software costs
    by running the open-source Linux operating system
    rather than Windows, and reduce the need for
    costly base stations by participating in "mesh"
    networks and connecting to Wi-Fi wireless
    networks.
  • MITs Laptop Per Child initiative, which seeks
    to produce between 5 million and 15 million 100
    laptops within a year. The MIT team plans to
    continually push the cost of the laptops down by
    incorporating low-power enhancements such as
    "electronic ink" displays.
  • CNET News.com, 9/28

3
The Rest Of DBMS In A Nutshell
  • Session 9

4
Good Side Of Technology
  • Nicholas Negroponte (born 1943) is a
    Greek-American computer scientist best known as
    founder and director of Massachusetts Institute
    of Technology's Media Lab.
  • Born the son of a Greek ship owner on New York
    City's Upper East Side, Nicholas Negroponte
    studied at MIT, where as a graduate student he
    specialized in the field of computer-aided
    design. He joined the faculty of MIT in 1966. He
    is the brother of United States Director of
    National Intelligence, John Negroponte.
  • He was a founder of WiReD magazine and has been
    an "angel investor" for over 40 start-ups,
    including three in China. Professor Negroponte
    helped to establish, and serves as chairman of,
    the 2B1 Foundation, an organization dedicated to
    bringing computer access to children in the most
    remote and poorest parts of the world

5
Lecture Overview
  • Overview Of Storage and Indexing (Chap 8)
  • Overview Of Query Evaluation (Chap 12)
  • Schema Refinement and Normal Forms (Chap 19)
  • Physical Database Design (Chap 20)
  • Security and Authorization (Chap 21)

6
Overview of Storage and Indexing
  • Chapter 8

7
Data on External Storage
  • Disks Can retrieve random page at fixed cost
  • But reading several consecutive pages is much
    cheaper than reading them in random order
  • Tapes Can only read pages in sequence
  • Cheaper than disks used for archival storage
  • File organization Method of arranging a file of
    records on external storage.
  • Record id (rid) is sufficient to physically
    locate record
  • Indexes are data structures that allow us to find
    the record ids of records with given values in
    index search key fields
  • Architecture Buffer manager stages pages from
    external storage to main memory buffer pool. File
    and index layers make calls to the buffer manager.

8
Alternative File Organizations
  • Many alternatives exist, each ideal for some
    situations, and not so good in others
  • Heap (random order) files - Suitable when
    typical access is a file scan retrieving all
    records.
  • Sorted Files - Best if records must be retrieved
    in some order, or only a range of records is
    needed.
  • Indexes - Data structures to organize records via
    trees or hashing.
  • Like sorted files, they speed up searches for a
    subset of records, based on values in certain
    (search key) fields
  • Updates are much faster than in sorted files.

9
Indexes
  • An index on a file speeds up selections on the
    search key fields for the index
  • Any subset of the fields of a relation can be the
    search key for an index on the relation
  • Search key is not the same as key (minimal set of
    fields that uniquely identify a record in a
    relation)
  • An index contains a collection of data entries,
    and supports efficient retrieval of all data
    entries k with a given key value k
  • Given data entry k, we can find record with key
    k in at most one disk I/O

10
B Tree Indexes
Non-leaf
Pages
Leaf
Pages (Sorted by search key)
  • Leaf pages contain data entries, and are chained
    (prev next)
  • Non-leaf pages have index entries only used to
    direct searches

index entry
P
K
P
K
P
P
K
m
0
1
2
1
m
2
11
Example B Tree
Note how data entries in leaf level are sorted
Root
17
Entries lt 17
Entries gt 17
27
30
13
5
2
3
39
38
7
5
8
22
24
27
29
14
16
33
34
  • Find 28? 29? All gt 15 and lt 30
  • Insert/delete Find data entry in leaf, then
    change it. Need to adjust parent sometimes.
  • And change sometimes bubbles up the tree

12
Hash-Based Indexes
  • Good for equality selections.
  • Index is a collection of buckets
  • Bucket primary page plus zero or more overflow
    pages
  • Buckets contain data entries.
  • Hashing function h - h(r) bucket in which (data
    entry for) record r belongs. h looks at the
    search key fields of r.

13
Index Classification
  • Primary vs secondary - If search key contains
    primary key, then called primary index.
  • Unique index Search key contains a candidate
    key.
  • Clustered vs unclustered - If order of data
    records is the same as, or close to, order of
    data entries, then called clustered index.
  • A file can be clustered on at most one search
    key.
  • Cost of retrieving data records through index
    varies greatly based on whether index is
    clustered or not!

14
Clustered vs. Unclustered Index
  • Suppose that the data records are stored in a
    Heap file
  • To build clustered index, first sort the Heap
    file (with some free space on each page for
    future inserts).
  • Overflow pages may be needed for inserts. (Thus,
    order of data recs is close to, but not
    identical to, the sort order.)

Index entries
CLUSTERED
direct search for
UNCLUSTERED
data entries
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
15
Cost Model for Our Analysis
  • We ignore CPU costs, for simplicity
  • B The number of data pages
  • R Number of records per page
  • D (Average) time to read or write disk page
  • Measuring number of page I/Os ignores gains of
    pre-fetching a sequence of pages thus, even I/O
    cost is only approximated.
  • Average-case analysis based on several
    simplistic assumptions.
  • Good enough to show the overall trends!

16
Comparing File Organizations
  • Heap files (random order insert at eof)
  • Sorted files, sorted on ltage, salgt
  • Clustered B tree file, search key ltage, salgt
  • Heap file with unclustered B tree index on
    search key ltage, salgt
  • Heap file with unclustered hash index on search
    key ltage, salgt

17
Operations to Compare
  • Scan Fetch all records from disk
  • Equality search
  • Range selection
  • Insert a record
  • Delete a record

18
Understanding the Workload
  • For each query in the workload
  • Which relations does it access?
  • Which attributes are retrieved?
  • Which attributes are involved in selection/join
    conditions? How selective are these conditions
    likely to be?
  • For each update in the workload
  • Which attributes are involved in selection/join
    conditions? How selective are these conditions
    likely to be?
  • The type of update (INSERT/DELETE/UPDATE), and
    the attributes that are affected.

19
Choice of Indexes
  • What indexes should we create?
  • Which relations should have indexes? What
    field(s) should be the search key? Should we
    build several indexes?
  • For each index, what kind of an index should it
    be?
  • Clustered? Hash/tree?

20
Choice of Indexes (Contd.)
  • One approach Consider the most important queries
    in turn. Consider the best plan using the
    current indexes, and see if a better plan is
    possible with an additional index. If so, create
    it.
  • Obviously, this implies that we must understand
    how a DBMS evaluates queries and creates query
    evaluation plans!
  • For now, we discuss simple 1-table queries.
  • Before creating an index, must also consider the
    impact on updates in the workload!
  • Trade-off Indexes can make queries go faster,
    updates slower. Require disk space, too.

21
Index Selection Guidelines
  • Attributes in WHERE clause are candidates for
    index keys.
  • Exact match condition suggests hash index.
  • Range query suggests tree index.
  • Clustering is especially useful for range
    queries can also help on equality queries if
    there are many duplicates.
  • Multi-attribute search keys should be considered
    when a WHERE clause contains several conditions.
  • Order of attributes is important for range
    queries.
  • Such indexes can sometimes enable index-only
    strategies for important queries.
  • For index-only strategies, clustering is not
    important!
  • Try to choose indexes that benefit as many
    queries as possible. Since only one index can be
    clustered per relation, choose it based on
    important queries that would benefit the most
    from clustering.

22
Examples of Clustered Indexes
SELECT E.dno FROM Emp E WHERE E.agegt40
  • B tree index on E.age can be used to get
    qualifying rows
  • How selective is the condition?
  • Is the index clustered?
  • Consider the GROUP BY query
  • If many tuples have E.age gt 10, using E.age index
    and sorting the retrieved rows may be costly
  • Clustered E.dno index may be better!
  • Equality queries and duplicates
  • Clustering on E.hobby helps!

SELECT E.dno, COUNT () FROM Emp E WHERE
E.agegt10 GROUP BY E.dno
SELECT E.dno FROM Emp E WHERE E.hobbyStamps
23
Indexes with Composite Search Keys
  • Composite Search Keys - Search on a combination
    of fields.
  • Equality query - Every field value is equal to a
    constant value. E.g. wrt ltsal,agegt index
  • age20 and sal 75
  • Range query - Some field value is not a constant.
    E.g.
  • age20 and sal gt 10
  • Data entries in index sorted by search key to
    support range queries.
  • Lexicographic order, or
  • Spatial order.

Examples of composite key indexes using
lexicographic order.
11,80
11
12
12,10
name
age
sal
12,20
12
bob
10
12
13,75
13
cal
80
11
ltage, salgt
ltagegt
joe
12
20
sue
13
75
10,12
10
20
20,12
Data records sorted by name
75,13
75
80,11
80
ltsal, agegt
ltsalgt
Data entries in index sorted by ltsal,agegt
Data entries sorted by ltsalgt
24
Composite Search Keys
  • To retrieve Emp records with age30 AND sal4000,
    an index on ltage,salgt would be better than an
    index on age or an index on sal
  • Choice of index key orthogonal to clustering etc.
  • If condition is 20ltagelt30 AND 3000ltsallt5000
  • Clustered tree index on ltage,salgt or ltsal,agegt is
    best
  • If condition is age30 AND 3000ltsallt5000
  • Clustered ltage,salgt index much better than
    ltsal,agegt index!

25
Summary
  • Many alternative file organizations exist, each
    appropriate in some situation.
  • If selection queries are frequent, sorting the
    file or building an index is important.
  • Hash-based indexes only good for equality search
  • Sorted files and tree-based indexes best for
    range search also good for equality search.
    (Files rarely kept sorted in practice B tree
    index is better.)
  • Index is a collection of data entries plus a way
    to quickly find entries with given key values.

26
Summary (Contd.)
  • Data entries can be actual data records, ltkey,
    ridgt pairs, or ltkey, rid-listgt pairs.
  • Can have several indexes on a given file of data
    records, each with a different search key
  • Indexes can be classified as clustered vs.
    unclustered, primary vs. secondary. Differences
    have important consequences for
    utility/performance.

27
Summary (Contd.)
  • Understanding the nature of the workload for the
    application, and the performance goals, is
    essential to developing a good design.
  • What are the important queries and updates? What
    attributes/relations are involved?
  • Indexes must be chosen to speed up important
    queries (and perhaps some updates!).
  • Index maintenance overhead on updates to key
    fields.
  • Choose indexes that can help many queries, if
    possible.
  • Build indexes to support index-only strategies.
  • Clustering is an important decision only one
    index on a given relation can be clustered!
  • Order of fields in composite index key can be
    important.

28
Lecture Overview
  • Overview Of Storage and Indexing (Chap 8)
  • Overview Of Query Evaluation (Chap 12)
  • Schema Refinement and Normal Forms (Chap 19)
  • Physical Database Design (Chap 20)
  • Security and Authorization (Chap 21)

29
Overview of Query Evaluation
  • Chapter 12

30
Some Common Techniques
  • Algorithms for evaluating relational operators
    use some simple ideas extensively
  • Indexing Can use WHERE conditions to retrieve
    small set of tuples (selections, joins)
  • Iteration Sometimes, faster to scan all tuples
    even if there is an index. (And sometimes, we can
    scan the data entries in an index instead of the
    table itself.)
  • Partitioning By using sorting or hashing, we can
    partition the input tuples and replace an
    expensive operation by similar operations on
    smaller inputs.

Watch for these techniques as we discuss query
evaluation!
31
Statistics and Catalogs
  • Need information about the relations and indexes
    involved. Catalogs typically contain at least
  • tuples (NTuples) and pages (NPages) for each
    relation
  • distinct key values (NKeys) and NPages for each
    index
  • Index height, low/high key values (Low/High) for
    each tree index
  • Catalogs updated periodically
  • Updating whenever data changes is too expensive
    lots of approximation anyway, so slight
    inconsistency ok.
  • More detailed information (e.g., histograms of
    the values in some field) are sometimes stored.

32
Highlights of System R Optimizer
  • Impact
  • Most widely used currently works well for lt 10
    joins.
  • Cost estimation - Approximate art at best
  • Statistics, maintained in system catalogs, used
    to estimate cost of operations and result sizes.
  • Considers combination of CPU and I/O costs.
  • Plan Space - Too large, must be pruned.
  • Only the space of left-deep plans is considered
  • Left-deep plans allow output of each operator to
    be pipelined into the next operator without
    storing it in a temporary relation.

33
Cost Estimation
  • For each plan considered, must estimate cost
  • Must estimate cost of each operation in plan
    tree.
  • Depends on input cardinalities.
  • Weve already discussed how to estimate the cost
    of operations (sequential scan, index scan,
    joins, etc.)
  • Must also estimate size of result for each
    operation in tree!
  • Use information about the input relations.
  • For selections and joins, assume independence of
    predicates.

34
Schema for Examples
Sailors (sid integer, sname string, rating
integer, age real) Reserves (sid integer, bid
integer, day dates, rname string)
  • Similar to old schema rname added for
    variations.
  • Reserves
  • Each tuple is 40 bytes long, 100 tuples per
    page, 1000 pages.
  • Sailors
  • Each tuple is 50 bytes long, 80 tuples per page,
    500 pages.

35
Motivating Example
RA Tree
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
  • Cost 5005001000 I/Os
  • By no means the worst plan!
  • Misses several opportunities selections could
    have been pushed earlier, no use is made of any
    available indexes, etc.
  • Goal of optimization To find more efficient
    plans that compute the same answer.

Plan
36
Alternative Plans 1 (No Indexes)
  • Main difference push selects.
  • With 5 buffers, cost of plan
  • Scan Reserves (1000) write temp T1 (10 pages,
    if we have 100 boats, uniform distribution).
  • Scan Sailors (500) write temp T2 (250 pages, if
    we have 10 ratings).
  • Sort T1 (2210), sort T2 (23250), merge
    (10250)
  • Total 3560 page I/Os.
  • If we used BNL join, join cost 104250, total
    cost 2770.
  • If we push projections, T1 has only sid, T2
    only sid and sname
  • T1 fits in 3 pages, cost of BNL drops to under
    250 pages, total lt 2000.

37
Alternative Plans 2With Indexes
(On-the-fly)
sname
(On-the-fly)
rating gt 5
  • With clustered index on bid of Reserves, we get
    100,000/100 1000 tuples on 1000/100 10
    pages.
  • INL with pipelining (outer is not materialized).

(Index Nested Loops,
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves
  • Projecting out unnecessary fields from outer
    doesnt help.
  • Join column sid is a key for Sailors.
  • At most one matching tuple, unclustered index on
    sid OK.
  • Decision not to push ratinggt5 before the join
    is based on
  • availability of sid index on Sailors.
  • Cost Selection of Reserves tuples (10 I/Os)
    for each,
  • must get matching Sailors tuple (10001.2)
    total 1210 I/Os.

38
Summary
  • There are several alternative evaluation
    algorithms for each relational operator
  • A query is evaluated by converting it to a tree
    of operators and evaluating the operators in the
    tree
  • Must understand query optimization in order to
    fully understand the performance impact of a
    given database design (relations, indexes) on a
    workload (set of queries)
  • Two parts to optimizing a query
  • Consider a set of alternative plans.
  • Must prune search space typically, left-deep
    plans only.
  • Must estimate cost of each plan that is
    considered.
  • Must estimate size of result and cost for each
    plan node.
  • Key issues Statistics, indexes, operator
    implementations.

39
Lecture Overview
  • Overview Of Storage and Indexing (Chap 8)
  • Overview Of Query Evaluation (Chap 12)
  • Schema Refinement and Normal Forms (Chap 19)
  • Physical Database Design (Chap 20)
  • Security and Authorization (Chap 21)

40
Schema Refinement and Normal Forms
  • Chapter 19

41
The Evils of Redundancy
  • Redundancy is at the root of several problems
    associated with relational schemas
  • redundant storage, insert/delete/update anomalies
  • Integrity constraints, in particular functional
    dependencies, can be used to identify schemas
    with such problems and to suggest refinements
  • Main refinement technique decomposition
    (replacing ABCD with, say, AB and BCD, or ACD and
    ABD)
  • Decomposition should be used judiciously
  • Is there reason to decompose a relation?
  • What problems (if any) does the decomposition
    cause?

42
Functional Dependencies (FDs)
  • A functional dependency X Y holds over
    relation R if, for every allowable instance r of
    R
  • t1 r, t2 r, (t1) (t2)
    implies (t1) (t2)
  • i.e., given two tuples in r, if the X values
    agree, then the Y values must also agree. (X and
    Y are sets of attributes.)
  • An FD is a statement about all allowable
    relations.
  • Must be identified based on semantics of
    application.
  • Given some allowable instance r1 of R, we can
    check if it violates some FD f, but we cannot
    tell if f holds over R!
  • K is a candidate key for R means that K R
  • However, K R does not require K to be
    minimal!

43
Example Constraints on Entity Set
  • Consider relation obtained from Hourly_Emps
  • Hourly_Emps (ssn, name, lot, rating, hrly_wages,
    hrs_worked)
  • Notation We will denote this relation schema by
    listing the attributes SNLRWH
  • This is really the set of attributes
    S,N,L,R,W,H.
  • Sometimes, we will refer to all attributes of a
    relation by using the relation name. (e.g.,
    Hourly_Emps for SNLRWH)
  • Some FDs on Hourly_Emps
  • ssn is the key S SNLRWH
  • rating determines hrly_wages R W

44
Example (Contd.)
lot
name
rating
ssn
hourly_wages
hours_worked
45
Example (Contd.)
Wages
Hourly_Emps2
  • Problems due to R W
  • Update anomaly Can we change W in
    just the 1st tuple of SNLRWH?
  • Insertion anomaly What if we want to insert an
    employee and dont know the hourly wage for his
    rating?
  • Deletion anomaly If we delete all employees with
    rating 5, we lose the information about the wage
    for rating 5!

Will 2 smaller tables be better?
46
Normal Forms
  • Returning to the issue of schema refinement, the
    first question to ask, Is any refinement
    needed?
  • If a relation is in a certain normal form (BCNF,
    3NF etc.), it is known that certain kinds of
    problems are avoided/minimized. This can be used
    to help us decide whether decomposing the
    relation will help.
  • Role of FDs in detecting redundancy
  • Consider a relation R with 3 attributes, ABC.
  • No FDs hold There is no redundancy here.
  • Given A B Several tuples could have the
    same A value, and if so, theyll all have the
    same B value!

47
Boyce-Codd Normal Form (BCNF)
  • In other words, a relation, R, is in BCNF if the
    only non-trivial FDs that hold over R are key
    constraints
  • No dependency in R that can be predicted using
    FDs alone
  • If we are shown two tuples that agree upon the X
    value, we cannot infer the A value in one tuple
    from the A value in the other
  • If example relation is in BCNF, the 2 tuples must
    be identical (since X is a key)

Nonkey attrk
Nonkey attr2
KEY
Nonkey attr1
BNCF ensures that no redundancy can be detected
using FD information alone. It is the most
desirable normal form!
48
Third Normal Form (3NF)
  • If R is in BCNF, its obviously in 3NF.
  • If R is in 3NF, some redundancy is possible. It
    is a compromise, used when BCNF not achievable
    (e.g., no good decomp, or performance
    considerations).
  • Lossless-join, dependency-preserving
    decomposition of R into a collection of 3NF
    relations is always possible
  • Thus, 3NF is indeed a compromise relative to
    BCNF.

49
Decomposition of a Relation Scheme
  • Suppose that relation R contains attributes A1
    ... An. A decomposition of R consists of
    replacing R by two or more relations such that
  • Each new relation scheme contains a subset of the
    attributes of R (and no attributes that do not
    appear in R), and
  • Every attribute of R appears as an attribute of
    one of the new relations.
  • Intuitively, decomposing R means we will store
    instances of the relation schemes produced by the
    decomposition, instead of instances of R.
  • E.g., Can decompose SNLRWH into SNLRH and RW.

50
Example Decomposition
  • Decompositions should be used only when needed.
  • SNLRWH has FDs S SNLRWH and R W
  • Second FD causes violation of 3NF W values
    repeatedly associated with R values. Easiest way
    to fix this is to create a relation RW to store
    these associations, and to remove W from the main
    schema
  • i.e., we decompose SNLRWH into SNLRH and RW
  • The information to be stored consists of SNLRWH
    tuples. If we just store the projections of
    these tuples onto SNLRH and RW, are there any
    potential problems that we should be aware of?

51
Problems with Decompositions
  • There are three potential problems to consider
  • Some queries become more expensive.
  • e.g., How much did sailor Joe earn? (salary
    WH)
  • Given instances of the decomposed relations, we
    may not be able to reconstruct the corresponding
    instance of the original relation!
  • Fortunately, not in the SNLRWH example.
  • Checking some dependencies may require joining
    the instances of the decomposed relations.
  • Fortunately, not in the SNLRWH example.
  • Tradeoff Must consider these issues vs.
    redundancy.

52
Lossless Join Decompositions
  • Decomposition of R into X and Y is lossless-join
    w.r.t. a set of FDs F if, for every instance r
    that satisfies F
  • (r) (r) r
  • It is always true that r (r)
    (r)
  • In general, the other direction does not hold!
    If it does, the decomposition is lossless-join.
  • Definition extended to decomposition into 3 or
    more relations in a straightforward way.
  • It is essential that all decompositions used to
    deal with redundancy be lossless! (Avoids
    Problem (2).)

53
More on Lossless Join
  • The decomposition of R into X and Y is
    lossless-join wrt F if and only if the closure
    of F contains
  • X Y X, or
  • X Y Y
  • In particular, the decomposition of R into
    UV and R - V is lossless-join if U V
    holds over R.

54
Dependency Preserving Decomposition
  • Dependency preserving decomposition (Intuitive)
  • If R is decomposed into X, Y and Z, and we
    enforce the FDs that hold on X, on Y and on Z,
    then all FDs that were given to hold on R must
    also hold. (Avoids Problem (3).)

55
Summary of Normalization
  • 1NF Eliminate Repeating Groups - Make a separate
    table for each set of related attributes, and
    give each table a primary key
  • 2NF Eliminate Redundant Data - If an attribute
    depends on only part of a multi-valued key,
    remove it to a separate table
  • 3NF Eliminate Columns Not Dependent On Key - If
    attributes do not contribute to a description of
    the key, remove them to a separate table
  • BCNF Boyce-Codd Normal Form - If there are
    non-trivial dependencies between candidate key
    attributes, separate them out into distinct tables

http//www.datamodel.org/NormalizationRules.html
56
Refining an ER Diagram
Before
  • 1st diagram translated
    Workers(S,N,L,D,S) Departments(D,M,B)
  • Lots associated with workers.
  • Suppose all workers in a dept are assigned the
    same lot D L
  • Redundancy fixed by Workers2(S,N,D,S)
    Dept_Lots(D,L)
  • Can fine-tune this Workers2(S,N,D,S)
    Departments(D,M,B,L)

Note that Employees and Works_In are mapped to a
single relation, Workers
After
57
Summary of Schema Refinement
  • If a relation is in BCNF, it is free of
    redundancies that can be detected using FDs.
    Thus, trying to ensure that all relations are in
    BCNF is a good heuristic.
  • If a relation is not in BCNF, we can try to
    decompose it into a collection of BCNF relations.
  • Must consider whether all FDs are preserved. If
    a lossless-join, dependency preserving
    decomposition into BCNF is not possible (or
    unsuitable, given typical queries), should
    consider decomposition into 3NF.
  • Decompositions should be carried out and/or
    re-examined while keeping performance
    requirements in mind.

58
Lecture Overview
  • Overview Of Storage and Indexing (Chap 8)
  • Overview Of Query Evaluation (Chap 12)
  • Schema Refinement and Normal Forms (Chap 19)
  • Physical Database Design (Chap 20)
  • Security and Authorization (Chap 21)

59
Physical Database Design
  • Chapter 20

60
Overview
  • After ER design, schema refinement, and the
    definition of views, we have the conceptual and
    external schemas for our database.
  • The next step is to choose indexes, make
    clustering decisions, and to refine the
    conceptual and external schemas (if necessary) to
    meet performance goals.
  • We must begin by understanding the workload
  • The most important queries and how often they
    arise.
  • The most important updates and how often they
    arise.
  • The desired performance for these queries and
    updates.

61
Decisions to Make
  • What indexes should we create?
  • Which relations should have indexes? What
    field(s) should be the search key? Should we
    build several indexes?
  • For each index, what kind of an index should it
    be?
  • Clustered? Hash/tree?
  • Should we make changes to the conceptual schema?
  • Consider alternative normalized schemas?
    (Remember, there are many choices in decomposing
    into BCNF, etc.)
  • Should we undo some decomposition steps and
    settle for a lower normal form?
    (Denormalization.)
  • Horizontal partitioning, replication, views ...

62
Index Selection for Joins
  • When considering a join condition
  • Hash index on inner is very good for Index Nested
    Loops
  • Should be clustered if join column is not key for
    inner, and inner tuples need to be retrieved
  • Clustered B tree on join column(s) good for
    Sort-Merge

63
Example 1
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
D.dnameToy AND E.dnoD.dno
  • Hash index on D.dname supports Toy selection.
  • Given this, index on D.dno is not needed
  • Hash index on E.dno allows us to get matching
    (inner) Emp rows for each selected (outer) Dept
    row
  • What if WHERE included ... AND E.age25
    ?
  • Could retrieve Emp rows using index on E.age,
    then join with Dept rows satisfying dname
    selection. Comparable to strategy that used
    E.dno index.
  • So, if E.age index is already created, this query
    provides much less motivation for adding an E.dno
    index.

64
Example 2
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
E.sal BETWEEN 10000 AND 20000 AND
E.hobbyStamps AND E.dnoD.dno
Range selection
  • Clearly, Emp should be the outer relation.
  • Suggests that we build a hash index on D.dno
  • What index should we build on Emp?
  • B tree on E.sal could be used, OR an index on
    E.hobby could be used. Only one of these is
    needed, and which is better depends upon the
    selectivity of the conditions.
  • As a rule of thumb, equality selections more
    selective than range selections.
  • As both examples indicate, our choice of indexes
    is guided by the plan(s) that we expect an
    optimizer to consider for a query. Have to
    understand optimizers!

65
Clustering and Joins
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE
D.dnameToy AND E.dnoD.dno
  • Clustering is especially important when accessing
    inner rows in INL.
  • Should make index on E.dno clustered
  • Suppose that the WHERE clause is instead
  • WHERE E.hobbyStamps AND E.dnoD.dno
  • If many employees collect stamps, Sort-Merge join
    may be worth considering. A clustered index on
    D.dno would help.
  • Summary Clustering is useful whenever many rows
    are to be retrieved.

66
Tuning the Conceptual Schema
  • The choice of conceptual schema should be guided
    by the workload, in addition to redundancy
    issues
  • We may settle for a 3NF schema rather than BCNF
  • Workload may influence the choice we make in
    decomposing a relation into 3NF or BCNF
  • We may further decompose a BCNF schema!
  • We might denormalize (i.e., undo a decomposition
    step), or we might add fields to a relation.
  • If such changes are made after a database is in
    use, called schema evolution might want to mask
    some of these changes from applications by
    defining views

67
Tuning Queries and Views
  • If a query runs slower than expected, check if an
    index needs to be re-built, or if statistics are
    too old
  • Sometimes, the DBMS may not be executing the plan
    you had in mind. Common areas of weakness
  • Selections involving null values
  • Selections involving arithmetic or string
    expressions
  • Selections involving OR conditions.
  • Lack of evaluation features like index-only
    strategies or certain join methods or poor size
    estimation
  • Check the plan that is being used! Then adjust
    the choice of indexes or rewrite the query/view

68
More Guidelines for Query Tuning
  • Minimize the use of DISTINCT dont need it if
    duplicates are acceptable, or if answer contains
    a key.
  • Minimize the use of GROUP BY and HAVING

SELECT MIN (E.age) FROM Employee E GROUP BY
E.dno HAVING E.dno102
SELECT MIN (E.age) FROM Employee E WHERE
E.dno102
69
Summary
  • Database design consists of several tasks
    requirements analysis, conceptual design, schema
    refinement, physical design and tuning.
  • In general, have to go back and forth between
    these tasks to refine a database design, and
    decisions in one task can influence the choices
    in another task.
  • Understanding the nature of the workload for the
    application, and the performance goals, is
    essential to developing a good design
  • What are the important queries and updates? What
    attributes/relations are involved?

70
Summary
  • The conceptual schema should be refined by
    considering performance criteria and workload
  • May choose 3NF or lower normal form over BCNF.
  • May choose among alternative decompositions into
    BCNF (or 3NF) based upon the workload.
  • May denormalize, or undo some decompositions.
  • May decompose a BCNF relation further!

71
Summary (Contd.)
  • Over time, indexes have to be fine-tuned
    (dropped, created, re-built, ...) for performance
  • Should determine the plan used by the system, and
    adjust the choice of indexes appropriately
  • System may still not find a good plan
  • Only left-deep plans considered!
  • Null values, arithmetic conditions, string
    expressions, the use of ORs, etc. can confuse an
    optimizer
  • So, may have to rewrite the query/view
  • Avoid nested queries, temporary relations,
    complex conditions, and operations like DISTINCT
    and GROUP BY

72
Lecture Overview
  • Overview Of Storage and Indexing (Chap 8)
  • Overview Of Query Evaluation (Chap 12)
  • Schema Refinement and Normal Forms (Chap 19)
  • Physical Database Design (Chap 20)
  • Security and Authorization (Chap 21)

73
Security and Authorization
  • Chapter 21

74
Introduction to DB Security
  • Secrecy - Users should not be able to see things
    they are not supposed to
  • E.g., A student cant see other students grades
  • Integrity - Users should not be able to modify
    things they are not supposed to
  • E.g., Only instructors can assign grades
  • Availability - Users should be able to see and
    modify things they are allowed to

75
Access Controls
  • A security policy specifies who is authorized to
    do what
  • A security mechanism allows us to enforce a
    chosen security policy.
  • Two main mechanisms at the DBMS level
  • Discretionary access control - privileges
  • Mandatory access control system policies (root)

76
Discretionary Access Control
  • Based on the concept of access rights or
    privileges for objects (tables and views), and
    mechanisms for giving users privileges (and
    revoking privileges).
  • Creator of a table or a view automatically gets
    all privileges on it.
  • DMBS keeps track of who subsequently gains and
    loses privileges, and ensures that only requests
    from users who have the necessary privileges (at
    the time the request is issued) are allowed.

77
GRANT Command
GRANT privileges ON object TO users WITH GRANT
OPTION
  • The following privileges can be specified
  • SELECT Can read all columns (including those
    added later via ALTER TABLE command)
  • INSERT(col-name) Can insert rows with non-null
    or non-default values in this column
  • INSERT means same right with respect to all
    columns
  • DELETE Can delete rows
  • REFERENCES (col-name) Can define foreign keys
    (in other tables) that refer to this column
  • If a user has a privilege with the GRANT OPTION,
    can pass privilege on to other users (with or
    without passing on the GRANT OPTION)
  • Only owner can execute CREATE, ALTER, and DROP

78
GRANT and REVOKE of Privileges
  • GRANT INSERT, SELECT ON Sailors TO Horatio
  • Horatio can query Sailors or insert rows into it
  • GRANT DELETE ON Sailors TO Yuppy WITH GRANT
    OPTION
  • Yuppy can delete rows, and also authorize others
    to do so
  • GRANT UPDATE (rating) ON Sailors TO Dustin
  • Dustin can update (only) the rating field of
    Sailors rows
  • GRANT SELECT ON ActiveSailors TO Guppy, Yuppy
  • This does NOT allow the uppies to query Sailors
    directly!
  • REVOKE
  • When a privilege is revoked from X, it is also
    revoked from all users who got it solely from X

79
GRANT/REVOKE on Views
  • If the creator of a view loses the SELECT
    privilege on an underlying table, the view is
    dropped!
  • If the creator of a view loses a privilege held
    with the grant option on an underlying table,
    (s)he loses the privilege on the view as well so
    do users who were granted that privilege on the
    view!

80
Views and Security
  • Views can be used to present necessary
    information (or a summary), while hiding details
    in underlying relation(s)
  • Given ActiveSailors, but not Sailors or Reserves,
    we can find sailors who have a reservation, but
    not the bids of boats that have been reserved
  • Creator of view has a privilege on the view if
    (s)he has the privilege on all underlying tables.
  • Together with GRANT/REVOKE commands, views are a
    very powerful access control tool

81
Role-Based Authorization
  • In SQL-92, privileges are actually assigned to
    authorization ids, which can denote a single user
    or a group of users
  • In SQL1999 (and in many current systems),
    privileges are assigned to roles
  • Roles can then be granted to users and to other
    roles
  • Reflects how real organizations work
  • Illustrates how standards often catch up with de
    facto standards embodied in popular systems

82
Security to the Level of a Field!
  • Can create a view that only returns one field of
    one row
  • Then grant access to that view accordingly.
  • Allows for arbitrary granularity of control, but
  • Clumsy to specify, though this can be hidden
    under a good UI
  • Performance is unacceptable if we need to define
    field-granularity access frequently (Too many
    view creations and look-ups)

83
Internet-Oriented Security
  • Key Issues User authentication and trust
  • When DB must be accessed from a secure location,
    password-based schemes are usually adequate.
  • For access over an external network, trust is
    hard to achieve
  • If someone with Sams credit card wants to buy
    from you, how can you be sure it is not someone
    who stole his card?
  • How can Sam be sure that the screen for entering
    his credit card information is indeed yours, and
    not some rogue site spoofing you (to steal such
    information)? How can he be sure that sensitive
    information is not sniffed while it is being
    sent over the network to you?
  • Encryption is a technique used to address these
    issues.

84
Encryption
  • Masks data for secure transmission or storage
  • Encrypt(data, encryption key) encrypted data
  • Decrypt(encrypted data, decryption key)
    original data
  • Without decryption key, the encrypted data is
    meaningless gibberish
  • Symmetric Encryption
  • Encryption key decryption key all authorized
    users know decryption key (a weakness).
  • DES, used since 1977, has 56-bit key AES has
    128-bit (optionally, 192-bit or 256-bit) key
  • Public-Key Encryption - Each user has two keys
  • Users public encryption key - Known to all
  • Private decryption key - Known only to this user
  • Used in RSA scheme

85
RSA Public-Key Encryption
  • Let the data be an integer I
  • Choose a large (gtgt I) integer L p q
  • p, q are large, say 1024-bit, distinct prime
    numbers
  • Encryption - Choose a random number 1 lt e lt L
    that is relatively prime to (p-1) (q-1)
  • Encrypted data S I e mod L
  • Decryption key d - Chosen so that
  • d e 1 mod ((p-1) (q-1))
  • We can then show that I S d mod L
  • It turns out that the roles of e and d can be
    reversed so they are simply called the public
    and private keys

86
Certifying Servers SSL, SET
  • If Amazon distributes their public key, Sams
    browser will encrypt his order using it.
  • So, only Amazon can decipher the order, since no
    one else has Amazons private key.
  • But how can Sam (or his browser) know that the
    public key for Amazon is genuine? The SSL
    protocol covers this.
  • Amazon contracts with, say, Verisign, to issue a
    certificate ltVerisign, Amazon, amazon.com,
    public-keygt
  • This certificate is stored in encrypted form,
    encrypted with Verisigns private key, known only
    to Verisign
  • Verisigns public key is known to all browsers,
    which can therefore decrypt the certificate and
    obtain Amazons public key, and be confident that
    it is genuine.
  • The browser then generates a temporary session
    key, encodes it using Amazons public key, and
    sends it to Amazon.
  • All subsequent msgs between the browser and
    Amazon are encoded using symmetric encryption
    (e.g., DES), which is more efficient than
    public-key encryption.
  • What if Sam doesnt trust Amazon with his credit
    card information?
  • Secure Electronic Transaction protocol 3-way
    communication between Amazon, Sam, and a trusted
    server, e.g., Visa.

87
Authenticating Users
  • Amazon can simply use password authentication,
    i.e., ask Sam to log into his Amazon account
  • Done after SSL is used to establish a session
    key, so that the transmission of the password is
    secure!
  • Amazon is still at risk if Sams card is stolen
    and his password is hacked. Business risk
  • Digital Signatures
  • Sam encrypts the order using his private key,
    then encrypts the result using Amazons public
    key
  • Amazon decrypts the msg with their private key,
    and then decrypts the result using Sams public
    key, which yields the original order!
  • Exploits interchangeability of public/private
    keys for encryption/decryption
  • Now, no one can forge Sams order, and Sam cannot
    claim that someone else forged the order.

88
Summary
  • Three main security objectives secrecy,
    integrity, availability
  • DB admin is responsible for overall security.
  • Designs security policy, maintains an audit
    trail, or history of users accesses to DB.
  • Two main approaches to DBMS security
    discretionary and mandatory access control
  • Discretionary control based on notion of
    privileges
  • Mandatory control based on notion of security
    classes

89
Final Exam
  • Due Next Saturday, Oct 15

Last Class
  • Class Begins at 9AM!!!
  • Well Cover XML Data

90
See Dark At The End Of The Tunnel
But Almost There!!!
Write a Comment
User Comments (0)
About PowerShow.com