PPT – CS157B - Fall 2004 PowerPoint presentation | free to view

About This Presentation

Title:

CS157B - Fall 2004

Description:

Title: Slide 1 Last modified by: Fain Butt Document presentation format: Custom Other titles: Times New Roman Arial WingDings LotusWP Type Default Design Slide 1 ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 164

Provided by: csSjsuEd2

Category:

more less

Transcript and Presenter's Notes

Title: CS157B - Fall 2004

1
CS157B - Fall 2004
2
Entity-Relationship Data Model

Entity Sets
a collection of similar entities
Attributes
properties of the entities in an entity set
Relationships
connection among two or more entity sets
Constraints
Keys uniquely identify an entity within its
entity set
Single-value constraints requirement that the
value in a certain context be unique
Referential integrity constraints existence
test
Domain constraints range test
General constraints data set constraints

3
E/R - Cont.

Constraints Are of the Schema
Should be part of the DB design
Must be carefully constructed
Can be troublesome if major redesign is required
Subclasses
Special "isa" relationship
Similar to OO's definition
Weak Entity Sets
Some of its key may be composed of attributes
which belong to another entity set
Need additional qualifier

4
The Relational Data Model

Relation
Represented by a two-demensional table
Attribute
Represented as columns in a table
Schema
The name of a relation and the set of attributes
of this relation
A database schema consists of one of more
relation schemas
Tuple
A row in the table
Domain
Each attribute of a relation is an elementary
type, thus each tuple must be atomic

5
Relational - Cont

Converting Subclass Structures to Relations
E/R-Style Conversion
Object-Oriented Approach
Null Values
Functional Dependencies (FD)
An unique-value constraint in a relational schema
design
Keys of Relations
A(1), ..., A(n) is a key of relation R if
no distinct tuples agree on all of the ...
no other subset of determines all other
attributes of R, must be minimal
If more than one key, then designate one of the
keys as the primary key
Superkey Composed by a set of attributes

6
Relational - Cont

Design of Relational Database Schemas
Anomalies
Redundancy Many repeated information in each
tuple
Update Anomalies One tuple update does not
cause other tuples to be updated
Deletion Anomalies Lose other information as a
side effect when a set of values becomes empty
Decomposing Relations
Splitting the attributes of R to make the schemas
of two new relations
Boyce-Codd Normal Form (BCNF)
The left side of every nontrivial FD must be a
superkey
Recursively decompose into smaller sets of tables
until they all satisfy the BCNF rule
Must be able to join all set of tables into the
original tuple

7
Relational - Cont

Third Nomal Form (3NF)
A relation R is in 3NF if whenever A(1)...A(n)
-gtB is a nontrivial FD, either A(1)...A(n) is a
superkey, or B is a member of some key
May allow minimal redundancy in the end
Multivalued Dependencies (MVD)
A statement that two sets of attributes in a
relation have sets of values that appear in all
possible combinations
Example of MVD in p.118 figure 3.29
No BCNF violation
Fourth Normal Form (4NF)
A relation R is in 4NF if whenever
A(a)..A(n)-gt-gtB(1)..B(m) is a nontrivial MVD,
A(1)...A(n) is a superkey
Can be decomposed similar to the BCNF
decomposition algorithm

8
Other Data Models

Review of Object-Oriented Concepts
Type System
Record structures
Collection types
Reference types
Classes and Objects
Contain attributes and methods
Object Identity
OID
Methods
Functions within a class
Hierarchies
Superclass and subclass relationship
Introduction to ODL
Object-Oriented Design
Three main properties attributes, relationships,
and methods

9
Other Data Models - Cont

Class Declarations
e.g. class ltnamegt ltlist of propertiesgt
Attributes in ODL
can be simple or complex type
Relationships in ODL
e.g. in Movie relationship SetltStargt stars
Inverse Relationships
e.g. in Star relationship SetltMoviegt starredIn
Methods in ODL
signatures
Types in ODL
Atomic types
integer
float
character
character string
boolean
enumerations

10
Other Data Models - Cont

Structure types
e.g. class name
Structures - similar to a tuple of values
Collection types
Set - distinct unordered elements
Bag - unordered elements
List - ordered elements
Array - e.g. Arrayltchar,10gt
Dictionary - e.g. Dictionary(Keytype, Rangetype)
Additional ODL Concepts
Multiway Relationships in ODL
ODL supports only binary relationships
Use several binary, many-one relationship instead
Subclasses in ODL
Inherits all the properties of its superclass

11
Other Data Models - Cont

Multiple Inheritance in ODL
Needs to resolve name conflicts from multiple
superclass
Keys in ODL
Optional because of the existence of an OID
Can be declared with one or more attributes
ODL Designs to Relational Designs
Many issues
Possibly no key in ODL
Relational can't handle structure and collection
types directly
Convert any ODL type constructor can lead to a
BCNF violation
Representing ODL Relationships
one relation for each inverse pairs

12
Other Data Models - Cont

Object-Relational Model
O/R Features
Structured types for attributes
Methods
Identifiers for tuples (like OID)
References
Compromise Between Pure OO and Relational
Semistructured Data
"Schemaless" - the schema is attached to the data
itself
Represented in a collection of nodes
Interior node describes the data with a label
Leaf node contains the data of any atomic type
Legacy-database problem

13
Relational Algebra

Algebra of Relational Operations
Bags rather than sets can be more efficient
depending on the operation such as an union of
two relations which contain duplicates
Components
Variables that stand for relations and Constants
which are finite relations
Expressions of relational algebra are referred to
as queries
Set Operations on Relations
R u S, the union of R and S (distinct)
R n S, the intersection of R and S (distinct)
R - S, the difference of R and S, is the set of
elements that are in R but not in S different
from S - R
Tuples must be in the same order and attribute
types must be the same

14
Relational Algebra - Cont

Projection
The projection operator is used to produce from a
relation R a new relation that has only some of
R's columns
e.g. Views
Selection
The selection operator produces a new relation R
with a subset of R's tuples the tuples in the
resulting relation are those that satisfy some
condition C that involves the attributes of R
e.g. SQL select statement
Cartesian Product
A cross-product of two sets R and S
e.g. R x S if R has 2 tuples and S has 3
tuples, the result has 6 tuples
Natural Joins
Use matching attributes

15
Relational Algebra - Cont

Theta-Joins
Join two relations with a condition denoted by
'C'
Combining Operations to Form Queries
Multiple queries can be combined into a complex
query
e.g. AND, OR, (), ...
Renaming
Control the names of the attributes used for
relations that are constructed by applying
relational-algebra operations
Dependent and Independent Operations
Some relational expression can be "rewritten" in
a different expression
e.g. R n S R - (R - S)
e.g. Glass half-empty or half-full?

16
Relational Algebra - Cont

Extended Operators of Relational Algebra
Duplicate Elimination
This operator converts a bag to a set
e.g. SQL keyword DISTINCT
Aggregation Operators
SUM(), AVG(), MIN(), MAX(), COUNT()
Grouping
This operator allows us to group a relation
and/or aggregate some columns
Extended Projection
This allows expressions as part of an attribute
list
Sorting Operator
This operator is anomalous, in that it is the
only operator in our relational algrebra whose
result is a list of tuples, rather than a set
Outerjoin
Takes care of the dangling tuples denote with
special "null" symbol

17
SQL

Simple Queries in SQL
select A(...) from R where C
Projection Selection in SQL
select title, length from Movie where studioName
'Disney' and year 1990
String Comparison
Bit Strings (bit data)
e.g. B'011' or X'7ff'
Lexicographic order
e.g. 'A' lt 'B' 'a' lt 'b' 'a' gt 'B'?
Depends on encoding scheme or standard
(e.g. ASCII, UTF8)
LIKE keyword
"s like p" denotes where s is a string and p is a
pattern
Special character
Dates and Times
DATE '2002-02-04'
TIME '190030.5'

18
SQL - Cont

Null Values
3 cases unknown, inapplicable, withheld
NOT a constant
Test expression for IS NULL
Truth-Value "unknown"
Pitfalls regarding nulls
e.g. select from Movie
where length lt 120 or length gt 120
Ordering the Output
ORDER BY ltlist of attributesgt
Can be ASC or DESC, default is ASC
Queries with gt 1 Relation
e.g. select name from Movie, MovieExec where
title 'Star Wars' and producerC cert
Disambiguating Attributes
select MovieStar.name, MovieExec.name from
MovieStar, MovieExec where MovieStar.address
MovieExec.address

19
SQL - Cont

UNION, INTERSECT, EXCEPT Keywords
Same logic as the set operators of u, n, and -
Subqueries
Can return a single constant or relations in the
WHERE clause
Can have relations appear in the FROM clauses
Scalar Value
An atomic value that can appear as one component
of a tuple (e.g. constant, attribute)
e.g. select name from MovieExec where
cert(select producerC from Movie where
title'Star Wars')
Conditions Involving Relations
If R is a relation, then EXISTS R is a condition
that is true if R is not empty

20
SQL - Cont

s IN R is true if s is equal to one of the values
in R s NOT IN R is true if s is equal to no
value in R
s gt ALL R is true if s is greater than every
value in unary relation R
s gt ANY R is true if s is greater than at least
one value in unary relation R
EXISTS, ALL, ANY operators can be negated by
putting NOT in front of the entire expression
Subqueries in FROM Clauses
Can substitute a R in the FROM clause with a
subquery
e.g. select name from MovieExec, (select
producerC from Movie, StarsIn where title
movieTitle and year movieYear and starname
'Harrison Ford'

21
SQL - Cont

Cross Joins
Known as Cartesian product or just product
e.g. Movie CROSS JOIN StarsIn
Natural Joins
The join condition is that all pairs of
attributes from the two relations having a common
name are equated, and no other conditions
One of each pair of equated attributes is
projected out
Outerjoins
e.g. MovieStar NATURAL FULL OUTER JOIN MovieExec
Full-Relation Operations
Eliminating Duplicates
Use the DISTINCT keyword in SELECT
Performance consideration

22
SQL - Cont

Duplicates in U, I, and D
By default, UID operations convert bags to sets
Use keyword ALL after UNION, INTERSECT EXCEPT
keywords to prevent the elimination of duplicates
Grouping and Aggregation in SQL
Use the special GROUP BY clause
Aggregation Operators
SUM, AVG, MIN, MAX are used by applying them to a
scalar-valued expression, typically a column
name, in a SELECT clause
COUNT() is used to counts all the tuples in R
that is constructed from the FROM clause and
WHERE clause of the query
HAVING Clause
Use in conjunction with GROUP BY to narrow the
aggregated list

23
SQL - Cont

Database Modifications (DML)
Insertion
Insert into R(A1...An) values (V1,...Vn)
e.g. insert into Studio(name) values('S1')
Can insert multiple tuples with subquery
e.g. insert into Studio(name) select distinct
studioName from Movie where studioName not in
(select name from studio)
Deletion
Delete from R where ltconditiongt
Can delete multiple tuples with 1 delete
statement depending on ltconditiongt
Updates
Update R set ltnew-value assignmentgt where
ltconditiongt
Can update multiple tuples with 1 update
statement depending on ltconditiongt

24
SQL - Cont

DDL in SQL
Data Types
Character strings, fixed or variable length
Bit strings, fixed or variable length
Boolean true, false, unknown
Integer or int shortint
Floating-point numbers
e.g. decimal(n,d) where n is total number of
digits with d is the decimal point from the
right
1234.56 can be described as decimal(6,2)
Dates and times can be represented by the data
types DATE and TIME respectively
Table Declarations
Use the keywords CREATE TABLE followed by the R
name and list of As and their types
e.g. create table MovieStar(name char(30),
address varchar(256), gender char(1), birthday
DATE)

25
SQL - Cont

Modifying Relation Schemas
Drop table MovieStar
Alter table MovieStar add phone char(16)
Alter table MovieStar drop birthdate
Default Values
Use the DEFAULT keyword to set default values for
a column
e.g. alter table MovieStar add phone char(16)
default 'unlisted'
Indexes
Allow faster access to data
e.g. create index YearIndex on Movie(year)
Can be one or more attributes
e.g. create index KeyIndex on Movie(title,year)
Delete the index using drop index statement
Selection of Indexes
Selection vs IUD performance

26
SQL - Cont

View Definitions
View does not contains any physical data
"virtual relation"
Declaring Views
Create view ltview-namegt as ltview-definitiongt
Querying Views
Use the normal select syntax with a view name in
place of the table name
Renaming Attributes
Can map table attribute name from the base table
to a new name in a view definition
Modifying Views
Updatable views are useful in special cases
selective IUDs

27
Constraints Triggers

Keys Foreign Keys
Primary Keys
Each relation can have only one primary key
Primary key attribute(s) can not be NULL
Two ways to specify the primary key
1) create table MovieStar(
name char(30) primary key,
address varchar(255),
gender char(1),
birthdate date)
2) create table MovieStar(
name char(30),
address varchar(255),
gender char(1),
birthdate date,
primary key(name, birthday))
Unique Keys
Each relation can have gt1 unique keys
Declared the same way as primary key

28
Constraints Triggers - Cont

Enforcing Key Constraints
During insertion or update to the relation
Foreign-Key
The "referenced" attribute(s) must be declared
unique or the primary key for their relation it
must not have a NULL value
create table Studio(
name char(30) primary key,
address varchar(255),
presC int references MovieExec(cert)
)
create table Studio(
name char(30) primary key,
address varchar(255),
presC int,
foreign key (presC) references
MovieExec(cert)
)

29
Constraints Triggers - Cont

Maintaining Referential Integrity
Reject Violating Modifications (Default)
Insert or update Studio tuple whose presC value
is not NULL and is not the cert component of any
MovieExec tuple
Delete a MovieExec tuple and its cert component
appears as the presC component of one or more
Studio tuples
Update a MovieExec tuple cert value but the old
cert is the value of presC of some movie studio
in Studio
Cascade Policy
When deleting the MovieExec tuple for the
president of a studio, then it will delete the
referencing tuple from Studio
By changing the cert value for a MovieExec tuple
from c1 to c2 and there was some Studio tuple
with c1 as the value of its presC component,
then it will update this presC component to have
the value c2

30
Constraints Triggers - Cont

Set-Null Policy
Can handle the delete and update problem by
setting the presC to NULL
e.g. create table Studio (
name char(30) primary key,
address varchar(255),
presC int references MovieExec(cert)
on delete set null
on update cascade
)
Deferring Checking of Constraints
Do selective insert to default the presC to null
Insert tuple into MovieExec with new cert
Update the Studio tuple with matching presC
Use keyword DEFERRABLE and DEFERRED to delay the
checking until the whole tranaction is
"committed"
Reverse the DEFERRED case with keyword IMMEDIATE

31
Constraints Triggers - Cont

Constraints on Attributes and Tuples
Not-Null Constraints
Use the NOT NULL keywords in create table
statement for any attribute
Attribute-Based Constraints
Use the CHECK keyword in create table statement
Limit the value for an attribute
e.g. gender char(1) check (gender in ('F','M'))
Tuple-Based Constraints
Use the CHECK keyword in create table statement
Can compose of complex expression of multiple
attributes
Constraints Modification
Naming Constraints
In order to change, it must have a name
Use the CONSTRAINT keyword

32
Constraints Triggers - Cont

Altering Constraints on Tables
Can use ALTER TABLE to add or drop a constraint
Can use SET CONSTRAINTS to set it for deferred or
immediate
Schema-Level Constraints and Triggers
Assertions (General Constraint)
A boolean-valued SQL expression that must be true
at all times
create assertion ltnamegt check (ltconditiongt)
e.g.
create assertion RichPres check (not exists
(select from Studio, MovieExec where presC
cert AND netWorth lt 10000000))
Event-Condition-Action Rules (ECA Rules)
Triggers are awakened by certain events
The "action" will be preform only if C true

33
Constraints Triggers - Cont

Triggers in SQL
create trigger NetWorthTrigger
after update of netWorth ON MovieExec
referencing
old row as OTuple,
new row as NTuple
for each row
when (OTuple.netWorth gt NTuple.netWorth)
update MovieExec
set netWorth OTuple.netWorth
where cert NTuple.cert
Default is "for each statement"
Besides update, can use insert and delete
Action can be "before" or "after" the event
Use BEGIN...END for multiple statements
Instead-Of Triggers
Not part of SQL-99
Replace event with new defined operations
Very powerful when used on a view

34
System Aspects of SQL

SQL Programming Environment
Host language Embedded SQL
v
Preprocessor
v
Host language Function calls
v
Host-language compiler lt SQL Library
v
Object-code program
Impedance Mismatch Problem
Different data model between SQL statements and
programming langauges
SQL/Host Language Interface
Use EXEC SQL keywords in front of an SQL
statement
Use shared (host) variables for SQL stmt
Check SQLSTATE for SQL errors

35
System Aspects of SQL - Cont

The DECLARE Section and Its Usage
Shared variables are declared between two
embedded SQL statements.
e.g.
EXEC SQL BEGIN DECLARE SECTION
char studioName50, studioAddr256
char SQLSTATE6
EXEC SQL END DECLARE SECTION
A shared variable can be used within the SQL
statement by placing a colon in front it.
e.g.
EXEC SQL INSERT INTO
Studio(name, address)
VALUES (studioName, studioAddr)
Single-Row Select Statement
e.g.
EXEC SQL SELECT netWorth
INTO presNetWorth
FROM Studio, MovieExec
WHERE presC cert AND
Studio.name studioName

36
System Aspects of SQL - Cont

Cursors
Allow programs to "fetch" multiple rows from a
relation
Here are the steps for using a cursor
EXEC SQL DECLARE ltcursorgt CURSOR FOR ltquerygt
EXEC SQL OPEN ltcursorgt
EXEC SQL FETCH FROM ltcursorgt INTO
ltlist-of-variablesgt
If SQLSTATE is "02000", then goto close ltcursorgt
otherwise fetch next row
EXEC SQL CLOSE ltcursorgt
Row Modification with Cursor
Use the WHERE CURRENT OF keywords
e.g.
EXEC SQL DELETE FROM MovieExec
WHERE CURRENT OF execCursor
EXEC SQL UPDATE MovieExec
SET netWorth 2 netWorth
WHERE CURRENT OF execCursor

37
System Aspects of SQL - Cont

Concurrent Update of Tuple
Use keywords INSENSITIVE CURSOR to ignore new
changes which may affect the current cursor
Use Keywords FOR READ ONLY to signal that this
cursor does not allow any modification
Scrollable Cursors
Allow a set of movements within a cursor
Dynamic SQL
Flexibility to enter SQL statement at run time
Use EXEC SQL EXECUTE IMMEDIATE or
( EXEC SQL PREPARE ... and
EXEC SQL EXECUTE ... )
e.g.
EXEC SQL BEGIN DECLARE SECTION
char query
EXEC SQL END DECLARE SECTION
/ Allocate memory pointed to by query
and fill in the SQL statement /
EXEC SQL EXECUTE IMMEDIATE query

38
System Aspects of SQL - Cont

Procedures Stored in the Schema
Persistent Stored Modules (PSM)
Can build module to handle complex computations
which cannot be expressed using SQL
PSM Functions Procedures
CREATE PROCEDURE ltnamegt (ltparamgt)
local declarations
procedure body
Procedure parameter can be input-only,
output-only, or both
CREATE FUNCTION ltnamegt (ltparamgt) RETURNS lttypegt
local declarations
function body
Function parameter can only be input as PSM
forbids side-effects in functions

39
System Aspects of SQL - Cont

Statements in PSM
Call statement
CALL ltproc namegt (ltarg listgt)
e.g. EXEC SQL CALL Foo(x, 3)
RETURN ltexpressiongt
DECLARE ltnamegt lttypegt
SET ltvariablegt ltexpressiongt
BEGIN ... END
IF ltconditiongt THEN
ltstatement listgt
ELSEIF ltconditiongt THEN
ltstatement listgt
ELSEIF
...
ELSE ltstatement listgt
END IF
SELECT ltattrgt INTO ltvargt FROM lttablegt
WHERE ltconditiongt
LOOP ltstatement listgt END LOOP

40
System Aspects of SQL - Cont

FOR ltloop namegt AS ltcursor namegt CURSOR FOR
ltquerygt
DO
ltstatement listgt
END FOR
Support WHILE and REPEAT loops
Exception Handler in PSM
DECLARE ltwhere to gogt HANDLER FOR ltcondition
listgt ltstatementgt
ltwhere to gogt can be
CONTINUE - executing the handler statement and
then execute the next statement after the one
which cause the exception
EXIT - execute the handler statement and then
control leaves the BEGIN...END block in which the
handler is declared
UNDO - same as EXIT except that any changes to
the DB or local variables that were made by the
statements of the block are "undone"

41
System Aspects of SQL - Cont

SQL Environment
Schema
A collection of tables, views, assertions,
triggers, PSM modules, etc
CREATE SCHEMA ltnamegt ltdeclarationsgt
Use SET SCHEMA to change schema name
Catalog
A collection of schemas
CREATE CATALOG ltcatalog namegt
Use SET CATALOG to change the current catalog
Cluster
A collection of catalogs
Can be view as a set of all catalogs accessible
to a user
Client/Server
Both client and server can be on the different or
the same machine

42
System Aspects of SQL - Cont

Connection
CONNECT TO ltserver namegt AS ltconnection namegt
AUTHORIZATION ltname and passwordgt
SET CONNECTION ltnamegt
DISCONNECT ltnamegt
Call-Level Interface (CLI)
In C, each CLI program must include sqlcli.h
where it contains all the function, structure,
constant, and type definitions
4 kinds of records SQLHENV, SQLHDBC, SQLHSTMT,
and SQLHDESC.
Use SQLAllocHandle(hType, hIn, hOut)
Processing Statements
Use SQLPrepare(sh, st, sl) SQLExecute(sh)
or use SQLExecDirect(sh, st, sl)
Use SQLFetch(sh) from a query result

43
System Aspects of SQL - Cont

Use SQLBindCol(sh, colNo, colType, pVar, varSize,
varInfo) for column binding
Can use SQLGetData(...) in place of
SQLBindCol(...) to extract data from a query
Passing Parameters to Query
e.g.
SQLPrepare(myStmt, "INSERT INTO Studio(name,
address) VALUES (?, ?)", SQL_NTS)
SQLBindParameter(myStmt, 1, ..., studioName,
...)
SQLBindParameter(myStmt, 2, ..., studioAddr,
...)
SQLExecute(myStmt)
Transactions in SQL
Serializability
Multiple selects followed by multiple updates to
the same tuple e.g. chooseSeat()
Use locks to handle this problem

44
System Aspects of SQL - Cont

Atomicity
Single user transaction may have multiple updates
to different tables e.g. transfer from account A
to account B
Only "commit" after all the changes are made
Transaction
A collection of one or more operations on the
database that must be executed atomically
Use START TRANSACTION to begin
Use SQL COMMIT to commit
Use SQL ROLLBACK to abort and undo the changes
prior to the start of the transaction
Can set the transaction to READ ONLY
Dirty Reads (Uncommitted Reads)
Data read that were "dirty" or uncommitted
Isolation Levels
Serializable, uncommitted read, committed read,
and repeatable-read

45
System Aspects of SQL - Cont

Security User Authorization in SQL
Privileges
SQL defines nine types of privileges SELECT,
INSERT, DELETE, UPDATE, REFERENCES, USAGE,
TRIGGER, EXECUTE, and UNDER
Authorization Checking
First at connect time
Second at statement time
Additional checks with modules
Grant Revoke
GRANT ltprivilege listgt ON ltdatabase elementgt TO
ltuser listgt
Allow other user to perform certain actions
REVOKE ltprivilege listgt ON ltdatabase elementgt
FROM ltuser listgt
Disallow a previously granted privilege

46
Data Storage

Megatron 2002 Database System
Store relation in ASCII text file
Store the schema also in ASCII file
Obvious problems
Tuple layout on disk is not flexible any small
change may shuffle the whole file
Searching is expensive must read the whole file
Query-processing is by brute force nested loop
to examine all possibilities
No memory buffering, every query requires direct
access to disk
No concurrency control
No reliability e.g. no crash recovery
The Memory Hierarchy
Cache Memory
Fast access to and from processor or I/O
controller

47
Data Storage - Cont

Main Memory
Random access (RAM)
Both OS and applications reside in RAM
Virtual Memory
Allows each application to have their own private
memory space which mapped to physical memory
(RAM) or disk memory
A page is a memory block used by main memory
to/from disk
Secondary Storage
Much slower than main memory
Two type of disk I/O
Disk read means moving a block from disk to main
memory
Disk write means moving a block from main memory
to disk
Most DBMS will manage disk blocks itself, rather
than relying on the OS file manager

48
Data Storage - Cont

Volatile and Nonvolatile Storage
Main memory is typically volatile thus when the
power is off, the content is gone
Flash memory are nonvolatile but it is very
expensive and currently not used in main memory
An alternative is to use "RAM disk" combine with
a battery backup to the power supply
Disks
Disk Components
Head, platter (2 surfaces each), cylinder,
tracks, sectors, gap
Disk Controller
Controls the movement of the disk head(s) to a
specific track and preforms reads and writes
Tranfers data to and from main memory

49
Data Storage - Cont

Effective Use of Secondary Storage
CS studies of algorithm often assumes that the
data are always in main memory this is not a
valid assumption for DBMS
I/O Model of Computation
Dominance of I/O cost
If a block needs to be moved between disk and
main memory, then the time taken to perform the
read/write is much larger than the time for
manipulating that data in main memory thus the
I/O time is a good approximation of the total
time
Similar to Big O notation for algorithm study
Sorting Data in Secondary Storage
If we need to sort 1.64 billion bytes and a disk
block is configured to handle 16384 bytes, then
100000 blocks are required to read each tuple
once from disk
Quicksort is one of the fastest algorithm but its
assumption is all entries are in memory

50
Data Storage - Cont

Two-Phase, Multiway Merge-Sort (TPMMS)
Consists of 2 phases
Phase 1 Sort main-memory-sized pieces of the
data, so every record is part of a sorted list
that just fits in the availabe main memory the
results are a set of sorted sublists on disk
which we merge in the next phase
Phase 2 Merge all the sorted sublists into a
single sorted list
Example
If we have 100 MB of main memory using 16384 size
block sorting 1.64 billion bytes, we can fit 6400
blocks at a time in main memory thus the
results from phase 1 will have 16 sorted sublists
If merge two sublists at a time, we need 8 disk
I/O's performed on it
The better approach is to read the first block
from each of the sorted list into main-memory
buffer. Find smallest element into a output
buffer and flush/reload when necessary.

51
Data Storage - Cont

Accelerating Access to Secondary Storage
TPMMS example in 11.4.4 assumed that data was
stored on a single disk and the blocks were
chosen randomly
There are certainly room for improvement with the
following methods with their advantages and
disadvantages
Cylinder-Based Organization
Can reduce disk block access time in phase one of
TPMMS by more than 95
Excellent for applications that has only one
process accessing the disk and block reads are
grouped in logical sequences
Not useful when reading random blocks
Multiple Disks
Increase both group and random access time
Same disk access can't be parallel
Can be expensive since single large disk is
usually more cost effective than multiple smaller
disks with the same capacity

52
Data Storage - Cont

Mirroring
Reduce access time for read/write requests
Built-in fault tolerance for all applications
Must pay for 2 disks for the capacity of only 1
Elevator Algorithm
Reduce read/write access time when the blocks are
random
The average delays for each request can be high
for any high-traffic system
Prefetching/Double Buffering
Greatly improve access when the blocks are known
or grouped together.
Require extra main-memory buffers
No help when accesses are random
Disk Failures
Intemittent Failures
An attempt to read or write a sector failed, but
successful after n number of retries

53
Data Storage - Cont

Checksum
Widely used method to detect media errors
Use a collection of bits to calculate a fixed
number when the recalculation failed, then a
media error is the likely cause
Detect errors but does not fix them
Stable Storage
Similar to the disk mirroring except that this is
achieved at the software/application level
Keep an extra "delta" copy of the data to prevent
media error and possible data corruption caused
by power failure
Disk Crash Recovery
Redundant Arrays of Independent Disks-RAID
Redundancy Technique - Mirroring
Known as RAID level 1
When one of the disk failed, then the other
"mirroring" disk will become the main disk

54
Data Storage - Cont

Parity Blocks
Known as RAID level 4
Use only 1 redundant disk no matter how many data
disks it may support
Utilizing the modulo-2 sum for parity checks
Too many disks can cause the redundant disk to
perform poorly since each disk write in any n
disks can cause the check-sum bits to change
RAID 5
Improve the RAID 4 approach by sharing the
redundant disk workload into all n disks
Multiple Disk Crash
RAID 6 use error-correcting codes such as
Hamming code
Use a combination of data and redundant disks to
determine how many of each are required to
prevent concurrent failures or "n" disks

55
Representing Data Elements

Data Elements and Fields
Relational Database Elements
Since a relation is a set of tuples, and tuples
are similar to a record/structure in C or C, we
may imagine that each tuple will be stored on
disk as a record.
Objects
An object is like a tuple with its instance
variables are attributes.
Data Elements
INTEGER
Usually 2 or 4 bytes long
FLOAT
Usually 4 or 8 bytes long
CHAR(n)
Fixed length denoted by n

56
Representing Data Elements - Cont.

VARCHAR(n)
Variable length with n as the maximum
Two ways to represent varchar
Length plus content
Null-terminated string
Dates and Times
Date is usually represented as char(n)
Time is represented as varchar(n) because of the
support for fractional of seconds
Bits
Can pack 8 bits into a byte, use an 8 bits
boundary meaning rounded into the next byte.
Enumerated Types
Using a byte to represent each item, thus can
have 256 different values

57
Representing Data Elements - Cont.

Records
Building Fixed-Length Records
Can concatenate the fields to form a record
Be aware of the 4 and 8 bytes boundary depending
the HW and OS therefore must organize data
accordingly
Record Headers
Also known as the record descriptor
Information about the record such as length,
timestamp, record id, record type, etc.
Packing Fixed-Length Records into Blocks
Using a block header followed by multiple records

58
Representing Data Elements - Cont.

Representing Block and Record Addresses
Client-Server Systems
The server's data lives in a database address
space. The address space can refer to blocks and
possibly to offsets within the block.
Physical Addresses
Byte strings that can determine the location
within secondary storage system where the block
or record can be found. Information such as
hostname, cylinder number, track number, block
number, and offset from the beginning of a record
within a block.
Logical Addresses
Can be view as a flat model where all the records
are in logical sequence in memory.
Use a mapping table to map logical to physical
addresses.

59
Representing Data Elements - Cont.

Logical and Structured Addresses
Why logical addressing
Movement of data can be done by changing the
logical to physical mapping table rather than
moving the actual data itself.
Structured addressing
Using a key value and the physical address of a
block can easily locate a record
Can be view as a form of "hashing"
Fast lookup if the each record is fixed-length
Offset table
Keeping an offset table as part of a block header
can handle variable length record with fast
lookup.
Allow easy movement of data as one of the main
advantage of logical addressing.

60
Representing Data Elements - Cont.

Pointer Swizzling
It means to translate the embedded pointers from
secondary (database address) to main memory
(virtual address).
A pointer usually consists of a bit indicating
whether the pointer is currently a database
address or a (swizzled) memory address and the
actual database or memory pointer.
Automatic swizzling means when we load a block,
all its pointers and addresses are put into the
translation table if not already existed.
Anthoer approach is to translate the pointer only
when it is being used.
Programmer can control pointer swizzling by using
a look-ahead logic (e.g. prefetch).

61
Representing Data Elements - Cont.

Pinned Records and Blocks
A block in memory is pinned if it cannot be
written back to disk safely.
Can view this as a constraint or dependency from
other pointers in another block.
Variable-Length Data and Records
Variable-Length Fields Record
Must keep the length and offsets of the variable
length fields.
One simple but effective method is to put all the
fixed-length fields in the beginning and then
follow by the variable-length fields.
Repeating Fields Record
Use the same method as above but can move the
repeating fixed-length fields to another block.

62
Representing Data Elements - Cont.

Variable-Format Records
In certain situation, records may not have a
fixed schema. The fields or their order are not
completely determined by the relation.
Use tagged fields to handle such cases.
We stored attribute or field name, the type of
the field, and the length of the field.
Very similar to the SQLDA definition and the
bind-in and bind-out operations of SQL
Spanned Records
When a record can't fit into a block and must be
broken up into multiple blocks, it is called
spanned.
Spanned records require extra header information
to keep track of their fragments.

63
Representing Data Elements - Cont.

Binary Large Objects (BLOBS)
Can hold large audio and image files
Stored in a sequence of blocks but also can be
striped across multiple disk for faster retrieval
Retrieval is usually done in small chunks
Record Modifications
Insertion
If order is not important, then just add it to
the end of the free space within a block.
If order is important, then we must slide the
data to fit the new record. In this case, the
offset table implementation can help reduce the
actual movement of data.
If the block is full, then either a) find space
on a "nearby" block and keep the forwarding
address or b) use a chain of overflow blocks.

64
Representing Data Elements - Cont.

Deletion
When a record is deleted, the system must reclaim
its space.
If using an offset table, then it should shuffle
the free space to a central location
An alternative is to keep track of a link-listed
of free space (e.g. the freelist).
When the record is spanned or flowed to a
different block (nearby or overflow), we may need
to do a "reorganizing" of all the blocks.
To avoid dangling pointers, we can replace some
of the records with tombstones (dummy record
indicating a dead end).
Update
If the record is fixed-length, then there is no
effect on the storage.
Otherwise we have the same problems as insertion
or deletion of variable-length fields.

65
Index Structures

Indexes on Sequential Files
Usually have a data file and an index file. A
data file is a sorted file. An index file
contains only keys and pointers related to the
data file.
Sequential Files
A file which contains records that were sorted by
the keys defined by its index.
Dense Indexes
Every key from the data file is represented in
the index.
The index entry is small compare to an record
entry. Thus we may be able to keep the index
file content in memory, rather than read from the
index file.
Since the keys are sorted, we can use binary
search to find the key (K). Search time is about
log n (base 2).
Sparse Indexes
Hold only a subset of the dense indexes
Use much less space but slower search time.

66
Index Structures - Cont.

Multiple Levels of Index
If an index itself cover many blocks, then a
binary search will still need to do many disk
I/O's to get to the correct record. We can solve
this problem by putting an index on the index.
The outer level of the index will be more sparse
compared to the inner level.
Indexes With Duplicate Search Keys
If the index is not unique, we can have multiple
records with the same search key.
An efficient approach is to have only one record
in the dense index for each search key K. Then
find the record within the sorted sub-list.
Managing Indexes During IUDs
In Chapter 12, recall that different methods were
discussed to handle IUDs of fixed or variable
length records, similar logic applies to index
files as well.

67
Index Structures - Cont.

Here is a list of actions on the sequential file
which affect the index file
Action Dense Sparse
--------------------------------------------------
-----------
Create ltgt overflow block none none
Delete ltgt overflow block none none
Create ltgt sequential block none insert
Delete ltgt sequential block none delete
Insert record insert update(?)
Delete record delete update(?)
Slide record update update(?)
Empty overflow block has no effect because sparse
index keep track of only the primary blocks, not
overflow blocks.
IUD may or may not cause the sparse index to
change depending whether it was the sparse key or
if any record was slided.

68
Index Structures - Cont.

Secondary Index
This is a dense index, usually with duplicates.
Does not require the underlying file to be sorted
on the search key like in primary index.
The keys in the secondary index are sorted.
The pointers in one index block can go to many
different data blocks.
Clustered file structure can help manage the
many-one relationship. An example of this is the
number of columns or indexes within a table.
Indirection in Secondary Indexes
Using buckets in this index scheme to avoid
duplicates in the higher level.
e.g. SELECT title FROM Movie WHERE studioName
'Disney' AND year 1995
If studioName is the primary index and year is
the secondary index, then the number of tuples
which satisfy both condition will be reduced
significantly.

69
Index Structures - Cont.

Document Retrieval and Inverted Indexes
The WWW has brought many new requirements for
document retrieval and pattern match. This
results in newer and better search engine as time
passes.
e.g. Search all the documents which contain the
words "database", "programming", and
"implementation".
A relational view of the Doc search
A document may be thought of as a tuple in a
relation. Each attribute/word can be represent
as a bit and set to true if the Doc has at least
one match. Use a secondary index on each of the
attributes of Doc but only keep entries which has
the search-key value TRUE. Instead of creating a
separate index for each attribute, the indexes
are combined into one, called inverted index.
Each inverted index will point us to the bucket
entry where we can "join" the each list of
pointer

Write a Comment

User Comments (0)

About PowerShow.com

CS157B - Fall 2004 - PowerPoint PPT Presentation

CS157B - Fall 2004

Title: Slide 1 Last modified by: Fain Butt Document presentation format: Custom Other titles: Times New Roman Arial WingDings LotusWP Type Default Design Slide 1 ... – PowerPoint PPT presentation