Title: CS157B - Fall 2004
1CS157B - Fall 2004
2Entity-Relationship Data Model
- Entity Sets
- a collection of similar entities
- Attributes
- properties of the entities in an entity set
- Relationships
- connection among two or more entity sets
- Constraints
- Keys uniquely identify an entity within its
entity set - Single-value constraints requirement that the
value in a certain context be unique - Referential integrity constraints existence
test - Domain constraints range test
- General constraints data set constraints
3E/R - Cont.
- Constraints Are of the Schema
- Should be part of the DB design
- Must be carefully constructed
- Can be troublesome if major redesign is required
- Subclasses
- Special "isa" relationship
- Similar to OO's definition
- Weak Entity Sets
- Some of its key may be composed of attributes
which belong to another entity set - Need additional qualifier
4The Relational Data Model
- Relation
- Represented by a two-demensional table
- Attribute
- Represented as columns in a table
- Schema
- The name of a relation and the set of attributes
of this relation - A database schema consists of one of more
relation schemas - Tuple
- A row in the table
- Domain
- Each attribute of a relation is an elementary
type, thus each tuple must be atomic
5Relational - Cont
- Converting Subclass Structures to Relations
- E/R-Style Conversion
- Object-Oriented Approach
- Null Values
- Functional Dependencies (FD)
- An unique-value constraint in a relational schema
design - Keys of Relations
- A(1), ..., A(n) is a key of relation R if
- no distinct tuples agree on all of the ...
- no other subset of determines all other
attributes of R, must be minimal - If more than one key, then designate one of the
keys as the primary key - Superkey Composed by a set of attributes
6Relational - Cont
- Design of Relational Database Schemas
- Anomalies
- Redundancy Many repeated information in each
tuple - Update Anomalies One tuple update does not
cause other tuples to be updated - Deletion Anomalies Lose other information as a
side effect when a set of values becomes empty - Decomposing Relations
- Splitting the attributes of R to make the schemas
of two new relations - Boyce-Codd Normal Form (BCNF)
- The left side of every nontrivial FD must be a
superkey - Recursively decompose into smaller sets of tables
until they all satisfy the BCNF rule - Must be able to join all set of tables into the
original tuple
7Relational - Cont
- Third Nomal Form (3NF)
- A relation R is in 3NF if whenever A(1)...A(n)
-gtB is a nontrivial FD, either A(1)...A(n) is a
superkey, or B is a member of some key - May allow minimal redundancy in the end
- Multivalued Dependencies (MVD)
- A statement that two sets of attributes in a
relation have sets of values that appear in all
possible combinations - Example of MVD in p.118 figure 3.29
- No BCNF violation
- Fourth Normal Form (4NF)
- A relation R is in 4NF if whenever
A(a)..A(n)-gt-gtB(1)..B(m) is a nontrivial MVD,
A(1)...A(n) is a superkey - Can be decomposed similar to the BCNF
decomposition algorithm
8Other Data Models
- Review of Object-Oriented Concepts
- Type System
- Record structures
- Collection types
- Reference types
- Classes and Objects
- Contain attributes and methods
- Object Identity
- OID
- Methods
- Functions within a class
- Hierarchies
- Superclass and subclass relationship
- Introduction to ODL
- Object-Oriented Design
- Three main properties attributes, relationships,
and methods
9Other Data Models - Cont
- Class Declarations
- e.g. class ltnamegt ltlist of propertiesgt
- Attributes in ODL
- can be simple or complex type
- Relationships in ODL
- e.g. in Movie relationship SetltStargt stars
- Inverse Relationships
- e.g. in Star relationship SetltMoviegt starredIn
- Methods in ODL
- signatures
- Types in ODL
- Atomic types
- integer
- float
- character
- character string
- boolean
- enumerations
10Other Data Models - Cont
- Structure types
- e.g. class name
- Structures - similar to a tuple of values
- Collection types
- Set - distinct unordered elements
- Bag - unordered elements
- List - ordered elements
- Array - e.g. Arrayltchar,10gt
- Dictionary - e.g. Dictionary(Keytype, Rangetype)
- Additional ODL Concepts
- Multiway Relationships in ODL
- ODL supports only binary relationships
- Use several binary, many-one relationship instead
- Subclasses in ODL
- Inherits all the properties of its superclass
11Other Data Models - Cont
- Multiple Inheritance in ODL
- Needs to resolve name conflicts from multiple
superclass - Keys in ODL
- Optional because of the existence of an OID
- Can be declared with one or more attributes
- ODL Designs to Relational Designs
- Many issues
- Possibly no key in ODL
- Relational can't handle structure and collection
types directly - Convert any ODL type constructor can lead to a
BCNF violation - Representing ODL Relationships
- one relation for each inverse pairs
12Other Data Models - Cont
- Object-Relational Model
- O/R Features
- Structured types for attributes
- Methods
- Identifiers for tuples (like OID)
- References
- Compromise Between Pure OO and Relational
- Semistructured Data
- "Schemaless" - the schema is attached to the data
itself - Represented in a collection of nodes
- Interior node describes the data with a label
- Leaf node contains the data of any atomic type
- Legacy-database problem
13Relational Algebra
- Algebra of Relational Operations
- Bags rather than sets can be more efficient
depending on the operation such as an union of
two relations which contain duplicates - Components
- Variables that stand for relations and Constants
which are finite relations - Expressions of relational algebra are referred to
as queries - Set Operations on Relations
- R u S, the union of R and S (distinct)
- R n S, the intersection of R and S (distinct)
- R - S, the difference of R and S, is the set of
elements that are in R but not in S different
from S - R - Tuples must be in the same order and attribute
types must be the same
14Relational Algebra - Cont
- Projection
- The projection operator is used to produce from a
relation R a new relation that has only some of
R's columns - e.g. Views
- Selection
- The selection operator produces a new relation R
with a subset of R's tuples the tuples in the
resulting relation are those that satisfy some
condition C that involves the attributes of R - e.g. SQL select statement
- Cartesian Product
- A cross-product of two sets R and S
- e.g. R x S if R has 2 tuples and S has 3
tuples, the result has 6 tuples - Natural Joins
- Use matching attributes
15Relational Algebra - Cont
- Theta-Joins
- Join two relations with a condition denoted by
'C' - Combining Operations to Form Queries
- Multiple queries can be combined into a complex
query - e.g. AND, OR, (), ...
- Renaming
- Control the names of the attributes used for
relations that are constructed by applying
relational-algebra operations - Dependent and Independent Operations
- Some relational expression can be "rewritten" in
a different expression - e.g. R n S R - (R - S)
- e.g. Glass half-empty or half-full?
16Relational Algebra - Cont
- Extended Operators of Relational Algebra
- Duplicate Elimination
- This operator converts a bag to a set
- e.g. SQL keyword DISTINCT
- Aggregation Operators
- SUM(), AVG(), MIN(), MAX(), COUNT()
- Grouping
- This operator allows us to group a relation
and/or aggregate some columns - Extended Projection
- This allows expressions as part of an attribute
list - Sorting Operator
- This operator is anomalous, in that it is the
only operator in our relational algrebra whose
result is a list of tuples, rather than a set - Outerjoin
- Takes care of the dangling tuples denote with
special "null" symbol
17SQL
- Simple Queries in SQL
- select A(...) from R where C
- Projection Selection in SQL
- select title, length from Movie where studioName
'Disney' and year 1990 - String Comparison
- Bit Strings (bit data)
- e.g. B'011' or X'7ff'
- Lexicographic order
- e.g. 'A' lt 'B' 'a' lt 'b' 'a' gt 'B'?
- Depends on encoding scheme or standard
- (e.g. ASCII, UTF8)
- LIKE keyword
- "s like p" denotes where s is a string and p is a
pattern - Special character
- Dates and Times
- DATE '2002-02-04'
- TIME '190030.5'
18SQL - Cont
- Null Values
- 3 cases unknown, inapplicable, withheld
- NOT a constant
- Test expression for IS NULL
- Truth-Value "unknown"
- Pitfalls regarding nulls
- e.g. select from Movie
- where length lt 120 or length gt 120
- Ordering the Output
- ORDER BY ltlist of attributesgt
- Can be ASC or DESC, default is ASC
- Queries with gt 1 Relation
- e.g. select name from Movie, MovieExec where
title 'Star Wars' and producerC cert - Disambiguating Attributes
- select MovieStar.name, MovieExec.name from
MovieStar, MovieExec where MovieStar.address
MovieExec.address
19SQL - Cont
- UNION, INTERSECT, EXCEPT Keywords
- Same logic as the set operators of u, n, and -
- Subqueries
- Can return a single constant or relations in the
WHERE clause - Can have relations appear in the FROM clauses
- Scalar Value
- An atomic value that can appear as one component
of a tuple (e.g. constant, attribute) - e.g. select name from MovieExec where
cert(select producerC from Movie where
title'Star Wars') - Conditions Involving Relations
- If R is a relation, then EXISTS R is a condition
that is true if R is not empty
20SQL - Cont
- s IN R is true if s is equal to one of the values
in R s NOT IN R is true if s is equal to no
value in R - s gt ALL R is true if s is greater than every
value in unary relation R - s gt ANY R is true if s is greater than at least
one value in unary relation R - EXISTS, ALL, ANY operators can be negated by
putting NOT in front of the entire expression - Subqueries in FROM Clauses
- Can substitute a R in the FROM clause with a
subquery - e.g. select name from MovieExec, (select
producerC from Movie, StarsIn where title
movieTitle and year movieYear and starname
'Harrison Ford'
21SQL - Cont
- Cross Joins
- Known as Cartesian product or just product
- e.g. Movie CROSS JOIN StarsIn
- Natural Joins
- The join condition is that all pairs of
attributes from the two relations having a common
name are equated, and no other conditions - One of each pair of equated attributes is
projected out - Outerjoins
- e.g. MovieStar NATURAL FULL OUTER JOIN MovieExec
- Full-Relation Operations
- Eliminating Duplicates
- Use the DISTINCT keyword in SELECT
- Performance consideration
22SQL - Cont
- Duplicates in U, I, and D
- By default, UID operations convert bags to sets
- Use keyword ALL after UNION, INTERSECT EXCEPT
keywords to prevent the elimination of duplicates - Grouping and Aggregation in SQL
- Use the special GROUP BY clause
- Aggregation Operators
- SUM, AVG, MIN, MAX are used by applying them to a
scalar-valued expression, typically a column
name, in a SELECT clause - COUNT() is used to counts all the tuples in R
that is constructed from the FROM clause and
WHERE clause of the query - HAVING Clause
- Use in conjunction with GROUP BY to narrow the
aggregated list
23SQL - Cont
- Database Modifications (DML)
- Insertion
- Insert into R(A1...An) values (V1,...Vn)
- e.g. insert into Studio(name) values('S1')
- Can insert multiple tuples with subquery
- e.g. insert into Studio(name) select distinct
studioName from Movie where studioName not in
(select name from studio) - Deletion
- Delete from R where ltconditiongt
- Can delete multiple tuples with 1 delete
statement depending on ltconditiongt - Updates
- Update R set ltnew-value assignmentgt where
ltconditiongt - Can update multiple tuples with 1 update
statement depending on ltconditiongt
24SQL - Cont
- DDL in SQL
- Data Types
- Character strings, fixed or variable length
- Bit strings, fixed or variable length
- Boolean true, false, unknown
- Integer or int shortint
- Floating-point numbers
- e.g. decimal(n,d) where n is total number of
digits with d is the decimal point from the
right - 1234.56 can be described as decimal(6,2)
- Dates and times can be represented by the data
types DATE and TIME respectively - Table Declarations
- Use the keywords CREATE TABLE followed by the R
name and list of As and their types - e.g. create table MovieStar(name char(30),
address varchar(256), gender char(1), birthday
DATE)
25SQL - Cont
- Modifying Relation Schemas
- Drop table MovieStar
- Alter table MovieStar add phone char(16)
- Alter table MovieStar drop birthdate
- Default Values
- Use the DEFAULT keyword to set default values for
a column - e.g. alter table MovieStar add phone char(16)
default 'unlisted' - Indexes
- Allow faster access to data
- e.g. create index YearIndex on Movie(year)
- Can be one or more attributes
- e.g. create index KeyIndex on Movie(title,year)
- Delete the index using drop index statement
- Selection of Indexes
- Selection vs IUD performance
26SQL - Cont
- View Definitions
- View does not contains any physical data
- "virtual relation"
- Declaring Views
- Create view ltview-namegt as ltview-definitiongt
- Querying Views
- Use the normal select syntax with a view name in
place of the table name - Renaming Attributes
- Can map table attribute name from the base table
to a new name in a view definition - Modifying Views
- Updatable views are useful in special cases
selective IUDs
27Constraints Triggers
- Keys Foreign Keys
- Primary Keys
- Each relation can have only one primary key
- Primary key attribute(s) can not be NULL
- Two ways to specify the primary key
- 1) create table MovieStar(
- name char(30) primary key,
- address varchar(255),
- gender char(1),
- birthdate date)
- 2) create table MovieStar(
- name char(30),
- address varchar(255),
- gender char(1),
- birthdate date,
- primary key(name, birthday))
- Unique Keys
- Each relation can have gt1 unique keys
- Declared the same way as primary key
28Constraints Triggers - Cont
- Enforcing Key Constraints
- During insertion or update to the relation
- Foreign-Key
- The "referenced" attribute(s) must be declared
unique or the primary key for their relation it
must not have a NULL value - create table Studio(
- name char(30) primary key,
- address varchar(255),
- presC int references MovieExec(cert)
- )
- create table Studio(
- name char(30) primary key,
- address varchar(255),
- presC int,
- foreign key (presC) references
MovieExec(cert) - )
29Constraints Triggers - Cont
- Maintaining Referential Integrity
- Reject Violating Modifications (Default)
- Insert or update Studio tuple whose presC value
is not NULL and is not the cert component of any
MovieExec tuple - Delete a MovieExec tuple and its cert component
appears as the presC component of one or more
Studio tuples - Update a MovieExec tuple cert value but the old
cert is the value of presC of some movie studio
in Studio - Cascade Policy
- When deleting the MovieExec tuple for the
president of a studio, then it will delete the
referencing tuple from Studio - By changing the cert value for a MovieExec tuple
from c1 to c2 and there was some Studio tuple
with c1 as the value of its presC component,
then it will update this presC component to have
the value c2
30Constraints Triggers - Cont
- Set-Null Policy
- Can handle the delete and update problem by
setting the presC to NULL - e.g. create table Studio (
- name char(30) primary key,
- address varchar(255),
- presC int references MovieExec(cert)
- on delete set null
- on update cascade
- )
- Deferring Checking of Constraints
- Do selective insert to default the presC to null
- Insert tuple into MovieExec with new cert
- Update the Studio tuple with matching presC
- Use keyword DEFERRABLE and DEFERRED to delay the
checking until the whole tranaction is
"committed" - Reverse the DEFERRED case with keyword IMMEDIATE
31Constraints Triggers - Cont
- Constraints on Attributes and Tuples
- Not-Null Constraints
- Use the NOT NULL keywords in create table
statement for any attribute - Attribute-Based Constraints
- Use the CHECK keyword in create table statement
- Limit the value for an attribute
- e.g. gender char(1) check (gender in ('F','M'))
- Tuple-Based Constraints
- Use the CHECK keyword in create table statement
- Can compose of complex expression of multiple
attributes - Constraints Modification
- Naming Constraints
- In order to change, it must have a name
- Use the CONSTRAINT keyword
32Constraints Triggers - Cont
- Altering Constraints on Tables
- Can use ALTER TABLE to add or drop a constraint
- Can use SET CONSTRAINTS to set it for deferred or
immediate - Schema-Level Constraints and Triggers
- Assertions (General Constraint)
- A boolean-valued SQL expression that must be true
at all times - create assertion ltnamegt check (ltconditiongt)
- e.g.
- create assertion RichPres check (not exists
(select from Studio, MovieExec where presC
cert AND netWorth lt 10000000)) - Event-Condition-Action Rules (ECA Rules)
- Triggers are awakened by certain events
- The "action" will be preform only if C true
33Constraints Triggers - Cont
- Triggers in SQL
- create trigger NetWorthTrigger
- after update of netWorth ON MovieExec
- referencing
- old row as OTuple,
- new row as NTuple
- for each row
- when (OTuple.netWorth gt NTuple.netWorth)
- update MovieExec
- set netWorth OTuple.netWorth
- where cert NTuple.cert
- Default is "for each statement"
- Besides update, can use insert and delete
- Action can be "before" or "after" the event
- Use BEGIN...END for multiple statements
- Instead-Of Triggers
- Not part of SQL-99
- Replace event with new defined operations
- Very powerful when used on a view
34System Aspects of SQL
- SQL Programming Environment
- Host language Embedded SQL
- v
- Preprocessor
- v
- Host language Function calls
- v
- Host-language compiler lt SQL Library
- v
- Object-code program
- Impedance Mismatch Problem
- Different data model between SQL statements and
programming langauges - SQL/Host Language Interface
- Use EXEC SQL keywords in front of an SQL
statement - Use shared (host) variables for SQL stmt
- Check SQLSTATE for SQL errors
35System Aspects of SQL - Cont
- The DECLARE Section and Its Usage
- Shared variables are declared between two
embedded SQL statements. - e.g.
- EXEC SQL BEGIN DECLARE SECTION
- char studioName50, studioAddr256
- char SQLSTATE6
- EXEC SQL END DECLARE SECTION
- A shared variable can be used within the SQL
statement by placing a colon in front it. - e.g.
- EXEC SQL INSERT INTO
- Studio(name, address)
- VALUES (studioName, studioAddr)
- Single-Row Select Statement
- e.g.
- EXEC SQL SELECT netWorth
- INTO presNetWorth
- FROM Studio, MovieExec
- WHERE presC cert AND
- Studio.name studioName
36System Aspects of SQL - Cont
- Cursors
- Allow programs to "fetch" multiple rows from a
relation - Here are the steps for using a cursor
- EXEC SQL DECLARE ltcursorgt CURSOR FOR ltquerygt
- EXEC SQL OPEN ltcursorgt
- EXEC SQL FETCH FROM ltcursorgt INTO
ltlist-of-variablesgt - If SQLSTATE is "02000", then goto close ltcursorgt
otherwise fetch next row - EXEC SQL CLOSE ltcursorgt
- Row Modification with Cursor
- Use the WHERE CURRENT OF keywords
- e.g.
- EXEC SQL DELETE FROM MovieExec
- WHERE CURRENT OF execCursor
- EXEC SQL UPDATE MovieExec
- SET netWorth 2 netWorth
- WHERE CURRENT OF execCursor
-
37System Aspects of SQL - Cont
- Concurrent Update of Tuple
- Use keywords INSENSITIVE CURSOR to ignore new
changes which may affect the current cursor - Use Keywords FOR READ ONLY to signal that this
cursor does not allow any modification - Scrollable Cursors
- Allow a set of movements within a cursor
- Dynamic SQL
- Flexibility to enter SQL statement at run time
- Use EXEC SQL EXECUTE IMMEDIATE or
- ( EXEC SQL PREPARE ... and
- EXEC SQL EXECUTE ... )
- e.g.
- EXEC SQL BEGIN DECLARE SECTION
- char query
- EXEC SQL END DECLARE SECTION
- / Allocate memory pointed to by query
- and fill in the SQL statement /
- EXEC SQL EXECUTE IMMEDIATE query
38System Aspects of SQL - Cont
- Procedures Stored in the Schema
- Persistent Stored Modules (PSM)
- Can build module to handle complex computations
which cannot be expressed using SQL - PSM Functions Procedures
- CREATE PROCEDURE ltnamegt (ltparamgt)
- local declarations
- procedure body
- Procedure parameter can be input-only,
output-only, or both - CREATE FUNCTION ltnamegt (ltparamgt) RETURNS lttypegt
- local declarations
- function body
- Function parameter can only be input as PSM
forbids side-effects in functions
39System Aspects of SQL - Cont
- Statements in PSM
- Call statement
- CALL ltproc namegt (ltarg listgt)
- e.g. EXEC SQL CALL Foo(x, 3)
- RETURN ltexpressiongt
- DECLARE ltnamegt lttypegt
- SET ltvariablegt ltexpressiongt
- BEGIN ... END
- IF ltconditiongt THEN
- ltstatement listgt
- ELSEIF ltconditiongt THEN
- ltstatement listgt
- ELSEIF
- ...
- ELSE ltstatement listgt
- END IF
- SELECT ltattrgt INTO ltvargt FROM lttablegt
- WHERE ltconditiongt
- LOOP ltstatement listgt END LOOP
40System Aspects of SQL - Cont
- FOR ltloop namegt AS ltcursor namegt CURSOR FOR
- ltquerygt
- DO
- ltstatement listgt
- END FOR
- Support WHILE and REPEAT loops
- Exception Handler in PSM
- DECLARE ltwhere to gogt HANDLER FOR ltcondition
listgt ltstatementgt - ltwhere to gogt can be
- CONTINUE - executing the handler statement and
then execute the next statement after the one
which cause the exception - EXIT - execute the handler statement and then
control leaves the BEGIN...END block in which the
handler is declared - UNDO - same as EXIT except that any changes to
the DB or local variables that were made by the
statements of the block are "undone"
41System Aspects of SQL - Cont
- SQL Environment
- Schema
- A collection of tables, views, assertions,
triggers, PSM modules, etc - CREATE SCHEMA ltnamegt ltdeclarationsgt
- Use SET SCHEMA to change schema name
- Catalog
- A collection of schemas
- CREATE CATALOG ltcatalog namegt
- Use SET CATALOG to change the current catalog
- Cluster
- A collection of catalogs
- Can be view as a set of all catalogs accessible
to a user - Client/Server
- Both client and server can be on the different or
the same machine
42System Aspects of SQL - Cont
- Connection
- CONNECT TO ltserver namegt AS ltconnection namegt
AUTHORIZATION ltname and passwordgt - SET CONNECTION ltnamegt
- DISCONNECT ltnamegt
- Call-Level Interface (CLI)
- In C, each CLI program must include sqlcli.h
where it contains all the function, structure,
constant, and type definitions - 4 kinds of records SQLHENV, SQLHDBC, SQLHSTMT,
and SQLHDESC. - Use SQLAllocHandle(hType, hIn, hOut)
- Processing Statements
- Use SQLPrepare(sh, st, sl) SQLExecute(sh)
- or use SQLExecDirect(sh, st, sl)
- Use SQLFetch(sh) from a query result
43System Aspects of SQL - Cont
- Use SQLBindCol(sh, colNo, colType, pVar, varSize,
varInfo) for column binding - Can use SQLGetData(...) in place of
SQLBindCol(...) to extract data from a query - Passing Parameters to Query
- e.g.
- SQLPrepare(myStmt, "INSERT INTO Studio(name,
address) VALUES (?, ?)", SQL_NTS) - SQLBindParameter(myStmt, 1, ..., studioName,
...) - SQLBindParameter(myStmt, 2, ..., studioAddr,
...) - SQLExecute(myStmt)
- Transactions in SQL
- Serializability
- Multiple selects followed by multiple updates to
the same tuple e.g. chooseSeat() - Use locks to handle this problem
44System Aspects of SQL - Cont
- Atomicity
- Single user transaction may have multiple updates
to different tables e.g. transfer from account A
to account B - Only "commit" after all the changes are made
- Transaction
- A collection of one or more operations on the
database that must be executed atomically - Use START TRANSACTION to begin
- Use SQL COMMIT to commit
- Use SQL ROLLBACK to abort and undo the changes
prior to the start of the transaction - Can set the transaction to READ ONLY
- Dirty Reads (Uncommitted Reads)
- Data read that were "dirty" or uncommitted
- Isolation Levels
- Serializable, uncommitted read, committed read,
and repeatable-read
45System Aspects of SQL - Cont
- Security User Authorization in SQL
- Privileges
- SQL defines nine types of privileges SELECT,
INSERT, DELETE, UPDATE, REFERENCES, USAGE,
TRIGGER, EXECUTE, and UNDER - Authorization Checking
- First at connect time
- Second at statement time
- Additional checks with modules
- Grant Revoke
- GRANT ltprivilege listgt ON ltdatabase elementgt TO
ltuser listgt - Allow other user to perform certain actions
- REVOKE ltprivilege listgt ON ltdatabase elementgt
FROM ltuser listgt - Disallow a previously granted privilege
46Data Storage
- Megatron 2002 Database System
- Store relation in ASCII text file
- Store the schema also in ASCII file
- Obvious problems
- Tuple layout on disk is not flexible any small
change may shuffle the whole file - Searching is expensive must read the whole file
- Query-processing is by brute force nested loop
to examine all possibilities - No memory buffering, every query requires direct
access to disk - No concurrency control
- No reliability e.g. no crash recovery
- The Memory Hierarchy
- Cache Memory
- Fast access to and from processor or I/O
controller
47Data Storage - Cont
- Main Memory
- Random access (RAM)
- Both OS and applications reside in RAM
- Virtual Memory
- Allows each application to have their own private
memory space which mapped to physical memory
(RAM) or disk memory - A page is a memory block used by main memory
to/from disk - Secondary Storage
- Much slower than main memory
- Two type of disk I/O
- Disk read means moving a block from disk to main
memory - Disk write means moving a block from main memory
to disk - Most DBMS will manage disk blocks itself, rather
than relying on the OS file manager
48Data Storage - Cont
- Volatile and Nonvolatile Storage
- Main memory is typically volatile thus when the
power is off, the content is gone - Flash memory are nonvolatile but it is very
expensive and currently not used in main memory - An alternative is to use "RAM disk" combine with
a battery backup to the power supply - Disks
- Disk Components
- Head, platter (2 surfaces each), cylinder,
tracks, sectors, gap - Disk Controller
- Controls the movement of the disk head(s) to a
specific track and preforms reads and writes - Tranfers data to and from main memory
49Data Storage - Cont
- Effective Use of Secondary Storage
- CS studies of algorithm often assumes that the
data are always in main memory this is not a
valid assumption for DBMS - I/O Model of Computation
- Dominance of I/O cost
- If a block needs to be moved between disk and
main memory, then the time taken to perform the
read/write is much larger than the time for
manipulating that data in main memory thus the
I/O time is a good approximation of the total
time - Similar to Big O notation for algorithm study
- Sorting Data in Secondary Storage
- If we need to sort 1.64 billion bytes and a disk
block is configured to handle 16384 bytes, then
100000 blocks are required to read each tuple
once from disk - Quicksort is one of the fastest algorithm but its
assumption is all entries are in memory
50Data Storage - Cont
- Two-Phase, Multiway Merge-Sort (TPMMS)
- Consists of 2 phases
- Phase 1 Sort main-memory-sized pieces of the
data, so every record is part of a sorted list
that just fits in the availabe main memory the
results are a set of sorted sublists on disk
which we merge in the next phase - Phase 2 Merge all the sorted sublists into a
single sorted list - Example
- If we have 100 MB of main memory using 16384 size
block sorting 1.64 billion bytes, we can fit 6400
blocks at a time in main memory thus the
results from phase 1 will have 16 sorted sublists
- If merge two sublists at a time, we need 8 disk
I/O's performed on it - The better approach is to read the first block
from each of the sorted list into main-memory
buffer. Find smallest element into a output
buffer and flush/reload when necessary.
51Data Storage - Cont
- Accelerating Access to Secondary Storage
- TPMMS example in 11.4.4 assumed that data was
stored on a single disk and the blocks were
chosen randomly - There are certainly room for improvement with the
following methods with their advantages and
disadvantages - Cylinder-Based Organization
- Can reduce disk block access time in phase one of
TPMMS by more than 95 - Excellent for applications that has only one
process accessing the disk and block reads are
grouped in logical sequences - Not useful when reading random blocks
- Multiple Disks
- Increase both group and random access time
- Same disk access can't be parallel
- Can be expensive since single large disk is
usually more cost effective than multiple smaller
disks with the same capacity
52Data Storage - Cont
- Mirroring
- Reduce access time for read/write requests
- Built-in fault tolerance for all applications
- Must pay for 2 disks for the capacity of only 1
- Elevator Algorithm
- Reduce read/write access time when the blocks are
random - The average delays for each request can be high
for any high-traffic system - Prefetching/Double Buffering
- Greatly improve access when the blocks are known
or grouped together. - Require extra main-memory buffers
- No help when accesses are random
- Disk Failures
- Intemittent Failures
- An attempt to read or write a sector failed, but
successful after n number of retries
53Data Storage - Cont
- Checksum
- Widely used method to detect media errors
- Use a collection of bits to calculate a fixed
number when the recalculation failed, then a
media error is the likely cause - Detect errors but does not fix them
- Stable Storage
- Similar to the disk mirroring except that this is
achieved at the software/application level - Keep an extra "delta" copy of the data to prevent
media error and possible data corruption caused
by power failure - Disk Crash Recovery
- Redundant Arrays of Independent Disks-RAID
- Redundancy Technique - Mirroring
- Known as RAID level 1
- When one of the disk failed, then the other
"mirroring" disk will become the main disk
54Data Storage - Cont
- Parity Blocks
- Known as RAID level 4
- Use only 1 redundant disk no matter how many data
disks it may support - Utilizing the modulo-2 sum for parity checks
- Too many disks can cause the redundant disk to
perform poorly since each disk write in any n
disks can cause the check-sum bits to change - RAID 5
- Improve the RAID 4 approach by sharing the
redundant disk workload into all n disks - Multiple Disk Crash
- RAID 6 use error-correcting codes such as
Hamming code - Use a combination of data and redundant disks to
determine how many of each are required to
prevent concurrent failures or "n" disks
55Representing Data Elements
- Data Elements and Fields
- Relational Database Elements
- Since a relation is a set of tuples, and tuples
are similar to a record/structure in C or C, we
may imagine that each tuple will be stored on
disk as a record. - Objects
- An object is like a tuple with its instance
variables are attributes. - Data Elements
- INTEGER
- Usually 2 or 4 bytes long
- FLOAT
- Usually 4 or 8 bytes long
- CHAR(n)
- Fixed length denoted by n
56Representing Data Elements - Cont.
- VARCHAR(n)
- Variable length with n as the maximum
- Two ways to represent varchar
- Length plus content
- Null-terminated string
- Dates and Times
- Date is usually represented as char(n)
- Time is represented as varchar(n) because of the
support for fractional of seconds - Bits
- Can pack 8 bits into a byte, use an 8 bits
boundary meaning rounded into the next byte. - Enumerated Types
- Using a byte to represent each item, thus can
have 256 different values
57Representing Data Elements - Cont.
- Records
- Building Fixed-Length Records
- Can concatenate the fields to form a record
- Be aware of the 4 and 8 bytes boundary depending
the HW and OS therefore must organize data
accordingly - Record Headers
- Also known as the record descriptor
- Information about the record such as length,
timestamp, record id, record type, etc. - Packing Fixed-Length Records into Blocks
- Using a block header followed by multiple records
58Representing Data Elements - Cont.
- Representing Block and Record Addresses
- Client-Server Systems
- The server's data lives in a database address
space. The address space can refer to blocks and
possibly to offsets within the block. - Physical Addresses
- Byte strings that can determine the location
within secondary storage system where the block
or record can be found. Information such as
hostname, cylinder number, track number, block
number, and offset from the beginning of a record
within a block. - Logical Addresses
- Can be view as a flat model where all the records
are in logical sequence in memory. - Use a mapping table to map logical to physical
addresses.
59Representing Data Elements - Cont.
- Logical and Structured Addresses
- Why logical addressing
- Movement of data can be done by changing the
logical to physical mapping table rather than
moving the actual data itself. - Structured addressing
- Using a key value and the physical address of a
block can easily locate a record - Can be view as a form of "hashing"
- Fast lookup if the each record is fixed-length
- Offset table
- Keeping an offset table as part of a block header
can handle variable length record with fast
lookup. - Allow easy movement of data as one of the main
advantage of logical addressing.
60Representing Data Elements - Cont.
- Pointer Swizzling
- It means to translate the embedded pointers from
secondary (database address) to main memory
(virtual address). - A pointer usually consists of a bit indicating
whether the pointer is currently a database
address or a (swizzled) memory address and the
actual database or memory pointer. - Automatic swizzling means when we load a block,
all its pointers and addresses are put into the
translation table if not already existed. - Anthoer approach is to translate the pointer only
when it is being used. - Programmer can control pointer swizzling by using
a look-ahead logic (e.g. prefetch).
61Representing Data Elements - Cont.
- Pinned Records and Blocks
- A block in memory is pinned if it cannot be
written back to disk safely. - Can view this as a constraint or dependency from
other pointers in another block. - Variable-Length Data and Records
- Variable-Length Fields Record
- Must keep the length and offsets of the variable
length fields. - One simple but effective method is to put all the
fixed-length fields in the beginning and then
follow by the variable-length fields. - Repeating Fields Record
- Use the same method as above but can move the
repeating fixed-length fields to another block.
62Representing Data Elements - Cont.
- Variable-Format Records
- In certain situation, records may not have a
fixed schema. The fields or their order are not
completely determined by the relation. - Use tagged fields to handle such cases.
- We stored attribute or field name, the type of
the field, and the length of the field. - Very similar to the SQLDA definition and the
bind-in and bind-out operations of SQL - Spanned Records
- When a record can't fit into a block and must be
broken up into multiple blocks, it is called
spanned. - Spanned records require extra header information
to keep track of their fragments.
63Representing Data Elements - Cont.
- Binary Large Objects (BLOBS)
- Can hold large audio and image files
- Stored in a sequence of blocks but also can be
striped across multiple disk for faster retrieval - Retrieval is usually done in small chunks
- Record Modifications
- Insertion
- If order is not important, then just add it to
the end of the free space within a block. - If order is important, then we must slide the
data to fit the new record. In this case, the
offset table implementation can help reduce the
actual movement of data. - If the block is full, then either a) find space
on a "nearby" block and keep the forwarding
address or b) use a chain of overflow blocks.
64Representing Data Elements - Cont.
- Deletion
- When a record is deleted, the system must reclaim
its space. - If using an offset table, then it should shuffle
the free space to a central location - An alternative is to keep track of a link-listed
of free space (e.g. the freelist). - When the record is spanned or flowed to a
different block (nearby or overflow), we may need
to do a "reorganizing" of all the blocks. - To avoid dangling pointers, we can replace some
of the records with tombstones (dummy record
indicating a dead end). - Update
- If the record is fixed-length, then there is no
effect on the storage. - Otherwise we have the same problems as insertion
or deletion of variable-length fields.
65Index Structures
- Indexes on Sequential Files
- Usually have a data file and an index file. A
data file is a sorted file. An index file
contains only keys and pointers related to the
data file. - Sequential Files
- A file which contains records that were sorted by
the keys defined by its index. - Dense Indexes
- Every key from the data file is represented in
the index. - The index entry is small compare to an record
entry. Thus we may be able to keep the index
file content in memory, rather than read from the
index file. - Since the keys are sorted, we can use binary
search to find the key (K). Search time is about
log n (base 2). - Sparse Indexes
- Hold only a subset of the dense indexes
- Use much less space but slower search time.
66Index Structures - Cont.
- Multiple Levels of Index
- If an index itself cover many blocks, then a
binary search will still need to do many disk
I/O's to get to the correct record. We can solve
this problem by putting an index on the index. - The outer level of the index will be more sparse
compared to the inner level. - Indexes With Duplicate Search Keys
- If the index is not unique, we can have multiple
records with the same search key. - An efficient approach is to have only one record
in the dense index for each search key K. Then
find the record within the sorted sub-list. - Managing Indexes During IUDs
- In Chapter 12, recall that different methods were
discussed to handle IUDs of fixed or variable
length records, similar logic applies to index
files as well.
67Index Structures - Cont.
- Here is a list of actions on the sequential file
which affect the index file - Action Dense Sparse
- --------------------------------------------------
----------- - Create ltgt overflow block none none
- Delete ltgt overflow block none none
- Create ltgt sequential block none insert
- Delete ltgt sequential block none delete
- Insert record insert update(?)
- Delete record delete update(?)
- Slide record update update(?)
- Empty overflow block has no effect because sparse
index keep track of only the primary blocks, not
overflow blocks. - IUD may or may not cause the sparse index to
change depending whether it was the sparse key or
if any record was slided.
68Index Structures - Cont.
- Secondary Index
- This is a dense index, usually with duplicates.
- Does not require the underlying file to be sorted
on the search key like in primary index. - The keys in the secondary index are sorted.
- The pointers in one index block can go to many
different data blocks. - Clustered file structure can help manage the
many-one relationship. An example of this is the
number of columns or indexes within a table. - Indirection in Secondary Indexes
- Using buckets in this index scheme to avoid
duplicates in the higher level. - e.g. SELECT title FROM Movie WHERE studioName
'Disney' AND year 1995 - If studioName is the primary index and year is
the secondary index, then the number of tuples
which satisfy both condition will be reduced
significantly.
69Index Structures - Cont.
- Document Retrieval and Inverted Indexes
- The WWW has brought many new requirements for
document retrieval and pattern match. This
results in newer and better search engine as time
passes. - e.g. Search all the documents which contain the
words "database", "programming", and
"implementation". - A relational view of the Doc search
- A document may be thought of as a tuple in a
relation. Each attribute/word can be represent
as a bit and set to true if the Doc has at least
one match. Use a secondary index on each of the
attributes of Doc but only keep entries which has
the search-key value TRUE. Instead of creating a
separate index for each attribute, the indexes
are combined into one, called inverted index. - Each inverted index will point us to the bucket
entry where we can "join" the each list of
pointer