PPT – CG171 - Database Implementation and Development: Lecture 1 PowerPoint presentation

About This Presentation

Title:

CG171 - Database Implementation and Development: Lecture 1

Description:

Title: PowerPoint Presentation Author: CGNR1 Last modified by: CGNR1 Created Date: 1/24/2003 2:41:47 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:227

Avg rating:3.0/5.0

Slides: 87

Provided by: CGN6

Category:

more less

Transcript and Presenter's Notes

Title: CG171 - Database Implementation and Development: Lecture 1

1
CG171 - Database Implementation and Development
Lecture 1

Nick Rossiter

2
Relationship to other modules

Prerequisite
CG084/CG085 Relational Databases Theory and
Practice
Possible follow-up module
CM036 Advanced Databases

3
CG171 Part 1 (NR)

Introduction (10 )
Database Design Vs. Information System Design
Database Design Methodologies and CASE tools
Physical Database Implementation Fundamentals (30
)
Memory structure
Data placement
Composite Data types
Access methods
Materialized queries

4
CG171 Part 2 (mainly AA)

Physical Database Design (30 )
Relational
Object-Relational
Object-Oriented
Database Programming and Development (30 )
External programming of composite data structures
Programming stored procedures, classes, and
packages
Creating and maintaining physical database
structures

5
Aims of module

To provide essentials of database system
implementation.
To develop students' ability in designing and
implementing physical database structures.
To extend students' ability in developing
information systems based on databases.

6
On completion of module

You should be able to
Explain and discuss the fundamentals of a
database system implementation.
Transform a logical database design into a
physical one for a target DBMS.
Evaluate effects of physical design decisions on
performance and complexity of a database.
Connect to a database and manipulate the data
using a programming language (e.g. Java, C, VB)
and procedural extensions to SQL (e.g. PL/SQL,
Informix 4GL)

7
Module Delivery/Assessment

Delivery
lectures
seminar/practical sessions
Assessment
examination
group-based assignment

8
Assignment

Will assess the students' ability to
evaluate design alternatives,
formulate physical design,
implement the physical design,
program connectivity and data manipulation using
programming languages and SQL procedural
extensions.

9
Reading

Main texts
Ramakrishnan, R and Gehrke, J. Database
Management Systems, McGraw Hill, 2nd Edition,
2000.
Connolly, T and Begg, C. Database Systems, 3rd
ed. Pearson, 2002.
Others
Robert J. Muller. Database Design for Smarties
using UML for Data Modeling. Morgan Kaufmann
Publishers, 1999.
Oracle Concepts, Oracle Corporation.
Oracle Application Developer's Guide, Oracle
Corporation.
Oracle Java Documentation, Oracle Corporation.

10
Basic Problem in Physical Access 1

Long times to fetch data cause
Slow response
Low throughput
User impatience
Loss of concentration by user
Logical design may be elegant but must be coupled
for efficiency with
effective physical design

11
Basic Problem in Physical Access 2

Assumption may be
A row of data is retrieved in one disk access
This may be very far from the case
Possible to design databases in which many 1,000s
of disk accesses are required to retrieve an item

12
ATM Example 1

Youre in the rain outside an ATM
You want to withdraw 50
You type in your needs
The system first checks your balance
How wet do you get before the 50 arrives?

13
ATM Example 2

Say your account number is 168234
The table holds 4,500,000 records, each holding
an account number and a balance
The system searches serially (each record in
turn)
Might need to search 4,500,000 records if very
unlucky
Or 1 record if very lucky
On average 2,250,000 records
With 100 records/block this gives range 1-45,000
disk accesses (average 22,500)
Each disk access c10 msecs (10-2 secs)
So minimum elapsed time from 10-2 to 450 secs

14
ATM Example 3

This is time for just one operation
Also need to record
Amount withdrawn
Transaction details
New customer balance
There are also times for network transmission.
Customer will usually get very wet!

15
ATM Example 4

So how can customer keep dry?
Could hold all of data in main memory of computer
Time for accessing a location here is c60
nanoseconds (60 x 10-9 secs)
So could search all 4,500,000 records quickly
One access/record would take only 27msecs (27 x
10-3 secs) for the whole file
Customer will keep very dry even with other
accesses.

16
Clever disk access needed

Problem is main memory is not available in volume
necessary for databases such as these (but
increasing all the time)
So need clever disk access system
Access methods
often relate primary key to storage address
Indexes
Sorted lists of search values with their
locations
Clustering
Place linked data close together

17
ATM Example 5

Now put data on disk
Assume B-tree organisation
All records found in lt5 disk accesses
Time is from 10 msecs (10 x 10-3 secs) to 50
msecs (50 x 10-3 secs)
Even with other tasks
Customer keeps very dry!

18
Database Design

After producing logical design with elegant
maintainable structures
Need to do physical design to make it run fast.
Performance is often more important in database
applications than in more general information
system design
Emphasis on number of transactions per second
(throughput)

19
Database Design Methodologies

Produce default storage schema
May be adequate for small applications
For large applications, much further tuning
required
Physical design is the technique
Concepts memory (main/disk), target disk
architecture, blocks, access methods, indexing,
clustering.

20
Seminar week 1

Using the web and other resources
Find out more about how disks work
What do the terms seek time, rotational delay,
transfer time, platter, track and cylinder mean?
For two different manufacturers disk-packs (say
IBM mainframe, Microsoft PC) find typical
magnitudes for the above features.
Also identify typical main memory sizes for the
corresponding computers.
Suggest how the characteristics affect physical
design of databases

21
Lecture 2 File Access Principles and Access
Methods

Nick Rossiter

22
Aims

Fast retrieval usually taken as lt 5 disk
accesses
Since disk access is very long compared to other
access times, number of disk accesses is often
used as indicator of performance
Fast placement within 5 disk accesses
Insertion of data, may be in middle of file not
at end
Deleting data, actual removal or tombstone
Updating data, including primary key and other
data

23
Retrieval/Placement

Distinguish between actions involving primary and
secondary keys
Primary key is that determined by normalisation
May be single or multiple attributes
Only one per table
Secondary keys
Again may be single or multiple attributes
Many per table
Include attributes other than the primary key
Complications such as candidate keys are omitted
in this part of the course

24
Access Method

An access method is the software responsible for
storage and retrieval of data on disk
Handles page I/O between disk and main memory
A page is a unit of storage on disk
Pages may be blocked so that many are retrieved
in a single disk access
Many different access methods exist
Each has a particular technique for relating a
primary key value to a page number

25
Processing of Data

All processing by software is done in main memory
Blocks of pages are moved
from disk to main memory for retrieval by user
from main memory to disk for storage by user
Access method drives the retrieval/storage
process

26
Cost Model

Identify cost of each retrieval/storage operation
Access time to disk to read/write page D
seek time (time to move head to required
cylinder)
rotational delay (time to rotate disk once the
head is over the required track)
transfer time (time to transfer data from disk
to main memory)
Typical value for D 15 milliseconds (msecs) or
15 x 10-3 secs

27
Other times

C average time to process a record in main
memory 100 nanoseconds (nsecs) 100 x 10-9
secs.
R number of records/page
B number of pages in file
Note that D gt C by roughly 105 times

28
Access Method I the Heap

Records (tuples) are held on file
in no particular order
with no indexing
that is in a heap unix default file type
Strategy
Insertions usually at end
Deletions are marked by tombstones
Searching is exhaustive from start to finish
until required record found

29
The Heap Cost Model 1

Cost of complete scan B(DRC)
For each page, one disk access D and process of R
records taking C each.
If R1000, B1000 (file contains 1,000,000
records)
Then cost 1000(0.015(100010-7))
1000(0.01500.0001) 1000(0.0151) 15.1 secs
Cost for finding particular record B(DRC)/2
(scan half file on average) 7.55 secs
Cost for finding a range of records e.g. student
names beginning with sm B(DRC) (must search
whole file) 15.1 secs

30
The Heap Cost Model 2

Insertion 2D C
Fetch last page (D), process record (C ), write
last page back again (D).
Assumes
all insertions at end
system can fetch last page in one disk access
Cost (20.015)10-7 ? 0.030 secs

31
The Heap Cost Model 3

Deletions B(DRC)/2 C D
Find particular record (scan half file -
B(DRC)/2), process record on page (C ), write
page back (D).
Record will be flagged as deleted (tombstone)
Record will still occupy space
If reclaim space need potentially to read/write
many more pages
Cost 7.550 10-7 0.015 ? 7.565 secs

32
Pros and Cons of Heaps

Pros
Cost effective where many records processed in a
single scan (can process 1,000,000 records in
15.1 secs)
Simple access method to write and maintain
Cons
Very expensive for searching for single records
in large files (1 record could take 15.1 secs to
find)
Expensive for operations like sorting as no
inherent order

33
Usage of Heaps

Where much of the file is to be processed
Batch mode (collect search and update requests
into batches)
Reports
Statistical summaries
Program source files
Files which occupy a small number of pages

34
Access Method II the Sorted File (Sequential
File)

Records (tuples) are held on file
in order of the primary key
with no indexing
that is in a sorted heap
Strategy
Insertions in a particular page to maintain order
Deletions are physically removed
Searching may be exhaustive from start to finish
until required record found but can use binary
search

35
Sorted File Cost Model 1

Cost of complete scan B(DRC)
Same as heap
If R1000, B1000 (file contains 1000000 records)
Then cost 1000(0.015(100010-7))
1000(0.01500.0001) 1000(0.0151) 15.1 secs
Cost for finding particular record
Can do binary search (binary chop)
Go to middle of file is key sought lower or
higher than value here?
If lower, go to 1/4 of way through file
If higher, go to 3/4 of way through file
And so on until found

36
Sorted File Cost Model 1

Cost is D log 2B C log 2R 0.01510
10-710 ? 0.15 secs
Much faster than with heap where 7.55 secs
Cost for finding a range of records e.g. student
names beginning with sm
Approximately same as for search for a particular
record
D log 2B C log 2R 0.01510 10-710 ? 0.15
secs
Find position in file then retrieve all records
from the page found
Works well only if sort key is same as search key
May need to retrieve further pages if result
large

37
Lecture 3 CG171 --Hashing

Nick Rossiter

38
Sorted File Cost Model 2

Insertion D log 2B C log 2R 2(0.5B(DRC))
Difficult
Find page where record must be held in sequence
search time is D log 2B C log 2R
Insert record on page in sort sequence
Then shuffle records on this page to make room
for insertion
Fetch all later pages to shuffle records further
requiring RC secs processing on 0.5B pages and D
secs/page fetch
0.5 from assumption that half the file needs to
be searched on average
Total for last two steps (4-5) is 0.5B(DRC)

39
Sorted File Cost Model 2

Write all pages back again. So again incur the
cost as in steps 4-5 0.5B(DRC)
Cost 0.01510 10-710
2(0.5 1000(0.015 1000 10-7 ))
? 0.15 15.1 ? 15.25 secs (long!)
Deletions
Cost is the same as for insertions
Do not have option of leaving deleted records
flagged
Records are shuffled upwards

40
Pros and Cons of Sorted Files

Pros
Cost effective where many records processed in a
single scan (can process 1,000,000 records in
15.1 secs)
Simple access method to write and maintain
Fairly fast for searches on search key (10-15
disk accesses)
Cons
Very expensive for insertions and deletions
(except at end) as much shuffling required

41
Usage of Sorted Files

Where much of the file is to be processed
Batch mode (collect search and update requests
into batches)
Reports
Statistical summaries
Program source files
Searching (on sorted key) in files which occupy a
relatively small number of pages

42
Hashing

One of the big two Access Methods
Very fast potentially
One disk access only in ideal situation
Used in many database and more general
information systems
where speed is vital
Many variants to cope with certain problems

43
Meaning of Hash

Definition 3. to cut into many small pieces
mince (often fol. by up).
Example He chopped up some garlic.
Synonyms dice , mince (1) , hash (1)
Similar Words chip1 , cut up cut (vt) ,
carve , crumble , cube1 , divide
From http//www.wordsmyth.net/live/home.php?conten
twotw/2001.0521/wotw_esl

44
Hash Function

Takes key value of record to be stored
Applies some function (often including a chop)
delivering an integer
This integer is a page number on the disk. So
input is key
output is a page number

45
Simple Example

Have
B10 (ten pages for disk file)
R2,000 (2,000 records/page)
Keys S12, S27, S30, S42
Apply function chop to keys giving
12, 27, 30, 42 so that initial letter is
discarded

46
Simple Example

Then take remainder of dividing chopped key by
10.
Why divide?
Gives integer remainder
Why 10?
Output numbers from 0 9
10 possible outputs corresponds with 10 pages for
storage
In this case, numbers returned are
2, 7, 0, 2

47
Disk File hash table
Page Records (only keys shown)
0 S30
1
2 S12, S42
3
4
5
6
7 S27
8
9
48
Retrieval

Say user looks for record with key S42
Apply hash function to key
Discard initial letter, divide by 10, take
remainder
Gives 2
Transfer page 2 to buffer in main memory
Search buffer looking for record with key S42

49
Cost Model

Retrieval of a particular record
D0.5RC (one disk access time taken to search
half a page for the required record)
0.015(0.5200010-7) 0.0151 secs (very fast)
Insertion of a record
Fetch page (D) Modify in main memory (C )
Write back to disk (D)
0.01510-70.015 ? 0.0300
Deletions same as insertions

50
Effectiveness

Looks very good
Searches in one disk access
Insertions and deletions in two disk accesses
So
Searching faster than heap and sorted
Insertions and deletions similar to heaped, much
faster than sorted

51
Minor Problem

Complete scan
Normally do not fill pages to leave space for new
records to be inserted
80 initially loading
So records occupy 1.25 times number of pages if
densely packed
1.25B(DRC) 1.2510(0.015200010-7) ? 0.189
secs (against 0.152 if packed densely)

52
Larger Problems

Scan for groups of records say S31-S37 will be
very slow
Each record will be in different page, not in
same page.
So 7 disk accesses instead of 1 with sorted file
(once page located holding S31-S37).

53
Larger Problems

What happens if page becomes full?
This could happen if
Hash function poorly chosen e.g. all keys end in
0 and hash function is a number divided by 10
All records go in same page
Simply too many records for initial space
allocated to file

54
Overflow Area

Have extra pages in an overflow area to hold
records
Mark overflowed pages in main disk area
Retrieval now may take 2 disk accesses to search
expected page and overflow page.
If have overflow on overflow page, may take 3
disk accesses for retrieval.
Insertions may also be slow collisions on
already full pages.
Performance degrades

55
At Intervals in Static Hashing

The Data Base Administrators lost weekend
He/she comes in
Closes system down to external use
Runs a utility expanding number of pages and
re-hashing all records into the new space

56
Alternatives to Static Hashing

Automatic adjustment Dynamic Hashing
Extendible Hashing
Have index (directory) to pages which adjusts to
number of records in the system
Linear Hashing
Have family of hash functions

57
Pros and Cons of Hashing

Pros
Very fast for searches on search key (may be 1
disk access)
Very fast for insertions and deletions (often 2
disk accesses)
No indexes to search/maintain (in most variants)
Cons
Slightly slower than sorted files for scan of
complete file
Requires periodic off-line maintenance in static
hashing as pages become full and collisions occur
Poor for range searches

58
Usage of Hashing

Applications involving
Searching (on key) in files of any size for
single records very fast
Insertions and deletions of single records
So typical in On-line Transaction Processing
(OLTP)

59
Lecture 4

Nick Rossiter

60
Indexed Sequential

Can see that (static) hashing is
very fast for single-record retrieval
very slow for fetching ranges
So not an all-round access method
What we want is an access method which has
Sequential access AND
Indexed access

61
Concept of Indexes

An auxiliary storage structure
possibly a separate file
or a partition within the data file
May be loaded into main memory if not too large
Comprises in general a set of pairs of values,
each pair
key value
pointer to a page (page number or page address --
perhaps cylinder and track number)

62
Concept of Indexes

Selection of key values may be
dense -- every key value is included in the index
sparse -- only one key value per page is included
in the index
Key values may be
primary key (attributes comprising primary key)
-- primary index
secondary key (all other attributes) -- secondary
index

63
Concept of Indexes

Primary and secondary keys may be
simple (single attribute)
composite (multiple attributes)
Order of values in the file compared to those in
the index may be
clustered (the same order or very close to it)
unclustered (random order)

64
Concept of Indexes

Shape of index may be
flat (all entries at one level)
tree-structured (the index has a tree structure)
As updates take place, index may be
static (does not adjust automatically)
dynamic (adjusts automatically)

65
Use of Indexes

Indexes are extensively used in
access methods
database performance optimisation in joins and
other operations

66
ISAM

Indexed Sequential Access Method
Has records in sequence as in sorted file
Has records indexed at intervals
in separate data structure as pairs of values
perhaps highest key found on each page
page number
In effect a sorted file with records indexed at
intervals

67
ISAM

In terms of previous terminology, ISAM is
tree-structured
sparse
clustered
static
either primary or secondary keys
either simple or composite keys

68
ISAM Node

Each node in the index tree is of the form

P is a pointer K is a key value in each node
always one more pointer than key value
69
Sorted file is underneath Index
Index
Page 0
Data
Page 1
Page 2
Indicates that page 0 contains records with keys
lt 20 page 1 contains records with keys from 20
to 49 page 2 contains records with keys from 50
upwards
70
Lecture 5

Nick Rossiter

71
Verdict on ISAM

Static index
Unchanged by data changes
Becomes out of date
May have intense local growth
Non-random addition of key values
So need overflow pages
Require extra disk accessing for search/update

72
Verdict on ISAM

Performance
Much depends on where index is held
Main memory makes searches very fast
On disk deep tree will involve a number of
accesses
Good for range and whole file processing
Just process sequential set
Good for single record retrieval
If index in main memory (not always possible)
Not many overflow records
Concurrency -- No locking problems on index as it
is not updated

73
Lost weekend

Once index is too much out of step with data
DBA schedules a weekend slot for
Taking database offline
Dump of database
Re-load into new ISAM format
Problems of a static approach

74
Usage

Where the following three conditions are
satisfied
data is fairly static
sequential access is needed (whole file and
ranges)
single-record (indexed) access is required
If condition 1 is not met, use a B-tree

75
B-trees

What does the B stand for?
Balanced, Bushy or Bayer (apparently not clear)
Balanced means distance from root to leaf node is
same for all of tree
Bushy is a gardening term meaning all strands of
similar length
Bayer is a person who wrote a seminal paper on
them

76
B-Trees

There are some variants on B-trees.
We deal here with B-trees where all data entries
are held in leaf nodes of tree
The two terms are interchangeable for our
purposes.
B-trees are dynamic index structures
The tree automatically adjusts as the data changes

77
B-tree

A B-tree is a
Multiway tree (fan-out or branching factor gt 2,
binary tree has fan-out 2) with
Efficient self-balancing operations
Minimum node occupancy is 50
Typical node occupancy can be greater than this
but initially keep it around 60

78
B-tree Diagram indexsequence set
d 2
Index set
Internal structure as in root

Sequence Set

The data entries
79
Structure

Index set (tree, other than leaf pages) is sparse
Contains key-pointer pairs (as in ISAM)
Not all keys included
Data set (leaf pages) is dense
All entries included
Data held is key non-key data
Pages are connected by double linked lists
Can navigate in either direction
Pages are ideally in sequence but this is not a
rule
No pointers are held at this level

80
Parameters

d 2 says that the order of the tree is 2
Each non-terminal node contains between d and 2d
index entries (except root node 12d entries)
Each leaf node contains between d and 2d data
entries
So tree shown can hold 2, 3 or 4 index values in
each node
How many index pointers?
Format is as for ISAM so always one more pointer
than value. So 3, 4 or 5 pointers per node.

81
Capacity of Tree

d 2
One root node can hold 2d 4 records
Next level 5 pointers from root, each node
holds maximum 4 records 20 records in 5 nodes
Next level 5 pointers from each of the 5 nodes
above, each node maximum 4 records 100 records
in leaf nodes
In practice will not pack as closely
But d2, 3-levels potentially addresses 100
records
If all held on disk, 3 disk accesses

82
Capacity of Tree

High branching factors (fan-out) increase
capacity dramatically (see seminar).
So tree structure with high branching factor can
be kept to a small number of levels
Bushy trees (heavily pruned!) mean less disk
accesses
Root node at least will normally be held in main
memory

83
Example of a B-tree
Order (d) 2
84
Search Times

Always go to leaf node (data entries) to retrieve
actual data.
Root node in main memory
Cost (T-1)(D0.5RC) for single record
T height of tree
If d2, then R4 (max), cost (3-1)(0.015(0.54
10-7))
20.0150002 0.0300004 secs

85
Insertions

First add into sequence set
Search for node
If space add to node
Leave index untouched
If no space
Try re-distributing records in sibling nodes
Sibling node is adjacent node with same parent
Or split node and share entries amongst the two
nodes
Will involve alteration to index
Insertions tend to push index entries up the tree
If splits all the way up and root node overflows,
then height of tree T increases by one.

86
Deletions