Title: CG171 - Database Implementation and Development: Lecture 1
1CG171 - Database Implementation and Development
Lecture 1
2Relationship to other modules
- Prerequisite
- CG084/CG085 Relational Databases Theory and
Practice - Possible follow-up module
- CM036 Advanced Databases
3CG171 Part 1 (NR)
- Introduction (10 )
- Database Design Vs. Information System Design
- Database Design Methodologies and CASE tools
- Physical Database Implementation Fundamentals (30
) - Memory structure
- Data placement
- Composite Data types
- Access methods
- Materialized queries
4CG171 Part 2 (mainly AA)
- Physical Database Design (30 )
- Relational
- Object-Relational
- Object-Oriented
- Database Programming and Development (30 )
- External programming of composite data structures
- Programming stored procedures, classes, and
packages - Creating and maintaining physical database
structures
5Aims of module
- To provide essentials of database system
implementation. - To develop students' ability in designing and
implementing physical database structures. - To extend students' ability in developing
information systems based on databases.
6On completion of module
- You should be able to
- Explain and discuss the fundamentals of a
database system implementation. - Transform a logical database design into a
physical one for a target DBMS. - Evaluate effects of physical design decisions on
performance and complexity of a database. - Connect to a database and manipulate the data
using a programming language (e.g. Java, C, VB)
and procedural extensions to SQL (e.g. PL/SQL,
Informix 4GL)
7Module Delivery/Assessment
- Delivery
- lectures
- seminar/practical sessions
- Assessment
- examination
- group-based assignment
8Assignment
- Will assess the students' ability to
- evaluate design alternatives,
- formulate physical design,
- implement the physical design,
- program connectivity and data manipulation using
programming languages and SQL procedural
extensions.
9Reading
- Main texts
- Ramakrishnan, R and Gehrke, J. Database
Management Systems, McGraw Hill, 2nd Edition,
2000. - Connolly, T and Begg, C. Database Systems, 3rd
ed. Pearson, 2002. - Others
- Robert J. Muller. Database Design for Smarties
using UML for Data Modeling. Morgan Kaufmann
Publishers, 1999. - Oracle Concepts, Oracle Corporation.
- Oracle Application Developer's Guide, Oracle
Corporation. - Oracle Java Documentation, Oracle Corporation.
10Basic Problem in Physical Access 1
- Long times to fetch data cause
- Slow response
- Low throughput
- User impatience
- Loss of concentration by user
- Logical design may be elegant but must be coupled
for efficiency with - effective physical design
11Basic Problem in Physical Access 2
- Assumption may be
- A row of data is retrieved in one disk access
- This may be very far from the case
- Possible to design databases in which many 1,000s
of disk accesses are required to retrieve an item
12ATM Example 1
- Youre in the rain outside an ATM
- You want to withdraw 50
- You type in your needs
- The system first checks your balance
- How wet do you get before the 50 arrives?
13ATM Example 2
- Say your account number is 168234
- The table holds 4,500,000 records, each holding
an account number and a balance - The system searches serially (each record in
turn) - Might need to search 4,500,000 records if very
unlucky - Or 1 record if very lucky
- On average 2,250,000 records
- With 100 records/block this gives range 1-45,000
disk accesses (average 22,500) - Each disk access c10 msecs (10-2 secs)
- So minimum elapsed time from 10-2 to 450 secs
14ATM Example 3
- This is time for just one operation
- Also need to record
- Amount withdrawn
- Transaction details
- New customer balance
- There are also times for network transmission.
- Customer will usually get very wet!
15ATM Example 4
- So how can customer keep dry?
- Could hold all of data in main memory of computer
- Time for accessing a location here is c60
nanoseconds (60 x 10-9 secs) - So could search all 4,500,000 records quickly
- One access/record would take only 27msecs (27 x
10-3 secs) for the whole file - Customer will keep very dry even with other
accesses. -
16Clever disk access needed
- Problem is main memory is not available in volume
necessary for databases such as these (but
increasing all the time) - So need clever disk access system
- Access methods
- often relate primary key to storage address
- Indexes
- Sorted lists of search values with their
locations - Clustering
- Place linked data close together
17ATM Example 5
- Now put data on disk
- Assume B-tree organisation
- All records found in lt5 disk accesses
- Time is from 10 msecs (10 x 10-3 secs) to 50
msecs (50 x 10-3 secs) - Even with other tasks
- Customer keeps very dry!
18Database Design
- After producing logical design with elegant
maintainable structures - Need to do physical design to make it run fast.
- Performance is often more important in database
applications than in more general information
system design - Emphasis on number of transactions per second
(throughput)
19Database Design Methodologies
- Produce default storage schema
- May be adequate for small applications
- For large applications, much further tuning
required - Physical design is the technique
- Concepts memory (main/disk), target disk
architecture, blocks, access methods, indexing,
clustering.
20Seminar week 1
- Using the web and other resources
- Find out more about how disks work
- What do the terms seek time, rotational delay,
transfer time, platter, track and cylinder mean? - For two different manufacturers disk-packs (say
IBM mainframe, Microsoft PC) find typical
magnitudes for the above features. - Also identify typical main memory sizes for the
corresponding computers. - Suggest how the characteristics affect physical
design of databases
21Lecture 2 File Access Principles and Access
Methods
22Aims
- Fast retrieval usually taken as lt 5 disk
accesses - Since disk access is very long compared to other
access times, number of disk accesses is often
used as indicator of performance - Fast placement within 5 disk accesses
- Insertion of data, may be in middle of file not
at end - Deleting data, actual removal or tombstone
- Updating data, including primary key and other
data
23Retrieval/Placement
- Distinguish between actions involving primary and
secondary keys - Primary key is that determined by normalisation
- May be single or multiple attributes
- Only one per table
- Secondary keys
- Again may be single or multiple attributes
- Many per table
- Include attributes other than the primary key
- Complications such as candidate keys are omitted
in this part of the course
24Access Method
- An access method is the software responsible for
storage and retrieval of data on disk - Handles page I/O between disk and main memory
- A page is a unit of storage on disk
- Pages may be blocked so that many are retrieved
in a single disk access - Many different access methods exist
- Each has a particular technique for relating a
primary key value to a page number
25Processing of Data
- All processing by software is done in main memory
- Blocks of pages are moved
- from disk to main memory for retrieval by user
- from main memory to disk for storage by user
- Access method drives the retrieval/storage
process
26Cost Model
- Identify cost of each retrieval/storage operation
- Access time to disk to read/write page D
- seek time (time to move head to required
cylinder) - rotational delay (time to rotate disk once the
head is over the required track) - transfer time (time to transfer data from disk
to main memory) - Typical value for D 15 milliseconds (msecs) or
- 15 x 10-3 secs
27Other times
- C average time to process a record in main
memory 100 nanoseconds (nsecs) 100 x 10-9
secs. - R number of records/page
- B number of pages in file
- Note that D gt C by roughly 105 times
28Access Method I the Heap
- Records (tuples) are held on file
- in no particular order
- with no indexing
- that is in a heap unix default file type
- Strategy
- Insertions usually at end
- Deletions are marked by tombstones
- Searching is exhaustive from start to finish
until required record found
29The Heap Cost Model 1
- Cost of complete scan B(DRC)
- For each page, one disk access D and process of R
records taking C each. - If R1000, B1000 (file contains 1,000,000
records) - Then cost 1000(0.015(100010-7))
1000(0.01500.0001) 1000(0.0151) 15.1 secs - Cost for finding particular record B(DRC)/2
(scan half file on average) 7.55 secs - Cost for finding a range of records e.g. student
names beginning with sm B(DRC) (must search
whole file) 15.1 secs
30The Heap Cost Model 2
- Insertion 2D C
- Fetch last page (D), process record (C ), write
last page back again (D). - Assumes
- all insertions at end
- system can fetch last page in one disk access
- Cost (20.015)10-7 ? 0.030 secs
31The Heap Cost Model 3
- Deletions B(DRC)/2 C D
- Find particular record (scan half file -
B(DRC)/2), process record on page (C ), write
page back (D). - Record will be flagged as deleted (tombstone)
- Record will still occupy space
- If reclaim space need potentially to read/write
many more pages - Cost 7.550 10-7 0.015 ? 7.565 secs
32Pros and Cons of Heaps
- Pros
- Cost effective where many records processed in a
single scan (can process 1,000,000 records in
15.1 secs) - Simple access method to write and maintain
- Cons
- Very expensive for searching for single records
in large files (1 record could take 15.1 secs to
find) - Expensive for operations like sorting as no
inherent order
33Usage of Heaps
- Where much of the file is to be processed
- Batch mode (collect search and update requests
into batches) - Reports
- Statistical summaries
- Program source files
- Files which occupy a small number of pages
34Access Method II the Sorted File (Sequential
File)
- Records (tuples) are held on file
- in order of the primary key
- with no indexing
- that is in a sorted heap
- Strategy
- Insertions in a particular page to maintain order
- Deletions are physically removed
- Searching may be exhaustive from start to finish
until required record found but can use binary
search
35Sorted File Cost Model 1
- Cost of complete scan B(DRC)
- Same as heap
- If R1000, B1000 (file contains 1000000 records)
- Then cost 1000(0.015(100010-7))
1000(0.01500.0001) 1000(0.0151) 15.1 secs - Cost for finding particular record
- Can do binary search (binary chop)
- Go to middle of file is key sought lower or
higher than value here? - If lower, go to 1/4 of way through file
- If higher, go to 3/4 of way through file
- And so on until found
36Sorted File Cost Model 1
- Cost is D log 2B C log 2R 0.01510
- 10-710 ? 0.15 secs
- Much faster than with heap where 7.55 secs
- Cost for finding a range of records e.g. student
names beginning with sm - Approximately same as for search for a particular
record - D log 2B C log 2R 0.01510 10-710 ? 0.15
secs - Find position in file then retrieve all records
from the page found - Works well only if sort key is same as search key
- May need to retrieve further pages if result
large
37Lecture 3 CG171 --Hashing
38Sorted File Cost Model 2
- Insertion D log 2B C log 2R 2(0.5B(DRC))
- Difficult
- Find page where record must be held in sequence
search time is D log 2B C log 2R - Insert record on page in sort sequence
- Then shuffle records on this page to make room
for insertion - Fetch all later pages to shuffle records further
requiring RC secs processing on 0.5B pages and D
secs/page fetch - 0.5 from assumption that half the file needs to
be searched on average - Total for last two steps (4-5) is 0.5B(DRC)
39Sorted File Cost Model 2
- Write all pages back again. So again incur the
cost as in steps 4-5 0.5B(DRC) - Cost 0.01510 10-710
- 2(0.5 1000(0.015 1000 10-7 ))
- ? 0.15 15.1 ? 15.25 secs (long!)
- Deletions
- Cost is the same as for insertions
- Do not have option of leaving deleted records
flagged - Records are shuffled upwards
40Pros and Cons of Sorted Files
- Pros
- Cost effective where many records processed in a
single scan (can process 1,000,000 records in
15.1 secs) - Simple access method to write and maintain
- Fairly fast for searches on search key (10-15
disk accesses) - Cons
- Very expensive for insertions and deletions
(except at end) as much shuffling required
41Usage of Sorted Files
- Where much of the file is to be processed
- Batch mode (collect search and update requests
into batches) - Reports
- Statistical summaries
- Program source files
- Searching (on sorted key) in files which occupy a
relatively small number of pages
42Hashing
- One of the big two Access Methods
- Very fast potentially
- One disk access only in ideal situation
- Used in many database and more general
information systems - where speed is vital
- Many variants to cope with certain problems
43Meaning of Hash
- Definition 3. to cut into many small pieces
mince (often fol. by up). - Example He chopped up some garlic.
- Synonyms dice , mince (1) , hash (1)
- Similar Words chip1 , cut up cut (vt) ,
carve , crumble , cube1 , divide - From http//www.wordsmyth.net/live/home.php?conten
twotw/2001.0521/wotw_esl
44Hash Function
- Takes key value of record to be stored
- Applies some function (often including a chop)
delivering an integer - This integer is a page number on the disk. So
- input is key
- output is a page number
45Simple Example
- Have
- B10 (ten pages for disk file)
- R2,000 (2,000 records/page)
- Keys S12, S27, S30, S42
- Apply function chop to keys giving
- 12, 27, 30, 42 so that initial letter is
discarded
46Simple Example
- Then take remainder of dividing chopped key by
10. - Why divide?
- Gives integer remainder
- Why 10?
- Output numbers from 0 9
- 10 possible outputs corresponds with 10 pages for
storage - In this case, numbers returned are
- 2, 7, 0, 2
47Disk File hash table
Page Records (only keys shown)
0 S30
1
2 S12, S42
3
4
5
6
7 S27
8
9
48Retrieval
- Say user looks for record with key S42
- Apply hash function to key
- Discard initial letter, divide by 10, take
remainder - Gives 2
- Transfer page 2 to buffer in main memory
- Search buffer looking for record with key S42
49Cost Model
- Retrieval of a particular record
- D0.5RC (one disk access time taken to search
half a page for the required record) - 0.015(0.5200010-7) 0.0151 secs (very fast)
- Insertion of a record
- Fetch page (D) Modify in main memory (C )
Write back to disk (D) - 0.01510-70.015 ? 0.0300
- Deletions same as insertions
50Effectiveness
- Looks very good
- Searches in one disk access
- Insertions and deletions in two disk accesses
- So
- Searching faster than heap and sorted
- Insertions and deletions similar to heaped, much
faster than sorted
51Minor Problem
- Complete scan
- Normally do not fill pages to leave space for new
records to be inserted - 80 initially loading
- So records occupy 1.25 times number of pages if
densely packed - 1.25B(DRC) 1.2510(0.015200010-7) ? 0.189
secs (against 0.152 if packed densely)
52Larger Problems
- Scan for groups of records say S31-S37 will be
very slow - Each record will be in different page, not in
same page. - So 7 disk accesses instead of 1 with sorted file
(once page located holding S31-S37).
53Larger Problems
- What happens if page becomes full?
- This could happen if
- Hash function poorly chosen e.g. all keys end in
0 and hash function is a number divided by 10 - All records go in same page
- Simply too many records for initial space
allocated to file
54Overflow Area
- Have extra pages in an overflow area to hold
records - Mark overflowed pages in main disk area
- Retrieval now may take 2 disk accesses to search
expected page and overflow page. - If have overflow on overflow page, may take 3
disk accesses for retrieval. - Insertions may also be slow collisions on
already full pages. - Performance degrades
55At Intervals in Static Hashing
- The Data Base Administrators lost weekend
- He/she comes in
- Closes system down to external use
- Runs a utility expanding number of pages and
re-hashing all records into the new space
56Alternatives to Static Hashing
- Automatic adjustment Dynamic Hashing
- Extendible Hashing
- Have index (directory) to pages which adjusts to
number of records in the system - Linear Hashing
- Have family of hash functions
57Pros and Cons of Hashing
- Pros
- Very fast for searches on search key (may be 1
disk access) - Very fast for insertions and deletions (often 2
disk accesses) - No indexes to search/maintain (in most variants)
- Cons
- Slightly slower than sorted files for scan of
complete file - Requires periodic off-line maintenance in static
hashing as pages become full and collisions occur - Poor for range searches
58Usage of Hashing
- Applications involving
- Searching (on key) in files of any size for
single records very fast - Insertions and deletions of single records
- So typical in On-line Transaction Processing
(OLTP)
59Lecture 4
60Indexed Sequential
- Can see that (static) hashing is
- very fast for single-record retrieval
- very slow for fetching ranges
- So not an all-round access method
- What we want is an access method which has
- Sequential access AND
- Indexed access
61Concept of Indexes
- An auxiliary storage structure
- possibly a separate file
- or a partition within the data file
- May be loaded into main memory if not too large
- Comprises in general a set of pairs of values,
each pair - key value
- pointer to a page (page number or page address --
perhaps cylinder and track number)
62Concept of Indexes
- Selection of key values may be
- dense -- every key value is included in the index
- sparse -- only one key value per page is included
in the index - Key values may be
- primary key (attributes comprising primary key)
-- primary index - secondary key (all other attributes) -- secondary
index
63Concept of Indexes
- Primary and secondary keys may be
- simple (single attribute)
- composite (multiple attributes)
- Order of values in the file compared to those in
the index may be - clustered (the same order or very close to it)
- unclustered (random order)
64Concept of Indexes
- Shape of index may be
- flat (all entries at one level)
- tree-structured (the index has a tree structure)
- As updates take place, index may be
- static (does not adjust automatically)
- dynamic (adjusts automatically)
65Use of Indexes
- Indexes are extensively used in
- access methods
- database performance optimisation in joins and
other operations
66ISAM
- Indexed Sequential Access Method
- Has records in sequence as in sorted file
- Has records indexed at intervals
- in separate data structure as pairs of values
- perhaps highest key found on each page
- page number
- In effect a sorted file with records indexed at
intervals
67ISAM
- In terms of previous terminology, ISAM is
- tree-structured
- sparse
- clustered
- static
- either primary or secondary keys
- either simple or composite keys
68ISAM Node
- Each node in the index tree is of the form
P is a pointer K is a key value in each node
always one more pointer than key value
69Sorted file is underneath Index
Index
Page 0
Data
Page 1
Page 2
Indicates that page 0 contains records with keys
lt 20 page 1 contains records with keys from 20
to 49 page 2 contains records with keys from 50
upwards
70Lecture 5
71Verdict on ISAM
- Static index
- Unchanged by data changes
- Becomes out of date
- May have intense local growth
- Non-random addition of key values
- So need overflow pages
- Require extra disk accessing for search/update
72Verdict on ISAM
- Performance
- Much depends on where index is held
- Main memory makes searches very fast
- On disk deep tree will involve a number of
accesses - Good for range and whole file processing
- Just process sequential set
- Good for single record retrieval
- If index in main memory (not always possible)
- Not many overflow records
- Concurrency -- No locking problems on index as it
is not updated
73Lost weekend
- Once index is too much out of step with data
- DBA schedules a weekend slot for
- Taking database offline
- Dump of database
- Re-load into new ISAM format
- Problems of a static approach
74Usage
- Where the following three conditions are
satisfied - data is fairly static
- sequential access is needed (whole file and
ranges) - single-record (indexed) access is required
- If condition 1 is not met, use a B-tree
75B-trees
- What does the B stand for?
- Balanced, Bushy or Bayer (apparently not clear)
- Balanced means distance from root to leaf node is
same for all of tree - Bushy is a gardening term meaning all strands of
similar length - Bayer is a person who wrote a seminal paper on
them
76B-Trees
- There are some variants on B-trees.
- We deal here with B-trees where all data entries
are held in leaf nodes of tree - The two terms are interchangeable for our
purposes. - B-trees are dynamic index structures
- The tree automatically adjusts as the data changes
77B-tree
- A B-tree is a
- Multiway tree (fan-out or branching factor gt 2,
binary tree has fan-out 2) with - Efficient self-balancing operations
- Minimum node occupancy is 50
- Typical node occupancy can be greater than this
but initially keep it around 60
78B-tree Diagram indexsequence set
d 2
Index set
Internal structure as in root
The data entries
79Structure
- Index set (tree, other than leaf pages) is sparse
- Contains key-pointer pairs (as in ISAM)
- Not all keys included
- Data set (leaf pages) is dense
- All entries included
- Data held is key non-key data
- Pages are connected by double linked lists
- Can navigate in either direction
- Pages are ideally in sequence but this is not a
rule - No pointers are held at this level
80Parameters
- d 2 says that the order of the tree is 2
- Each non-terminal node contains between d and 2d
index entries (except root node 12d entries) - Each leaf node contains between d and 2d data
entries - So tree shown can hold 2, 3 or 4 index values in
each node - How many index pointers?
- Format is as for ISAM so always one more pointer
than value. So 3, 4 or 5 pointers per node.
81Capacity of Tree
- d 2
- One root node can hold 2d 4 records
- Next level 5 pointers from root, each node
holds maximum 4 records 20 records in 5 nodes - Next level 5 pointers from each of the 5 nodes
above, each node maximum 4 records 100 records
in leaf nodes - In practice will not pack as closely
- But d2, 3-levels potentially addresses 100
records - If all held on disk, 3 disk accesses
82Capacity of Tree
- High branching factors (fan-out) increase
capacity dramatically (see seminar). - So tree structure with high branching factor can
be kept to a small number of levels - Bushy trees (heavily pruned!) mean less disk
accesses - Root node at least will normally be held in main
memory
83Example of a B-tree
Order (d) 2
84Search Times
- Always go to leaf node (data entries) to retrieve
actual data. - Root node in main memory
- Cost (T-1)(D0.5RC) for single record
- T height of tree
- If d2, then R4 (max), cost (3-1)(0.015(0.54
10-7)) - 20.0150002 0.0300004 secs
85Insertions
- First add into sequence set
- Search for node
- If space add to node
- Leave index untouched
- If no space
- Try re-distributing records in sibling nodes
- Sibling node is adjacent node with same parent
- Or split node and share entries amongst the two
nodes - Will involve alteration to index
- Insertions tend to push index entries up the tree
- If splits all the way up and root node overflows,
then height of tree T increases by one.
86Deletions
- First delete from sequence set
- Search for node
- Delete record from node
- If node still has at least d entries
- Leave index untouched
- If node has less than d entries
- Try re-distributing records in sibling nodes
- Sibling node is adjacent node with same parent
- Or merge sibling nodes
- Will involve alteration to index
- Deletions tend to push index entries down the
tree - If merges all the way up and root node
underflows, then T decreases by one.