CG171 - Database Implementation and Development: Lecture 1 - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

CG171 - Database Implementation and Development: Lecture 1

Description:

Title: PowerPoint Presentation Author: CGNR1 Last modified by: CGNR1 Created Date: 1/24/2003 2:41:47 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:227
Avg rating:3.0/5.0
Slides: 87
Provided by: CGN6
Category:

less

Transcript and Presenter's Notes

Title: CG171 - Database Implementation and Development: Lecture 1


1
CG171 - Database Implementation and Development
Lecture 1
  • Nick Rossiter

2
Relationship to other modules
  • Prerequisite
  • CG084/CG085 Relational Databases Theory and
    Practice
  • Possible follow-up module
  • CM036 Advanced Databases

3
CG171 Part 1 (NR)
  • Introduction (10 )
  • Database Design Vs. Information System Design
  • Database Design Methodologies and CASE tools
  • Physical Database Implementation Fundamentals (30
    )
  • Memory structure
  • Data placement
  • Composite Data types
  • Access methods
  • Materialized queries

4
CG171 Part 2 (mainly AA)
  • Physical Database Design (30 )
  • Relational
  • Object-Relational
  • Object-Oriented
  • Database Programming and Development (30 )
  • External programming of composite data structures
  • Programming stored procedures, classes, and
    packages
  • Creating and maintaining physical database
    structures

5
Aims of module
  • To provide essentials of database system
    implementation.
  • To develop students' ability in designing and
    implementing physical database structures.
  • To extend students' ability in developing
    information systems based on databases.

6
On completion of module
  • You should be able to
  • Explain and discuss the fundamentals of a
    database system implementation.
  • Transform a logical database design into a
    physical one for a target DBMS.
  • Evaluate effects of physical design decisions on
    performance and complexity of a database.
  • Connect to a database and manipulate the data
    using a programming language (e.g. Java, C, VB)
    and procedural extensions to SQL (e.g. PL/SQL,
    Informix 4GL)

7
Module Delivery/Assessment
  • Delivery
  • lectures
  • seminar/practical sessions
  • Assessment
  • examination
  • group-based assignment

8
Assignment
  • Will assess the students' ability to
  • evaluate design alternatives,
  • formulate physical design,
  • implement the physical design,
  • program connectivity and data manipulation using
    programming languages and SQL procedural
    extensions.

9
Reading
  • Main texts
  • Ramakrishnan, R and Gehrke, J. Database
    Management Systems, McGraw Hill, 2nd Edition,
    2000.
  • Connolly, T and Begg, C. Database Systems, 3rd
    ed. Pearson, 2002.
  • Others
  • Robert J. Muller. Database Design for Smarties
    using UML for Data Modeling. Morgan Kaufmann
    Publishers, 1999.
  • Oracle Concepts, Oracle Corporation.
  • Oracle Application Developer's Guide, Oracle
    Corporation.
  • Oracle Java Documentation, Oracle Corporation.

10
Basic Problem in Physical Access 1
  • Long times to fetch data cause
  • Slow response
  • Low throughput
  • User impatience
  • Loss of concentration by user
  • Logical design may be elegant but must be coupled
    for efficiency with
  • effective physical design

11
Basic Problem in Physical Access 2
  • Assumption may be
  • A row of data is retrieved in one disk access
  • This may be very far from the case
  • Possible to design databases in which many 1,000s
    of disk accesses are required to retrieve an item

12
ATM Example 1
  • Youre in the rain outside an ATM
  • You want to withdraw 50
  • You type in your needs
  • The system first checks your balance
  • How wet do you get before the 50 arrives?

13
ATM Example 2
  • Say your account number is 168234
  • The table holds 4,500,000 records, each holding
    an account number and a balance
  • The system searches serially (each record in
    turn)
  • Might need to search 4,500,000 records if very
    unlucky
  • Or 1 record if very lucky
  • On average 2,250,000 records
  • With 100 records/block this gives range 1-45,000
    disk accesses (average 22,500)
  • Each disk access c10 msecs (10-2 secs)
  • So minimum elapsed time from 10-2 to 450 secs

14
ATM Example 3
  • This is time for just one operation
  • Also need to record
  • Amount withdrawn
  • Transaction details
  • New customer balance
  • There are also times for network transmission.
  • Customer will usually get very wet!

15
ATM Example 4
  • So how can customer keep dry?
  • Could hold all of data in main memory of computer
  • Time for accessing a location here is c60
    nanoseconds (60 x 10-9 secs)
  • So could search all 4,500,000 records quickly
  • One access/record would take only 27msecs (27 x
    10-3 secs) for the whole file
  • Customer will keep very dry even with other
    accesses.

16
Clever disk access needed
  • Problem is main memory is not available in volume
    necessary for databases such as these (but
    increasing all the time)
  • So need clever disk access system
  • Access methods
  • often relate primary key to storage address
  • Indexes
  • Sorted lists of search values with their
    locations
  • Clustering
  • Place linked data close together

17
ATM Example 5
  • Now put data on disk
  • Assume B-tree organisation
  • All records found in lt5 disk accesses
  • Time is from 10 msecs (10 x 10-3 secs) to 50
    msecs (50 x 10-3 secs)
  • Even with other tasks
  • Customer keeps very dry!

18
Database Design
  • After producing logical design with elegant
    maintainable structures
  • Need to do physical design to make it run fast.
  • Performance is often more important in database
    applications than in more general information
    system design
  • Emphasis on number of transactions per second
    (throughput)

19
Database Design Methodologies
  • Produce default storage schema
  • May be adequate for small applications
  • For large applications, much further tuning
    required
  • Physical design is the technique
  • Concepts memory (main/disk), target disk
    architecture, blocks, access methods, indexing,
    clustering.

20
Seminar week 1
  • Using the web and other resources
  • Find out more about how disks work
  • What do the terms seek time, rotational delay,
    transfer time, platter, track and cylinder mean?
  • For two different manufacturers disk-packs (say
    IBM mainframe, Microsoft PC) find typical
    magnitudes for the above features.
  • Also identify typical main memory sizes for the
    corresponding computers.
  • Suggest how the characteristics affect physical
    design of databases

21
Lecture 2 File Access Principles and Access
Methods
  • Nick Rossiter

22
Aims
  • Fast retrieval usually taken as lt 5 disk
    accesses
  • Since disk access is very long compared to other
    access times, number of disk accesses is often
    used as indicator of performance
  • Fast placement within 5 disk accesses
  • Insertion of data, may be in middle of file not
    at end
  • Deleting data, actual removal or tombstone
  • Updating data, including primary key and other
    data

23
Retrieval/Placement
  • Distinguish between actions involving primary and
    secondary keys
  • Primary key is that determined by normalisation
  • May be single or multiple attributes
  • Only one per table
  • Secondary keys
  • Again may be single or multiple attributes
  • Many per table
  • Include attributes other than the primary key
  • Complications such as candidate keys are omitted
    in this part of the course

24
Access Method
  • An access method is the software responsible for
    storage and retrieval of data on disk
  • Handles page I/O between disk and main memory
  • A page is a unit of storage on disk
  • Pages may be blocked so that many are retrieved
    in a single disk access
  • Many different access methods exist
  • Each has a particular technique for relating a
    primary key value to a page number

25
Processing of Data
  • All processing by software is done in main memory
  • Blocks of pages are moved
  • from disk to main memory for retrieval by user
  • from main memory to disk for storage by user
  • Access method drives the retrieval/storage
    process

26
Cost Model
  • Identify cost of each retrieval/storage operation
  • Access time to disk to read/write page D
  • seek time (time to move head to required
    cylinder)
  • rotational delay (time to rotate disk once the
    head is over the required track)
  • transfer time (time to transfer data from disk
    to main memory)
  • Typical value for D 15 milliseconds (msecs) or
  • 15 x 10-3 secs

27
Other times
  • C average time to process a record in main
    memory 100 nanoseconds (nsecs) 100 x 10-9
    secs.
  • R number of records/page
  • B number of pages in file
  • Note that D gt C by roughly 105 times

28
Access Method I the Heap
  • Records (tuples) are held on file
  • in no particular order
  • with no indexing
  • that is in a heap unix default file type
  • Strategy
  • Insertions usually at end
  • Deletions are marked by tombstones
  • Searching is exhaustive from start to finish
    until required record found

29
The Heap Cost Model 1
  • Cost of complete scan B(DRC)
  • For each page, one disk access D and process of R
    records taking C each.
  • If R1000, B1000 (file contains 1,000,000
    records)
  • Then cost 1000(0.015(100010-7))
    1000(0.01500.0001) 1000(0.0151) 15.1 secs
  • Cost for finding particular record B(DRC)/2
    (scan half file on average) 7.55 secs
  • Cost for finding a range of records e.g. student
    names beginning with sm B(DRC) (must search
    whole file) 15.1 secs

30
The Heap Cost Model 2
  • Insertion 2D C
  • Fetch last page (D), process record (C ), write
    last page back again (D).
  • Assumes
  • all insertions at end
  • system can fetch last page in one disk access
  • Cost (20.015)10-7 ? 0.030 secs

31
The Heap Cost Model 3
  • Deletions B(DRC)/2 C D
  • Find particular record (scan half file -
    B(DRC)/2), process record on page (C ), write
    page back (D).
  • Record will be flagged as deleted (tombstone)
  • Record will still occupy space
  • If reclaim space need potentially to read/write
    many more pages
  • Cost 7.550 10-7 0.015 ? 7.565 secs

32
Pros and Cons of Heaps
  • Pros
  • Cost effective where many records processed in a
    single scan (can process 1,000,000 records in
    15.1 secs)
  • Simple access method to write and maintain
  • Cons
  • Very expensive for searching for single records
    in large files (1 record could take 15.1 secs to
    find)
  • Expensive for operations like sorting as no
    inherent order

33
Usage of Heaps
  • Where much of the file is to be processed
  • Batch mode (collect search and update requests
    into batches)
  • Reports
  • Statistical summaries
  • Program source files
  • Files which occupy a small number of pages

34
Access Method II the Sorted File (Sequential
File)
  • Records (tuples) are held on file
  • in order of the primary key
  • with no indexing
  • that is in a sorted heap
  • Strategy
  • Insertions in a particular page to maintain order
  • Deletions are physically removed
  • Searching may be exhaustive from start to finish
    until required record found but can use binary
    search

35
Sorted File Cost Model 1
  • Cost of complete scan B(DRC)
  • Same as heap
  • If R1000, B1000 (file contains 1000000 records)
  • Then cost 1000(0.015(100010-7))
    1000(0.01500.0001) 1000(0.0151) 15.1 secs
  • Cost for finding particular record
  • Can do binary search (binary chop)
  • Go to middle of file is key sought lower or
    higher than value here?
  • If lower, go to 1/4 of way through file
  • If higher, go to 3/4 of way through file
  • And so on until found

36
Sorted File Cost Model 1
  • Cost is D log 2B C log 2R 0.01510
  • 10-710 ? 0.15 secs
  • Much faster than with heap where 7.55 secs
  • Cost for finding a range of records e.g. student
    names beginning with sm
  • Approximately same as for search for a particular
    record
  • D log 2B C log 2R 0.01510 10-710 ? 0.15
    secs
  • Find position in file then retrieve all records
    from the page found
  • Works well only if sort key is same as search key
  • May need to retrieve further pages if result
    large

37
Lecture 3 CG171 --Hashing
  • Nick Rossiter

38
Sorted File Cost Model 2
  • Insertion D log 2B C log 2R 2(0.5B(DRC))
  • Difficult
  • Find page where record must be held in sequence
    search time is D log 2B C log 2R
  • Insert record on page in sort sequence
  • Then shuffle records on this page to make room
    for insertion
  • Fetch all later pages to shuffle records further
    requiring RC secs processing on 0.5B pages and D
    secs/page fetch
  • 0.5 from assumption that half the file needs to
    be searched on average
  • Total for last two steps (4-5) is 0.5B(DRC)

39
Sorted File Cost Model 2
  • Write all pages back again. So again incur the
    cost as in steps 4-5 0.5B(DRC)
  • Cost 0.01510 10-710
  • 2(0.5 1000(0.015 1000 10-7 ))
  • ? 0.15 15.1 ? 15.25 secs (long!)
  • Deletions
  • Cost is the same as for insertions
  • Do not have option of leaving deleted records
    flagged
  • Records are shuffled upwards

40
Pros and Cons of Sorted Files
  • Pros
  • Cost effective where many records processed in a
    single scan (can process 1,000,000 records in
    15.1 secs)
  • Simple access method to write and maintain
  • Fairly fast for searches on search key (10-15
    disk accesses)
  • Cons
  • Very expensive for insertions and deletions
    (except at end) as much shuffling required

41
Usage of Sorted Files
  • Where much of the file is to be processed
  • Batch mode (collect search and update requests
    into batches)
  • Reports
  • Statistical summaries
  • Program source files
  • Searching (on sorted key) in files which occupy a
    relatively small number of pages

42
Hashing
  • One of the big two Access Methods
  • Very fast potentially
  • One disk access only in ideal situation
  • Used in many database and more general
    information systems
  • where speed is vital
  • Many variants to cope with certain problems

43
Meaning of Hash
  • Definition 3. to cut into many small pieces
    mince (often fol. by up).
  • Example He chopped up some garlic.
  • Synonyms dice , mince (1) , hash (1)
  • Similar Words chip1 , cut up cut (vt) ,
    carve , crumble , cube1 , divide
  • From http//www.wordsmyth.net/live/home.php?conten
    twotw/2001.0521/wotw_esl

44
Hash Function
  • Takes key value of record to be stored
  • Applies some function (often including a chop)
    delivering an integer
  • This integer is a page number on the disk. So
  • input is key
  • output is a page number

45
Simple Example
  • Have
  • B10 (ten pages for disk file)
  • R2,000 (2,000 records/page)
  • Keys S12, S27, S30, S42
  • Apply function chop to keys giving
  • 12, 27, 30, 42 so that initial letter is
    discarded

46
Simple Example
  • Then take remainder of dividing chopped key by
    10.
  • Why divide?
  • Gives integer remainder
  • Why 10?
  • Output numbers from 0 9
  • 10 possible outputs corresponds with 10 pages for
    storage
  • In this case, numbers returned are
  • 2, 7, 0, 2

47
Disk File hash table
Page Records (only keys shown)
0 S30
1
2 S12, S42
3
4
5
6
7 S27
8
9
48
Retrieval
  • Say user looks for record with key S42
  • Apply hash function to key
  • Discard initial letter, divide by 10, take
    remainder
  • Gives 2
  • Transfer page 2 to buffer in main memory
  • Search buffer looking for record with key S42

49
Cost Model
  • Retrieval of a particular record
  • D0.5RC (one disk access time taken to search
    half a page for the required record)
  • 0.015(0.5200010-7) 0.0151 secs (very fast)
  • Insertion of a record
  • Fetch page (D) Modify in main memory (C )
    Write back to disk (D)
  • 0.01510-70.015 ? 0.0300
  • Deletions same as insertions

50
Effectiveness
  • Looks very good
  • Searches in one disk access
  • Insertions and deletions in two disk accesses
  • So
  • Searching faster than heap and sorted
  • Insertions and deletions similar to heaped, much
    faster than sorted

51
Minor Problem
  • Complete scan
  • Normally do not fill pages to leave space for new
    records to be inserted
  • 80 initially loading
  • So records occupy 1.25 times number of pages if
    densely packed
  • 1.25B(DRC) 1.2510(0.015200010-7) ? 0.189
    secs (against 0.152 if packed densely)

52
Larger Problems
  • Scan for groups of records say S31-S37 will be
    very slow
  • Each record will be in different page, not in
    same page.
  • So 7 disk accesses instead of 1 with sorted file
    (once page located holding S31-S37).

53
Larger Problems
  • What happens if page becomes full?
  • This could happen if
  • Hash function poorly chosen e.g. all keys end in
    0 and hash function is a number divided by 10
  • All records go in same page
  • Simply too many records for initial space
    allocated to file

54
Overflow Area
  • Have extra pages in an overflow area to hold
    records
  • Mark overflowed pages in main disk area
  • Retrieval now may take 2 disk accesses to search
    expected page and overflow page.
  • If have overflow on overflow page, may take 3
    disk accesses for retrieval.
  • Insertions may also be slow collisions on
    already full pages.
  • Performance degrades

55
At Intervals in Static Hashing
  • The Data Base Administrators lost weekend
  • He/she comes in
  • Closes system down to external use
  • Runs a utility expanding number of pages and
    re-hashing all records into the new space

56
Alternatives to Static Hashing
  • Automatic adjustment Dynamic Hashing
  • Extendible Hashing
  • Have index (directory) to pages which adjusts to
    number of records in the system
  • Linear Hashing
  • Have family of hash functions

57
Pros and Cons of Hashing
  • Pros
  • Very fast for searches on search key (may be 1
    disk access)
  • Very fast for insertions and deletions (often 2
    disk accesses)
  • No indexes to search/maintain (in most variants)
  • Cons
  • Slightly slower than sorted files for scan of
    complete file
  • Requires periodic off-line maintenance in static
    hashing as pages become full and collisions occur
  • Poor for range searches

58
Usage of Hashing
  • Applications involving
  • Searching (on key) in files of any size for
    single records very fast
  • Insertions and deletions of single records
  • So typical in On-line Transaction Processing
    (OLTP)

59
Lecture 4
  • Nick Rossiter

60
Indexed Sequential
  • Can see that (static) hashing is
  • very fast for single-record retrieval
  • very slow for fetching ranges
  • So not an all-round access method
  • What we want is an access method which has
  • Sequential access AND
  • Indexed access

61
Concept of Indexes
  • An auxiliary storage structure
  • possibly a separate file
  • or a partition within the data file
  • May be loaded into main memory if not too large
  • Comprises in general a set of pairs of values,
    each pair
  • key value
  • pointer to a page (page number or page address --
    perhaps cylinder and track number)

62
Concept of Indexes
  • Selection of key values may be
  • dense -- every key value is included in the index
  • sparse -- only one key value per page is included
    in the index
  • Key values may be
  • primary key (attributes comprising primary key)
    -- primary index
  • secondary key (all other attributes) -- secondary
    index

63
Concept of Indexes
  • Primary and secondary keys may be
  • simple (single attribute)
  • composite (multiple attributes)
  • Order of values in the file compared to those in
    the index may be
  • clustered (the same order or very close to it)
  • unclustered (random order)

64
Concept of Indexes
  • Shape of index may be
  • flat (all entries at one level)
  • tree-structured (the index has a tree structure)
  • As updates take place, index may be
  • static (does not adjust automatically)
  • dynamic (adjusts automatically)

65
Use of Indexes
  • Indexes are extensively used in
  • access methods
  • database performance optimisation in joins and
    other operations

66
ISAM
  • Indexed Sequential Access Method
  • Has records in sequence as in sorted file
  • Has records indexed at intervals
  • in separate data structure as pairs of values
  • perhaps highest key found on each page
  • page number
  • In effect a sorted file with records indexed at
    intervals

67
ISAM
  • In terms of previous terminology, ISAM is
  • tree-structured
  • sparse
  • clustered
  • static
  • either primary or secondary keys
  • either simple or composite keys

68
ISAM Node
  • Each node in the index tree is of the form

P is a pointer K is a key value in each node
always one more pointer than key value
69
Sorted file is underneath Index
Index
Page 0
Data
Page 1
Page 2
Indicates that page 0 contains records with keys
lt 20 page 1 contains records with keys from 20
to 49 page 2 contains records with keys from 50
upwards
70
Lecture 5
  • Nick Rossiter

71
Verdict on ISAM
  • Static index
  • Unchanged by data changes
  • Becomes out of date
  • May have intense local growth
  • Non-random addition of key values
  • So need overflow pages
  • Require extra disk accessing for search/update

72
Verdict on ISAM
  • Performance
  • Much depends on where index is held
  • Main memory makes searches very fast
  • On disk deep tree will involve a number of
    accesses
  • Good for range and whole file processing
  • Just process sequential set
  • Good for single record retrieval
  • If index in main memory (not always possible)
  • Not many overflow records
  • Concurrency -- No locking problems on index as it
    is not updated

73
Lost weekend
  • Once index is too much out of step with data
  • DBA schedules a weekend slot for
  • Taking database offline
  • Dump of database
  • Re-load into new ISAM format
  • Problems of a static approach

74
Usage
  • Where the following three conditions are
    satisfied
  • data is fairly static
  • sequential access is needed (whole file and
    ranges)
  • single-record (indexed) access is required
  • If condition 1 is not met, use a B-tree

75
B-trees
  • What does the B stand for?
  • Balanced, Bushy or Bayer (apparently not clear)
  • Balanced means distance from root to leaf node is
    same for all of tree
  • Bushy is a gardening term meaning all strands of
    similar length
  • Bayer is a person who wrote a seminal paper on
    them

76
B-Trees
  • There are some variants on B-trees.
  • We deal here with B-trees where all data entries
    are held in leaf nodes of tree
  • The two terms are interchangeable for our
    purposes.
  • B-trees are dynamic index structures
  • The tree automatically adjusts as the data changes

77
B-tree
  • A B-tree is a
  • Multiway tree (fan-out or branching factor gt 2,
    binary tree has fan-out 2) with
  • Efficient self-balancing operations
  • Minimum node occupancy is 50
  • Typical node occupancy can be greater than this
    but initially keep it around 60

78
B-tree Diagram indexsequence set
d 2
Index set
Internal structure as in root
  • Sequence Set

The data entries
79
Structure
  • Index set (tree, other than leaf pages) is sparse
  • Contains key-pointer pairs (as in ISAM)
  • Not all keys included
  • Data set (leaf pages) is dense
  • All entries included
  • Data held is key non-key data
  • Pages are connected by double linked lists
  • Can navigate in either direction
  • Pages are ideally in sequence but this is not a
    rule
  • No pointers are held at this level

80
Parameters
  • d 2 says that the order of the tree is 2
  • Each non-terminal node contains between d and 2d
    index entries (except root node 12d entries)
  • Each leaf node contains between d and 2d data
    entries
  • So tree shown can hold 2, 3 or 4 index values in
    each node
  • How many index pointers?
  • Format is as for ISAM so always one more pointer
    than value. So 3, 4 or 5 pointers per node.

81
Capacity of Tree
  • d 2
  • One root node can hold 2d 4 records
  • Next level 5 pointers from root, each node
    holds maximum 4 records 20 records in 5 nodes
  • Next level 5 pointers from each of the 5 nodes
    above, each node maximum 4 records 100 records
    in leaf nodes
  • In practice will not pack as closely
  • But d2, 3-levels potentially addresses 100
    records
  • If all held on disk, 3 disk accesses

82
Capacity of Tree
  • High branching factors (fan-out) increase
    capacity dramatically (see seminar).
  • So tree structure with high branching factor can
    be kept to a small number of levels
  • Bushy trees (heavily pruned!) mean less disk
    accesses
  • Root node at least will normally be held in main
    memory

83
Example of a B-tree
Order (d) 2
84
Search Times
  • Always go to leaf node (data entries) to retrieve
    actual data.
  • Root node in main memory
  • Cost (T-1)(D0.5RC) for single record
  • T height of tree
  • If d2, then R4 (max), cost (3-1)(0.015(0.54
    10-7))
  • 20.0150002 0.0300004 secs

85
Insertions
  • First add into sequence set
  • Search for node
  • If space add to node
  • Leave index untouched
  • If no space
  • Try re-distributing records in sibling nodes
  • Sibling node is adjacent node with same parent
  • Or split node and share entries amongst the two
    nodes
  • Will involve alteration to index
  • Insertions tend to push index entries up the tree
  • If splits all the way up and root node overflows,
    then height of tree T increases by one.

86
Deletions
  • First delete from sequence set
  • Search for node
  • Delete record from node
  • If node still has at least d entries
  • Leave index untouched
  • If node has less than d entries
  • Try re-distributing records in sibling nodes
  • Sibling node is adjacent node with same parent
  • Or merge sibling nodes
  • Will involve alteration to index
  • Deletions tend to push index entries down the
    tree
  • If merges all the way up and root node
    underflows, then T decreases by one.
Write a Comment
User Comments (0)
About PowerShow.com