CS 728 Advanced Database Systems Chapter 16 - PowerPoint PPT Presentation

About This Presentation
Title:

CS 728 Advanced Database Systems Chapter 16

Description:

CS 728 Advanced Database Systems Chapter 16 Database File Organization: Unordered, Ordered, and Hashed Files of Records Disk I/O Model of Computation Disk I/O is ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 53
Provided by: eduj1
Category:

less

Transcript and Presenter's Notes

Title: CS 728 Advanced Database Systems Chapter 16


1
CS 728 Advanced Database Systems Chapter 16
  • Database File Organization
  • Unordered, Ordered, and Hashed Files of Records

2
Disk I/O Model of Computation
  • Disk I/O is equivalent to one read or write
    operation of a single block
  • It is very expensive compared with what is likely
    to be done once the block gets in main memory
  • one random disk I/O about 1,000,000 machine
    instructions in terms of time
  • Cost for computation that requires secondary
    storage is computed only by disk I/Os.

3
Disk Storage Devices
  • Preferred secondary storage device for high
    storage capacity and low cost.
  • Data stored as magnetized areas on magnetic disk
    surfaces.
  • A disk pack contains several magnetic disks
    connected to a rotating spindle.
  • Disks are divided into concentric circular tracks
    on each disk surface.
  • Track capacities vary typically from 4 to 50
    Kbytes or more

4
Disk Storage Devices (cont.)
  • A track is divided into smaller blocks or sectors
  • because it usually contains a large amount of
    information
  • The division of a track into sectors is
    hard-coded on the disk surface and cannot be
    changed.
  • One type of sector organization calls a portion
    of a track that subtends a fixed angle at the
    center as a sector.
  • A track is divided into blocks.
  • The block size B is fixed for each system.
  • Typical block sizes range from B512 bytes to
    B4096 bytes.
  • Whole blocks are transferred between disk and
    main memory for processing.

5
Disk Storage Devices (cont.)
6
Disk Storage Devices (cont.)
  • A read-write head moves to the track that
    contains the block to be transferred.
  • Disk rotation moves the block under the
    read-write head for reading or writing.
  • A physical disk block (hardware) address consists
    of
  • a cylinder number (imaginary collection of tracks
    of same radius from all recorded surfaces)
  • the track number or surface number (within the
    cylinder)
  • and block number (within track).
  • Reading or writing a disk block is time consuming
    because of the seek time s and rotational delay
    (latency) rd.
  • Double buffering can be used to speed up the
    transfer of contiguous disk blocks.

7
Disk Storage Devices (cont.)
8
Typical Disk Parameters
9
Pages and Blocks
  • Data files decomposed into pages (blocks)
  • fixed size piece of contiguous information in the
    file
  • sizes range from 512 bytes to several kilobytes
  • block is the smallest unit for transferring data
    between the main memory and the disk.
  • Address of a page (block)
  • (cylinder, track (within cylinder), sector
    (within track)

10
Pages and Blocks
11
Page I/O
  • Page I/O --- one page I/O is the cost (or time
    needed) to transfer one page of data between the
    memory and the disk.
  • The cost of a (random) page I/O
  • seek time rotational delay block transfer
    time
  • Seek time
  • time needed to position read/write head on
    correct track.
  • Rotational delay (latency)
  • time needed to rotate the beginning of page under
    read/write head.
  • Block transfer time
  • time needed to transfer data in the page/block.

12
Page I/O
  • Average rotational delay (rd)
  • rd ½ (1/p) min (601000)/(2p) msec
  • OR
  • rd ½ cost of 1 revolution
  • ½ (601000/p) msec
  • where
  • p is speed of disk rotation (how many revolutions
    per minute - rpm)
  • Example
  • Speed of disk rotatioon is p 3600 rpm
  • 60 revolutions/sec
  • 1 rev. 16.66 msec. (1 second 1000 msec)
  • rd 8.33 ms

13
Page I/O
  • Transfer rate (tr)
  • tr track size / cost of one revolution
  • track size / (601000/p) in msec
  • Bulk transfer rate (btr)
  • btr (B/(BG)) tr bytes/msec
  • Where B is the block size in bytes
  • G is interblock gap size in bytes
  • Block transfer time (btt)
  • btt B / tr not taking into acount G
  • btt B / btr taking into acount G

14
Page I/O
  • Example
  • Track size 50 KB and p 3600 rpm
  • Block size B 3KB 3000 bytes
  • tr (501000)/(601000/3600) 3000 bytes/msec
  • btt B / tr 3000/3000 1 msec

15
Page I/O
  • Average time for reading/writing n consecutive
    pages that are in the same track or cylinder s
    rd n btt
  • Average time for reading/writing consecutively n
    noncontigues pages/blocks that are in the same
    cylinder s n (rd btt)

16
An Example
  • A hard disk specifications
  • 4 platters, 8 Surfaces, 3.5 Inch diameter
  • 213 8192 tracks/surface
  • 28 256 sectors/track
  • 29 512 bytes/sector
  • Average seek time s 25 ms
  • Rotation rate rd 3600 rpm 60 rps
  • 1 rev. 16.66 msec
  • Transfer rate
  • tr 1 KB in 0.117 ms
  • tr 1 KB in 0.130 ms with gap

17
An Example
  • What is the total capacity of this disk
  • 8 GB (82132829233)
  • How many bytes does one track hold?
  • 256 sectors/track512 bytes/sector 128KB
  • How many blocks per track?
  • one block 4096 bytes 8 sectors (4096/512)
  • 256/8 32 blocks/track

18
An Example
  • How long does it take to access one block?
  • One block 4096 bytes
  • 8 sectors 4096/512
  • Rotation rate r
  • 1 rev. 16.66 msec.
  • Time to access 1 sector (s r/2
    tr/(secters/KB)
  • 25 (16.66/2) .117/2 33.3885 ms.
  • time to access 1 block
  • time to access the first sector of the block
    time to access the subsequent 7 sectors.

19
An Example
  • T 25 (16.66/2) (0.117/2) 1 (0.13/2) 7
  • 33.3885 0.455 ms 33.8435ms
  • Compare to one sector access time 33.3885 ms

20
Buffering
  • A buffer
  • is a contiguous reserved area in main memory
    available for storage of copies of disk blocks.
  • to speed up the processes.
  • For a read command
  • the block from disk is copied into the buffer.
  • For a write command
  • the contents of the buffer are copied into the
    disk.

21
Accessing Data Through RAM Buffer
22
File Organization
  • The database is stored as a collection of files.
  • Each file is a sequence of records.
  • A record is a sequence of fields.
  • Records are stored on disk blocks.
  • A file can have fixed-length records or
    variable-length records.

23
File Organization
  • Fixed length records
  • Each record is of fixed length. Pad with spaces.
  • Variable length records
  • different records in the file have different
    sizes.
  • Arise in database systems in several ways
  • different record types in a file.
  • same record type with (variable-length fields,
    repeating field, or optional fields)

24
File Organization
25
Fixed-Length Records
  • Insertion
  • Store record i starting from byte n ? (i 1),
    where n is the size of each record.
  • Deletion of record i
  • Packed format
  • move records i 1, . . ., n to i, . . . , n 1
  • OR
  • move record n to i
  • Unpacked format (do not move records, but)
  • link all free records on a free list
  • OR
  • Use bitmap vector

26
Free Lists
  • Store the address of the first deleted record in
    the file header.
  • Use this first record to store the address of the
    second deleted record, and so on.

27
Page Formats Fixed Length Records
  • Record id ltpage id, slot gt.

28
Variable-Length Records Representation
  • Byte-String representation
  • Attach an end-of-record (?) control character to
    the end of each record
  • Difficulty with deletion and growth
  • Slotted-page header contains
  • number of record entries
  • location and size of each record
  • end of free space in the block

29
Slotted Page Structure
  • Records can be moved around within a page to keep
    them contiguous with no empty space between them
  • entry in the header must be updated.
  • Pointers should not point directly to record -
    instead they should point to the entry for the
    record in header.

30
Fixed-Length Representation
  • Reserved Space
  • can use fixed-length records of a known maximum
    length
  • unused space in shorter records filled with a
    null or end-of-record symbol.

31
Fixed-Length Representation
  • List Representation by Pointers
  • A variable-length record is represented by a list
    of fixed-length records, chained together via
    pointers.
  • Can be used even if the maximum record length is
    not known

32
Fixed-Length Representation
  • Disadvantage space is wasted in all records
    except the first in a a chain.
  • Solution is to allow two kinds of block in file
  • Anchor block contains the first records of chain
  • Overflow block contains records other than those
    that are the first records of chairs.

33
Blocking Factor
  • Blocking Factor (bfr) - the number of records
    that can fit into a single block.
  • bfr ?B/R?
  • B Block size in bytes
  • R Record size in bytes
  • Example
  • Record size R 100 bytes
  • Block Size B 2,000 bytes
  • Thus the blocking factor bfr 2000/100 20
  • The number of blocks b needed to store a file of
    r records
  • b ?r/bfr? blocks

34
Spanned Unspanned Records
  • A block is the unit of data transfer between disk
    and memory.
  • Unspanned records
  • A record is found in one and only one block.
  • records do not span across block boundaries.
  • Used with fixed-length records having B ? R
  • Spanned records
  • Records are allowed to span across block
    boundaries.
  • Used with variable-length records having R ? B
  • In variable-length records, either organization
    can be used.

35
Placing File Records on Disk
  • A file header or file descriptor contains
    information about a file (e.g., the disk address,
    record format descriptions, etc.)

36
Allocating File Blocks on Disk
  • The physical disk blocks that are allocated to
    hold the records of a file can be contiguous,
    linked, or indexed.
  • In contiguous allocation, the file blocks are
    allocated to consecutive disk blocks.
  • In linked allocation, each file block contains a
    pointer to the next file block.
  • In indexed allocation, one or more index blocks
    contain pointers to the actual file blocks.

37
Organization of Records in Files
  • Heap/Unordered/Pile File Organization
  • a record can be placed anywhere in the file where
    there is space, or at the end
  • for full file scans or frequent updates
  • Data unordered (unsorted)
  • Sorted/Ordered File Organization
  • store records sorted in order, based on the value
    of the search key of each record
  • Need external sort or an index to keep sorted
  • Hashing File Organization
  • a hash function computed on some attribute of
    each record
  • the result specifies in which block of the file
    the record should be placed

38
Heap File Organization
  • Records are placed in the file in the order in
    which they are inserted. Such an organization is
    called a heap file.
  • Insertion is at the end
  • takes constant time O(1) (very efficient)
  • Searching
  • requires a linear search (expensive)
  • Deleting
  • requires a search, then delete
  • Select, Update and Delete
  • take b/2 time (linear time) in average
  • b is the number of blocks

39
Heap File Organization
  • For a file of unordered fixed-length records
    using unspanned blocks and contiguous allocation,
    it is straightforward to access any record by its
    position in the file.
  • If the records are numbered 0,1,2, , r-1 and
  • The records in each block are numbered 0,1,2, ,
    f-1, where f is the blocking factor
  • The the i-th record of the file is located in
  • Block ?i/f? and in the
  • (i mod f)-th record in that block

40
Heap File Organization
  • A Heap file allows us to retrieve records
  • by specifying the rid, or
  • by scanning all records sequentially
  • Accessing a record by its position does not help
    locate a record based on a search condition.

41
File Stored as a Heap File
666666 MGT123 F1994 4.0 123456
CS305 S1996 4.0 page 0 987654
CS305 F1995 2.0 717171 CS315
S1997 4.0 666666 EE101 S1998
3.0 page 1 765432 MAT123 S1996
2.0 515151 EE101 F1995
3.0 234567 CS305 S1999 4.0

page 2 878787 MGT123 S1996
3.0
42
Sequential File Organization
  • Suitable for applications that require sequential
    processing of the entire file
  • The records in the file are ordered by a
    search-key

43
Files of Ordered Records
  • Some blocks of an ordered (sequential) file of
    EMPLOYEE records with NAME as the ordering key
    field.

44
File Stored as a Sorted File
111111 MGT123 F1994 4.0 111111
CS305 S1996 4.0 page 0 123456
CS305 F1995 2.0 123456 CS315
S1997 4.0 123456 EE101 S1998
3.0 page 1 232323 MAT123 S1996
2.0 234567 EE101 F1995
3.0 234567 CS305 S1999 4.0

page 2 313131 MGT123 S1996
3.0
45
Sequential File Organization
  • Insertion is expensive
  • records must be inserted in the correct order
  • locate the position where the record is to be
    inserted
  • if there is free space insert there
  • if no free space insert the record in an overflow
    block
  • In either case, pointer chain must be updated
  • Insert takes lg(b) plus the time to re-organize
    records.
  • b is the number of blocks
  • Deletion
  • use pointer chains
  • Searching
  • very efficient (Binary search)
  • This requires lg(b) on the average

46
Sequential File Organization
47
Average Access Times
  • The following table shows the average access time
    to access a specific record for a given type of
    file

48
Hashed Files
  • Hashing for disk files is called External Hashing
  • The file blocks are divided into M equal-sized
    buckets, numbered bucket0, bucket1, ...,
    bucketM-1.
  • Typically, a bucket corresponds to one (or a
    fixed number of) disk block.
  • One of the file fields is designated to be the
    hash key of the file.
  • The record with hash key value K is stored in
    bucket i, where ih(K), and h is the hashing
    function.
  • Search is very efficient on the hash key.
  • Collisions occur when a new record hashes to a
    bucket that is already full.
  • An overflow file is kept for storing such
    records.
  • Overflow records that hash to each bucket can be
    linked together.

49
Hashed Files (cont.)
  • There are numerous methods for collision
    resolution, including the following
  • Open addressing Proceeding from the occupied
    position specified by the hash address, the
    program checks the subsequent positions in order
    until an unused (empty) position is found.
  • Chaining For this method, various overflow
    locations are kept, usually by extending the
    array with a number of overflow positions. In
    addition, a pointer field is added to each record
    location. A collision is resolved by placing the
    new record in an unused overflow location and
    setting the pointer of the occupied hash address
    location to the address of that overflow
    location.
  • Multiple hashing The program applies a second
    hash function if the first results in a
    collision. If another collision results, the
    program uses open addressing or applies a third
    hash function and then uses open addressing if
    necessary.

50
Hashed Files (cont.)
51
Hashed Files (cont.)
  • To reduce overflow records, a hash file is
    typically kept 70-80 full.
  • The hash function h should distribute the records
    uniformly among the buckets
  • Otherwise, search time will be increased because
    many overflow records will exist.
  • Main disadvantages of static external hashing
  • Fixed number of buckets M is a problem if the
    number of records in the file grows or shrinks.
  • Ordered access on the hash key is quite
    inefficient (requires sorting the records).

52
Hashed Files - Overflow Handling
Write a Comment
User Comments (0)
About PowerShow.com