CS 728 Advanced Database Systems Chapter 16 presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 728 Advanced Database Systems Chapter 16

1
CS 728 Advanced Database Systems Chapter 16

Database File Organization
Unordered, Ordered, and Hashed Files of Records

2
Disk I/O Model of Computation

Disk I/O is equivalent to one read or write
operation of a single block
It is very expensive compared with what is likely
to be done once the block gets in main memory
one random disk I/O about 1,000,000 machine
instructions in terms of time
Cost for computation that requires secondary
storage is computed only by disk I/Os.

3
Disk Storage Devices

Preferred secondary storage device for high
storage capacity and low cost.
Data stored as magnetized areas on magnetic disk
surfaces.
A disk pack contains several magnetic disks
connected to a rotating spindle.
Disks are divided into concentric circular tracks
on each disk surface.
Track capacities vary typically from 4 to 50
Kbytes or more

4
Disk Storage Devices (cont.)

A track is divided into smaller blocks or sectors
because it usually contains a large amount of
information
The division of a track into sectors is
hard-coded on the disk surface and cannot be
changed.
One type of sector organization calls a portion
of a track that subtends a fixed angle at the
center as a sector.
A track is divided into blocks.
The block size B is fixed for each system.
Typical block sizes range from B512 bytes to
B4096 bytes.
Whole blocks are transferred between disk and
main memory for processing.

5
Disk Storage Devices (cont.)
6
Disk Storage Devices (cont.)

A read-write head moves to the track that
contains the block to be transferred.
Disk rotation moves the block under the
read-write head for reading or writing.
A physical disk block (hardware) address consists
of
a cylinder number (imaginary collection of tracks
of same radius from all recorded surfaces)
the track number or surface number (within the
cylinder)
and block number (within track).
Reading or writing a disk block is time consuming
because of the seek time s and rotational delay
(latency) rd.
Double buffering can be used to speed up the
transfer of contiguous disk blocks.

7
Disk Storage Devices (cont.)
8
Typical Disk Parameters
9
Pages and Blocks

Data files decomposed into pages (blocks)
fixed size piece of contiguous information in the
file
sizes range from 512 bytes to several kilobytes
block is the smallest unit for transferring data
between the main memory and the disk.
Address of a page (block)
(cylinder, track (within cylinder), sector
(within track)

10
Pages and Blocks
11
Page I/O

Page I/O --- one page I/O is the cost (or time
needed) to transfer one page of data between the
memory and the disk.
The cost of a (random) page I/O
seek time rotational delay block transfer
time
Seek time
time needed to position read/write head on
correct track.
Rotational delay (latency)
time needed to rotate the beginning of page under
read/write head.
Block transfer time
time needed to transfer data in the page/block.

12
Page I/O

Average rotational delay (rd)
rd ½ (1/p) min (601000)/(2p) msec
OR
rd ½ cost of 1 revolution
½ (601000/p) msec
where
p is speed of disk rotation (how many revolutions
per minute - rpm)
Example
Speed of disk rotatioon is p 3600 rpm
60 revolutions/sec
1 rev. 16.66 msec. (1 second 1000 msec)
rd 8.33 ms

13
Page I/O

Transfer rate (tr)
tr track size / cost of one revolution
track size / (601000/p) in msec
Bulk transfer rate (btr)
btr (B/(BG)) tr bytes/msec
Where B is the block size in bytes
G is interblock gap size in bytes
Block transfer time (btt)
btt B / tr not taking into acount G
btt B / btr taking into acount G

14
Page I/O

Example
Track size 50 KB and p 3600 rpm
Block size B 3KB 3000 bytes
tr (501000)/(601000/3600) 3000 bytes/msec
btt B / tr 3000/3000 1 msec

15
Page I/O

Average time for reading/writing n consecutive
pages that are in the same track or cylinder s
rd n btt
Average time for reading/writing consecutively n
noncontigues pages/blocks that are in the same
cylinder s n (rd btt)

16
An Example

A hard disk specifications
4 platters, 8 Surfaces, 3.5 Inch diameter
213 8192 tracks/surface
28 256 sectors/track
29 512 bytes/sector
Average seek time s 25 ms
Rotation rate rd 3600 rpm 60 rps
1 rev. 16.66 msec
Transfer rate
tr 1 KB in 0.117 ms
tr 1 KB in 0.130 ms with gap

17
An Example

What is the total capacity of this disk
8 GB (82132829233)
How many bytes does one track hold?
256 sectors/track512 bytes/sector 128KB
How many blocks per track?
one block 4096 bytes 8 sectors (4096/512)
256/8 32 blocks/track

18
An Example

How long does it take to access one block?
One block 4096 bytes
8 sectors 4096/512
Rotation rate r
1 rev. 16.66 msec.
Time to access 1 sector (s r/2
tr/(secters/KB)
25 (16.66/2) .117/2 33.3885 ms.
time to access 1 block
time to access the first sector of the block
time to access the subsequent 7 sectors.

19
An Example

T 25 (16.66/2) (0.117/2) 1 (0.13/2) 7
33.3885 0.455 ms 33.8435ms
Compare to one sector access time 33.3885 ms

20
Buffering

A buffer
is a contiguous reserved area in main memory
available for storage of copies of disk blocks.
to speed up the processes.
For a read command
the block from disk is copied into the buffer.
For a write command
the contents of the buffer are copied into the
disk.

21
Accessing Data Through RAM Buffer
22
File Organization

The database is stored as a collection of files.
Each file is a sequence of records.
A record is a sequence of fields.
Records are stored on disk blocks.
A file can have fixed-length records or
variable-length records.

23
File Organization

Fixed length records
Each record is of fixed length. Pad with spaces.
Variable length records
different records in the file have different
sizes.
Arise in database systems in several ways
different record types in a file.
same record type with (variable-length fields,
repeating field, or optional fields)

24
File Organization
25
Fixed-Length Records

Insertion
Store record i starting from byte n ? (i 1),
where n is the size of each record.
Deletion of record i
Packed format
move records i 1, . . ., n to i, . . . , n 1
OR
move record n to i
Unpacked format (do not move records, but)
link all free records on a free list
OR
Use bitmap vector

26
Free Lists

Store the address of the first deleted record in
the file header.
Use this first record to store the address of the
second deleted record, and so on.

27
Page Formats Fixed Length Records

Record id ltpage id, slot gt.

28
Variable-Length Records Representation

Byte-String representation
Attach an end-of-record (?) control character to
the end of each record
Difficulty with deletion and growth
Slotted-page header contains
number of record entries
location and size of each record
end of free space in the block

29
Slotted Page Structure

Records can be moved around within a page to keep
them contiguous with no empty space between them
entry in the header must be updated.
Pointers should not point directly to record -
instead they should point to the entry for the
record in header.

30
Fixed-Length Representation

Reserved Space
can use fixed-length records of a known maximum
length
unused space in shorter records filled with a
null or end-of-record symbol.

31
Fixed-Length Representation

List Representation by Pointers
A variable-length record is represented by a list
of fixed-length records, chained together via
pointers.
Can be used even if the maximum record length is
not known

32
Fixed-Length Representation

Disadvantage space is wasted in all records
except the first in a a chain.
Solution is to allow two kinds of block in file
Anchor block contains the first records of chain
Overflow block contains records other than those
that are the first records of chairs.

33
Blocking Factor

Blocking Factor (bfr) - the number of records
that can fit into a single block.
bfr ?B/R?
B Block size in bytes
R Record size in bytes
Example
Record size R 100 bytes
Block Size B 2,000 bytes
Thus the blocking factor bfr 2000/100 20
The number of blocks b needed to store a file of
r records
b ?r/bfr? blocks

34
Spanned Unspanned Records

A block is the unit of data transfer between disk
and memory.
Unspanned records
A record is found in one and only one block.
records do not span across block boundaries.
Used with fixed-length records having B ? R
Spanned records
Records are allowed to span across block
boundaries.
Used with variable-length records having R ? B
In variable-length records, either organization
can be used.

35
Placing File Records on Disk

A file header or file descriptor contains
information about a file (e.g., the disk address,
record format descriptions, etc.)

36
Allocating File Blocks on Disk

The physical disk blocks that are allocated to
hold the records of a file can be contiguous,
linked, or indexed.
In contiguous allocation, the file blocks are
allocated to consecutive disk blocks.
In linked allocation, each file block contains a
pointer to the next file block.
In indexed allocation, one or more index blocks
contain pointers to the actual file blocks.

37
Organization of Records in Files

Heap/Unordered/Pile File Organization
a record can be placed anywhere in the file where
there is space, or at the end
for full file scans or frequent updates
Data unordered (unsorted)
Sorted/Ordered File Organization
store records sorted in order, based on the value
of the search key of each record
Need external sort or an index to keep sorted
Hashing File Organization
a hash function computed on some attribute of
each record
the result specifies in which block of the file
the record should be placed

38
Heap File Organization

Records are placed in the file in the order in
which they are inserted. Such an organization is
called a heap file.
Insertion is at the end
takes constant time O(1) (very efficient)
Searching
requires a linear search (expensive)
Deleting
requires a search, then delete
Select, Update and Delete
take b/2 time (linear time) in average
b is the number of blocks

39
Heap File Organization

For a file of unordered fixed-length records
using unspanned blocks and contiguous allocation,
it is straightforward to access any record by its
position in the file.
If the records are numbered 0,1,2, , r-1 and
The records in each block are numbered 0,1,2, ,
f-1, where f is the blocking factor
The the i-th record of the file is located in
Block ?i/f? and in the
(i mod f)-th record in that block

40
Heap File Organization

A Heap file allows us to retrieve records
by specifying the rid, or
by scanning all records sequentially
Accessing a record by its position does not help
locate a record based on a search condition.

41
File Stored as a Heap File
666666 MGT123 F1994 4.0 123456
CS305 S1996 4.0 page 0 987654
CS305 F1995 2.0 717171 CS315
S1997 4.0 666666 EE101 S1998
3.0 page 1 765432 MAT123 S1996
2.0 515151 EE101 F1995
3.0 234567 CS305 S1999 4.0

page 2 878787 MGT123 S1996
3.0
42
Sequential File Organization

Suitable for applications that require sequential
processing of the entire file
The records in the file are ordered by a
search-key

43
Files of Ordered Records

Some blocks of an ordered (sequential) file of
EMPLOYEE records with NAME as the ordering key
field.

44
File Stored as a Sorted File
111111 MGT123 F1994 4.0 111111
CS305 S1996 4.0 page 0 123456
CS305 F1995 2.0 123456 CS315
S1997 4.0 123456 EE101 S1998
3.0 page 1 232323 MAT123 S1996
2.0 234567 EE101 F1995
3.0 234567 CS305 S1999 4.0

page 2 313131 MGT123 S1996
3.0
45
Sequential File Organization

Insertion is expensive
records must be inserted in the correct order
locate the position where the record is to be
inserted
if there is free space insert there
if no free space insert the record in an overflow
block
In either case, pointer chain must be updated
Insert takes lg(b) plus the time to re-organize
records.
b is the number of blocks
Deletion
use pointer chains
Searching
very efficient (Binary search)
This requires lg(b) on the average

46
Sequential File Organization
47
Average Access Times

The following table shows the average access time
to access a specific record for a given type of
file

48
Hashed Files

Hashing for disk files is called External Hashing
The file blocks are divided into M equal-sized
buckets, numbered bucket0, bucket1, ...,
bucketM-1.
Typically, a bucket corresponds to one (or a
fixed number of) disk block.
One of the file fields is designated to be the
hash key of the file.
The record with hash key value K is stored in
bucket i, where ih(K), and h is the hashing
function.
Search is very efficient on the hash key.
Collisions occur when a new record hashes to a
bucket that is already full.
An overflow file is kept for storing such
records.
Overflow records that hash to each bucket can be
linked together.

49
Hashed Files (cont.)

There are numerous methods for collision
resolution, including the following
Open addressing Proceeding from the occupied
position specified by the hash address, the
program checks the subsequent positions in order
until an unused (empty) position is found.
Chaining For this method, various overflow
locations are kept, usually by extending the
array with a number of overflow positions. In
addition, a pointer field is added to each record
location. A collision is resolved by placing the
new record in an unused overflow location and
setting the pointer of the occupied hash address
location to the address of that overflow
location.
Multiple hashing The program applies a second
hash function if the first results in a
collision. If another collision results, the
program uses open addressing or applies a third
hash function and then uses open addressing if
necessary.

50
Hashed Files (cont.)
51
Hashed Files (cont.)

To reduce overflow records, a hash file is
typically kept 70-80 full.
The hash function h should distribute the records
uniformly among the buckets
Otherwise, search time will be increased because
many overflow records will exist.
Main disadvantages of static external hashing
Fixed number of buckets M is a problem if the
number of records in the file grows or shrinks.
Ordered access on the hash key is quite
inefficient (requires sorting the records).

52
Hashed Files - Overflow Handling

Write a Comment

User Comments (0)

About PowerShow.com

CS 728 Advanced Database Systems Chapter 16 PowerPoint PPT Presentation