Title: Storing Data: Disks and Files
1Storing Data Disks and Files
Yea, from the table of my memory Ill wipe away
all trivial fond records. -- Shakespeare, Hamlet
2Disks and Files
- DBMS stores information on (hard) disks.
- This has major implications for DBMS design!
- READ transfer data from disk to main memory
(RAM). - WRITE transfer data from RAM to disk.
- Both are high-cost operations, relative to
in-memory operations, so must be planned
carefully!
3Why Not Store Everything in Main Memory?
- Costs too much. 1000 will buy you either 128MB
of RAM or 7.5GB of disk today. - Main memory is volatile. We want data to be
saved between runs. (Obviously!) - Typical storage hierarchy
- Main memory (RAM) for currently used data.
- Disk for the main database (secondary storage).
- Tapes for archiving older versions of the data
(tertiary storage).
4Disks
- Secondary storage device of choice.
- Main advantage over tapes random access vs.
sequential. - Data is stored and retrieved in units called disk
blocks or pages. - Unlike RAM, time to retrieve a disk page varies
depending upon location on disk. - Therefore, relative placement of pages on disk
has major impact on DBMS performance!
5Components of a Disk
Spindle
Disk head
- The platters spin (say, 90rps).
- The arm assembly is moved in or out to position
a head on a desired track. Tracks under heads
make a cylinder (imaginary!).
Sector
Platters
- Only one head reads/writes at any one time.
- Block size is a multiple of sector
size (which is fixed).
6Accessing a Disk Page
- Time to access (read/write) a disk block
- seek time (moving arms to position disk head on
track) - rotational delay (waiting for block to rotate
under head) - transfer time (actually moving data to/from disk
surface) - Seek time and rotational delay dominate.
- Seek time varies from about 1 to 20msec
- Rotational delay varies from 0 to 10msec
- Transfer rate is about 1msec per 4KB page
- Key to lower I/O cost reduce seek/rotation
delays! Hardware vs. software solutions?
7Arranging Pages on Disk
- Next block concept
- blocks on same track, followed by
- blocks on same cylinder, followed by
- blocks on adjacent cylinder
- Blocks in a file should be arranged sequentially
on disk (by next), to minimize seek and
rotational delay. - For a sequential scan, pre-fetching several pages
at a time is a big win!
8RAID
- Disk Array Arrangement of several disks that
gives abstraction of a single, large disk. - Goals Increase performance and reliability.
- Two main techniques
- Data striping Data is partitioned size of a
partition is called the striping unit. Partitions
(of equal size called stripping unit) are
distributed over several disks. - Redundancy More disks -gt more failures.
Redundant information allows reconstruction of
data if a disk fails.
9RAID Levels
- Level 0 No redundancy
- Level 1 Mirrored (two identical copies)
- Each disk has a mirror image (check disk)
- Parallel reads, a write involves two disks.
- Maximum transfer rate transfer rate of one disk
- Level 01 Striping and Mirroring
- Parallel reads, a write involves two disks.
- Maximum transfer rate aggregate bandwidth
10RAID Levels (Contd.)
- Level 2 Error-Correcting Codes
- Striping Unit One bit.
- Redundancy scheme Hamming Code
- Keeps more redundant information than necessary
11RAID Levels (Contd.)
- Level 3 Bit-Interleaved Parity
- Striping Unit One bit. One check disk.
- Each read and write request involves all disks
disk array can process one request at a time. - Level 4 Block-Interleaved Parity
- Striping Unit One disk block. One check disk.
- Parallel reads possible for small requests, large
requests can utilize full bandwidth - Writes involve modified block and check disk
- Level 5 Block-Interleaved Distributed Parity
- Similar to RAID Level 4, but parity blocks are
distributed over all disks
12Disk Space Management
- Lowest layer of DBMS software manages space on
disk. - Supports concept of page as unit of data
- size of page chosen as size of disk block
- Pages stored as disk blocks
- Reading writing page can be done in one disk I/O
- Higher levels call upon this layer to
- allocate/de-allocate a page
- read/write a page
- Request for a sequence of pages must be satisfied
by allocating the pages sequentially on disk!
Higher levels dont need to know how this is
done, or how free space is managed.
13Practice Chapter 7
- What is the most important difference between a
disk and a tape? - Tapes are sequential devices, disk support direct
access to desired page. - Explain the terms seek time, rotational delay,
and transfer time. - Time to access (read/write) a disk block
- seek time time to move disk heads to track on
which desired block is located - rotational delay waiting for block to rotate
under disk head it is the time required for half
a rotation on average, and is usually less than
seek time. - transfer time actually read/write data to/from
block once head is positioned, I.e. time for disk
to rotate over block.
14Practice Chapter 7 (Cont.)
- Both disks and main memory support direct access
to any desired location (page). On average, main
memory accesses are faster, of course. What is
the other important difference (from the
perspective of the time required to access a
desired page)? - The time to access a disk page is not constant.
It depends on the location of the data.
Accessing to some data might be much faster than
to others. It is different in memory access to
memory is uniform for most computer systems. - If you have a large file that is frequently
scanned sequentially, explain how you would store
the pages in the file on a disk. - The pages in the file should be stored
sequentially on a disk. We should put two
logically adjacent pages as close as possible.
In decreasing order of closeness, they could be
on the same track, the same cylinder, or an
adjacent cylinder.
15Components of a Disk
Spindle
Disk head
- The platters spin (say, 90rps).
- The arm assembly is moved in or out to position
a head on a desired track. Tracks under heads
make a cylinder (imaginary!).
Sector
Platters
- Only one head reads/writes at any one time.
- Block size is a multiple of sector
size (which is fixed).
16Practice Chapter 7 (Cont.)
- Consider a disk with sector size 512 bytes, 2,000
tracks per surface, 50 sectors per track, 5
double-sided platters, average seek time of 10
msec. - What is the capacity of a track in bytes?
- What is the capacity of each surface?
- What is the capacity of the disk?
-
- What is the capacity of a track in bytes?
- gtgt Bytes/track bytes/sector x sector/track
512 x 50 25K - 2. What is the capacity of each surface?
- gtgt Bytes/surface bytes/track x tracks/surface
25K x 2000 50,000K - 3. What is the capacity of the disk?
- gtgt Bytes/disk bytes/surface x surfaces/disk
50,000K x 10
17Practice Chapter 7 (Cont.)
- Consider a disk with sector size 512 bytes, 2,000
tracks per surface, 50 sectors per track, 5
double-sided platters, average seek time of 10
msec. - 4. How many cylinders does the disk have?
- 5. Give examples of valid block sizes. Is 256
bytes a valid block size? 2,048? 51,200?
- 4. gtgt The number of cylinders is the same as the
number of tracks on each platter, which is 2000. - gtgt Block size should be a multiple of the sector
size 256 not valid block size, but 2,048 and
51,200 are.
18Practice Chapter 7 (Cont.)
- Consider a disk with sector size 512 bytes, 2,000
tracks per surface, 50 sectors per track, 5
double-sided platters, average seek time of 10
msec. - 6. If the disk platters rotate at 5,400 rpm
(revolutions per minute), what is the maximum
rotational delay? -
6. gtgt If the disk platters rotate at 5,400 rpm,
the time required for a rotation, which is the
maximum rotational delay, is (1/5400) x 60
0.011 seconds The average rotational delay is
half of the rotation time, 0.006 seconds.
19Practice Chapter 7 (Cont.)
- Consider a disk with sector size 512 bytes, 2,000
tracks per surface, 50 sectors per track, 5
double-sided platters, average seek time of 10
msec. - 7. Assuming that one track of data can be
transferred per revolution, what is the transfer
rate?
7. gtgt The capacity of a track is 25K bytes.
Since one track of data can be transferred by
revolution, the data transfer rate is 25K / 0.011
2,250K bytespersec
20Buffer Management in a DBMS
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK
choice of frame dictated by replacement policy
- Data must be in RAM for DBMS to operate on it!
- Table of ltframe, pageidgt pairs is maintained.
21When a Page is Requested ...
- If requested page is not in pool
- Choose a frame for replacement
- If frame is dirty, write it to disk
- Read requested page into chosen frame
- Pin the page and return its address.
Requires two variables for buffer manager ?
pincount (initially set to zero for each
frame) ? dirty (initially off for each frame)
- If requests can be predicted (e.g., sequential
scans) - pages can be pre-fetched several pages at a
time!
22More on Buffer Management
- Requestor of page must unpin it, and indicate
whether page has been modified - dirty bit is used for this.
- Page in pool may be requested many times,
- a pin count is used. A page is a candidate for
replacement iff pin count 0. - CC recovery may entail additional I/O when a
frame is chosen for replacement. (Write-Ahead Log
protocol more later.)
23Buffer Replacement Policy
- Frame is chosen for replacement by a replacement
policy - Least-recently-used (LRU), Clock, MRU etc.
- Policy can have big impact on of I/Os depends
on the access pattern. - Sequential flooding Nasty situation caused by
LRU repeated sequential scans. - buffer frames lt pages in file means each page
request causes an I/O. MRU much better in this
situation (but not in all situations, of course).
24DBMS vs. OS File System
- OS does disk space buffer mgmt why not let
OS manage these tasks? - Differences in OS support portability issues
- DBMS can predict access patterns in typical DB
operations, - because most page references are generated by
higher-level opns (e.g. seq. scans or
implementations of rel. algebra operators) - and therefore it can adjust replacement policy,
and pre-fetch pages based on those access
patterns - Buffer management in DBMS requires ability to
- pin a page in buffer pool, force a page to disk
(important for implementing CC recovery),
25Unordered (Heap) Files
- Simplest file structure contains records in no
particular order. - Only guarantee one can retrieve all records in
file by repeated requests for the next record. - As file grows and shrinks, disk pages are
allocated and de-allocated. - Every record in file has unique record id, rid,
and every page in file is of same size. - To support record level operations, we must
- keep track of the pages in a file
- keep track of free space on pages
- keep track of the records on a page
- Supported operations create and destroy files,
insert delete and get a record with rid, scan all
records. - There are many alternatives for keeping track of
this.
26Heap File Implemented as a List
Data Page
Data Page
Data Page
Full Pages
Header Page
Data Page
Data Page
Data Page
Pages with Free Space
- The ltheader page id, Heap file namegt must be
stored someplace DBMS maintains table of such
pairs. - Each page contains 2 pointers plus data.
27Heap File Using a Page Directory
- The entry for a page can include the number of
free bytes on the page. - The directory is a collection of pages linked
list implementation is just one alternative. - Much smaller than linked list of all HF pages!
28Indexes
- A Heap file allows us to retrieve records
- by specifying the rid, or
- by scanning all records sequentially
- Sometimes, we want to retrieve records by
specifying the values in one or more fields,
e.g., - Find all students in the CS department
- Find all students with a gpa gt 3
- Find all books by Asimov (? index by Author)
- Find Foundation (? index by Title)
- Indexes are file structures that enable us to
answer such value-based queries efficiently. - Implementation just another kind of file
containing records that direct traffic on
requests for data records.
29Indexes (Cont.)
- Each index has an associated search key
- collection of one or more fields of the file of
records for which we are building the index - Any subset of the fields can be a search key
- Sometimes refer to the file of records as the
indexed file. - Each index designed to speed up equality or range
selections on search key - E.g. If want to build index to improve efficiency
of queries about employees of given age ? build
index on age attribute of employee dataset. - Records stored in index file (called entries
vs. data records) allow to find data records
with given search key value - ltage, ridgt where rid identifies data record
30Summary
- Disks provide cheap, non-volatile storage.
- Unit of Xfer from disk into main memory is called
block or page. - Blocks are arranged on tracks on several
platters. - Random access, but cost depends on location of
page on disk important to arrange data
sequentially to minimize seek and rotation
delays. - Buffer manager brings pages into RAM.
- Page stays in RAM until released by requestor.
- Written to disk when frame chosen for replacement
(which is sometime after requestor releases the
page). - Choice of frame to replace based on replacement
policy. - Tries to pre-fetch several pages at a time.
31Summary (Contd.)
- DBMS vs. OS File Support
- DBMS needs features not found in many OSs, e.g.,
forcing a page to disk, controlling the order of
page writes to disk, files spanning disks,
ability to control pre-fetching and page
replacement policy based on predictable access
patterns, etc. - Variable length record format with field offset
directory offers support for direct access to
ith field and null values. - Slotted page format supports variable length
records and allows records to move on page.
32Summary (Contd.)
- File layer keeps track of pages in a file, and
supports abstraction of a collection of records. - Pages with free space identified using linked
list or directory structure (similar to how pages
in file are kept track of). - Indexes support efficient retrieval of records
based on the values in some fields. - Catalog relations store information about
relations, indexes and views. (Information that
is common to all records in a given collection.)
33Practice Chapter 7 (Cont.??)
- Consider a disk with sector size 512 bytes, 2,000
tracks per surface, 50 sectors per track, 5
double-sided platters, average seek time of 10
msec., and suppose block size of 1,024 is chosen.
Suppose that a file containing 100,000 records
of 100 bytes each is to be stored on such a disk
and that no record is allowed to span on two
blocks. - How many records fit onto a block?
-
- blocksize / bytesperrec
- 1024 / 100 10. We can have at most 10 records
in a block.
34Chapter 7 Practice (Cont.)
- Explain what the buffer manager must do to
process a read request for a page. What happens
if the requested page is in the pool but not
pinned? - When a page is requested the buffer manager does
the following - The buffer pool is checked to see if it contains
the requested page. If the page is not in the
pool, it is brought as follows - A frame is chosen for replacement, using the
replacement policy. - If the frame chosen for replacement is dirty (has
been written to), it is flushed (the page it
contains is written out to disk). - The requested page is read into the frame chosen
for replacement. - The requested page is pinned (pin_count of its
frame is incremented), and its address is
returned to the requestor. - Note that if the page is not pinned, it could be
removed from buffer pool even if it is actually
needed in main memory.
35Chapter 7 Practice (Cont.)
- When does a buffer manager write a page to disk?
- If a page in the buffer pool is chosen to be
replaced and this page is dirty, the buffer
manager must write the page to the disk called
flushing the page to the disk. - Sometimes buffer mgr can also force a page to
disk for recovery-related purposes (to ensure
that log records corresponding to a modified page
are written to disk before modified page itself
is written to disk).
36Chapter 7 Practice (Cont.)
- What does it mean to say that a page is pinned in
the buffer pool? Who is responsible for pinning
pages? Who is responsible for unpinning pages? - Pinning a page means the pin_count of its frame
is incremented. Pinning a page guarantees
higher-level DBMS s/w that the page will not be
removed from the buffer pool by the buffer
manager. That is, another file page will not be
read into the frame containing this page until it
is unpinned by this requestor. - It is the buffer managers responsibility to pin
a page. - It is the responsibility of the requestor of that
page to tell the buffer manager to unpin a page.
37Chapter 7 Practice (Cont.)
- When a page in the buffer pool is modified, how
does the DBMS ensure that this change is
propagated to disk? (Explain role of buffer mgr
as well as modifier of the page). - The modifier of the page tells the buffer manger
that the page is modified by setting the dirty
bit of the page. - The buffer manager flushes the page to disk when
necessary.
38Chapter 7 Practice (Cont.)
- What happens if there is a page request when all
pages in the buffer pool are dirty? - If there are some unpinned pages, the buffer
manger chooses one by using the replacement
policy, flushes this page, and then replaces it
with the requested page. - If there are no unpinned pages, the buffer
manager has to wait until an unpinned page is
available (or signal an error condition to the
page requestor).
39Chapter 7 Practice (Cont.)
- What is sequential flooding of the buffer pool?
- Some DB operations (e.g. certain implementations
of the join relational algebra operator) require
repeated sequential scans of a relation. Suppose
that there are 10 frames available in the buffer
pool, and the file to be scanned has 11 or more
pages (I.e. at least one more than the number of
available pages in the buffer pool). Using LRU,
every scan of the file will result in reading in
every page of the file! In this situation,
called sequential flooding, LRU is the worst
possible replacement strategy.
40Chapter 7 Practice (Cont.)
- Name an important capability of a DBMS buffer mgr
that is not supported by a typical OSs buffer
mgr. - Pinning a page to prevent it from being replaced.
- Ability to explicitly force a single page to
disk. - Explain the term prefetching. Why is it
important? - Because most page references in a DBMS env. Are
with a known reference pattern, the buffer mgr
can anticipate the next several page requests and
fetch the corresponding pages into memory before
the pages are requested. This is prefetching. - Benefits 1) pages are available in buffer pool
when they are requested 2) reading in a
contiguous block of pages is much faster than
reading the same pages at different times in
response to distinct requests.