Title: Advanced Database System
1Advanced Database System
- CS 641
- Lecture 4
- Jan 24th 2008
2Accelerating access to secondary storage
- Place blocks that are accessed together on the
same cylinder. - Divide the data among several smaller disks
rather than a large one. - Mirror a disk
- Use a disk scheduling algorithm
- Prefetch blocks to main memory
3Organizing data by cylinders
- Objective reduce seek time
- Method analyze application behavior, put data
that is likely to be accessed together on a
single cylinder or adjacent cylinders - Thus, if read all the blocks on a single track or
on a cylinder consecutively, we can neglect all
but the first seek time and the first rotational
latency.
4Example
- Megatron 747
- Avg transfer time ,seek time and rotational
latency are 0.25ms, 6.46ms and 4.17ms
respectively. - Sorting for 10,000,000 records need 74mins in
TPMMS. - Each cylinder stores 8M bytes (512 blocks)
- We store data on 100,000/512196 cylinders.
- We must read 100M/8M13 different cylinders to
fill main memory once.
5Example (Cont.)
- The total time to fill main memory once
- 6.46ms for one avg seek
- 12 ms for 12 one-cylinder seeks
- 1.60s for 6400 blocks (0.25ms per block)
- We need to fill main memory 16 times. Thats
about 1.61625.6s. - Both read and write operations take 225.6s51.2s
- However, this mechanism can not help phase 2
because we have to read/write one block each
time. Still need 37mins.(?)
6Using multiple disks
- For single disk, only one head can read/write
data at a certain time. - With multiple disks, we can read/write data onto
different disks at the same time.
7Example
- Replace Megatron 747 with 4 Megatron 737.
- Divide the given records among 4 disks. Thus the
time to fill 100M main memory is 1.6/40.4s - Entire phase 1 takes 51.2/412.8s
- In phase 2, to take advantage of 4 disks, TPMMS
need to be modified - Comparison starts as soon as the first element of
the new block appears in the main memory - The speedup is about in the 2-3 times range.
8Mirroring disks
- Have two or more disks hold identical copies of
data. - It is used for reliability, however, it can also
be use to speed up access to data. - For example, if we have four copies, system can
guarantee to retrieve 4 blocks at once. (read in
parallel) - It can speed up read, but not write.
9Disk scheduling
- Useful to reduce the access latency for many
small processes that each access a few blocks. - Definition In disk I/O, seek time and rotational
latency are very important. Since all disk
requests are linked in queues, reduce them
causing the system to speed up. Disk Scheduling
Algorithms are used to reduce the total seek time
and rotational latency of all requests.
10Shortest Seek Time First (SSTF)
11Elevator
12Example
13Elevator
14FCFS
15Comments
- Elevator algorithm can further improve the
throughput when the average number of requests
waiting for the disk increases.
16Example
- Assume megatron 747, 16384 cylinders.
- 1000 I/O requests
- avg 10.254.175.42ms
- compare to previous random access (10.88ms)
- 1000 I/Os finish in 5.42s, avg delay to satisfy
a request is half of it 2.71s - 32768 I/O requests
- avg 1/20.25(1/2)(2/3)8.333.53ms
- 32768 I/Os finish in 116s, avg delay to satisfy
a request is 58s.
17Prefetching (double buffering)
- Load the blocks into main memory before they are
needed. - Advantage better schedule disk I/Os.
- We can gain speedup in block access.
18Example
- In previous example, if two blocks are given to
each sorted sublist (in phase 2), in case the
records in one block are used up, we can switch
to another one without delay. - However, if those 100,000 blocks are random
accessed, we did not get much benefits. To solve
that - Store the sorted sublists on whole.
- Read whole tracks or who cylinder.
19Another Example
- In previous example, each sublist has 2
track-sized buffers, need 16M main memory. - Time to read the whole track 6.468.3314.79ms.
- We have 196 cylinders (3136 tracks)
- The time to read 313614.79ms46.4s
- 2 cylinder-sized (16 tracks) buffers, need 256M
main memory - Time to read the whole cylinder
6.46168.33140ms. - The time to read 196140ms27.44s
20Two different applications
- A) regular situation, where blocks can be read
and written in a sequence that can be predicted
in advance. Only one process using the disk
(TPMMS phase 1) - B) a collection of short processes, execute in
parallel, share the same disk, and can not be
predicted in advance. (TPMMS phase 2)
21Cylinder-based organization
- advantage excellent for A)
- disadvantage no help for B)
22Multiple disks
- Advantage increasing read/write access rates for
both applications - Problems read/write to the same disk can not be
satisfied at the same time. - Disadvantage cost
23Mirroring
- Advantage increasing read rates for both
applications improve fault tolerance for both
applications - Disadvantage pay for two or more disks but get
the capacity of one.
24Disk scheduling (Elevator algorithm)
- Advantage reduce the average time to read/write
when the accesses to blocks are unpredictable. - Problem it is more efficient for a large amount
of pending I/O requests and the average delay for
the processes are high
25Prefetching
- Advantage speeds up access when the needed
blocks are known but the timing of requests is
data-dependent. - Disadvantage requires extra main memory buffers.
No help when accesses are random.
26Exercises 11.5.2
- Suppose use two Megatron 747 disks as mirrors.
However, instead of allowing reads of any blocks
from either disk, we keep the head of the first
disk in the inner half of the cylinder, and the
head of the second disk in the outer half of the
cylinders. Assuming read requests are on random
tracks - What is the average rate at which this system can
read blocks? - How does this rate compare to no restriction?
- What disadvantages do you foresee for this system?
27Disk failure
- Intermittent failure an attempt to read/write a
sector fails, but repeated tries succeed. - Media decay a bit or bits are permanently
corrupted, thus the corresponding sector is
damaged. - Write failure attempt to write a sector is fail
- Disk crash the entire disk becomes unreadable,
suddenly or permanently.
28Techniques
- Parity check detect intermittent failures
- Stable storage prevent media decay/write failure
damages - RAID coping with disk crashes.
- Basic idea using additional storage to keep
redundant information.
29A useful model
- Disk sectors are ordinarily stored with some
redundant bits. - For read a pair(w,s) is returned, where w is the
data in the sector that is read, and s is a
status bit that tells whether or not the read was
successful.
30Checksum
- Additional bits are set depending on the values
of the data bits stored in that sector. - A simple form is based on the parity of all the
bits in the sector. - If an odd number of 1s among the bits, it is odd
parity, or their parity bit is 1. - If an even number of 1s among the bits, it is
even parity, or their parity bit is 0. - Thus, the number of 1s among a collection of
bits and their parity bit is always even.
31Example
- 01101000, the parity bit is 1, will be stored as
011010001 - 11101110, the parity bit is 0, will be stored as
111011100. - If one of these bits fail, it can be detected, if
more than one bits corrupted, 50 the error can
not be detected. - More parity bits can be used to detect more
errors. - In general, if n independent bits are used as a
checksum, the chance of missing an error is only
1/2n
32Stable storage
- Problem checksum can detect error, but can not
fix the error. For critical applications, such as
banks, airlines, this is not enough. - Stable storage pair the sectors, and each pair
representing one sector-contents X. - We call these two copies as XL and XR
- If the read returns (w, good) for either XL or
XR, then w is the true value of X.
33Writing policy
- Write the value of X into XL. Check that the
value has status good. Otherwise, repeat write,
if still not succeed for several times, a media
failure is detected, find a spare sector for XL - Repeat for XR
34Reading policy
- To obtain the value of X, read XL. If status
bad, repeat several times. If eventually status
becomes good, take the value for X - If can not read XL, repeat for XR
35Error-handling capabilities
- Media failure
- Only both of them fail, then we can not get the
value for X otherwise, we can read X. - Write failure if a system failure (power
outrage) occurs when write X, failure can be
detected - The failure occurred as we were writing XL,
status of XL becomes bad, but XR has the old
value for X - The failure occurred after we wrote XL, XL status
is good, write XR with the value in XL
36Exercises
37Recovery from disk crashes
- The most common strategy is RAID, Redundant
Arrays of Independent Disks.
38The failure model for disks
- Mean time to failure (MTTF) the length of time
by which 50 of a population of disks will have
failed catastrophically. - For modern disks 10 years.
39A survival rate curve for disks
40Mechanism
- One or more disks that hold the data (data disks)
and, - adding one or more disks that hold information
that is completely determined by the contents of
the data disks (redundant disks) - In case a disk crash of either a data disk or a
redundant disk, the other disks can be used to
restore the failed disk, so no permanent
information loss.
41Mirroring (RAID 1)
- Mirror each data disk with a redundant disk.
- Data loss occurs only when both disks fail
simultaneously. - Example
- A disk has 10 probability to fail each year
- Replace a new one takes 3 hours,(1/2920 of a
year). - The probability the mirror disk fails is 1/29200.
- One of two disks will failure once in 5 years on
average - The mean time to data loss is 1/5(1/29200)1/1460
00 or once per 146,000 years.
42Parity blocks (RAID 4)
- Mirroring approach use as many redundant disks as
data disks. - RAID 4 only use 1 redundant disk no matter how
many data disks there are. - Mechanism assume all the disks are identical, in
the redundant disk, the ith block consists of
parity checks for the ith blocks of all the data
disks.
43Example
- 3 data disks 1,2,3. A redundant disk 4.
- Disk 1 11110000
- Disk 2 10101010
- Disk 3 00111000
- Then on redundant disk
- Disk 4 01100010
44Reading example
- Suppose we are reading a block from disk 1, and
another request comes in to read a different
block of the same disk. - Ordinary, we have to wait for the first request
to finish. - However, we can computer it by reading blocks
from other disks.
45Reading example (Cont.)
- Disk 2 10101010
- Disk 3 00111000
- Disk 4 01100010
- We can get
- Disk 1 11110000
46Writing
- When write a new block of a data disk, we need to
change the corresponding block of the redundant
disk as well. - Assume n data disks, a naive approach needs n1
disk I/Os. - A better approach only check old and new version
of the data block to be written. - Read the old value of the block to be changed
- Read the corresponding block of the redundant
disk - Write the new data block
- Recalculate and write the block of the redundant
disk.
47Writing example
- Disk 1 11110000
- Disk 2 10101010
- Disk 3 00111000
- Disk 4 01100010
- Now, change the block in disk 2 to 11001100
- Thus, the values in positions 2,3,6 and 7 have
been changed - The block on redundant disk will do the modulo-2
sum of 0110110 to 01100010, and get 00000100. - 4 disk I/Os are needed.
48Failure recovery
- If redundant disk fails, replace with a new disk,
and recompute the redundant blocks - If one of the data disk fails, replace with a new
disk, and recompute its data from other disks. - Rule the bit in any position is the modulo-2 sum
of all the bits in the corresponding positions of
all the other disks.
49Example
- If disk 2 fails
- Disk 1 11110000
- Disk 2 ????????
- Disk 3 00111000
- Disk 4 01100010
- Take the modulo-2 sum of each column
- Disk 2 10101010
50An improvement RAID 5
- Problem for RAID 4
- if n data disks, the number of disk writes to
the redundant disk will be n times the average
number of writes to any one data disk. - To solve it, RAID 5 treat each disk as the
redundant disk for some of the blocks.
51Example
52Example
- N4 (4 disks)
- average access to each disk is
- 1/43/41/31/2 of the writes.
53Coping with multiple disk crashes (RAID 6)
- RAID 4 and 5 can not deal with multiple errors.
- To deal with two simultaneous crashes, hamming
code (error-correcting code) is used. - In the following discussion
- 4 data disks (1,2,3,4) and 3 redundant disks
(5,6,7)
54Redundancy pattern
55Rules
- The bits of disk 5 are the modulo-2 sum of the
corresponding bits of disks 1,2, and 3 - The bits of disk 6 are the modulo-2 sum of the
corresponding bits of disks 1,2, and 4 - The bits of disk 7 are the modulo-2 sum of the
corresponding bits of disks 1,3, and 4
56Reading
- We can read data from any data disk normally. The
redundant disk can be ignored.
57Writing
- The idea is similar, but now several redundant
disks may be involved.
58Example
- Disk 1 11110000
- Disk 2 10101010
- Disk 3 00111000
- Disk 4 01000001
- Disk 5 01100010
- Disk 6 00011011
- Disk 7 10001001
- Rewrite disk 2 to be 00001111, positions 1,3,6,8
changes. - Since disk 5 and 6 calculate the bits involve
disk 2, we must modify them accordingly. - Disk 5 11000111
- Disk 6 10111110
-
59Failure recovery (2 simultaneous disk crashes)
- Disk 1 11110000
- Disk 2 ????????
- Disk 3 00111000
- Disk 4 01000001
- Disk 5 ????????
- Disk 6 10111110
- Disk 7 10001001
- Since disk 6 are calculated with disk 1,2,4. we
can recover the data in disk 2 with disk 1,4,6. - Disk 2 00001111
- Disk 5 is calculated with disk 1,2,3.
- Disk 5 11000111
60Exercises
- 11.7.1
- 11.7.4 a)
- 11.7.5 a)
- 11.7.6
- 11.7.8 a)
61Summary
- Memory Hierarchy
- Tertiary storage
- Disk/secondary storage
- Blocks, sectors, cylinders, etc.
- Disk access time
- TPMMS
- Speed up disk access
- Disk failure model
- Checksum
- Stable storage
- RAID