Title: ECE 6160: Advanced Computer Networks Disk Arrays
1ECE 6160 Advanced Computer NetworksDisk Arrays
- Instructor Dr. Xubin (Ben) He
- Email Hexb_at_tntech.edu
- Tel 931-372-3462
- Course web http//www.ece.tntech.edu/hexb/616f05
2Prev
3Rotational Media
Sector
Track
Arm
Cylinder
Platter
Head
Access time seek time rotational delay
transfer timeoverhead seek time 5-15
milliseconds to move the disk arm and settle on a
cylinder rotational delay 8 milliseconds for
full rotation at 7200 RPM average delay 4
ms transfer time 1 millisecond for an 8KB block
at 8 MB/s
4Disk Operations
- Seek move head to track
- Rotation wait for sector under head
- TransferMove data to/from disks
- Overhead
- Controller delay
- Queuing delay
Access time seek time rotational delay
transfer timeoverhead
5Improving disk performance.
- Use large sectors to improve bandwidth
- Use track caches and read ahead
- Read entire track into on-controller cache
- Exploit locality (improves both latency and BW)
- Design file systems to maximize locality
- Allocate files sequentially on disks (exploit
track cache) - Locate similar files in same cylinder (reduce
seeks) - Locate simlar files in near-by cylinders (reduce
seek distance) - Pack bits closer together to improve transfer
rate and density. - Use a collection of small disks to form a large,
high performance one---gtdisk array - Stripping data across multiple disks to allow
parallel I/O, thus improving performance.
6Use Arrays of Small Disks?
- Katz and Patterson asked in 1987
- Can smaller disks be used to close gap in
performance between disks and CPUs?
Conventional 4 disk designs
10
5.25
3.5
14
High End
Low End
Disk Array 1 disk design
3.5
7Replace Small Number of Large Disks with Large
Number of Small Disks! (1988 Disks)
IBM 3390K 20 GBytes 97 cu. ft. 3 KW 15
MB/s 600 I/Os/s 250 KHrs 250K
x70 23 GBytes 11 cu. ft. 1 KW 110 MB/s 3900
IOs/s ??? Hrs 150K
IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5
MB/s 55 I/Os/s 50 KHrs 2K
Capacity Volume Power Data Rate I/O Rate
MTTF Cost
9X
3X
8X
6X
Disk Arrays have potential for large data and I/O
rates, high MB per cu. ft., high MB per KW, but
what about reliability?
8Array Reliability
- MTTF Mean Time To Failure average time that a
non - repairable component will operate before
experiencing failure. - Reliability of N disks Reliability of 1 Disk N
-
- 50,000 Hours 70 disks 700 hours
- Disk system MTTF Drops from 6 years to 1
month! - Arrays without redundancy too unreliable to be
useful! - Solution redundancy.
9Redundant Arrays of (Inexpensive) Disks
- Replicate data over several disks so that no data
will be lost if one disk fails. - Redundancy yields high data availability
- Availability service still provided to user,
even if some components failed - Disks will still fail
- Contents reconstructed from data redundantly
stored in the array - ? Capacity penalty to store redundant info
- ? Bandwidth penalty to update redundant info
10(No Transcript)
11Levels of RAID
- Original RAID paper described five categories
(RAID levels 1-5). (Patterson et al, A case for
redundant arrays of inexpensive disks (RAID),
ACM SIGMOD, 1988) - Disk striping with no redundant now is called
RAID0 or JBOD(Just a bunch of disks). - Other kinds have been proposed in literature,
- Level 6 (PQ Redundancy), Level 10, RAID53,
etc. - Except RAID0, all the RAID levels trade disk
capacity for reliability, and the extra
reliability makes parallism a practical way to
improve performance.
12RAID 0 Nonredundant (JBOD)
- High I/O performance.
- Data is not save redundantly.
- Single copy of data is striped across multiple
disks. - Low cost.
- Lack of redundancy.
- Least reliable single disk failure leads to data
loss.
13Redundant Arrays of Inexpensive DisksRAID 1
Disk Mirroring/Shadowing
recovery group
Each disk is fully duplicated onto its
mirror Very high availability can be
achieved Bandwidth sacrifice on write
Logical write two physical writes Reads may
be optimized, minimize the queue and disk search
time Most expensive solution 100 capacity
overhead
Targeted for high I/O rate , high availability
environments
14RAID 2 Memory-Style ECC
Data Disks
Multiple ECC Disks and a Parity Disk
- Multiple disks record the ECC information to
determine which disk is in fault - A parity disk is then used to reconstruct
corrupted or lost data - Needs log2(number of disks) redundancy disks
15RAID 3 Bit (Byte) Interleaved Parity
- Only need one parity disk
- Write/Read accesses all disks
- Only one request can be serviced at a time
- Easy to implement
- Provides high bandwidth but not high I/O rates
Targeted for high bandwidth applications
Multimedia, Image Processing
16RAID 3
- Sum computed across recovery group to protect
against hard disk failures, stored in P disk - Logically, a single high capacity, high transfer
rate disk good for large transfers - Wider arrays reduce capacity costs, but decreases
availability - 12.5 capacity cost for parity in this
configuration
Inspiration for RAID 4
- RAID 3 relies on parity disk to discover errors
on Read - But every sector has an error detection field
- Rely on error detection field to catch errors on
read, not on the parity disk - Allows independent reads to different disks
simultaneously
17RAID 4 Block Interleaved Parity
- Blocks striping units
- Allow for parallel access by multiple I/O
requests, high I/O rates - Doing multiple small reads is now faster than
before. (allows small read requests to be
restricted to a single disk). - Large writes(full stripe), update the parity
- P d0 d1 d2 d3
- Small writes(eg. write on d0), update the
parity - P d0 d1 d2 d3
- P d0 d1 d2 d3 P d0 d0
- However, writes are still very slow since the
parity - disk is the bottleneck.
18Problems of Disk Arrays Small Writes
(read-modify-write procedure)
RAID-5 Small Write Algorithm
1 Logical Write 2 Physical Reads 2 Physical
Writes
D0
D1
D2
D3
D0'
P
old data
new data
old parity
(1. Read)
(2. Read)
XOR
XOR
(3. Write)
(4. Write)
D0'
D1
D2
D3
P'
19Inspiration for RAID 5
- RAID 4 works well for small reads
- Small writes (write to one disk)
- Option 1 read other data disks, create new sum
and write to Parity Disk - Option 2 since P has old sum, compare old data
to new data, add the difference to P - Small writes are limited by Parity Disk Write to
D0, D5 both also write to P disk. Parity disk
must be updated for every write operation!
20Redundant Arrays of Inexpensive Disks RAID 5
High I/O Rate Interleaved Parity
Increasing Logical Disk Addresses
D0
D1
D2
D3
P
Independent writes possible because
of interleaved parity
D4
D5
D6
P
D7
D8
D9
P
D10
D11
D12
P
D13
D14
D15
Example write to D0, D5 uses disks 0, 1, 3, 4
P
D16
D17
D18
D19
D20
D21
D22
D23
P
. . .
. . .
. . .
. . .
. . .
Disk Columns
21RAID 5 Block Interleaved Distributed-Parity
Left Symmetric Distribution
- Parity disk (block number/4) mod 5
- Eliminate the parity disk bottleneck of RAID 4
- Best small read, large read and large write
performance - Can correct any single self-identifying failure
- Small logical writes take two physical reads and
two - physical writes.
- Recovering needs reading all nonfailed disks
22RAID 6 P Q Redundancy
- An extension to RAID 5 but with two-dimensional
parity. - Each row has P parity and each row has Q parity.
- (Reed-Solomon Codes)
- Has an extremely high data fault tolerance and
- can sustain multiple simultaneous drive
failures - Rarely implemented
More information, please see the paper A
tutorial on Reed-Solomon Coding for Fault
Tolerance in RAID-like Systems
23Comparison of RAID Levels (N disks, each with
capacity of C)
24Implementation Consideration
- Avoiding Stale Data
- Regenerating Parity after a System Crash
- Operating with a Failed Disks
- Orthogonal RAID
- Stripping Unit Size
- Other RAID Improvement Techniques
25Avoiding Stale Data
- Maintain a bit-vector to indicate the validity of
each logical sector. - Avoid Reading Stale Data
- When a disk fails, the corresponding logical
sectors must be marked invalid before any read
access when a disk fails. - Avoid Creating Stale Data
- When the invalid sector has been reconstructed
to a spare disks, the corresponding logical
sectors must be marked valid before any write
access.
26Regenerating Parity after a System Crash
- Hardware RAID system
- Before servicing any write request, the
corresponding parity sectors must be mark
inconsistent. - When bringing a system up from a system crash,
all inconsistent parity sectors must be
regenerated. - Periodically mark partial sectors as consistent
to avoid having to regenerate a large number of
parity sectors after each crash. - Software RAID system
- A simple solution
- Mark the corresponding parity sectors as
inconsistent before each write operation, and
mark them consistent after the write operation. - A more practical solution
- Maintain a most recently used pool that keeps
track of a fixed number of inconsistent parity
sectors on stable storage.
27Operating with a Failed Disk
- A disk array operating with a failed disk can
potentially lose data in the even of a system
crash. Therefore, we need to perform some form of
logging on every write operation to prevent the
loss. Two elegant methods - Demand reconstruction
- Require stand-by disks
- Any write access to a parity stripe with an
invalid sector triggers reconstruction of the
appropriate data immediately onto the spare
disks. - Parity sparing
- Does not need stand-by disks but require
additional metadata - Use spares to make smaller disk arrays
- Smaller arrays means higher reliability, faster
reconstruction. - On a disk failure, merge smaller arrays into a
larger one - For more information, please see the paper
- Failure Evaluation of Disk Array Organization
28Orthogonal RAID
Option 1
Option 2
Error Correction Group Option
29Stripping Unit in RAID 5
- S optimal stripping unit
- N the number of disks
- S increases as N increases for read-intensive
workloads - S decreases as N increases for write-intensive
workloads - S is independent of N for unspecified mix of
reads and writes - Recommended strip size
- S ½average disk positioning timedisk
transfer rate
For more information, see the paper
P.M. Chen, Stripping in a Raid Level 5 Disk
Array, ACM 1995
30Other RAID Improvement Techniques
- Improving small write performance of RAID 5
- Buffering and caching
- Floating Parity
- Shorten the read-modify-write of parity updating
to nearly a single disk access time on average - Basic Idea the new parity block can be written
on the rotationally nearest unallocated block
following the old parity block. - Declustered Parity
- Distributing the increased load caused by disk
failures uniformly over all disks - Basic Idea construct a multiple RAID system with
overlapping parity groups.
31Other RAIDs
- HP AutoRAID
- AFRAID
- RAPID
- SwiftRAID
- TickerTAIP
- SMDA
32Berkeley History RAID-I
- RAID-I (1989)
- Consisted of a Sun 4/280 workstation with 128 MB
of DRAM, four dual-string SCSI controllers, 28
5.25-inch SCSI disks and specialized disk
striping software - Today RAID is 19 billion dollar industry, 80
nonPC disks sold in RAIDs
33RAID Techniques Goal was performance, popularity
due to reliability of storage
1 0 0 1 0 0 1 1
1 0 0 1 0 0 1 1
Disk Mirroring, Shadowing (RAID 1)
Each disk is fully duplicated onto its "shadow"
Logical write two physical writes 100
capacity overhead
1 0 0 1 0 0 1 1
0 0 1 1 0 0 1 0
1 1 0 0 1 1 0 1
1 0 0 1 0 0 1 1
Parity Data Bandwidth Array (RAID 3)
Parity computed horizontally Logically a single
high data bw disk
High I/O Rate Parity Array (RAID 5)
Interleaved parity blocks Independent reads and
writes Logical write 2 reads 2 writes
34RAID 0 Striped Disk Array without Fault
Tolerance
RAID Level 0 requires a minimum of 2 drives to
implement
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)