STORAGE AND I/O - PowerPoint PPT Presentation

About This Presentation
Title:

STORAGE AND I/O

Description:

MTTF, MMTR and MTBF MTTF is mean time to failure MTTR is mean time to repair 1/MTTF is failure rate l MTTBF, the mean time between failures, ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 93
Provided by: Jehan65
Learn more at: https://www2.cs.uh.edu
Category:
Tags: and | storage | mttf

less

Transcript and Presenter's Notes

Title: STORAGE AND I/O


1
STORAGE AND I/O
  • Jehan-François Pâris
  • jfparis_at_uh.edu

2
Chapter Organization
  • Availability and Reliability
  • Technology review
  • Solid-state storage devices
  • I/O Operations
  • Reliable Arrays of Inexpensive Disks

3
DEPENDABILITY
4
Reliability and Availability
  • Reliability
  • Probability R(t) that system will be up at time
    t if it was up at time t 0
  • Availability
  • Fraction of time the system is up
  • Reliability and availability do not measure the
    same thing!

5
Which matters?
  • It depends
  • Reliability for real-time systems
  • Flight control
  • Process control,
  • Availability for many other applications
  • DSL service
  • File server, web server,

6
MTTF, MMTR and MTBF
  • MTTF is mean time to failure
  • MTTR is mean time to repair
  • 1/MTTF is failure rate l
  • MTTBF, the mean time between failures, is
  • MTBF MTTF MTTR

7
Reliability
  • As a first approximation R(t) exp(t/MTTF)
  • Not true if failure rate varies over time

8
Availability
  • Measured by (MTTF)/(MTTF MTTR) MTTF/MTBF
  • MTTR is very important
  • A good MTTR requires that we detect quickly the
    failure

9
The nine notation
  • Availability is often expressed in "nines"
  • 99 percent is two nines
  • 99.9 percent is three nines
  • Formula is log10 (1 A)
  • Examplelog10 (1 0.999) log10 (10-3) 3

10
Example
  • A server crashes on the average once a month
  • When this happens, it takes 12 hours to reboot it
  • What is the server availability ?

11
Solution
  • MTBF 30 days
  • MTTR 12 hours ½ day
  • MTTF 29 ½ days
  • Availability is 29.5/30 98.3

12
Keep in mind
  • A 99 percent availability is not as great as we
    might think
  • One hour down every 100 hours
  • Fifteen minutes down every 24 hours

13
Example
  • A disk drive has a MTTF of 20 years.
  • What is the probability that the data it contains
    will not be lost over a period of five years?

14
Example
  • A disk farm contains 100 disks whose MTTF is 20
    years.
  • What is the probability that no data will be
    lost over a period of five years?

15
Solution
  • The aggregate failure rate of the disk farm is
  • 100x1/20 5 failures/year
  • The mean time to failure of the farm is
  • 1/5 year
  • We apply the formula
  • R(t) exp(t/MTTF) -exp(55) 1.4 10-11
  • Almost zero chance!

16
TECHNOLOGY OVERVIEW
17
Disk drives
  • See previous chapter
  • Recall that the disk access time is the sum of
  • The disk seek time (to get to the right track)
  • The disk rotational latency
  • The actual transfer time

18
Flash drives
  • Widely used in flash drives, most MP3 players and
    some small portable computers
  • Similar technology as EEPROM
  • Two technologies

19
What about flash?
  • Widely used in flash drives, most MP3 players and
    some small portable computers
  • Several important limitations
  • Limited write bandwidth
  • Must erase a whole block of data before
    overwriting them
  • Limited endurance
  • 10,000 to 100,000 write cycles

20
Storage Class Memories
  • Solid-state storage
  • Non-volatile
  • Much faster than conventional disks
  • Numerous proposals
  • Ferro-electric RAM (FRAM)
  • Magneto-resistive RAM (MRAM)
  • Phase-Change Memories (PCM)

21
Phase-Change Memories
No moving parts
A data cell
Crossbarorganization
22
Phase-Change Memories
  • Cells contain a chalcogenide material that has
    two states
  • Amorphous with high electrical resistivity
  • Crystalline with low electrical resistivity
  • Quickly cooling material from above fusion point
    leaves it in amorphous state
  • Slowly cooling material from above
    crystallization point leaves it in crystalline
    state

23
Projections
  • Target date 2012
  • Access time 100 ns
  • Data Rate 2001000 MB/s
  • Write Endurance 109 write cycles
  • Read Endurance no upper limit
  • Capacity 16 GB
  • Capacity growth gt 40 per year
  • MTTF 1050 million hours
  • Cost lt 2/GB

24
Interesting Issues (I)
  • Disks will remain much cheaper than SCM for some
    time
  • Could use SCMs as intermediary level between main
    memory and disks

Main memory
SCM
Disk
25
A last comment
  • The technology is still experimental
  • Not sure when it will come to the market
  • Might even never come to the market

26
Interesting Issues (II)
  • Rather narrow gap between SCM access times and
    main memory access times
  • Main memory and SCM will interact
  • As the L3 cache interact with the main memory
  • Not as the main memory now interacts with the disk

27
RAID Arrays
28
Todays Motivation
  • We use RAID today for
  • Increasing disk throughput by allowing parallel
    access
  • Eliminating the need to make disk backups
  • Disks are too big to be backed up in an efficient
    fashion

29
RAID LEVEL 0
  • No replication
  • Advantages
  • Simple to implement
  • No overhead
  • Disadvantage
  • If array has n disks failure rate is n times the
    failure rate of a single disk

30
RAID levels 0 and 1
Mirrors
RAID level 1
31
RAID LEVEL 1
  • Mirroring
  • Two copies of each disk block
  • Advantages
  • Simple to implement
  • Fault-tolerant
  • Disadvantage
  • Requires twice the disk capacity of normal file
    systems

32
RAID LEVEL 2
  • Instead of duplicating the data blocks we use an
    error correction code
  • Very bad idea because disk drives either work
    correctly or do not work at all
  • Only possible errors are omission errors
  • We need an omission correction code
  • A parity bit is enough to correct a single
    omission

33
RAID levels 2 and 3
Check disks
RAID level 2
Parity disk
RAID level 3
34
RAID LEVEL 3
  • Requires N1 disk drives
  • N drives contain data (1/N of each data block)
  • Block bk now partitioned into N fragments
    bk,1, bk,2, ... bk,N
  • Parity drive contains exclusive or of these N
    fragments
  • pk bk,1 ? bk,2 ? ... ? bk,N

35
How parity works?
  • Truth table for XOR (same as parity)

A B A?B
0 0 0
0 1 1
1 0 1
1 1 0
36
Recovering from a disk failure
  • Small RAID level 3 array with data disks D0 and
    D1 and parity disk P can tolerate failure of
    either D0 or D1

D0 D1 P
0 0 0
0 1 1
1 0 1
1 1 0
D1?PD0 D0?PD1
0 0
0 1
1 0
1 1
37
How RAID level 3 works (I)
  • Assume we have N 1 disks
  • Each block is partitioned into N equal chunks

N 4 in example
38
How RAID level 3 works (II)
  • XOR data chunks to compute the parity chunk
  • Each chunk is written into a separate disk

Parity
39
How RAID level 3 works (III)
  • Each read/write involves all disks in RAID array
  • Cannot do two or more reads/writes in parallel
  • Performance of array not better than that of a
    single disk

40
RAID LEVEL 4 (I)
  • Requires N1 disk drives
  • N drives contain data
  • Individual blocks, not chunks
  • Blocks with same disk address form a stripe

x
x
x
x
?
41
RAID LEVEL 4 (II)
  • Parity drive contains exclusive or of the N
    blocks in stripe
  • pk bk ? bk1 ? ... ? bkN-1
  • Parity block now reflects contents of several
    blocks!
  • Can now do parallel reads/writes

42
RAID levels 4 and 5
Bottleneck
RAID level 4
RAID level 5
43
RAID LEVEL 5
  • Single parity drive of RAID level 4 is involved
    in every write
  • Will limit parallelism
  • RAID-5 distribute the parity blocks among the
    N1 drives
  • Much better

44
The small write problem
  • Specific to RAID 5
  • Happens when we want to update a single block
  • Block belongs to a stripe
  • How can we compute the new value of the parity
    block

pk
...
bk1
bk2
bk
45
First solution
  • Read values of N-1 other blocks in stripe
  • Recompute
  • pk bk ? bk1 ? ... ? bkN-1
  • Solution requires
  • N-1 reads
  • 2 writes (new block and new parity block)

46
Second solution
  • Assume we want to update block bm
  • Read old values of bm and parity block pk
  • Compute
  • pk new bm ? old bm ? old pk
  • Solution requires
  • 2 reads (old values of block and parity block)
  • 2 writes (new block and new parity block)

47
RAID level 6 (I)
  • Not part of the original proposal
  • Two check disks
  • Tolerates two disk failures
  • More complex updates

48
RAID level 6 (II)
  • Has become more popular as disks become
  • Bigger
  • More vulnerable to irrecoverable read errors
  • Most frequent cause for RAID level 5 array
    failures is
  • Irrecoverable read error occurring while
    contents of a failed disk are reconstituted

49
RAID level 6 (III)
  • Typical array size is 12 disks
  • Space overhead is 2/12 16.7
  • Sole real issue is cost of small writes
  • Three reads and three writes
  • Read old value of block being updated,old parity
    block P, old party block Q
  • Write new value of block being updated, new
    parity block P, new party block Q

50
CONCLUSION (II)
  • Low cost of disk drives made RAID level 1
    attractive for small installations
  • Otherwise pick
  • RAID level 5 for higher parallelism
  • RAID level 6 for higher protection
  • Can tolerate one disk failure and irrecoverable
    read errors

51
A review question
  • Consider an array consisting of four 750 GB disks
  • What is the storage capacity of the array if we
    organize it
  • As a RAID level 0 array?
  • As a RAID level 1 array?
  • As a RAID level 5 array?

52
The answers
  • Consider an array consisting of four 750 GB disks
  • What is the storage capacity of the array if we
    organize it
  • As a RAID level 0 array? 3 TB
  • As a RAID level 1 array? 1.5 TB
  • As a RAID level 5 array? 2.25 TB

53
CONNECTING I/O DEVICES
54
Busses
  • Connecting computer subsystems with each other
    was traditionally done through busses
  • A bus is a shared communication link connecting
    multiple devices
  • Transmit several bits at a time
  • Parallel buses

55
Busses
56
Examples
  • Processor-memory busses
  • Connect CPU with memory modules
  • Short and high-speed
  • I/O busses
  • Longer
  • Wide range of data bandwidths
  • Connect to memory through processor-memory bus of
    backplane bus

57
Standards
  • Firewire
  • For external use
  • 63 devices per channel
  • 4 signal lines
  • 400 Mb/s or 800 Mb/s
  • Up to 4.5 m

58
Standards
  • USB 2.0
  • For external use
  • 127 devices per channels
  • 2 signal lines
  • 1.5 Mb/s (Low Speed), 12 Mb/s (Full Speed) and
    480 Mb/s (Hi Speed)
  • Up to 5 m

59
Standards
  • USB 3.0
  • For external use
  • Adds a 5 Gb/s transfer rate (Super Speed)
  • Maximum distance is still 5m

60
Standards
  • PCI Express
  • For internal use
  • 1 device per channel
  • 2 signal lines per "lane"
  • Multiples of 250 MB/s
  • 1x, 2x, 4x, 8x, 16x and 32x
  • Up to 0.5 m

61
Standards
  • Serial ATA
  • For internal use
  • Connects cheap disks to computer
  • 1 device per channel
  • 4 data lines
  • 300 MB/s
  • Up to 1 m

62
Standards
  • Serial Attached SCSI (SAS)
  • For external use
  • 4 devices per channel
  • 4 data lines
  • 300 MB/s
  • Up to 8 m

63
Synchronous busses
  • Include a clock in the control lines
  • Bus protocols expressed in actions to be taken at
    each clock pulse
  • Have very simple protocols
  • Disadvantages
  • All bus devices must run at same clock rate
  • Due to clock skew issues, cannot be both fast
    and long

64
Asynchronous busses
  • Have no clock
  • Can accommodate a wide variety of devices
  • Have no clock skew issues
  • Require a handshaking protocol before any
    transmission
  • Implemented with extra control lines

65
Advantages of busses
  • Cheap
  • One bus can link many devices
  • Flexible
  • Can add devices

66
Disadvantages of busses
  • Shared devices
  • can become bottlenecks
  • Hard to run many parallel lines at high clock
    speeds

67
New trend
  • Away from parallel shared buses
  • Towards serial point-to-point switched
    interconnections
  • Serial
  • One bit at a time
  • Point-to-point
  • Each line links a specific device to another
    specific device

68
x86 bus organization
  • Processor connects to peripherals through two
    chips (bridges)
  • North Bridge
  • South Bridge

69
x86 bus organization
North Bridge
South Bridge
70
North bridge
  • Essentially a DMA controller
  • Lets disk controller access main memory w/o any
    intervention of the CPU
  • Connects CPU to
  • Main memory
  • Optional graphics card
  • South Bridge

71
South Bridge
  • Connects North bridge to a wide variety of I/O
    busses

72
Communicating with I/O devices
  • Two solutions
  • Memory-mapped I/O
  • Special I/O instructions

73
Memory mapped I/O
  • A portion of the address space reserved for I/O
    operations
  • Writes to any to these addresses are interpreted
    as I/O commands
  • Reading from these addresses gives access to
  • Error bit
  • I/O completion bit
  • Data being read

74
Memory mapped I/O
  • User processes cannot access these addresses
  • Only the kernel
  • Prevents user processes from accessing the disk
    in an uncontrolled fashion

75
Dedicated I/O instructions
  • Privileged instructions that cannot be executed
    by user processes
  • Only the kernel
  • Prevents user processes from accessing the disk
    in an uncontrolled fashion

76
Polling
  • Simplest way for an I/O device to communicate
    with the CPU
  • CPU periodically checks the status of pending I/O
    operations
  • High CPU overhead

77
I/O completion interrupts
  • Notify the CPU that an I/O operation has
    completed
  • Allows the CPU to do something else while waiting
    for the completion of an I/O operation
  • Multiprogramming
  • I/O completion interrupts are processed by CPU
    between instructions
  • No internal instruction state to save

78
Interrupts levels
  • See previous chapter

79
Direct memory access
  • DMA
  • Lets disk controller access main memory w/o any
    intervention of the CPU

80
DMA and virtual memory
  • A single DMA transfer may cross page boundaries
    with
  • One page being in main memory
  • One missing page

81
Solutions
  • Make DMA work with virtual addresses
  • Issue is then dealt by the virtual memory
    subsystem
  • Break DMA transfers crossing page boundaries into
    chains of transfers that do not cross page
    boundaries

82
Solutions
  • Make DMA work with virtual addresses
  • Issue is then dealt by the virtual memory
    subsystem
  • Break DMA transfers crossing page boundaries into
    chains of transfers that do not cores page
    boundaries

83
An Example
Break
DMA transfer
DMA
DMA
into
84
DMA and cache hierarchy
  • Three approaches for handling temporary
    inconsistencies between caches and main memory

85
Solutions
  • Running all DMA accesses to the cache
  • Bad solution
  • Have OS selectively
  • Invalidate affected cache entries when
    performing a read
  • Forcing immediate flush of dirty cache entries
    when performing a write
  • Have specific hardware to do same

86
Benchmarking I/O
87
Benchmarks
  • Specific benchmarks for
  • Transaction processing
  • Emphasis on speed and graceful recovery from
    failures
  • Atomic transactions
  • All or nothing behavior

88
An important observation
  • Very difficult to operate a disk subsystem at a
    reasonable fraction of its maximum throughput
  • Unless we access sequentially very large ranges
    of data
  • 512 KB and more

89
Major fallacies
  • Since rated MTTFs of disk drives exceed one
    million hours, disk can last more than 100 years
  • MTTF expresses failure rate during the disk
    actual lifetime
  • Disk failure rates in the field match the MMTTFS
    mentioned in the manufacturers literature
  • They are up to ten times higher

90
Major fallacies
  • Neglecting to do end-to-end checks
  • Using magnetic tapes to back up disks
  • Tape formats can become quickly obsolescent
  • Disk bit densities have grown much faster than
    tape data densities.

91
Can you read these?
On an old PC
No
No
92
But you can still read this
Write a Comment
User Comments (0)
About PowerShow.com