Storage Systems - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Storage Systems

Description:

Disk Redundancy and RAIDs. Disk failures are a significant fraction of all hardware failures ... RAID: redundant array of inexpensive disks ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 53
Provided by: venkat3
Category:
Tags: raid | storage | systems

less

Transcript and Presenter's Notes

Title: Storage Systems


1
Storage Systems
2
Memory Hierarchy III I/O System
Registers
  • Boring, but important
  • I/O has been the orphan of Comp. Arch
  • Most widely used performance measure
  • CPU time
  • By definition ignores I/O
  • Second class citizenship
  • peripheral applied to I/O devices
  • Common sense
  • Response time is better measure
  • I/O is a big part of it!
  • Customer who pays for it cares,
  • Even if computer designer does not

Data Cache (D)
Instruction Cache (I)
L2 Cache
L3 Cache
Memory
Disk (swap)
3
I/O (Disk) Performance
  • Who cares? You do
  • Remember Amdahls Law
  • Want fast disk access (fast swap, fast file
    reads)
  • I/O performance
  • Bandwidth I/Os per second (IOPS)
  • Latency response time
  • Is I/O (disk) latency important? Why not just
    context-switch?
  • Context-switching requires more memory
  • Context-switching requires jobs to context-switch
    to
  • Context-switching annoys users (productivity
    f(1/response time))

4
I/O Device Characteristics
  • Type
  • Input read only
  • Output write only
  • Storage both
  • Partner
  • Human
  • Machine
  • Data rate
  • Peak transfer rate

5
Disk Parameters
  • 1-20 platters (data on both sides)
  • Magnetic iron-oxide coating
  • 1 read/write head per side
  • 500-2,500 tracks per platter
  • 32-128 sectors per track
  • Sometimes fewer on inside tracks
  • 512-2048 bytes per sector
  • Usually fixed length
  • DataECC (parity) gap
  • 4-24 GB total
  • 3,000-10,000 RPM

platter
head
spindle
R/W Cache
Controller
track
sector
6
Disk Performance
Service time
  • Response time
  • tdisk tseek trotation ttransfer
    tcontroller tqueuing
  • tseek (seek time) move head to track
  • trotation (rotational latency) wait for sector
    to come around
  • average trotation 0.5 /RPS //RPS RPM/60
  • ttransfer (transfer time) read disk
  • ratetransfer bytes/sector sector/track RPS
  • ttransfer bytes transferred /ratetransfer
  • tcontroller (controller delay) wait for
    controller to do its thing
  • tqueuing (queueing delay) wait for older
    requests to finish

7
Example Seegate
  • Cheetah 73LP 
  •  
  • Model NumberST373405LWCapacity73.4
    GBSpeed10000 rpmSeek time5.1 ms
    avgInterfaceUltra160 SCSISuggested Resale
    Price 980.00

8
Disk Performance Example
  • Parameters
  • 3600 RPM ? 60 RPS
  • Avg seek time 9 ms
  • 100 sectors per track, 512 bytes per sector
  • Controller queuing delays 1 ms
  • Q average time to read 1 sector?
  • ratetransfer 100 sectors/track 512 B/Sector
    60 RPS 24 Mb/s
  • ttransfer 512 B/24 Mb/s 0.16ms
  • trotation 0.5/60 RPS 8.3ms
  • tdisk 9ms 8.3 ms 0.2ms 1ms 18.5ms
  • ttransfer is only a small component!!
  • End of story? No! tqueuing not fixed (gets longer
    with more requests)

9
Disk Performance Queuing Theory
  • I/O is a queuing system
  • equilibrium ratearrival ratedeparture
  • total time tsystem tqueue tserver
  • ratearrival tsystem lengthsystem (Littles
    Law)
  • utilizationserver tserver ratearrival
  • The important result (derivation in HP)
  • tsystem tserver / (1 utilizationserver )
  • If server highly utilized tsystem gets VERY HIGH
  • Lesson keep utilization low (below 75)
  • Q what is new tdisk if disk is 50 utilized
  • tdisk_new tdisk_old /(1-0.5) 37 ms

Server
10
Disk Usage Models
  • Data mining supercomputing
  • Large files, sequential reads
  • Raw data transfer rate is most important
  • Transaction processing
  • Large files, but random access, many small
    requests
  • IOPS is most important
  • Time sharing filesystems
  • Small files, sequential accesses, potential for
    file caching
  • IOPS is most important
  • Must design disk (I/O) system based on target
    workload
  • Use disk benchmarks (they exist)

11
Disk Alternatives
  • Solid state disk (SSD)
  • DRAM battery backup with standard disk
    interface
  • Fast no seek time, no rotation time, fast
    transfer rate
  • Expensive
  • FLASH memory
  • Fast no seek time, no rotation time, fast
    transfer rate
  • Non-volatile
  • Slow bulk erase before write
  • Wears out over time
  • Optical disks (CDs)
  • Cheap if write-once, expensive if write-multiple
  • Slow

12
Extensions to Conventional Disks
  • Increasing density more sensitive heads, finer
    control
  • Increases cost
  • Fixed head head per track
  • Seek time eliminated
  • Low track density
  • Parallel transfer simultaneous read from
    multiple platters
  • Difficulty in looking onto different tracks on
    multiple surfaces
  • Lower cost alternatives possible (disk arrays)

13
More Extensions to Conventional Disks
  • Disk caches disk-controller RAM buffers data
  • Fast writes RAM acts as a write buffer
  • Better utilization of host-to-device path
  • High miss rate increases request latency
  • Disk scheduling schedule requests to reduce
    latency
  • e.g., schedule request with shortest seek time
  • e.g., elevator algorithm for seeks (head sweeps
    back and forth)
  • Works best for unlikely cases (long queues)

14
Disk Arrays
  • Collection of individual disks (D of disks)
  • Distribute data across disks
  • Access in parallel for higher b/w (IOPS)
  • Issue data distribution ? load balancing
  • e.g., 3 disks, 3 files (A, B, and C) each 2
    sectors long

undistributed
coarse-grain striping
fine-grain striping
A0
B0
C0
A0
A1
B0
A0
A0
A0
A1
A1
A1
B0
B0
B0
A1
B1
C1
B1
C0
C1
B1
B1
B1
C0
C0
C0
C1
C1
C1
15
Disk Arrays Stripe Width
  • Fine-grain striping
  • D stripe width evenly divides smallest
    accessible data (sector)
  • Only one request served at a time
  • Perfect load balance
  • Effective transfer rate approx D times better
    than single disk
  • Access time can go up, unless disks synchronized
    (disk skew)
  • Coarse-grain striping
  • Data transfer parallelism for large requests
  • Concurrency for small requests (several small
    requests at once)
  • statistical load balance
  • Must consider workload to determine stripe width

16
Disk Redundancy and RAIDs
  • Disk failures are a significant fraction of all
    hardware failures
  • Electrical failures rare, mechanical failures
    more common
  • Striping increases number of files touched by
    failure
  • Fix with replication and / or parity protection
  • RAID redundant array of inexpensive disks
  • Arrays of cheap disks provide high performance
    dependability
  • MTTF is high and MTTR is low ? redundancy can
    increase significantly

17
Reliability, Availability, and Dependability
  • Reliability and Availability are measure of
    Dependability
  • Reliability measure of the continuous service
    accomplishment
  • MTTF Mean time to failure
  • Rate of failure 1/MTTF
  • Service interruption is measured as MTTR Mean
    time to repair
  • if a collection of modules have exponentially
    distributed lifetimes,
  • overall failure rate sum of the failure rates
    of the modules
  • Availability measure of the service
    accomplishment
  • MTTF / (MTTF MTTR)
  • MTBF Mean time between failures MTTF MTTR
  • Widely used
  • MTTF is often more appropriate

18
Array Reliability
Reliability of N disks Reliability of 1 Disk
N 1,200,000 Hours 100 disks 12,000 hours 1
year 365 24 8,700 hours Disk system MTTF
Drops from 140 years to about 1.5
years! Arrays (without redundancy) too
unreliable to be useful!
19
Redundant Arrays of Independent Disks
Files are "striped" across multiple
spindles Redundancy yields high data availability
  • Disks will fail!
  • Contents reconstructed from data redundantly
    stored in the array
  • Capacity penalty to store it
  • Bandwidth penalty to update

Mirroring/Shadowing (high cost) Parity
Techniques
20
RAID Levels
  • 6 levels of RAID depend on redundancy
    /concurrency (D of data disks, C of check
    disks)
  • Level 0 nonredundant striped (D0, C0) widely
    used
  • Level 1 full mirroring (D C)
  • Level 2 Memory-style ECC (D8, C 4) Not used
  • Level 3 bit-interleaved parity (e.g., D8, C1)
  • Level 4 block-interleaved parity (e.g., D8,
    C1)
  • Level 5 block-interleaved distributed parity
    (e.g., D8, C1) most widely used
  • Level 6 two-dimensional error bits (e.g., D8,
    C2) Not presently available

21
RAID1 Disk Mirroring/Shadowing
recovery group
 Each disk is fully duplicated onto its
"shadow" ?high availability Bandwidth
sacrifice on write Logical write ? two
physical writes Reads may be optimized Most
expensive solution 100 capacity overhead
Targeted for high I/O rate , high availability
environments
Probability of failure (assuming 24 hours MTTR)
24 / ( 1.2 X 106 X 1.2 X 106 ) 6.9 x 10-13
170,000,000 years
22
RAID 3 Parity Disk
10010011 11001101 10010011 . . .
P
logical record
1 0 0 1 0 0 1 1
1 1 0 0 1 1 0 1
1 0 0 1 0 0 1 1
0 0 1 1 0 0 0 0
Striped physical records
Parity computed across recovery group to
protect against hard disk failures 33
capacity cost for parity in this configuration
wider arrays reduce capacity costs, decrease
expected availability, increase
reconstruction time Arms logically
synchronized, spindles rotationally synchronized
logically a single high capacity, high
transfer rate disk
Targeted for high bandwidth applications
Scientific, Image Processing
23
RAID 3 Write Update
RAID-3 Small Write Algorithm
1 Logical Write ? 3 Reads 2 Writes
D0
D1
D2
D3
D0'
P
new data
Involves all the disks

XOR
D0'
D1
D2
D3
P'
24
RAID 4 Write Update
RAID-5 Small Write Algorithm
1 Logical Write ? 2 Reads 2 Writes
D0
D1
D2
D3
D0'
P
  • Involves just two disks
  • Lesser read/write ops
  • Increasing size of parity group
  • Increases savings
  • Bottleneck
  • Parity disk update on every write
  • Spread parity info on all disks
  • ? RAID 5

new data
XOR


XOR
D0'
D1
D2
D3
P'
25
RAID 5 High I/O Rate Parity
Increasing Logical Disk Addresses
D0
D1
D2
D3
P
A logical write becomes four physical
I/Os multiple writes can occur simultaneously as
long as stripe units are not located in the same
disks
D4
D5
D6
P
D7
D8
D9
P
D10
D11
D12
P
D13
D14
D15
Stripe
P
D16
D17
D18
D19
Stripe Unit
D20
D21
D22
D23
P
Targeted for mixed applications
. . .
. . .
. . .
. . .
. . .
Disk Columns
26
Subsystem Organization
single board disk controller
Cache
array controller
host
host adapter
interface to host, DMA
control, buffering, parity logic
single board disk controller
single board disk controller
no applications modifications no reduction of
host performance
single board disk controller
physical device control
27
System Availability Orthogonal RAIDs
Fault-tolerant scheme to protect against string
faults as well as disk faults
28
System Level Availability
Fully redundant No single point of failure
host
host
I/O Controller
I/O Controller
Cache Array Controller
Cache Array Controller
. . .
. . .
. . .
. . .
. . .
Recovery Group
. . .
with duplicated paths, higher performance when no
failures
29
Basic Computer Structure
CPU
Memory
Memory Bus (System Bus)
Bridge
I/O Bus
NIC
30
A Typical PC Bus Structure
31
PC Bus View
Processor/Memory Bus
PCI Bus
I/O Buses
32
I/O and Memory Buses
Memory buses speed (usually custom design) I/O
buses compatibility (usually industry standard)
cost
33
Buses
Network
Channel
Backplane
Connects
Machines
Chips
Devices
gt1000 m
10 - 100 m
0.1 m
Distance
10 - 1000 Mb/s
40 - 1000 Mb/s
320 - 2000 Mb/s
Bandwidth
high ( 1ms)
medium
low (Nanosecs)
Latency
low
medium
high
Reliability
Extensive CRC
Byte Parity
Byte Parity
memory-mapped wide pathways centralized
arbitration
message-based narrow pathways distributed
arbitration
34
What Defines A Bus?
Transaction Protocol
Timing and Signaling Specification
Bunch of Wires
Electrical Specification
Physical / Mechanical Characteristics the
connectors
  • Glue that interfaces computer system components

35
Bus Issues
  • Clocking is bus clocked?
  • Synchronous
  • clocked, little logic ? fast (includes a clock in
    the control lines)
  • all devices need to run at same clock rate
  • to avoid clock skew, busses cannot be long if
    they are fast
  • Asynchronous
  • No clock, use handshaking instead ? slow
  • Can accommodate a wide range of devices
  • Can be lengthened without worrying about clock
    skew
  • Switching when control of bus is acquired and
    released?
  • Atomic bus held until request complete ? slow
  • Split-transaction bus free between request and
    reply ? fast

36
More Bus Issues
  • Arbitration deciding who gets the bus next?
  • Chaos is avoided by master-slave arrangement
  • Processor is the only bus master ?CPU involved in
    ever transaction
  • Overlap arbitration for next master with current
    transfer
  • Daisy chain closer devices have priority ?
    simple slow
  • Distributed wired-OR, low priority back-off ?
    medium
  • Other issues
  • Split data/address lines, width, burst transfer

Device 1 Highest Priority
Device N Lowest Priority
Device 2
Daisy Chain
Order Request Grant Release
Grant
Grant
Grant
Release
Bus Arbiter
Request
37
Asynchronous Handshake Write Operation
Address Data Read Request Acknowledge
Master Asserts Address
Next Address
Master Asserts Data
t0 t1 t2 t3 t4
t5
  • t0 Master has obtained control and asserts
    address, direction (not read), and data.
    Waits a specified amount of time for slaves to
    decode target
  • t1 Master asserts request line
  • t2 Slave asserts ack, indicating data received
  • t3 Master releases req
  • t4 Slave releases ack

38
Asynchronous Handshake Read Operation
Address Data Read Req Ack
Master Asserts Address
Next Address
Slave Data
t0 t1 t2 t3 t4
t5
  • t0 Master has obtained control and asserts
    address and direction
  • Waits a specified amount of time for slaves to
    decode target
  • t1 Master asserts request line
  • t2 Slave asserts ack, indicating ready to
    transmit data
  • t3 Master releases req, data received
  • t4 Slave releases ack

39
Example PCI Read/Write Transactions
  • All signals sampled on rising edge
  • Centralized Parallel Arbitration
  • overlapped with previous transaction
  • All transfers are (unlimited) bursts
  • Address phase starts by asserting FRAME
  • Next cycle initiator asserts cmd and address
  • Data transfers happen on when
  • IRDY asserted by master when ready to transfer
    data
  • TRDY asserted by target when ready to transfer
    data
  • transfer when both asserted on rising edge
  • FRAME de-asserted when master intends to
    complete only one more data transfer

40
Example PCI Read Transaction
Turn-around cycle on any signal driven by more
than one agent
41
Who Does I/O?
  • Main CPU (programmed IO)
  • Explicitly executes all I/O operations
  • High overhead, potential cache pollution
  • Memory mapped I/O
  • Physical addresses are set apart (no real
    memory!)
  • When processor sees these addresses, it aims the
    instruction to IO processor
  • I/O Processor (IOP or channel processor)
  • Special or general processor dedicated to I/O
    operations
  • Fast
  • May be overkill, cache coherence problems
  • DMAC (direct memory access controller)
  • Can transfer data to/from memory given start
    address
  • Fast, usually simple
  • Still may be coherence problems, must be on
    memory bus

42
Programmed I/O vs. DMA
  • Programmed I/O is ok for sending commands,
    receiving status, and communication of a small
    amount of data
  • Inefficient for large amount of data
  • Keeps CPU busy during the transfer
  • Programmed I/O ? memory operations ? slow
  • Direct Memory Access
  • Device read/write directly from/to memory
  • Memory ? device typically initiated from CPU
  • Device ? memory can be initiated by either the
    device or the CPU

43
Programmed I/O vs. DMA
CPU
Memory
CPU
Memory
CPU
Memory
Interconnect
Interconnect
Interconnect
Programmed I/O
DMA
DMA Device ? Memory
44
Six Steps to Perform DMA Transfer
45
Communicating with I/O Processors
  • Not issues if main CPU performs I/O by itself
  • I/O control how to initialize DMAC/IOP?
  • Memory mapped VM-protected addresses
  • Privileged I/O instructions
  • I/O completion how does CPU know DMAC/IOP is
    finished
  • Polling periodically check status bit ? slow
  • Interrupt I/O completion interrupts CPU ? fast
  • Q Do DMAC/IOP use physical or virtual addresses?
  • Physical simpler, but can only transfer 1 page
    at a time. Why?
  • Virtual more powerful, but DMAC, IOP needs TLB

46
Polling Vs. Interrupts
  • Polling
  • Busy-wait cycle to wait for I/O from device
  • Inefficient unless the device is very fast
  • Interrupts
  • CPU Interrupt request line triggered by I/O
    device
  • Interrupt handler receives interrupts
  • Maskable to ignore or delay some interrupts
  • Interrupt vector to dispatch interrupt to correct
    handler
  • Based on priority
  • Some unmaskable
  • Interrupt mechanism also used for exceptions

47
Example Interrupt Vs. DMA
  • 1000 transfers at 1000 bytes each
  • Interrupt mechanism
  • 1000 interrupts _at_ 2 microsec per interrupt
  • 1000 interrupt service _at_ 98 microsec each
  • ?0.1 CPU sec
  • Device transfer rate 10 MB/Sec ? 1000 bytes
    100 micro sec
  • 1000 transfers X 100 microsec 0.1 CPU seconds
  • Total 0.2 CPU second ( 50 overhead due to
    interrupt handling)
  • DMA
  • 1 DMA setup sequence _at_ 50 microsec
  • 1 interrupt _at_ 2microsec
  • 1 interrupt service sequence _at_ 48 microsec
  • Total 0.0001 second of CPU time

48
Other Issues/Trends
  • Block Servers versus Filers where is file
    illusion maintained?
  • Traditional answer server
  • Access storage as disk blocks and maintain meta
    data
  • Use file cache
  • Alternative disk subsystem maintains the file
    abstraction
  • Server uses file system protocol to communicate
  • Ex Network File System (NFS) for UNIX, Common
    Internet File System (CIFS) for Windows
  • Referred as Network attached storage (NAS)
    devices
  • Switches Replacing Buses
  • Moores Law driving cost of switch components
    down
  • Replace the bus with point to point links and
    switches
  • Switched networks provide higher aggregate
    bandwidth
  • Point-to-point links can be longer
  • Ex Infiniband delivers 2-24 Gbps and link can go
    up to 17 m (PCI 0.5m)

49
I/O System Example
  • Given
  • 500 MIPS CPU
  • 16B wide, 100 ns memory system
  • 10,000 instructions per I/O
  • 16KB per I/O
  • 200 MB/s I/O bus, with room for 20 SCSI-2
    controllers
  • SCSI-2 strings-20MB/s with 15 disks per bus
  • SCSI-2 1ms overhead per I/O
  • 7,200 RPM (120 RPS), 8ms avg seek, 6MB/s transfer
    disks
  • 200GB total storage
  • Q Choose 2GB or 8GB disks for maximum IOPS?
  • How to arrange disks and controllers?

Similar example in the book on page 744
50
I/O System Example (contd)
  • Step 1 Calculate CPU, memory, I/O bus peak IOPS
  • CPU 500 MIPS / (10,000 instructions/IO) 50,000
    IOPS
  • Memory (16-bytes / 100 ns) / (16 KB/IO) 10,000
    IOPS
  • I/O Bus (200MB/s) / 16 KB 12,500 IOPS
  • Memory bus is the bottleneck with 10,000 IOPS!
  • Step 2 Calculate disk IOPS
  • tdisk 8 ms 0.5 /120 RPS 16KB /(6MB/s) 15
    ms
  • Disk 1/15ms 67 IOPS
  • 8GB disks ? need 25 ? 25 67 IOPS 1,675 IOPS
  • 2BG disks ? need 100 ? 10067 IOPS 6,700 IOPS
  • 100 2GB disks with 6,700 IOPS are new bottleneck!
  • Answer. I 100 2 GB disks!

51
I/O System Example (contd)
  • Step 3 Calculate SCSI-2 controller peak IOPS
  • tSCSI-2 1 ms 16KB / (20 MB/s) 1.8ms
  • SCSI-2 1/ 1.8ms 556 IOPS
  • Step 4 how many disks per controller?
  • 556 IOPS / 67 IOPS 8 disks per controller
  • Step 5 how many controllers?
  • 100 disks / (8 disks / controller) 13
    controllers
  • Answer. II 13 controllers, 8-disks each

52
Summary
  • Disks
  • Parameters, performance, RAID
  • Buses
  • I/O vs. memory
  • I/O system architecture
  • CPU vs. DMAC vs. IOP
Write a Comment
User Comments (0)
About PowerShow.com