Title: Chapter 7 Storage Systems
1Chapter 7 Storage Systems
2Outline
- Introduction
- Types of Storage Devices
- Buses Connecting I/O Devices to CPU/Memory
- Reliability, Availability, and Dependability
- RAID Redundant Arrays of Inexpensive Disks
- Errors and Failures in Real Systems
- I/O Performance Measures
- A Little Queuing Theory
- Benchmarks of Storage Performance and
Availability - Crosscutting Issues
- Design An I/O System
37.1 Introduction
4Motivation Who Cares About I/O?
- CPU Performance 2 times very 18 months
- I/O performance limited by mechanical delays
(disk I/O) - lt 10 per year (I/O per sec or MB per sec)
- Amdahl's Law system speed-up limited by the
slowest part! - 10 I/O 10x CPU ? 5x Performance (lose
50) - 10 I/O 100x CPU ? 10x Performance (lose 90)
- I/O bottleneck
- Diminishing fraction of time in CPU
- Diminishing value of faster CPUs
5Position of I/O in Computer Architecture Past
- An orphan in the architecture domain
- I/O meant the non-processor and memory stuff
- Disk, tape, LAN, WAN, etc.
- Performance was not a major concern
- Devices characterized as
- Extraneous, non-priority, infrequently used, slow
- Exception is swap area of disk
- Part of the memory hierarchy
- Hence part of system performance but youre hosed
if you use it often
6Position of I/O in Computer Architecture Now
- Trends I/O is the bottleneck
- Communication is frequent
- Voice response transaction systems, real-time
video - Multimedia expectations
- Even standard networks come in gigabit/sec
flavors - For multi-computers
- Result
- Significant focus on system bus performance
- Common bridge to the memory system and the I/O
systems - Critical performance component for the SMP server
platforms
7System vs. CPU Performance
- Care about speed at which user jobs get done
- Throughput - how many jobs/time (system view)
- Latency - how quick for a single job (user view)
- Response time time between when a command is
issued and results appear (user view) - CPU performance main factor when
- Job mix fits in the memory ? there are very few
page faults - I/O performance main factor when
- The job is too big for the memory - paging is
dominant - When the job reads/writes/creates a lot of
unexpected files - OLTP Decision support -- Database
- And then there is graphics specialty I/O devices
8System Performance
- Depends on many factors in the worst case
- CPU
- Compiler
- Operating System
- Cache
- Main Memory
- Memory-IO bus
- I/O controller or channel
- I/O drivers and interrupt handlers
- I/O devices there are many types
- Level of autonomous behavior
- Amount of internal buffer capacity
- Device specific parameters for latency and
throughput
9I/O Systems
May the same or differentMemory I/O Bus
interrupts
Processor
Cache
Memory - I/O Bus
Main Memory
I/O Controller
I/O Controller
I/O Controller
Graphics
Disk
Disk
Network
10Keys to a Balanced System
- Its all about overlap - I/O vs CPU
- Timeworkload Timecpu TimeI/O - Timeoverlap
- Consider the benefit of just speeding up one
- Amdahls Law (see P4 as well)
- Latency vs. Throughput
11I/O System Design Considerations
- Depends on type of I/O device
- Size, bandwidth, and type of transaction
- Frequency of transaction
- Defer vs. do now
- Appropriate memory bus utilization
- What should the controller do
- Programmed I/O
- Interrupt vs. polled
- Priority or not
- DMA
- Buffering issues - what happens on over-run
- Protection
- Validation
12Types of I/O Devices
- Behavior
- Read, Write, Both
- Once, multiple
- Size of average transaction
- Bandwidth
- Latency
- Partner - the speed of the slowest link theory
- Human operated (interactive or not)
- Machine operated (local or remote)
13Some I/O Device Characteristics
14Is I/O Important?
- Depends on your application
- Business - disks for file system I/O
- Graphics - graphics cards or special
co-processors - Parallelism - the communications fabric
- Our focus mainline uniprocessing
- Storage subsystems (Chapter 7)
- Networks (Chapter 8)
- Noteworthy Point
- The traditional orphan
- But now often viewed more as a front line topic
157.2 Types of Storage Devices
16Magnetic Disks
- 2 important Roles
- Long term, non-volatile storage file system and
OS - Lowest level of the memory hierarchy
- Most of the virtual memory is physically resident
on the disk - Long viewed as a bottleneck
- Mechanical system ? slow
- Hence they seem to be an easy target for improved
technology - Disk improvement w.r.t. density have done better
than Moores law
17Disks are organized into platters, tracks, and
sectors
(1-12 2 sides)
(5000 30000 each surface)
(100 500)
A sector is the smallestunit that can be read or
written
18Physical Organization Options
- Platters one or many
- Density - fixed or variable
- All tracks have the same no. of sectors?)
- Organization - sectors, cylinders, and tracks
- Actuators - 1 or more
- Heads - 1 per track or 1 per actuator
- Access - seek time vs. rotational latency
- Seek related to distance but not linearly
- Typical rotation 3600 RPM or 15000 RPM
- Diameter 1.0 to 3.5 inches
19Typical Physical Organization
- Multiple platters
- Metal disks covered with magnetic recording
material on both sides - Single actuator (since they are expensive)
- Single R/W head per arm
- One arm per surface
- All heads therefore over same cylinder
- Fixed sector size
- Variable density encoding
- Disk controller usually built in processor
buffering
20Characteristics of Three Magnetic Disks of 2000
21Anatomy of a Read Access
- Steps
- Memory mapped I/O over bus to controller
- Controller starts access
- Seek rotational latency wait
- Sector is read and buffered (validity checked)
- Controller says ready or DMAs to main memory and
then says ready
22Access Time
- Access Time
- Seek time time to move the arm over the proper
track - Very non-linear accelerate and decelerate times
complicate - Rotation latency (delay) time for the requested
sector to rotate under the head (on average 0.5
RPM) - Transfer time time to transfer a block of bits
(typically a sector) under the read-write head - Controller overhead the overhead the controller
imposes in performing an I/O access - Queuing delay time spent waiting for a disk to
become free
23Access Time Example
- Assumption average seek time 5ms transfer
rate 40MB/sec 10,000 RPM controller overhead
0.1ms no queuing delay - What is the average time to r/w a 512-byte
sector? - Answer
24Cost VS Performance
- Large-diameter drives have many more data to
amortize the cost of electronics ? lowest cost
per GB - Higher sales volume ? lower manufacturing cost
- 3.5-inch drive, the largest surviving drive in
2001, also has the highest sales volume, so it
unquestionably has the best price per GB
25Future of Magnetic Disks
- Areal density bits/unit area is common
improvement metric - Trends
- Until 1988 29 improvement per year
- 1988 1996 60 per year
- 1997 2001 100 per year
- 2001
- 20 billion bits per square inch
- 60 billion bit per square inch demonstrated in
labs
26Disk Price Trends by Capacity
27Disk Price Trends Dollars Per MB
28Cost VS Access Time for SRAM, DRAM, and Magnetic
Disk
29Disk Alternatives
- Optical Disks
- Optical compact disks (CD) 0.65GB
- Digital video discs, digital versatile disks
(DVD) 4.7GB 2 sides - Rewritable CD (CD-RW) and write-once CD (CD-R)
- Rewritable DVD (DVD-RAM) and write-once DVD
(DVD-R) - Robotic Tape Storage
- Optical Juke Boxes
- Tapes DAT, DLT
- Flash memory
- Good for embedded systems
- Nonvolatile storage and rewritable ROM
307.3 Bus Connecting I/O Devices to CPU/Memory
31I/O Connection Issues
Connecting the CPU to the I/O device world
- Shared communication link between subsystems
- Typical choice is a bus
- Advantages
- Shares a common set of wires and protocols ? low
cost - Often based on standard - PCI, SCSI, etc. ?
portable and versatility - Disadvantages
- Poor performance
- Multiple devices imply arbitration and therefore
contention - Can be a bottleneck
32I/O Connection Issues Multiple Buses
- I/O bus
- Lengthy
- Many types of connected devices
- Wide range in device bandwidth
- Follow a bus standard
- Accept devices varying in latency and bandwidth
capabilities
- CPU-memory bus
- Short
- High speed
- Match to the memory system to maximize CPU-memory
bandwidth - Knows all types of devices that must connect
together
33Typical Bus Synchronous Read Transaction
34Bus Design Decisions
- Other things to standardize as well
- Connectors
- Voltage and current levels
- Physical encoding of control signals
- Protocols for good citizenship
35Bus Design Decisions (Cont.)
- Bus master devices that can initiate a R/W
transaction - Multiple multiple CPUs, I/O device initiate bus
transactions - Multiple bus masters need arbitration (fixed
priority or random) - Split transaction for multiple masters
- Use packets for the full transaction (does not
hold the bus) - A read transaction is broken into read-request
and memory-reply transactions - Make the bus available for other masters while
the data is read/written from/to the specified
address - Transactions must be tagged
- Higher bandwidth, but also higher latency
36Split Transaction Bus
37Bus Design Decisions (Cont.)
- Clocking Synchronous vs. Asynchronous
- Synchronous
- Include a clock in the control lines, and a fixed
protocol for address and data relative to the
clock - Fast and inexpensive (little or no logic to
determine what's next) - Everything on the bus must run at the same clock
rate - Short length (due to clock skew)
- CPU-memory buses
- Asynchronous
- Easier to connect a wide variety of devices, and
lengthen the bus - Scale better with technological changes
- I/O buses
38Synchronous or Asynchronous?
39Standards
- The Good
- Let the computer and I/O-device designers work
independently - Provides a path for second party (e.g. cheaper)
competition - The Bad
- Become major performance anchors
- Inhibit change
- How to create a standard
- Bottom-up
- Company tries to get standards committee to
approve its latest philosophy in hopes that
theyll get the jump on the others (e.g. S bus,
PC-AT bus, ...) - De facto standards
- Top-down
- Design by committee (PCI, SCSI, ...)
40Some Sample I/O Bus Designs
41Some Sample Serial I/O Bus
Often used in embedded computers
42CPU-Memory Buses Found in 2001 Servers
Crossbar Switch
43Connecting the I/O Bus
- To main memory
- I/O bus and CPU-memory bus may the same
- I/O commands on bus could interfere with CPU's
access memory - Since cache misses are rare, does not tend to
stall the CPU - Problem is lack of coherency
- Currently, we consider this case
- To cache
- Access
- Memory-mapped I/O or distinct instruction (I/O
opcodes) - Interrupt vs. Polling
- DMA or not
- Autonomous control allows overlap and latency
hiding - However there is a cost impact
44A typical interface of I/O devices and an I/O bus
to the CPU-memory bus
45Processor Interface Issues
- Processor interface
- Interrupts
- Memory mapped I/O
- I/O Control Structures
- Polling
- Interrupts
- DMA
- I/O Controllers
- I/O Processors
- Capacity, Access Time, Bandwidth
- Interconnections
- Busses
46I/O Controller
Ready, done, error
I/O Address
Command, Interrupt
47Memory Mapped I/O
Some portions of memory address space are
assigned to I/O device.Reads/Writes to these
space cause data transfer
48Programmed I/O
- Polling
- I/O module performs the action, on behalf of the
processor - But I/O module does not interrupt CPU when I/O is
done - Processor is kept busy checking status of I/O
module - not an efficient way to use the CPU unless the
device is very fast! - Byte by Byte
49Interrupt-Driven I/O
- Processor is interrupted when I/O module ready to
exchange data - Processor is free to do other work
- No needless waiting
- Consumes a lot of processor time because every
word read or written passes through the processor
and requires an interrupt - Interrupt per byte
50Direct Memory Access (DMA)
- CPU issues request to a DMA module (separate
module or incorporated into I/O module) - DMA module transfers a block of data directly to
or from memory (without going through CPU) - An interrupt is sent when the task is complete
- Only one interrupt per block, rather than one
interrupt per byte - The CPU is only involved at the beginning and end
of the transfer - The CPU is free to perform other tasks during
data transfer
51Input/Output Processors
527.4 Reliability, Availability, and Dependability
53Dependability, Faults, Errors, and Failures
- Computer system dependability is the quality of
delivered service such that reliance can
justifiably be placed on this service. The
service delivered by a system is its observed
actual behavior as perceived by other system(s)
interacting with this system's users. Each module
also has an ideal specified behavior, where a
service specification is an agreed description of
the expected behavior. A system failure occurs
when the actual behavior deviates from the
specified behavior. The failure occurred because
of an error, a defect in that module. The cause
of an error is a fault. When a fault occurs, it
creates a latent error, which becomes effective
when it is activated when the error actually
affects the delivered service, a failure occurs.
The time between the occurrence of an error and
the resulting failure is the error latency. Thus,
an error is the manifestation in the system of a
fault, and a failure is the manifestation on the
service of an error.
54Faults, Errors, and Failures
- A fault creates one or more latent errors
- The properties of errors are
- A latent error becomes effective once activated
- An error may cycle between its latent and
effective states - An effective error often propagates from one
component to another, thereby creating new
errors. - A component failure occurs when the error affects
the delivered service. - These properties are recursive and apply to any
component
55Example of Faults, Errors, and Failures
- Example 1
- A programming mistake fault
- The consequence is an error or latent error
- Upon activation, the error becomes effective
- When this effective error produces erroneous data
that affect the delivered service, a failure
occurs - Example 2
- An alpha particle hitting a DRAM ? fault
- It changes the memory ? latent error
- Affected memory word is read ? effective error
- The effective error produces erroneous data that
affect the delivered service ? failure (If ECC
corrected the error, a failure would not occur)
56Service Accomplishment and Interruption
- Service accomplishment service is delivered as
specified - Service interruptiondelivered service is
different from the specified service - Transitions between these two states are caused
by failures or restorations
57Measure Reliability And Availability
- Reliability measure of the continuous service
accomplishment from a reference initial instant - Mean time to failure (MTTF)
- The reciprocal of MTTF is a rate of failures
- Service interruption is measured as mean time to
repair (MTTR) - Availability measure of the service
accomplishment w.r.t the alternation between the
above-mentioned two states - Measured as MTTF/(MTTF MTTR)
- Mean time between failure MTTF MTTR
58Example
- A disk subsystem
- 10 disks, each rated at 1,000,000-hour MTTF
- 1 SCSI controller, 500,000-hour MTTF
- 1 power supply, 200,.000-hour MTTF
- 1 fan, 200,000-hour MTTF
- 1 SCSI cable, 1000,000-hour MTTF
- Component lifetimes are exponentially distributed
(the component age is not important in
probability of failure), and independent failure
59Cause of Faults
- Hardware faults devices that fail
- Design faults faults in software (usually) and
hardware design (occasionally) - Operation faults mistakes by operations and
maintenance personnel - Environmental faults fire, flood, earthquake,
power failure, and sabotage
60Classification of Faults
- Transient faults exist for a limited time and are
not recurring - Intermittent faults cause a system to oscillate
between faulty and fault-free operation - Permanent faults do not correct themselves with
the passing of time
61Reliability Improvements
- Fault avoidance how to prevent, by construction,
fault occurrence - Fault tolerance how to provide, by redundancy,
service complying with the service specification
in spite of faults having occurred or that are
occurring - Error removal how to minimize, by verification,
the presence of latent errors - Error forecasting how to estimate, by
evaluation, the presence, creation, and
consequences of errors
627.5 RAID Redundant Arrays of Inexpensive Disks
633 Important Aspects of File Systems
- Reliability is anything broken?
- Redundancy is main hack to increased reliability
- Availability is the system still available to
the user? - When single point of failure occurs is the rest
of the system still usable? - ECC and various correction schemes help (but
cannot improve reliability) - Data Integrity
- You must know exactly what is lost when something
goes wrong
64Disk Arrays
- Multiple arms improve throughput, but not
necessarily improve latency - Striping
- Spreading data over multiple disks
- Reliability
- General metric is N devices have 1/N reliability
- Rule of thumb MTTF of a disk is about 5 years
- Hence need to add redundant disks to compensate
- MTTR mean time to repair (or replace) (hours
for disks) - If MTTR is small then the arrays MTTF can be
pushed out significantly with a fairly small
redundancy factor
65Data Striping
- Bit-level striping split the bit of each bytes
across multiple disks - No. of disks can be a multiple of 8 or divides 8
- Block-level striping blocks of a file are
striped across multiple disks with n disks,
block i goes to disk (i mod n)1 - Every disk participates in every access
- Number of I/O per second is the same as a single
disk - Number of data per second is improved
- Provide high data-transfer rates, but not improve
reliability
66Redundant Arrays of Disks
- Files are "striped" across multiple disks
- Availability is improved by adding redundant
disks - If a single disk fails, the lost information can
be reconstructed from redundant information - Capacity penalty to store redundant information
- Bandwidth penalty to update
- RAID
- Redundant Arrays of Inexpensive Disks
- Redundant Arrays of Independent Disks
67Raid Levels, Reliability, Overhead
Redundantinformation
68RAID Levels 0 - 1
- RAID 0 No redundancy (Just block striping)
- Cheap but unable to withstand even a single
failure - RAID 1 Mirroring
- Each disk is fully duplicated onto its "shadow
- Files written to both, if one fails flag it and
get data from the mirror - Reads may be optimized use the disk delivering
the data first - Bandwidth sacrifice on write Logical write two
physical writes - Most expensive solution 100 capacity overhead
- Targeted for high I/O rate , high availability
environments - RAID 01 stripe first, then mirror the stripe
- RAID 10 mirror first, then stripe the mirror
69RAID Levels 2 3
- RAID 2 Memory style ECC
- Cuts down number of additional disks
- Actual number of redundant disks will depend on
correction model - RAID 2 is not used in practice
- RAID 3 Bit-interleaved parity
- Reduce the cost of higher availability to 1/N (N
of disks) - Use one additional redundant disk to hold parity
information - Bit interleaving allows corrupted data to be
reconstructed - Interesting trade off between increased time to
recover from a failure and cost reduction due to
decreased redundancy - Parity sum of all relative disk blocks (module
2) - Hence all disks must be accessed on a write
potential bottleneck - Targeted for high bandwidth applications
Scientific, Image Processing
70RAID Level 3 Parity Disk (Cont.)
10010011 11001101 10010011 . . .
P
logical record
1 0 0 1 0 0 1 1
1 1 0 0 1 1 0 1
1 0 0 1 0 0 1 1
0 0 1 1 0 0 0 0
Striped physical records
25 capacity cost for parity in this
configuration (1/N)
71RAID Levels 4 5 6
- RAID 4 Block interleaved parity
- Similar idea as RAID 3 but sum is on a per block
basis - Hence only the parity disk and the target disk
need be accessed - Problem still with concurrent writes since parity
disk bottlenecks - RAID 5 Block interleaved distributed parity
- Parity blocks are interleaved and distributed on
all disks - Hence parity blocks no longer reside on same disk
- Probability of write collisions to a single drive
are reduced - Hence higher performance in the consecutive write
situation - RAID 6
- Similar to RAID 5, but stores extra redundant
information to guard against multiple disk
failures
72Raid 4 5 Illustration
RAID 4
RAID 5
Targeted for mixed applications A logical write
becomes four physical I/Os
73Small Write Update on RAID 3
74Small Writes Update on RAID 4/5
RAID-5 Small Write Algorithm
1 Logical Write 2 Physical Reads 2 Physical
Writes
D0
D1
D2
D3
D0'
P
old data
new data
old parity
(1. Read)
(2. Read)
XOR
XOR
(3. Write)
(4. Write)
D0'
D1
D2
D3
P'
757.6 Errors and Failures in Real Systems
76Examples
- Berkeleys Tertiary Disk
- Tandem
- VAX
- FCC
77Berkeleys Tertiary Disk
18 months of operation
SCSI backplane, cables, Ethernetcables were no
more reliable thandata disks
787.7 I/O Performance Measures
79I/O Performance Measures
- Some similarities with CPU performance measures
- Bandwidth - 100 utilization is maximum
throughput - Latency - often called response time in the I/O
world - Some unique
- Diversity - what types can be connected to the
system - Capacity - how many and how much storage on each
unit - Usual relationship between Bandwidth Latency
80Latency VS. Throughput
- Response time (latency) the time a task takes
from the moment it is placed in the buffer until
the server finishes the task - Throughput the average number of tasks completed
by the server over a time period - Knee of the curve (L VS. T) the area where a
little more throughput results in much longer
response time, or, a little shorter response time
results in much lower throughput
Response time Queue Device Service time
81Latency VS. Throughput
Latency
82Transaction Model
- In an interactive environment, faster response
time is important - Impact of inherent long latency
- Transaction time sum of 3 components
- Entry time - time it takes user (usually human)
to enter command - System response time - command entry to response
out - Think time - user reaction time between response
and next entry
83The Impact of Reducing Response Time
84Transaction Time Oddity
- As system response time goes down
- Think time goes down even more
- Could conclude
- That system performance magnifies human talent
- OR conclude that with a fast system less thinking
is necessary - OR conclude that with a fast system less thinking
is done
857.8 A Little Queuing Theory
86Introduction
- Help calculate response time and throughput
- More interested in long term, steady state than
in startup ? - No. of tasks entering the system No. of tasks
leaving the system - Littles Law
- Mean number tasks in system arrival rate x mean
response time - Applies to any system in equilibrium, as long as
nothing in black box is creating or destroying
tasks
87Little's Law
- Mean no. of tasks in system arrival rate mean
response time - We observe a system for Timeobserve
- No. of tasks completed during Timeobserve is
Numbertask - Sum of the times each task in the system
Timeaccumulated
Timeaccumulated
Mean number of tasks in system
Timeobserve
Timeaccumulated
Mean response time
Numbertasks
Numbertasks
Timeaccumulated
Timeaccumulated
Timeobserve
Numbertasks
Timeobserve
88Queuing Theory Notation
- Queuing models assume state of equilibrium input
rate output rate - Notation
- Timeserver average time to service a task
- Service rate 1/Timeserver
- Timequeue average time/task in queue
- Timesystem reseponse, average time/task in
system - Timesystem Timeserver Timequeue
- Arrival rate average number of arriving
tasks/second - Lengthserver average number of tasks in service
- Lengthqueue average number of tasks in queue
- Lengthsystem average number of tasks in system
- Lengthsystem Lengthserver Lengthqueue
- Server Utilization Arrival Rate / Service Rate
(0 1) (equilibrium) - Littles Law ? Lengthsystem Arrivial Rate
Timesystem
89Example
- An I/O system with a single disk
- 10 I/O requests per second, average time to
service 50ms - Arrival Rate 10 IOPS Service Rate 1/50ms
20 IOPS - Server Utilization 10/20 0.5
- Lengthqueue Arrivial Rate Timequeue
- Lengthserver Arrivial Rate Timeserver
- Average time to satisfy a disk request 50ms,
Arrival Rate 200 IOPS - Lengthserver Arrivial Rate Timeserver 200
0.05 10
90Response Time
- Service time completions vs. waiting time for a
busy server randomly arriving event joins a
queue of arbitrary length when server is busy,
otherwise serviced immediately (Assume unlimited
length queues) - A single server queue combination of a servicing
facility that accomodates 1 task at a time
(server) waiting area (queue) together called
a system - Timequeue (suppose FIFO queue)
- Timequeue Lengthqueue Timeserver M
- M mean time to complete service of current task
when new task arrives if the server is busy - A new task can arrive at any instant
- Use distribution of a random variable histogram?
curve? - M is also called Average Residual Service Time
(ARST)
91Response Time (Cont.)
- Server spends a variable amount of time with
tasks - Weighted mean m1 (f1 x T1 f2 x T2 ... fn x
Tn)/F (Ff1 f2...) - variance (f1 x T12 f2 x T22 ... fn x Tn2)/F
m12 - Must keep track of unit of measure (100 ms2 vs.
0.1 s2 ) - Squared coefficient of variance C variance/m12
- Unitless measure (100 ms2 vs. 0.1 s2)
- Three Distributions
- Exponential distribution C 1 most short
relative to average, few others long 90 lt 2.3 x
average, 63 lt average - Hypoexponential distribution C lt 1 most close
to average, C0.5 ? 90 lt 2.0 x average, only
57 lt average - Hyperexponential distribution C gt 1 further
from average C2.0 ? 90 lt 2.8 x average, 69 lt
average - ARST 0.5 Weighted Mean Time (1 C)
Avg.
92Characteristics of Three Distributions
Memory-less C does not vary over time and does
not consider past historyof events
93Timequeue
- Derive Timequeue in terms of Timeserver, server
utilization, and C - Timequeue Lengthqueue Timeserver ARST
server utilization - Timequeue (arrival rate Timequeue )
Timeserver (0.5 Timeserver
(1C)) server utilization - Timequeue Timequeue server utilization
(0.5 Timeserver (1C)) server
utilization - TimequeueTimeserver(1C)server utilization /
(2(1-server utilization)) - For exponential distribution, the C 1.0 ?
- TimequeueTimeserver server utilization /
(1-server utilization)
94Queuing Theory
- Predict approximate behavior of random variables
- Make a sharp distinction between past events
(arithmetic measurements) and future events
(mathematical predictions) - In computer system, future rely on past ?
arithmetic measurements and mathematical
predictions (distribution) are blurred - Queuing model assumption ? M/G/1
- Equilibrium system
- Exponential inter-arrival time (time between two
successive tasks arriving) or arrival rate - Unlimited sources of requests (infinite
population model) - Unlimited queue length, and FIFO queue
- Server starts on next task immediately after
finish the prior one - All tasks must be completed
- One server
95M/G/1 and M/M/1
- M/G/1 queue
- M exponentially random request arrival (C 1)
- M for memoryless or Markovian
- G general service distribution (no restrictions)
- 1 server
- M/M/1 queue
- Exponential service distribution (C1)
- Why exponential distribution (used often in
queuing theory) - A collection of many arbitrary distributions acts
as an exponential distribution (A computer system
comprises many interacting components) - Simpler math
96Example
- Processor sends 10 disk I/Os per second requests
service are exponentially distributed avg.
disk service 20 ms - On average, how utilized is the disk?
- What is the average time spent in the queue?
- What is the 90th percentile of the queuing time?
- What is the number of requests in the queue?
- What is the average response time for a disk
request? - Answer
- Arrival rate 10 IOPS, Service rate 1/0.02
50 IOPS - Service utilization 10/50 0.2
- TimequeueTimeserver server utilization /
(1-server utilization) 20 0.2 / (1 0.2)
20 0.25 5ms - 90th percentile of the queuing time 2.3 (slide
91) 5 11.5ms - Lengthqueue Arrival rate Timequeue 10
0.05 0.5 - Average response time 5 20 25 ms
- Lengthsystem Arrival rate Timesystem 10
0.25 2.5
977.9 Benchmarks of Storage Performance and
Availability
98Transaction Processing (TP) Benchmarks
- TP database applications, OLTP
- Concerned with I/O rate ( of disk accesses per
second) - Started with anonymous gang of 24 members in 1985
- DebitCredit benchmark simulate bank tellers and
has as it bottom line the number of debit/credit
transactions per second (TPS) - Tighter more standard benchmark versions
- TPC-A, TPC-B
- TPC-C complex query processing - more accurate
model of a real bank which models credit analysis
for loans - TPC-D, TPC-H, TPC-R, TPC-W
- Also must report the cost per TPS
- Hence machine configuration is considered
99TP Benchmarks
100TP Benchmark -- DebitCredit
- Disk I/O is random reads and writes of 100-byte
records along with occasional sequential writes - 210 disk I/Os per transaction
- 5000 20000 CPU instructions per disk I/O
- Performance relies on
- The efficiency of TP software
- How many disk accesses can be avoided by keeping
information in main memory (cache) !!! ? wrong
for measuring disk I/O - Peak TPS
- Restriction 90 of transactions have lt 2sec.
response time - For TPS to increase, of tellers and the size of
the account file must also increase (more TPS
requires more users) - To ensure that the benchmark really measure disk
I/O (not cache)
101Relationship Among TPS, Tellers, and Account File
Size
The data set generally must scale in size as the
throughput increases
102SPEC System-Level File Server (SFS) Benchmark
- SPECsfs - system level file server
- 1990 agreement by 7 vendors to evaluate NFS
performance - Mix of file reads, writes, and file operations
- Write 50 done on 8KB blocks, 50 on partial (1,
2, 4KB) - Read 85 full block, 15 partial block
- Scales the size of FS according to the reported
throughput - For every 100 NFS operations per second, the
capacity must increase by 1GB - Limit average response time, such as 40ms
- Does not normalize for different configuration
- Retired in June 2001 due to bugs
103SPECsfs
Unfair configuration
OverallResponse time(ms)
104SPECWeb
- Benchmark for evaluating the performance of WWW
servers - SPECWeb99 workload simulates accesses to a Web
server provider supporting HP for several
organizations - For each HP, nine files in each of the four
classes - Less than 1KB (small icons) 35 of activity
- 110KB 50 of activity
- 10100KB 14 of activity
- 100KB1MB (large document and image) 1 of
activity - SPECWeb99 results in 2000 for Dell Computers
- Large memory is used for a file cache to reduce
disk I/O - Impact of Web server software and OS
105SPECWeb99 Results for Dell
106Examples of Benchmarks of Dependability and
Availability
- TPC-C has a dependability requirement must
handle a single disk failure - Brown and Patterson 2000
- Focus on the effectiveness of fault tolerance in
systems - Availability can be measured by examining the
variations in system QOS metrics over time as
faults are injected into the system - The initial experiment injected a single disk
fault - Software RAID by Linux, Solaris, and Windows 2000
- Reconstruct data onto a hot spare disk
- Disk emulator injects faults
- SPECWeb99 workload
107Availability Benchmark for Software RAID
(Red Hat 6.0)
(Solaris 7)
108Availability Benchmark for Software RAID (Cont.)
(Windows 2000)
109Availability Benchmark for Software RAID (Cont.)
- The longer the reconstruction (MMTF), the lower
the availability - Increased reconstruction speed implies decreased
application performance - Linux VS. Solaris and Windows 2000
- RAID reconstruction
- Linux and Solaris initiate reconstruction
automatically - Windows 2000 initiate reconstruction manually by
operators - Managing transient faults
- Linux paranoid
- Solaris and Windows ignore most transient faults
1107.10 Crosscutting Issues Interface to OS
111I/O Interface to the OS
- OS controls what I/O technique implemented by HW
will actually be used - Early Unix head wedge
- 16 bit controllers could only transfer 64KB at a
time - Later controllers go to 32 bit devices
- And are optimized for much larger blocks
- Unix however did not want to distinguish ? so it
kept the 64KB bias - A new I/O controller designed to efficiently
transfer 1 MB files would never see more than
63KB at a time under early UNIX
112Cache Problems -- Stale Data
- 3 potential copies - cache, memory, and disk
- Stale data CPU or I/O system could modify one
copy without updating the other copies - Where the I/O system is connected to the
computer? - CPU cache no stale-data problem
- All I/O devices and CPU see the most accurate
data - Cache systems multi-level inclusion
- Disadvantages
- Lost CPU performance ? all I/O data goes through
the cache, but little is referenced - Arbitration between CPU and I/O for accessing
cache - Memory stale-data problem occurs
113Connect I/O to Cache
114Cache-Coherence Problem
Output
Input
A stale
B' stale
115Stale Data Problem
- I/O sees stale data on output because memory data
is not up-to-date - Write-through cache OK
- Write-back cache
- OS flushes data to make sure they are not in
cache before output - HW checks cache tags to see if they are in cache,
and only interact with cache if the output tries
to use in-cache data - CPU sees stale data in cache on input after I/O
has updated memory - OS guarantees input data area cannot possibly be
in cache - OS flushes data to make sure they are not in
cache before input - HS checks tags during an input and invalidate the
data if conflict
116DMA and Virtual Memory
- 2 types of addresses Virtual (VA) and Physical
address (PA) - Physically addressed I/O problems for DMA
- Block size larger than a page
- Will likely not fall on consecutive physical page
numbers - What happens if the OS victimizes a page when DMA
is in progress - Pinned the page in the memory (not allow to be
replaced) - OS copy user data into the kernel address space
and then transfer between the kernel address
space to I/O space
117Virtual DMA
- DMA uses VAs that are mapped to PAs during the
DMA - DMA buffer sequential in VM, but can be scattered
in PM - Virtual addresses provide the protection of other
processes - OS updates the address tables of a DMA if a
process is moved using virtual DMA - Virtual DMA requires a register for each page to
be transferred in the DMA controller, showing the
protection bits and the physical page
corresponding to each virtual page
118Virtual DMA Illustration
1197.11 Designing an I/O System
120I/O Design Complexities
- Huge variety of I/O devices
- Latency
- Bandwidth
- Block size
- Expansion is a must longer buses, larger power
and cabinets - Balanced Performance and Cost
- Yet another n-dimensional conflicting
- Constraint problem
- Yep - its NP hard just like all the rest
- Experience plays a big role since the solutions
are heuristic
1217 Basic I/O Design Steps
- List types of I/O devices and buses to be
supported - List physical requirements of I/O devices
- Volume, power, bus slots, expansion slots or
cabinets, ... - List cost of each device and associated
controller - List the reliability of each I/O device
- Record CPU resource demands - e.g. cycles
- Start, support, and complete I/O operation
- Stalls due to I/O waits
- Overhead - e.g. cache flushes and context
switches - List memory and bus bandwidth demands
- Assess the performance of different ways to
organize I/O devices - Of course youll need to get into queuing theory
to get it right
122An Example
- Impact on CPU of reading a disk page directly
into cache. - Assumptions
- 16KB page, 64-bytes cache-block
- Addresses of the new page are not in the cache
- CPU will not access data in the new page
- 95 displaced cache block will be read in again
(miss) - Write-back cache, 50 are dirty
- I/O buffers a full cache block before writing to
cache - Access and misses are spread uniformly to all
cache blocks - No other interference between CPU and I/O for the
cache slots - 15,000 misses per 1 million clock cycles when no
I/O - Miss penalty 30 CC, 30 CC mores to write
dirty-blocks - 1 page is brought in every 1 million clock cycles
123An Example (Cont.)
- Each page fills 16,384/64 or 256 blocks
- 0.5 256 30 CCs to write dirty blocks to
memory - 95 256 (244) are referenced again and misses
- All of them are dirty and will need to be written
back when replaced - 244 60 more CCs to write back
- In total 128 30 244 60 more CCs than
1,000,00015,000307,50030 - 1 decrease in performance
124Five More Examples
- Naive cost-performance design and evaluation
- Availability of the first example
- Response time of the first example
- Most realistic cost-performance design and
evaluation - More realistic design for availability and its
evaluation