Title: IO Subsystem Chapter 8
1I/O SubsystemChapter 8
N. Guydosh 4/28/04
2Introduction
- Amazing variation of characteristics and behavors
- Characteristics largely driven by technology
- Not as elegant as processors or memory systems
- Traditionally the study of I/O took a back seat
to processors and memory - An unfortunate situation because a computer
system is useless without I/O and Amdals law
tells that ultimately I/O is the performance
bottleneck. See example in section 8.1
Typical I/O configuration
Fig. 8.1
3I/O Performance Metrics
- A point of confusion In I/O systems, KB, MB
etc. are traditionally powers of 10 1000,
1,000,000, bytes, but in memory/processor systems
these are powers of 2 1024, 1,048,576 - For simplicity lets ignore the small difference
and use only one base, say 2. - Supercomputer I/O benchmarks
- Typically for check-pointing the machine want
maximum bytes/sec on output. - Transaction processing (TP)
- Response time and throughput important
- Lots of small I/O events, thus number of disk
accesses per second more important than
bytes/sec - Reliability very important
- File Systems I/O benchmarks
- These exercise I/O system with I/O commands,
example for UNIX Makedir, copy, scandir
(transverse directory tree), ReadAll (scan every
byte in every file once), make (compiling and
linking)
4Types Characteristics of I/O Devices
- Again diversity is the problem here
- Devises differ significantly inbehaviorpartner
purely machine interfaced or human
interfacedData rate- ranges from a few bytes/sec
to 10s of millions bytes/sec. - See text for descriptions of various devices
commonly in use - Disk access time calculation
- See book on disk organization
- Components of access timeAverage seek time
move head to desired trackrotational latencies
wait for sector to get to head (0.5
rotation/RPM)transfer time - time to read or
write a sectorsometimes queuing time included
waiting for a request to get serviced. - Disk density and size affect performance and
usefulness
5Connecting The Systembusses
- A bus connects subsystems together
- Connects processor, memory, and i/o devices
together - Consists of a set of wires with control logic and
a will defined protocol for using the bus - Protocol is implemented in hardware
- A standard bus design was a prime factor in the
success of personal computer - Purchase a base system and grow it by adding
off the self components - Historically a very chaotic aspect of the
computer industry - The bus wars ... pci wins microchannels
loses! - Busses are a key factor in the overall
performance of a computer system
6Connecting The Systembusses (cont.)
- Some bus tradeoffs
- Advantage flexibility in adding new devices
peripherals - Disadvantage A Serial reusable resource gt only
one at a time communication bottleneck - Two performance goals
- High bandwidth (data rate mb/sec)
- Low latency
- Bus consists of a set of data lines and control
lines - Data lines include address and raw data
- Because bus is shared, we need a protocol to
decide who uses it next - Bus transaction (send address receive or send
data) - Terminology is from point of view of memory
(confusing!) - Input Writes data to memory from I/O
- Output Reads data from memory to I/O
- See example in fig. 8.7, 8.8
7Connecting The Systembusses (cont.)
?Mem read cmd
?address on bus
Access data in memory
? Data ready response
?Write to disk
Output operation data from memory outputted to
device
Fig. 8.7
8Connecting The Systembusses (cont.)
Write reg cmd ?
?addr on bus
?data on bus
?read from disk
Fig. 8.8
Input operation data to memory inputted from
device
9Types of Busses
- Backplane (mother board) bus
- Interconnects backplane components
- Plug-in feature
- Typical standard busses (ISA, AT, PCI ...)
- Connects to other busses
- Processor - memory
- Usually proprietary
- High speed
- Direct connection of processor to memory with
links to other busses - I/O Bus
- Typically does not connect directly to memory
- Usually bridge to backplane or processor-memory
bus - Example SCSI, IDE, EIDE,
10Types of Busses (cont.)
- A lot of functional overlap in above 3 types of
busses - Can put memory directly on backplane bus
- Logic needs to interconnect busses (bridge chips)
- Ex backplane to I/O bus
- A system may have a single backplane bus
- Ex old pcs (ISA/AT)
- See fig 8.9, p. 659 for examples gt
11Types of Busses Example
Single backplane older PCs
Processor/memory bus for main bus could be PCI
backplane in modern computers.
All 3 types of busses utilized here
Ex Proprietary (old IBM?)
Ex EIDE bus in a PC
Ex PCI
Fig. 8.9
12Synchronous Vs. Asynchronous Busses
- Synchronous
- Bus includes clock line in control
- Protocol is not very data dependent
- Protocol tightly coupled to clock
- Highly synchronized with clock
- Completely clock driven
- Only asynchronous stuff is generation of commands
or requests - Lines must be short due to clock skew
- A model for this type of bus is an FSM
- Disadvantagesall devices on bus must run at
same clock speedlines must be short due to clock
skew problem - Advantage can have high performance in special
applications such as processor memory bussing - Sometimes used for processor-memory bus
13Synchronous Vs. Asynchronous Busses (cont.)
- Asynchronous
- Very little clock dependency
- Event driven
- Keeps in step via hand shakingSee example in
figure 8.10 - Very versatile
- Bus can be arbitrarily long
- Common for standard busses
- Ex Sbus (SUN), microchannel, PCI
- Can even connect busses/devices using different
clocks - Disadvantage lower performance due to
handshaking? - A model for this type of bus is a pair of
interacting FSMs - See fig 8.11, P. 664 ... see performance analysis
pp. 662-663based on figure 8.10
14Handshaking on an Asynchronous Bus
address
data
Operation data from memory to device Initially
device raised RedReq and puts address on data
lines 1. mem see ReadReq reads address from
data bus raises Ack 2. I/O device see Ack line
high releases readReq data lines 3. Mem see
readReq low drops Ack line to ack ReadReq
signal 4. mem puts data on data line and asserts
DataRdy 5. I/O see DataRdy reads the data and
signals Ack 6. Mem see Ack drops DataRdy and
releases data lines 7. I/O see DataRdy drop and
drops Ack line Note bus is bi-directional Questi
on what happens if an Ack fails to get issued?
Color coding Colored signals are from
devise Black signals are from memory
Fig. 8.10
15FSM model of Asynchronous Busbased on example in
fig. 8.10
The numbers in each state correspond to the
numbered steps in fig. 8.10
Fig. 8.11
16An Example (pp.662-663)
- Referring to the example in fig 8.10 We will
compare the asynchronous bandwidth (BW) with a
synchronous approach - Asynchronous
- 40 ns per handshake (one of the 7 steps)
- Synchronous
- Clock cycle 50ns
- Each bus transmission takes one clock cycle
- Both schemes 32 bit dta bus and one word reads
from a 200ns memory - Synchronous
- Send address to memory 50ns, read memory 200ns,
send data to device 50ns for a total tome of 300
ns - BW 4bytes/300ns 13.3 MB/sec
- Asynchronous
- Can overlap steps 2, 3, and 4 with memory access
time - Step 1 40ns
- Steps 2, 3, 4 maximum3x40ns, 200ns 200ns
(steps 2,3,4 hidden by memory access) - Steps 5, 6, 7 3x40 120ns
- BW 4 bytes/ (40200120)ns 11.1 MB/sec
- Observation Synchronous is only 20 faster due
to overlap in handshaking - Comment asynchronous usually referred because
it ti more technology independent and more
versatile in handling different device speeds
17An Example (pp.665-666)The Effect of Block Size
on Synchronous Bus Bandwidth
- Bus description
- Two cases to consider Memory bus system
supporting access of 4 word blocks (case 1)and
16 word blocks (case 2)where a word is 32 bits
in each case - 64 bit (2 words) synchronous bus clocked at
200MHz (5 ns/cycle)each 64 bit transfer taking 1
clock cycle1 clock cycle needed to send the
initial address - Two idle clock cycles needed between bus
operations bus assumed to be idle before an
access - A memory access for the first 4 words is 200ns
(40 cycles) and each additional set of 4 words is
20 ns (4 cycles) - Assume that a bus transfer of the most recently
read data and a read of the next 4 words can be
overlapped. - Summary memory accessed 4 words at a time but
must be send over bus in two 2 word shots (2
cycles) since bus is only 2 words wide. - Find sustained bandwidth, latency (xfr time of
256 words, bus transactions/sec for a read
of 256 words in two cases 4-word blocks and
16-word blocks.Note interpret a bus
transaction as transferring a (4 or 16 word)
block.
18An Example (pp.665-666)Case 1 4-word Block
Transfers
- 1 clock cycle to send address of block to memory
- 200 MHz bus is has a 5ns period
(5ns/cycle)memory access time (1st (and only) 4
words) is 200nscycles to read memory (memory
access time)/(clock cycle time)
200ns/5ns 40 cycles - 2 clock cycles to send data from memorysince we
transfer 64 bits 2 words per cycle and a block
is 4 words - 2 idle cycles between this transfer and the next
- Note no overlap here because entire block
transferred in one access. Overlap is only
within a block for multiple accesses as in case
2 (next). - Total number of cycles for a block 45
cycles256 words to be read results in 256/4 64
blocks (transactions)thus 45x64 2880 cycles
needed for the transferlatency 2880 cycles x
5ns/cycle 14,400 ns bus transactions /sec
64/14400ns 4.44M transactions/secBW
(256x4)bytes/14400ns 71.11 MB/sec
19An Example (pp.665-666)Case 2 16-word Block
Transfers
- Timing for a 1 block (16 word) transfer
? This portion is essentially case 1
Note a 16 word block is read in four 4 word
shots, thus there will be overlap.
Total 1 40 16 57 cycles was 45 for 4
word block
Number of transactions (blocks) needed 256/16
16 transactions was 64 for 4 word blk Total
transfer time 57x16 912 cycles was 2880 for
4 word block Latency 912 cycles x 5 ns/cycle
4560 ns was 14,400ns for 4 word
block Transactions/sec 16 /4560 ns 3.51M
transactions/sec was 4.44M for 4 word block BW
(256x4)/4560ns 244.56 MB/sec was 71.11
for 4 word block
20Controlling Bus Access
- Only one on at a time
- Bus controls The bus master
- Controls access to bus
- Initiates controls all bus requests
- Slave
- Never generates own requests
- Responds to read and write requests
- Processor always a master
- Memory usually a slave
- Having a single bus master could create a bottle
neck - Processor would be involved with every bus
transaction - See fig 8.12 for an example
21Bus Control With a Single Master
Disk makes request to processor a data xfr from
memory to disk.
Processor responds by asserting read request line
to memory.
Processor acks to disk that request is being
processed. Disk now places desired address on
the bus.
Fig. 8.12
22Controlling Bus Access Multiple Masters
- Bus arbitration deciding which master gets
control of the bus p. 669 - A chip (arbiter) which decides which device gets
the bus next - Typically each device has a dedicated line to the
arbitrate for requests - Arbiter will eventually issue a grant (separate
line to device) - Device now is master, uses the bus, and then
signals the arbiter when is is done with the bus.
- Devices have priorities
- Bus arbiter may invoke a fairness rule to low
priority device which is waiting - Arbitration time is overhead and should be
overlapped with bus transfers whenever possible
- maybe use physically separate lines for
arbitration.
23Arbitration Schemes p. 670
- Daisy chain
- Chain from a high to low priority devices
- Device making request takes the grant but does
not pass it on, grant passed on only by
non-requesting devices - no fairness, possible
starvation.
24Arbitration Schemes p. 670
- Centralized, parallel
- Multiple request lineschosen device becomes
masterrequires central arbiter a potential
botleneck - Used by PCI
- Distributed arbitration - self selection
- Multiple request lines
- Request place id code on bus - by examining bus
can determine priority - No need for central arbiterneed more lines for
requestsex Nubus for Apple/Mac) - Distributed arbitration by collision detection
- Free for all request bus at will
- Collision detector then resolves who gets it
- Ethernet uses this.
25I/O To Memory, Processor, Os Interfaces
- Questions (p. 673)
- How do i/o requests transform to device commands
and get transferred to a device? - How are data transfers between device and memory
done? - What is the role of the Operating System?
- The OS
- Device drivers operating at kernel/supervisory
mode. - Performs interrupt handling DMA services.
- Functions Commands to I/O.
- Respond to I/O signals ... some are interrupts.
- Control data transfer ... buffers, DMA, other
algorithms, control priorities.
26Commands To I/O devices
- Two basic approaches
- Direct I/O (programmed I/O or PIO)
- Memory mapped I/O
- PIO
- Special I/O instructions in/out for Intel
- Address associated with in/out put on address
bus but - the op-code context causes i/o
interface to be access ... usually registers
causes I/O activity - Address is an I/O port
- Memory mapped gt see next
27Commands To I/O devices (cont)
- Memory mapped
- Certain portion of address space reserved for i/o
devices - Program communicates with device in same way it
does with memory memory instructions used - If the address is in device space range, the
device controller responds with appropriate
commands to device ... read/write - User programs not allowed to access memory mapped
I/O space - Address used by instruction encodes both device
identity types of data transmission - Memory mapped is usually faster than PIO because
DMA available
28I/O - PROCESSOR COMMUNICATIONpolling/memory
mapped
- Polling is simplest way for I/O to communicate
with processor - Periodically check status bits to seen what to do
nextI/O device posts status in a special
register, Ex I am busy - Processor Continually Checks For Status Using
Either PIO Or Memory Mapped I/O - Wastes a lot of processor time because processors
are faster than I/O devices. - Much of the polls occur when the waited for event
has not yet happened - OK for slow devices such as a mouse
- Under OS control, polls can be limited to periods
only when the device is active thus allowing
polling even for faster devices cheap I/O!
29I/O - Example
- Examples for slow medium high speed
deviceDetermine impact of polling overhead for 3
devices.Assume number of clock cycles per poll
is 400 and 500 MHz clock.In all cases no data
can be missed. - Example 1 a mouse polled 30 times/seccycle/sec
for polling 30 polls x 400 cyc/poll 12,000
cyc/sec of processor cycles consumed
12000/500MHz 0.002Negligible impact on
performance. - Example 2 a floppy disk
- Transfers data to processor is in 16 bit (2
byte) units - and has a data rate of 50 KB/sec
- Polling rate ( (50 KB/sec)/ 2 bytes/poll)
25K polls/secCycles/sec for polling 25K
polls/sec x 400 cyc/poll 107 cyc/sec of
processor cycles consumed (107cyc/sec)/500MHz
2Still tolerable
30I/O - Example (cont.)
- Example 3 a hard driveTransfers data in
four-word chunksTransfer rate is 4MB/secMust
poll at the data rate in 4-word chunks
(4MB/sec)/(16 bytes/xfr)or polling rate is 250K
polls/secCycles/sec for polling (250K
polls/sec) x (400cyc/poll) 108 cyc/sec of
processor cycles consumed (108 cyc/sec) /
500MHz 20 - 1/5 of processor would be used in polling the
disk!Not acceptable. - The bottom line polling works OK for low speed
devices but not for high speed devices.
31Interrupt driven I/O
- The problem with simple polling is that it must
be done when nothing is happening during a
waiting period - When CPU processing is needed for an I/O event,
the processor is interrupted. - Interrupts are asynchronous
- Not associated with any particular instruction
- Allows instruction completion (compare with
exceptions in chapter 5) - Interrupt must convey further information such as
identity of device and priority. - Convey this additional information by using
vectored interrupts or a cause register.
32Interrupt Scheme
The granularity of an interrupt is a single
machine instruction. The check for pending
interrupts and processing of interrupts is done
between instructions being executed, ie., the
current instruction is completed before a pending
interrupt is processed
33Overhead for Interrupt driven I/O
- Using the previous example of a hard drive (p.
676) - data transfers in 4 word chunksTransfer rate
of 4MB/sec - Assume overhead for each transfer, including the
interrupt is 500 clock cycles - Find the of processor consumed if hard drive is
only transferring data 5 of the time causing
CPU interaction. - Answer
- Interrupt rate for busy disk would be same as
previous polling rate to match the transfer
rate(250K interrupts/sec) x 500cycles/interrupt
125x106 cyc/sec - processor consumed during an XFR
125x106/500MHz 25assume disk is transferring
data 5 of the time, then processor consumed
during an XFR (average) 25 x 5 1.25 - No overhead when disk is not actually
transferring data improvement over polling.
34DMA I/O
- Polling and interrupt driven I/O best with lower
bandwidth devices where cost is more a factor. - Both polling and interrupt driven I/O, puts
burden of moving data and managing the transfer
on the CPU. - Even though the processor may continue processing
during an I/O access, it ultimately must move the
I/O data from the device when tha data becomes
available or perhaps from some I/O buffer to main
memory. - In our previous example of an interrupt driven
hard disk, even though the CPU does not have to
wait for every I/O event to complete, it would
still consume 25 of the CPU cycles while the
disk is transferring data. See p. 680. - Interrupt driven I/O for high bandwidth devices
can be greatly improved if we make a device
controller transfer data directly to memory
without involving the processor DMA (Direct
Memory Access).
35DMA I/O (cont.)
- DMA is a specialized processor that transfers
data between memory and an I/O device while the
CPU goes on with other tasks. - DMA is external to the CPU ans must act as a bus
master. - The CPU first sets up the DMA registers with a
memory address the number of bytes to be
transferred. - To the requesting program, this may be seen as
setting up a control block in memory. - DMA is frequently part of the controller for a
device. - Interrupts still are used with DMA, but only to
inform processor that the I/O transfer is
complete or an error. - DMA is a form or multi or parallel processing
not a new idea IBM Channels for main frames in
the 60s. - Channels are programmable (with channel control
words), whereas DMA is generally not
programmable.
36DMA I/O How It Works
- Three steps of DMA
- Processor sets up DMA Device id, operation,
source/destination, number of bytes to transfer - DMA controller arbitrates for bus Supplies
correct commands to device, source, destination,
etc.Then lets the data rip.Fancy buffering
may be used ... ping/pong buffers.May be
multi-channeled - Interrupt processor on completion of DMA or error
- DMA can still have contention with processor in
competing for memory and bus contention. - Problem cycle stealing - when there is
bus/memory contention when CPU is executing a
memory word during a DMA xfr, DMA wins out and
CPU will pause instruction execution memory cycle
(cycle was stolen).
37Overhead Using DMA
- Again use the previous disk example on page 676.
- Assume initial setup of DMA takes 1000 CPU cycles
- Assume interrupt handling for DMA completion
takes 500 CPU cycles - Hard drive has a transfer rate of 4MB/sec and
uses DMA - The average transfer size from disk is 8KB
- What of the 500MHz CPU is consumed if the disk
is actively transferring 100 of the time?
Ignore any bus contention between CPU and DMA
controller. - AnswerEach DMA transfer takes 8KB/(4MB/sec)
0.002 sec/xfrwhen the disk is constantly
transferring, it takes1000
500cyc/xfr/0.002sec/xfr 750,000 clock
cyc/secsince the CPU runs at 500MHz, then of
processor consumed (750,000 cyc/sec)/500 MHz
0.0015 ? 0.2
38DMA Virtual Vs. Physical Addressing (p. 683)
- In a VM system, should DMA use virtual addresses
or physical addresses? this topic in the book is
at best flaky-here is my take on it - If virtual addresses used
- Contiguous pages in VM may not be contiguous in
PM. - DMA request is made by specifying virtual address
for starting point of data to be transferred and
the number of bytes to be transferred. - DMA unit will have to translate VA to PA for all
read/writes to/from memory a performance
problem actually the address translation may be
done by the OS which will provide DMA with the
physical addresses a scatter/gather operation
fancy DMA controllers may be able to chain
series of pages for a single request of more
than one page OS provides list of physical page
frame addresses corresponding to the multi-page
DMA block in VM. orRestrict the DMA block sizes
to integral pages translate starting address. - If physical addresses used, they may be not
contiguous in virtual memory - if page boundary
crossed. Must constrain all DMA transfers to
stay within a single page or requests must be for
a page at a time. - Also the OS must be savvy enough so it would not
relocate pages in the target/source region during
a DMA transfer.
39DMA Memory Coherency
- DMA memory/cached systems
- W/O DMA all memory access is through address
translation and cache - With DMA, data is transferred to/from main memory
cache gt coherency problem - DMA read/writes are to main memory
- No cache between processor DMA controller
- Value of a memory location seen by DMA CPU may
differ - If DMA writes into main memory at location for
which there are corresponding pages in the cache,
the cache data seen by CPU will be obsolete. - If cache is write back, and the DMA reads a value
directly from main memory before cache does a
write back (due to lazy write backs), then the
value read by DMA will be obsolete. remember
there is a possibility that DMA will take
priority in accessing memory over the CPU to
its disadvantage. - Possible solutions see next gt
40DMA Memory Coherency (cont.)
- Some solutions see pp. 683-684
- Route all I/O activity through cache
- Performance hit and may be costly
- May flush out good data needed by processor
- ... I/O data may not be that critical to the
processor at the time it arrivesthe working set
may be messed up. - OS selectively invalidates the cache for
I/O?memory operation, orforce a write back for
an I/O read from memory?I/O operation called
cache flushing.(there may be some read/write
terminology confusion here!).Some HW support
needed here. - Hardware mechanism to selectively flush (or
invalidate) cache entriesThis is a common
mechanism used in multiprocessor systems where
there are many caches for a common main memory
(the MP cache coherency problem). The same
technique works for I/O after all DMA is a form
of multiprocessing.
41Designing an I/O System The Problem
- Specifications for a system
- CPU maximum instruction rate 300 MIPSaverage
number of CPU instructions per I/O in the OS
50,000 - Bandwidth of memory backplane bus 100 MB/sec
- SCSI-2 controller with a transfer rate of 20
MB/secthe SCSI bus on each controller can
accommodate up to 7 disks - Disk drivesread/write bandwidth of 5 MB/sec
average seek rotational latency of 10 m - The workload this system must support
- 64 KB reads sequential on a track
- User program needs 100,000 instructions/sec per
I/O operation.This is distinct from instructions
in the OS. - The problemFind the maximum sustainable I/O
rate and the number of disks and SCSI controllers
required. Assume that reads can always be done
on an idle disk if one exists ignore disk
conflicts.
42Designing an I/O System The Solution
- Strategy There are two fixed components in the
system memory bus and CPU. Find the I/O rate
that each component can sustain and determine
which of these is the bottleneck. - Each I/O takes 100,000 user instructions and
50,000 OS instructionsMax I/O rate for CPU
(instruction rate)/(Instructions per I/O)
(300x106) / (50100)x103 2000 I/Os per sec - Each I/O transfers 64KB, thusMax I/O rate of
backplane bus (bus BW)/(bytes per I/O
(100x106)/(64x103) 1562 I/Os per sec - The bus is the bottleneck design the system to
support the bus performance of 1562 I/Os per sec. - Number of disks needed to accommodate 1562 I/Os
per secTime per I/O at the disk
seek/rotational latency transfer time
10ms 64KB/(5MB/sec) 22.8 msThus each disk
can complete 1/22.8ms 43.9 I/Os per secTo
saturate the bus, we need (1562 I/Os per sec) /
43.9 I/Os per sec 36 disks. - How many SCSI busses is this?Required transfer
rate per disk xfr size/xfr time 64KB/22.8ms
2.74MB/secAssume we can use all the SCSI bus BW.
We can place SCSI BW/xfr rate per disk
(20MB/sec)/(2.74MB/sec) 7.3 gt 7 disks on each
SCSI bus. Note SCSI bus can support a max of 7
disks.For 36 disks we need 36/7 5.14 gt 6
buses.