Title: THIRD REVIEW SESSION
1THIRD REVIEW SESSION
- Jehan-François Pâris
- May 5, 2010
2MATERIALS (I)
- Memory hierarchies
- Caches
- Virtual memory
- Protection
- Virtual machines
- Cache consistency
3MATERIALS (II)
- I/O Operations
- More about disks
- I/O operation implementation
- Busses
- Memory-mapped I/O
- Specific I/O instructions
- RAID organizations
4MATERIALS (III)
- Parallel Architectures
- Shared memory multiprocessors
- Computer clusters
- Hardware multithreading
- SISD, SIMD, MIMD,
- Roofline performance model
5CACHING AND VIRTUAL MEMORY
6Common objective
- Make a combination of
- Small, fast and expensive memory
- Large, slow and cheap memory
- look like
- A single large and fast memory
- Fetch policy is fetch on demand
7Questions to ask
- What are the transfer units?
- How are they placed in the faster memory?
- How are they accessed?
- How do we handle misses?
- How do we implement writes?
- and more generally
- Are these tasks performed by the hardware or the
OS?
8Transfer units
- Block or pages containing 2n bytes
- Always properly aligned
- If a block or a page contains 2n bytes,the n
LSBs of its start address will be all zeroes
9Examples
- If block size is 4 words,
- Corresponds to 16 2 4 bytes
- 4 LSB of block address will be all zeroes
- If page size is 4KB
- Corresponds to 22210 212 bytes
- 12 LSBs of page address will be all zeroes
- Remaining bits of address form page number
10Examples
Page size 4KB
32-bit address of first byte in page
XXXXXXXXXXXXXXXXXlt12 zeroesgt
In page address 20-bit page number 12 bit
offset
XXXXXXXXXXXXXXXXX
YYYYYYYYY
11Consequence
- In a 32-bit architecture,
- We identify a block or a page of size 2n bytes by
the 32 n MSBs of its address - Will be called
- Tag
- Page number
12Placement policy
- Two extremes
- Each block can only occupy a fixed address in the
faster memory - Direct mapping (many caches)
- Each page can occupy any address in the faster
memory - Full association (virtual memory)
13Direct mapping
- Assume
- Cache has 2m entries
- Block size is 2n bytes
- a is the block address(with its n LSBs removed)
- The block will be placed at cache position
- a 2m
14Consequence
- The tag identifying the cache block will be the
start address of the block with its n m LSBs
removed - the original n LSBs because they are known to be
all zeroes - the next m LSBs because they are equal toa 2m
15Consequence
Block start address
Remove m additional LSBs given by a2m
Tag
16A cache whose block size is 8 bytes
Tag
Contents
17Fully associative solution
Page Frame
0 4
1 7
2 27
3 44
4 5
- Used in virtual memory systems
- Each page can occupy any free page frame in main
memory - Use a page table
- Without redundant first column
18Solutions with limited associativity
- A cache of size 2m with associativity level k
lets a given block occupy any of k possible
locations in the cache - Implementation looks very much like k caches of
size 2m/k put together - All possible cache locations for a block have the
same position a 2m/k in each of the smaller
caches
19A set-associative cache with k2
20Accessing an entry
- In a cache, use hardware to compute the
possible cache position for the block containing
the data - a 2m for a cache using direct mapping
- a 2m/k for a cache of associativity level k
- Check then if the cache entry is valid using its
valid bit
21Accessing an entry
- In a VM system, hardware checks the TLB to find
the frame containing a given page number - TLB entries contain
- A page number (tag)
- A frame number
- A valid bit
- A dirty bit
22Accessing an entry
- The valid bit indicates if the mapping is valid
- The dirty bit indicates whether we need to
savethe page contents when we expel it
23Accessing an entry
- If page mapping is not in the TLB, must consult
the page table and update the TLB - Can be done by hardware or software
24Realization
25Handling cache misses
- Cache hardware fetches missing block
- Often overwriting an existing entry
- Which one?
- The one that occupies the same location if cache
use direct mapping - One of those that occupy the same location if
cache use direct mapping
26Handling cache misses
- Before expelling a cache entry, we must
- Check its dirty bit
- Save its contents if dirty bit is on.
27Handling page faults
- OS fetches missing page
- Often overwriting an existing page
- Which one?
- One that was not recently used
- Selected by page replacement policy
28Handling page faults
- Before expelling a page, we must
- Check its dirty bit
- Save its contents if dirty bit is on.
29Handling writes (I)
- Two ways to handle writes
- Write through
- Each write updates both the cache and the main
memory - Write back
- Writes are not propagated to the main memory
until the updated word is expelled from the cache
30Handling writes (II)
CPU
Cache
later
RAM
31Pros and cons
- Write through
- Ensures that memory is always up to date
- Expelled cache entries can be overwritten
- Write back
- Faster writes
- Complicates cache expulsion procedure
- Must write back cache entries that have been
modified in the cache
32A better write through (I)
- Add a small buffer to speed up write performance
of write-through caches - At least four words
- Holds modified data until they are written into
main memory - Cache can proceed as soon as data are written
into the write buffer
33A better write through (II)
CPU
Cache
Write buffer
RAM
34Designing RAM to support caches
- RAM connected to CPU through a "bus"
- Clock rate much slower than CPU clock rate
- Assume that a RAM access takes
- 1 bus clock cycle to send the address
- 15 bus clock cycle to initiate a read
- 1 bus clock cycle to send a word of data
35Designing RAM to support caches
- Assume
- Cache block size is 4 words
- One-word bank of DRAM
- Fetching a cache block would take
- 1 415 41 65 bus clock cycles
- Transfer rate is 0.25 byte/bus cycle
- Awful!
36Designing RAM to support caches
- Could
- Have an interleaved memory organization
- Four one-word banks of DRAM
- A 32-bit bus
32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
37Designing RAM to support caches
- Can do the 4 accesses in parallel
- Must still transmit the block 32 bits by 32 bits
- Fetching a cache block would take
- 1 15 41 20 bus clock cycles
- Transfer rate is 0.80 word/bus cycle
- Even better
- Much cheaper than having a 64-bit bus
38PERFORMANCE ISSUES
39Memory stalls
- Can divide CPU time into
- NEXEC clock cycles spent executing instructions
- NMEM_STALLS cycles spent waiting for memory
accesses - We have
- CPU time (NEXEC NMEM_STALLS)TCYCLE
40Memory stalls
- We assume that
- cache access times can be neglected
- most CPU cycles spent waiting for memory accesses
are caused by cache misses
41Global impact
- We have
- NMEM_STALLS NMEM_ACCESSESCache miss rate
- Cache miss penalty
- and also
- NMEM_STALLS NINSTRUCTIONS(NMISSES/Instruction)
- Cache miss penalty
42Example
- Miss rate of instruction cache is 2 percentMiss
rate of data cache is 5 percentIn the absence of
memory stalls, each instruction would take 2
cyclesMiss penalty is 100 cycles40 percent of
instructions access the main memory - How many cycles are lost due to cache misses?
43Solution (I)
- Impact of instruction cache misses
- 0.02100 2 cycles/instruction
- Impact of data cache misses
- 0.400.05100 2 cycles/instruction
- Total impact of cache misses
- 2 2 4 cycles/instruction
44Solution (II)
- Average number of cycles per instruction
- 2 4 6 cycles/instruction
- Fraction of time wasted
- 4 /6 67 percent
45Average memory access time
- Some authors call it AMAT
- TAVERAGE TCACHE fTMISS
- where f is the cache miss rate
- Times can be expressed
- In nanoseconds
- In number of cycles
46Example
- A cache has a hit rate of 96 percent
- Accessing data
- In the cache requires one cycle
- In the memory requires 100 cycles
- What is the average memory access time?
47Solution
- Miss rate 1 Hit rate 0.04
- Applying the formula
- TAVERAGE 1 0.04100 401 cycles
48In other words
It's the miss rate, stupid!
49Improving cache hit rate
- Two complementary techniques
- Using set-associative caches
- Must check tags of all blocks with the same index
values - Slower
- Have fewer collisions
- Fewer misses
- Use a cache hierarchy
50A cache hierarchy
- Topmost cache
- Optimized for speed, not miss rate
- Rather small
- Uses a small block size
- As we go down the hierarchy
- Cache sizes increase
- Block sizes increase
- Cache associativity level increases
51Example
- Cache miss rate per instruction is 3 percentIn
the absence of memory stalls, each instruction
would take one cycleCache miss penalty is 100
nsClock rate is 4GHz - How many cycles are lost due to cache misses?
52Solution (I)
- Duration of clock cycle
- 1/(4 Ghz) 0.2510-9 s 0.25 ns
- Cache miss penalty
- 100ns 400 cycles
- Total impact of cache misses
- 0.03400 12 cycles/instruction
53Solution (II)
- Average number of cycles per instruction
- 1 12 13 cycles/instruction
- Fraction of time wasted
- 12/13 92 percent
A very good case for hardware multithreading
54Example (cont'd)
- How much faster would the processor if we added a
L2 cache that - Has a 5 ns access time
- Would reduce miss rate to main memory to one
percent?
55Solution (I)
- L2 cache access time
- 5ns 20 cycles
- Impact of cache misses per instruction
- L1 cache misses L2 cache misses
0.03200.01400 0.6 4.0 4.6
cycles/instruction - Average number of cycles per instruction
- 1 4.6 5.6 cycles/instruction
56Solution (II)
- Fraction of time wasted
- 4.6/5.6 82 percent
- CPU speedup
- 13/4.6 2.83
57Problem
- Redo the second part of the example assuming that
the secondary cache - Has a 3 ns access time
- Can reduce miss rate to main memory to one
percent?
58Solution
- Fraction of time wasted
- 86 percent
- CPU speedup
- 1.22
New L2 cache with a lower access timebut a
higher miss rate performs much worsethan first
L2 cache
59Example
- A virtual memory has a page fault rate of 10-4
faults per memory access - Accessing data
- In the memory requires 100 ns
- On disk requires 5 ms
- What is the average memory access time?Tavg
100 ns 10-4 5 ms 600ns
60The cost of a page fault
- Let
- Tm be the main memory access time
- Td the disk access time
- f the page fault rate
- Ta the average access time of the VM
- Ta (1 f ) Tm f (Tm Td ) Tm f Td
61Example
- Assume Tm 50 ns and Td 5 ms
f Mean memory access time
10-3 50 ns 5 ms/103 5,050 ns
10-4 50 ns 5 ms/104 550 ns
10-5 50 ns 5 ms/105 100 ns
10-6 50 ns 5 ms/ 106 55 ns
62In other words
It's the page fault rate, stupid!
63Locality principle (I)
- A process that would access its pages in a
totally unpredictable fashion would perform very
poorly in a VM system unless all its pages are in
main memory
64Locality principle (II)
- Process P accesses randomly a very large array
consisting of n pages - If m of these n pages are in main memory, the
page fault frequency of the process will be( n
m )/ n - Must switch to another algorithm
65 First problem
- A virtual memory system has
- 32 bit addresses
- 4 KB pages
- What are the sizes of the
- Page number field?
- Offset field?
66Solution (I)
- Step 1Convert page size to power of 24 KB
212 B - Step 2Exponent is length of offset field
67Solution (II)
- Step 3Size of page number field Address size
Offset sizeHere 32 12 20 bits
12 bits for the offset and 20 bits for the page
number
68MEMORY PROTECTION
69Objective
- Unless we have an isolated single-user system, we
must prevent users from - Accessing
- Deleting
- Modifying
- the address spaces of other processes, including
the kernel
70Memory protection (I)
- VM ensures that processes cannot access page
frames that are not referenced in their page
table. - Can refine control by distinguishing among
- Read access
- Write access
- Execute access
- Must also prevent processes from modifying their
own page tables
71Dual-mode CPU
- Require a dual-mode CPU
- Two CPU modes
- Privileged mode or executive mode that allows
CPU to execute all instructions - User mode that allows CPU to execute only safe
unprivileged instructions - State of CPU is determined by a special bits
72Switching between states
- User mode will be the default mode for all
programs - Only the kernel can run in supervisor mode
- Switching from user mode to supervisor mode is
done through an interrupt - Safe because the jump address is at a
well-defined location in main memory
73Memory protection (II)
- Has additional advantages
- Prevents programs from corrupting address spaces
of other programs - Prevents programs from crashing the kernel
- Not true for device drivers which are inside the
kernel - Required part of any multiprogramming system
74INTEGRATING CACHES AND VM
75The problem
- In a VM system, each byte of memory has teo
addresses - A virtual address
- A physical address
- Should cache tags contain virtual addresses or
physical addresses?
76Discussion
- Using virtual addresses
- Directly available
- Bypass TLB
- Cache entries specific to a given address space
- Must flush caches when the OS selects another
process
- Using physical addresses
- Must access first TLB
- Cache entries not specific to a given address
space - Do not have to flush caches when the OS selects
another process
77The best solution
- Let the cache use physical addresses
- No need to flush the cache at each context switch
- TLB access delay is tolerable
78VIRTUAL MACHINES
79Key idea
- Let different operating systems run at the same
time on a single computer - Windows, Linux and Mac OS
- A Real-time Os and a conventional OS
- A production OS and a new OS being tested
80How it is done
- A hypervisor /VM monitor defines two or more
virtual machines - Each virtual machine has
- Its own virtual CPU
- Its own virtual physical memory
- Its own virtual disk(s)
81Two virtual machines
User mode
Privileged mode
Hypervisor
82Translating a block address
That's block v, w of the actual disk
Access block x, y of my virtual disk
VM kernel
Hypervisor
Virtual disk
Access block v, w of actual disk
Actual disk
83Handling I/Os
- Difficult task because
- Wide variety of devices
- Some devices may be shared among several VMs
- Printers
- Shared disk partition
- Want to let Linux and Windows access the same
files
84Virtual Memory Issues
- Each VM kernel manages its own memory
- Its page tables map program virtual addresses
into pseudo-physical addresses - It treats these addresses as physical addresses
85The dilemma
User processA
Page 735 of process A is stored in page frame 435
That's page frame 993 of the actual RAM
VM kernel
Hypervisor
86The solution (I)
- Address translation must remain fast!
- Hypervisor lets each VM kernel manage their own
page tables but do not use them - They contain bogus mappings!
- It maintains instead its own shadow page tables
with the correct mappings - Used to handle TLB misses
87The solution (II)
- To keep its shadow page tables up to date,
hypervisor must track any changes made by the VM
kernels - Mark page tables read-only
88Nastiest Issue
- The whole VM approach assumes that a kernel
executing in user mode will behave exactly like a
kernel executing in privileged mode - Not true for all architectures!
- Intel x86 Pop flags (POPF) instruction
89Solutions
- Modify the instruction set and eliminate
instructions like POPF - IBM redesigned the instruction set of their 360
series for the 370 series - Mask it through clever software
- Dynamic "binary translation" when direct
execution of code could not work(VMWare)
90CACHE CONSISTENCY
91The problem
- Specific to architectures with
- Several processors sharing the same main memory
- Multicore architectures
- Each core/processor has its own private cache
- A must for performance
- Happens when same data are present in two or more
private caches
92An example (I)
RAM
93An example (II)
Increments x
Still assumes x 0
RAM
94An example
Both CPUs must applythe two updatesin the same
order
Sets x to 1
Resets x to 1
RAM
95Rules
- Whenever a processes accesses a variable it
always gets the value stored by the processor
that updated that variable last if the updates
are sufficiently separated in times - A processor accessing a variable sees all updates
applied to that variable in thesame order - No compromise is possible here
96A realization Snoopy caches
- All caches are linked to the main memory through
a shared bus - All caches observe the writes performed by other
caches - When a cache notices that another cache performs
a write on a memory location that it has in its
cache, it invalidates the corresponding cache
block
97An example (I)
Fetches x 2
RAM
98An example (II)
Also fetches x
RAM
99An example (III)
Resets x to 0
RAM
100An example (IV)
Performs write-through
Detects write-through and invalidates its copy
of x
RAM
101An example (IV)
when CPU wants to access x. cache gets correct
value from RAM
RAM
102A last correctness condition
- Cache cannot reorder their memory updates
- Cache RAM buffer must be FIFO
- First in first out
103Miscellaneous fallacies
- Segmented address spaces
- Address is segment number offset in segment
- Programmers hate them
- Ignoring virtual memory behavior when accessing
large two-dimensional arrays
104Miscellaneous fallacies
- Segmented address spaces
- Address is segment number offset in segment
- Programmers hate them
- Ignoring virtual memory behavior when accessing
large two-dimensional arrays - Believing that you can virtualize any CPU
architecture
105DEPENDABILITY
106Reliability and Availability
- Reliability
- Probability R(t) that system will be up at time
t if it was up at time t 0 - Availability
- Fraction of time the system is up
- Reliability and availability do not measure the
same thing!
107MTTF, MMTR and MTBF
- MTTF is mean time to failure
- MTTR is mean time to repair
- 1/MTTF is failure rate l
- MTTBF, the mean time between failures, is
- MTBF MTTF MTTR
108Reliability
- As a first approximation R(t) exp(t/MTTF)
- Not true if failure rate varies over time
109Availability
- Measured by (MTTF)/(MTTF MTTR) MTTF/MTBF
- MTTR is very important
110Example
- A server crashes on the average once a month
- When this happens, it takes six hours to reboot
it - What is the server availability ?
111Solution
- MTBF 30 days
- MTTR ½ day
- MTTF 29 ½ days
- Availability is 29.5/30 98.3
112Example
- A disk drive has a MTTF of 20 years.
- What is the probability that the data it contains
will not be lost over a period of five years?
113Example
- A disk farm contains 100 disks whose MTTF is 20
years. - What is the probability that no data will be
lost over a period of five years?
114Solution
- The aggregate failure rate of the disk farm is
- 100x1/20 5
- The mean time to failure of the farm is
- 1/5 year
- We apply the formula
- R(t) exp(t/MTTF) -exp(55) 1.4 10-11
115RAID Arrays
116Todays Motivation
- We use RAID today for
- Increasing disk throughput by allowing parallel
access - Eliminating the need to make disk backups
- Disks are too big to be backed up in an efficient
fashion
117RAID LEVEL 0
- No replication
- Advantages
- Simple to implement
- No overhead
- Disadvantage
- If array has n disks failure rate is n times the
failure rate of a single disk
118RAID levels 0 and 1
Mirrors
RAID level 1
119RAID LEVEL 1
- Mirroring
- Two copies of each disk block
- Advantages
- Simple to implement
- Fault-tolerant
- Disadvantage
- Requires twice the disk capacity of normal file
systems
120RAID LEVEL 2
- Instead of duplicating the data blocks we use an
error correction code - Very bad idea because disk drives either work
correctly or do not work at all - Only possible errors are omission errors
- We need an omission correction code
- A parity bit is enough to correct a single
omission
121RAID levels 2 and 3
Check disks
RAID level 2
Parity disk
RAID level 3
122RAID LEVEL 3
- Requires N1 disk drives
- N drives contain data (1/N of each data block)
- Block bk now partitioned into N fragments
bk,1, bk,2, ... bk,N - Parity drive contains exclusive or of these N
fragments - pk bk,1 ? bk,2 ? ... ? bk,N
123How parity works?
- Truth table for XOR (same as parity)
A B A?B
0 0 0
0 1 1
1 0 1
1 1 0
124Recovering from a disk failure
- Small RAID level 3 array with data disks D0 and
D1 and parity disk P can tolerate failure of
either D0 or D1
D0 D1 P
0 0 0
0 1 1
1 0 1
1 1 0
D1?PD0 D0?PD1
0 0
0 1
1 0
1 1
125How RAID level 3 works (I)
- Assume we have N 1 disks
- Each block is partitioned into N equal chunks
N 4 in example
126How RAID level 3 works (II)
- XOR data chunks to compute the parity chunk
- Each chunk is written into a separate disk
Parity
127How RAID level 3 works (III)
- Each read/write involves all disks in RAID array
- Cannot do two or more reads/writes in parallel
- Performance of array not netter than that of a
single disk
128RAID LEVEL 4 (I)
- Requires N1 disk drives
- N drives contain data
- Individual blocks, not chunks
- Blocks with same disk address form a stripe
x
x
x
x
?
129RAID LEVEL 4 (II)
- Parity drive contains exclusive or of the N
blocks in stripe - pk bk ? bk1 ? ... ? bkN-1
- Parity block now reflects contents of several
blocks! - Can now do parallel reads/writes
130RAID levels 4 and 5
Bottleneck
RAID level 4
RAID level 5
131RAID LEVEL 5
- Single parity drive of RAID level 4 is involved
in every write - Will limit parallelism
- RAID-5 distribute the parity blocks among the
N1 drives - Much better
132The small write problem
- Specific to RAID 5
- Happens when we want to update a single block
- Block belongs to a stripe
- How can we compute the new value of the parity
block
pk
...
bk1
bk2
bk
133First solution
- Read values of N-1 other blocks in stripe
- Recompute
- pk bk ? bk1 ? ... ? bkN-1
- Solution requires
- N-1 reads
- 2 writes (new block and new parity block)
134Second solution
- Assume we want to update block bm
- Read old values of bm and parity block pk
- Compute
- pk new bm ? old bm ? old pk
- Solution requires
- 2 reads (old values of block and parity block)
- 2 writes (new block and new parity block)
135RAID level 6 (I)
- Not part of the original proposal
- Two check disks
- Tolerates two disk failures
- More complex updates
136RAID level 6 (II)
- Has become more popular as disks are becoming
- Bigger
- More vulnerable to irrecoverable read errors
- Most frequent cause for RAID level 5 array
failures is - Irrecoverable read error occurring while
contents of a failed disk are reconstituted
137CONNECTING I/O DEVICES
138Busses
- Connecting computer subsystems with each other
was traditionally done through busses - A bus is a shared communication link connecting
multiple devices - Transmit several bits at a time
- Parallel buses
139Busses
140Examples
- Processor-memory busses
- Connect CPU with memory modules
- Short and high-speed
- I/O busses
- Longer
- Wide range of data bandwidths
- Connect to memory through processor-memory bus of
backplane bus
141Synchronous busses
- Include a clock in the control lines
- Bus protocols expressed in actions to be taken at
each clock pulse - Have very simple protocols
- Disadvantages
- All bus devices must run at same clock rate
- Due to clock skew issues, cannot be both fast
and long
142Asynchronous busses
- Have no clock
- Can accommodate a wide variety of devices
- Have no clock skew issues
- Require a handshaking protocol before any
transmission - Implemented with extra control lines
143Advantages of busses
- Cheap
- One bus can link many devices
- Flexible
- Can add devices
144Disadvantages of busses
- Shared devices
- can become bottlenecks
- Hard to run many parallel lines at high clock
speeds
145New trend
- Away from parallel shared buses
- Towards serial point-to-point switched
interconnections - Serial
- One bit at a time
- Point-to-point
- Each line links a specific device to another
specific device
146x86 bus organization
- Processor connects to peripherals through two
chips (bridges) - North Bridge
- South Bridge
147x86 bus organization
North Bridge
South Bridge
148North bridge
- Essentially a DMA controller
- Lets disk controller access main memory w/o any
intervention of the CPU - Connects CPU to
- Main memory
- Optional graphics card
- South Bridge
149South Bridge
- Connects North bridge to a wide variety of I/O
busses
150Communicating with I/O devices
- Two solutions
- Memory-mapped I/O
- Special I/O instructions
151Memory mapped I/O
- A portion of the address space reserved for I/O
operations - Writes to any to these addresses are interpreted
as I/O commands - Reading from these addresses gives access to
- Error bit
- I/O completion bit
- Data being read
152Memory mapped I/O
- User processes cannot access these addresses
- Only the kernel
- Prevents user processes from accessing the disk
in an uncontrolled fashion
153 Dedicated I/O instructions
- Privileged instructions that cannot be executed
by User processes cannot access these addresses - Only the kernel
- Prevents user processes from accessing the disk
in an uncontrolled fashion
154Polling
- Simplest way for an I/O device to communicate
with the CPU - CPU periodically checks the status of pending I/O
operations - High CPU overhead
155I/O completion interrupts
- Notify the CPU that an I/O operation has
completed - Allows th CPU to do something else while waiting
for the completion of an I/O operation - Multiprogramming
- I/O completion interrupts are processed by CPU
between instructions - No internal instruction state to save
156Interrupts levels
157Direct memory access
- DMA
- Lets disk controller access main memory w/o any
intervention of the CPU
158DMA and virtual memory
- A single DMA transfer may cross page boundaries
with - One page being in main memory
- One missing page
159Solutions
- Make DMA work with virtual addresses
- Issue is then dealt by the virtual memory
subsystem - Break DMA transfers crossing page boundaries into
chains of transfers that do not corss page
boundaries
160An Example
Break
DMA transfer
DMA
DMA
into
161DMA and cache hierarchy
- Three approaches for handling temporary
inconsistencies between caches and main memory
162Solutions
- Running all DMA accesses to the cache
- Bad solution
- Have OS selectively
- Invalidate affected cache entries when
performing a read - Forcing immediate flush of dirty cache entries
when performing a write - Have specific hardware to do same
163Benchmarking I/O
164Benchmarks
- Specific benchmarks for
- Transaction processing
- Emphasis on speed and graceful recovery from
failures - Atomic transactions
- All or nothing behavior
165An important observation
- Very difficult to operate a disk subsystem at a
reasonable fraction of its maximum throughput - Unless we access sequentially very large ranges
of data - 512 KB and more
166Major fallacies
- Since rated MTTFs of disk drives exceed one
million hours, disk can last more than 100 years - MTTF expresses failure rate during the disk
actual lifetime - Disk failure rates in the field match the MMTTFS
mentioned in the manufacturers literature - They are up to ten times higher
167Major fallacies
- Neglecting to do end-to-end checks
-
- Using magnetic tapes to back up disks
- Tape formats can become quickly obsolescent
- Disk bit densities have grown much faster than
tape data densities.
168WRITING PARALLEL PROGRAMS
169Overview
- Some problems are embarrassingly parallel
- Many computer graphics tasks
- Brute force searches in cryptography or password
guessing - Much more difficult for other applications
- Communication overhead among sub-tasks
- Ahmdahl's law
- Balancing the load
170Amdahl's Law
- Assume a sequential process takes
- tp seconds to perform operations that could be
performed in parallel - ts seconds to perform purely sequential
operations - The maximum speedup will be
- (tp ts )/ts
171Balancing the load
- Must ensure that workload is equally divided
among all the processors - Worst case is when one of the processors does
much more work than all others
172A last issue
- Humans likes to address issues one after the
order - We have meeting agendas
- We do not like to be interrupted
- We write sequential programs
173MULTI PROCESSOR ORGANIZATIONS
174Shared memory multiprocessors
Interconnection network
RAM
I/O
175Shared memory multiprocessor
- Can offer
- Uniform memory access to all processors(UMA)
- Easiest to program
- Non-uniform memory access to all
processors(NUMA) - Can scale up to larger sizes
- Offer faster access to nearby memory
176Computer clusters
Interconnection network
177Computer clusters
- Very easy to assemble
- Can take advantage of high-speed LANs
- Gigabit Ethernet, Myrinet,
- Data exchanges must be done throughmessage
passing
178HARDWARE MULTITHREADING
179General idea
- Let the processor switch to another thread of
computation while them current one is stalled - Motivation
- Increased cost of cache misses
180Implementation
- Entirely controlled by the hardware
- Unlike multiprogramming
- Requires a processor capable of
- Keeping track of the state of each thread
- One set of registersincluding PC for each
concurrent thread - Quickly switching among concurrent threads
181Approaches
- Fine-grained multithreading
- Switches between threads for each instruction
- Provides highest throughputs
- Slows down execution of individual threads
182Approaches
- Coarse-grained multithreading
- Switches between threads whenever a long stall
is detected - Easier to implement
- Cannot eliminate all stalls
183Approaches
- Simultaneous multi-threading
- Takes advantage of the possibility of modern
hardware to perform different tasks in parallel
for instructions of different threads - Best solution
184ALPHABET SOUP
185Classification
- SISD
- Single instruction, single data
- Conventional uniprocessor architecture
- MIMD
- Multiple instructions, multiple data
- Conventional multiprocessor architecture
186Classification
- SIMD
- Single instruction, multiple data
- Perform same operations on a set of similar data
- Think of adding two vectors
- for (i 0 i i lt VECSIZE) sumi ai
bi
187PERFORMANCE ISSUES
188Roofline model
- Takes into account
- Memory bandwidth
- Floating-point performance
- Introduces arithmetic intensity
- Total number of floating point operations in a
program divided by total number of bytes
transferred to main memory - Measured in FLOPS/byte
189Roofline model
- Attainable GFLOPS/s Min(Peak Memory
BW?Arithmetic Intensity, Peak
Floating-Point Performance
190Roofline model
Peak floating-point performance
Floating-point performance is limited by memory
bandwidth