THIRD REVIEW SESSION presentation

About This Presentation

Transcript and Presenter's Notes

Title: THIRD REVIEW SESSION

1
THIRD REVIEW SESSION

Jehan-François Pâris
May 5, 2010

2
MATERIALS (I)

Memory hierarchies
Caches
Virtual memory
Protection
Virtual machines
Cache consistency

3
MATERIALS (II)

I/O Operations
More about disks
I/O operation implementation
Busses
Memory-mapped I/O
Specific I/O instructions
RAID organizations

4
MATERIALS (III)

Parallel Architectures
Shared memory multiprocessors
Computer clusters
Hardware multithreading
SISD, SIMD, MIMD,
Roofline performance model

5
CACHING AND VIRTUAL MEMORY
6
Common objective

Make a combination of
Small, fast and expensive memory
Large, slow and cheap memory
look like
A single large and fast memory
Fetch policy is fetch on demand

7
Questions to ask

What are the transfer units?
How are they placed in the faster memory?
How are they accessed?
How do we handle misses?
How do we implement writes?
and more generally
Are these tasks performed by the hardware or the
OS?

8
Transfer units

Block or pages containing 2n bytes
Always properly aligned
If a block or a page contains 2n bytes,the n
LSBs of its start address will be all zeroes

9
Examples

If block size is 4 words,
Corresponds to 16 2 4 bytes
4 LSB of block address will be all zeroes
If page size is 4KB
Corresponds to 22210 212 bytes
12 LSBs of page address will be all zeroes
Remaining bits of address form page number

10
Examples
Page size 4KB
32-bit address of first byte in page
XXXXXXXXXXXXXXXXXlt12 zeroesgt
In page address 20-bit page number 12 bit
offset
XXXXXXXXXXXXXXXXX
YYYYYYYYY
11
Consequence

In a 32-bit architecture,
We identify a block or a page of size 2n bytes by
the 32 n MSBs of its address
Will be called
Tag
Page number

12
Placement policy

Two extremes
Each block can only occupy a fixed address in the
faster memory
Direct mapping (many caches)
Each page can occupy any address in the faster
memory
Full association (virtual memory)

13
Direct mapping

Assume
Cache has 2m entries
Block size is 2n bytes
a is the block address(with its n LSBs removed)
The block will be placed at cache position
a 2m

14
Consequence

The tag identifying the cache block will be the
start address of the block with its n m LSBs
removed
the original n LSBs because they are known to be
all zeroes
the next m LSBs because they are equal toa 2m

15
Consequence
Block start address
Remove m additional LSBs given by a2m
Tag
16
A cache whose block size is 8 bytes
Tag
Contents
17
Fully associative solution
Page Frame
0 4
1 7
2 27
3 44
4 5

Used in virtual memory systems
Each page can occupy any free page frame in main
memory
Use a page table
Without redundant first column

18
Solutions with limited associativity

A cache of size 2m with associativity level k
lets a given block occupy any of k possible
locations in the cache
Implementation looks very much like k caches of
size 2m/k put together
All possible cache locations for a block have the
same position a 2m/k in each of the smaller
caches

19
A set-associative cache with k2
20
Accessing an entry

In a cache, use hardware to compute the
possible cache position for the block containing
the data
a 2m for a cache using direct mapping
a 2m/k for a cache of associativity level k
Check then if the cache entry is valid using its
valid bit

21
Accessing an entry

In a VM system, hardware checks the TLB to find
the frame containing a given page number
TLB entries contain
A page number (tag)
A frame number
A valid bit
A dirty bit

22
Accessing an entry

The valid bit indicates if the mapping is valid
The dirty bit indicates whether we need to
savethe page contents when we expel it

23
Accessing an entry

If page mapping is not in the TLB, must consult
the page table and update the TLB
Can be done by hardware or software

24
Realization
25
Handling cache misses

Cache hardware fetches missing block
Often overwriting an existing entry
Which one?
The one that occupies the same location if cache
use direct mapping
One of those that occupy the same location if
cache use direct mapping

26
Handling cache misses

Before expelling a cache entry, we must
Check its dirty bit
Save its contents if dirty bit is on.

27
Handling page faults

OS fetches missing page
Often overwriting an existing page
Which one?
One that was not recently used
Selected by page replacement policy

28
Handling page faults

Before expelling a page, we must
Check its dirty bit
Save its contents if dirty bit is on.

29
Handling writes (I)

Two ways to handle writes
Write through
Each write updates both the cache and the main
memory
Write back
Writes are not propagated to the main memory
until the updated word is expelled from the cache

30
Handling writes (II)

Write through

Write back

CPU
Cache
later
RAM
31
Pros and cons

Write through
Ensures that memory is always up to date
Expelled cache entries can be overwritten
Write back
Faster writes
Complicates cache expulsion procedure
Must write back cache entries that have been
modified in the cache

32
A better write through (I)

Add a small buffer to speed up write performance
of write-through caches
At least four words
Holds modified data until they are written into
main memory
Cache can proceed as soon as data are written
into the write buffer

33
A better write through (II)

Write through

Better write through

CPU
Cache
Write buffer
RAM
34
Designing RAM to support caches

RAM connected to CPU through a "bus"
Clock rate much slower than CPU clock rate
Assume that a RAM access takes
1 bus clock cycle to send the address
15 bus clock cycle to initiate a read
1 bus clock cycle to send a word of data

35
Designing RAM to support caches

Assume
Cache block size is 4 words
One-word bank of DRAM
Fetching a cache block would take
1 415 41 65 bus clock cycles
Transfer rate is 0.25 byte/bus cycle
Awful!

36
Designing RAM to support caches

Could
Have an interleaved memory organization
Four one-word banks of DRAM
A 32-bit bus

32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
37
Designing RAM to support caches

Can do the 4 accesses in parallel
Must still transmit the block 32 bits by 32 bits
Fetching a cache block would take
1 15 41 20 bus clock cycles
Transfer rate is 0.80 word/bus cycle
Even better
Much cheaper than having a 64-bit bus

38
PERFORMANCE ISSUES
39
Memory stalls

Can divide CPU time into
NEXEC clock cycles spent executing instructions
NMEM_STALLS cycles spent waiting for memory
accesses
We have
CPU time (NEXEC NMEM_STALLS)TCYCLE

40
Memory stalls

We assume that
cache access times can be neglected
most CPU cycles spent waiting for memory accesses
are caused by cache misses

41
Global impact

We have
NMEM_STALLS NMEM_ACCESSESCache miss rate
Cache miss penalty
and also
NMEM_STALLS NINSTRUCTIONS(NMISSES/Instruction)
Cache miss penalty

42
Example

Miss rate of instruction cache is 2 percentMiss
rate of data cache is 5 percentIn the absence of
memory stalls, each instruction would take 2
cyclesMiss penalty is 100 cycles40 percent of
instructions access the main memory
How many cycles are lost due to cache misses?

43
Solution (I)

Impact of instruction cache misses
0.02100 2 cycles/instruction
Impact of data cache misses
0.400.05100 2 cycles/instruction
Total impact of cache misses
2 2 4 cycles/instruction

44
Solution (II)

Average number of cycles per instruction
2 4 6 cycles/instruction
Fraction of time wasted
4 /6 67 percent

45
Average memory access time

Some authors call it AMAT
TAVERAGE TCACHE fTMISS
where f is the cache miss rate
Times can be expressed
In nanoseconds
In number of cycles

46
Example

A cache has a hit rate of 96 percent
Accessing data
In the cache requires one cycle
In the memory requires 100 cycles
What is the average memory access time?

47
Solution

Miss rate 1 Hit rate 0.04
Applying the formula
TAVERAGE 1 0.04100 401 cycles

48
In other words
It's the miss rate, stupid!
49
Improving cache hit rate

Two complementary techniques
Using set-associative caches
Must check tags of all blocks with the same index
values
Slower
Have fewer collisions
Fewer misses
Use a cache hierarchy

50
A cache hierarchy

Topmost cache
Optimized for speed, not miss rate
Rather small
Uses a small block size
As we go down the hierarchy
Cache sizes increase
Block sizes increase
Cache associativity level increases

51
Example

Cache miss rate per instruction is 3 percentIn
the absence of memory stalls, each instruction
would take one cycleCache miss penalty is 100
nsClock rate is 4GHz
How many cycles are lost due to cache misses?

52
Solution (I)

Duration of clock cycle
1/(4 Ghz) 0.2510-9 s 0.25 ns
Cache miss penalty
100ns 400 cycles
Total impact of cache misses
0.03400 12 cycles/instruction

53
Solution (II)

Average number of cycles per instruction
1 12 13 cycles/instruction
Fraction of time wasted
12/13 92 percent

A very good case for hardware multithreading
54
Example (cont'd)

How much faster would the processor if we added a
L2 cache that
Has a 5 ns access time
Would reduce miss rate to main memory to one
percent?

55
Solution (I)

L2 cache access time
5ns 20 cycles
Impact of cache misses per instruction
L1 cache misses L2 cache misses
0.03200.01400 0.6 4.0 4.6
cycles/instruction
Average number of cycles per instruction
1 4.6 5.6 cycles/instruction

56
Solution (II)

Fraction of time wasted
4.6/5.6 82 percent
CPU speedup
13/4.6 2.83

57
Problem

Redo the second part of the example assuming that
the secondary cache
Has a 3 ns access time
Can reduce miss rate to main memory to one
percent?

58
Solution

Fraction of time wasted
86 percent
CPU speedup
1.22

New L2 cache with a lower access timebut a
higher miss rate performs much worsethan first
L2 cache
59
Example

A virtual memory has a page fault rate of 10-4
faults per memory access
Accessing data
In the memory requires 100 ns
On disk requires 5 ms
What is the average memory access time?Tavg
100 ns 10-4 5 ms 600ns

60
The cost of a page fault

Let
Tm be the main memory access time
Td the disk access time
f the page fault rate
Ta the average access time of the VM
Ta (1 f ) Tm f (Tm Td ) Tm f Td

61
Example

Assume Tm 50 ns and Td 5 ms

f Mean memory access time
10-3 50 ns 5 ms/103 5,050 ns
10-4 50 ns 5 ms/104 550 ns
10-5 50 ns 5 ms/105 100 ns
10-6 50 ns 5 ms/ 106 55 ns
62
In other words
It's the page fault rate, stupid!
63
Locality principle (I)

A process that would access its pages in a
totally unpredictable fashion would perform very
poorly in a VM system unless all its pages are in
main memory

64
Locality principle (II)

Process P accesses randomly a very large array
consisting of n pages
If m of these n pages are in main memory, the
page fault frequency of the process will be( n
m )/ n
Must switch to another algorithm

65
First problem

A virtual memory system has
32 bit addresses
4 KB pages
What are the sizes of the
Page number field?
Offset field?

66
Solution (I)

Step 1Convert page size to power of 24 KB
212 B
Step 2Exponent is length of offset field

67
Solution (II)

Step 3Size of page number field Address size
Offset sizeHere 32 12 20 bits

12 bits for the offset and 20 bits for the page
number
68
MEMORY PROTECTION
69
Objective

Unless we have an isolated single-user system, we
must prevent users from
Accessing
Deleting
Modifying
the address spaces of other processes, including
the kernel

70
Memory protection (I)

VM ensures that processes cannot access page
frames that are not referenced in their page
table.
Can refine control by distinguishing among
Read access
Write access
Execute access
Must also prevent processes from modifying their
own page tables

71
Dual-mode CPU

Require a dual-mode CPU
Two CPU modes
Privileged mode or executive mode that allows
CPU to execute all instructions
User mode that allows CPU to execute only safe
unprivileged instructions
State of CPU is determined by a special bits

72
Switching between states

User mode will be the default mode for all
programs
Only the kernel can run in supervisor mode
Switching from user mode to supervisor mode is
done through an interrupt
Safe because the jump address is at a
well-defined location in main memory

73
Memory protection (II)

Has additional advantages
Prevents programs from corrupting address spaces
of other programs
Prevents programs from crashing the kernel
Not true for device drivers which are inside the
kernel
Required part of any multiprogramming system

74
INTEGRATING CACHES AND VM
75
The problem

In a VM system, each byte of memory has teo
addresses
A virtual address
A physical address
Should cache tags contain virtual addresses or
physical addresses?

76
Discussion

Using virtual addresses
Directly available
Bypass TLB
Cache entries specific to a given address space
Must flush caches when the OS selects another
process

Using physical addresses
Must access first TLB
Cache entries not specific to a given address
space
Do not have to flush caches when the OS selects
another process

77
The best solution

Let the cache use physical addresses
No need to flush the cache at each context switch
TLB access delay is tolerable

78
VIRTUAL MACHINES
79
Key idea

Let different operating systems run at the same
time on a single computer
Windows, Linux and Mac OS
A Real-time Os and a conventional OS
A production OS and a new OS being tested

80
How it is done

A hypervisor /VM monitor defines two or more
virtual machines
Each virtual machine has
Its own virtual CPU
Its own virtual physical memory
Its own virtual disk(s)

81
Two virtual machines
User mode
Privileged mode
Hypervisor
82
Translating a block address
That's block v, w of the actual disk
Access block x, y of my virtual disk
VM kernel
Hypervisor
Virtual disk
Access block v, w of actual disk
Actual disk
83
Handling I/Os

Difficult task because
Wide variety of devices
Some devices may be shared among several VMs
Printers
Shared disk partition
Want to let Linux and Windows access the same
files

84
Virtual Memory Issues

Each VM kernel manages its own memory
Its page tables map program virtual addresses
into pseudo-physical addresses
It treats these addresses as physical addresses

85
The dilemma
User processA
Page 735 of process A is stored in page frame 435
That's page frame 993 of the actual RAM
VM kernel
Hypervisor
86
The solution (I)

Address translation must remain fast!
Hypervisor lets each VM kernel manage their own
page tables but do not use them
They contain bogus mappings!
It maintains instead its own shadow page tables
with the correct mappings
Used to handle TLB misses

87
The solution (II)

To keep its shadow page tables up to date,
hypervisor must track any changes made by the VM
kernels
Mark page tables read-only

88
Nastiest Issue

The whole VM approach assumes that a kernel
executing in user mode will behave exactly like a
kernel executing in privileged mode
Not true for all architectures!
Intel x86 Pop flags (POPF) instruction

89
Solutions

Modify the instruction set and eliminate
instructions like POPF
IBM redesigned the instruction set of their 360
series for the 370 series
Mask it through clever software
Dynamic "binary translation" when direct
execution of code could not work(VMWare)

90
CACHE CONSISTENCY
91
The problem

Specific to architectures with
Several processors sharing the same main memory
Multicore architectures
Each core/processor has its own private cache
A must for performance
Happens when same data are present in two or more
private caches

92
An example (I)
RAM
93
An example (II)
Increments x
Still assumes x 0
RAM
94
An example
Both CPUs must applythe two updatesin the same
order
Sets x to 1
Resets x to 1
RAM
95
Rules

Whenever a processes accesses a variable it
always gets the value stored by the processor
that updated that variable last if the updates
are sufficiently separated in times
A processor accessing a variable sees all updates
applied to that variable in thesame order
No compromise is possible here

96
A realization Snoopy caches

All caches are linked to the main memory through
a shared bus
All caches observe the writes performed by other
caches
When a cache notices that another cache performs
a write on a memory location that it has in its
cache, it invalidates the corresponding cache
block

97
An example (I)
Fetches x 2
RAM
98
An example (II)
Also fetches x
RAM
99
An example (III)
Resets x to 0
RAM
100
An example (IV)
Performs write-through
Detects write-through and invalidates its copy
of x
RAM
101
An example (IV)
when CPU wants to access x. cache gets correct
value from RAM
RAM
102
A last correctness condition

Cache cannot reorder their memory updates
Cache RAM buffer must be FIFO
First in first out

103
Miscellaneous fallacies

Segmented address spaces
Address is segment number offset in segment
Programmers hate them
Ignoring virtual memory behavior when accessing
large two-dimensional arrays

104
Miscellaneous fallacies

Segmented address spaces
Address is segment number offset in segment
Programmers hate them
Ignoring virtual memory behavior when accessing
large two-dimensional arrays
Believing that you can virtualize any CPU
architecture

105
DEPENDABILITY
106
Reliability and Availability

Reliability
Probability R(t) that system will be up at time
t if it was up at time t 0
Availability
Fraction of time the system is up
Reliability and availability do not measure the
same thing!

107
MTTF, MMTR and MTBF

MTTF is mean time to failure
MTTR is mean time to repair
1/MTTF is failure rate l
MTTBF, the mean time between failures, is
MTBF MTTF MTTR

108
Reliability

As a first approximation R(t) exp(t/MTTF)
Not true if failure rate varies over time

109
Availability

Measured by (MTTF)/(MTTF MTTR) MTTF/MTBF
MTTR is very important

110
Example

A server crashes on the average once a month
When this happens, it takes six hours to reboot
it
What is the server availability ?

111
Solution

MTBF 30 days
MTTR ½ day
MTTF 29 ½ days
Availability is 29.5/30 98.3

112
Example

A disk drive has a MTTF of 20 years.
What is the probability that the data it contains
will not be lost over a period of five years?

113
Example

A disk farm contains 100 disks whose MTTF is 20
years.
What is the probability that no data will be
lost over a period of five years?

114
Solution

The aggregate failure rate of the disk farm is
100x1/20 5
The mean time to failure of the farm is
1/5 year
We apply the formula
R(t) exp(t/MTTF) -exp(55) 1.4 10-11

115
RAID Arrays
116
Todays Motivation

We use RAID today for
Increasing disk throughput by allowing parallel
access
Eliminating the need to make disk backups
Disks are too big to be backed up in an efficient
fashion

117
RAID LEVEL 0

No replication
Advantages
Simple to implement
No overhead
Disadvantage
If array has n disks failure rate is n times the
failure rate of a single disk

118
RAID levels 0 and 1
Mirrors
RAID level 1
119
RAID LEVEL 1

Mirroring
Two copies of each disk block
Advantages
Simple to implement
Fault-tolerant
Disadvantage
Requires twice the disk capacity of normal file
systems

120
RAID LEVEL 2

Instead of duplicating the data blocks we use an
error correction code
Very bad idea because disk drives either work
correctly or do not work at all
Only possible errors are omission errors
We need an omission correction code
A parity bit is enough to correct a single
omission

121
RAID levels 2 and 3
Check disks
RAID level 2
Parity disk
RAID level 3
122
RAID LEVEL 3

Requires N1 disk drives
N drives contain data (1/N of each data block)
Block bk now partitioned into N fragments
bk,1, bk,2, ... bk,N
Parity drive contains exclusive or of these N
fragments
pk bk,1 ? bk,2 ? ... ? bk,N

123
How parity works?

Truth table for XOR (same as parity)

A B A?B
0 0 0
0 1 1
1 0 1
1 1 0
124
Recovering from a disk failure

Small RAID level 3 array with data disks D0 and
D1 and parity disk P can tolerate failure of
either D0 or D1

D0 D1 P
0 0 0
0 1 1
1 0 1
1 1 0
D1?PD0 D0?PD1
0 0
0 1
1 0
1 1
125
How RAID level 3 works (I)

Assume we have N 1 disks
Each block is partitioned into N equal chunks

N 4 in example
126
How RAID level 3 works (II)

XOR data chunks to compute the parity chunk

Each chunk is written into a separate disk

Parity
127
How RAID level 3 works (III)

Each read/write involves all disks in RAID array
Cannot do two or more reads/writes in parallel
Performance of array not netter than that of a
single disk

128
RAID LEVEL 4 (I)

Requires N1 disk drives
N drives contain data
Individual blocks, not chunks
Blocks with same disk address form a stripe

x
x
x
x
?
129
RAID LEVEL 4 (II)

Parity drive contains exclusive or of the N
blocks in stripe
pk bk ? bk1 ? ... ? bkN-1
Parity block now reflects contents of several
blocks!
Can now do parallel reads/writes

130
RAID levels 4 and 5
Bottleneck
RAID level 4
RAID level 5
131
RAID LEVEL 5

Single parity drive of RAID level 4 is involved
in every write
Will limit parallelism
RAID-5 distribute the parity blocks among the
N1 drives
Much better

132
The small write problem

Specific to RAID 5
Happens when we want to update a single block
Block belongs to a stripe
How can we compute the new value of the parity
block

pk
...
bk1
bk2
bk
133
First solution

Read values of N-1 other blocks in stripe
Recompute
pk bk ? bk1 ? ... ? bkN-1
Solution requires
N-1 reads
2 writes (new block and new parity block)

134
Second solution

Assume we want to update block bm
Read old values of bm and parity block pk
Compute
pk new bm ? old bm ? old pk
Solution requires
2 reads (old values of block and parity block)
2 writes (new block and new parity block)

135
RAID level 6 (I)

Not part of the original proposal
Two check disks
Tolerates two disk failures
More complex updates

136
RAID level 6 (II)

Has become more popular as disks are becoming
Bigger
More vulnerable to irrecoverable read errors
Most frequent cause for RAID level 5 array
failures is
Irrecoverable read error occurring while
contents of a failed disk are reconstituted

137
CONNECTING I/O DEVICES
138
Busses

Connecting computer subsystems with each other
was traditionally done through busses
A bus is a shared communication link connecting
multiple devices
Transmit several bits at a time
Parallel buses

139
Busses
140
Examples

Processor-memory busses
Connect CPU with memory modules
Short and high-speed
I/O busses
Longer
Wide range of data bandwidths
Connect to memory through processor-memory bus of
backplane bus

141
Synchronous busses

Include a clock in the control lines
Bus protocols expressed in actions to be taken at
each clock pulse
Have very simple protocols
Disadvantages
All bus devices must run at same clock rate
Due to clock skew issues, cannot be both fast
and long

142
Asynchronous busses

Have no clock
Can accommodate a wide variety of devices
Have no clock skew issues
Require a handshaking protocol before any
transmission
Implemented with extra control lines

143
Advantages of busses

Cheap
One bus can link many devices
Flexible
Can add devices

144
Disadvantages of busses

Shared devices
can become bottlenecks
Hard to run many parallel lines at high clock
speeds

145
New trend

Away from parallel shared buses
Towards serial point-to-point switched
interconnections
Serial
One bit at a time
Point-to-point
Each line links a specific device to another
specific device

146
x86 bus organization

Processor connects to peripherals through two
chips (bridges)
North Bridge
South Bridge

147
x86 bus organization
North Bridge
South Bridge
148
North bridge

Essentially a DMA controller
Lets disk controller access main memory w/o any
intervention of the CPU
Connects CPU to
Main memory
Optional graphics card
South Bridge

149
South Bridge

Connects North bridge to a wide variety of I/O
busses

150
Communicating with I/O devices

Two solutions
Memory-mapped I/O
Special I/O instructions

151
Memory mapped I/O

A portion of the address space reserved for I/O
operations
Writes to any to these addresses are interpreted
as I/O commands
Reading from these addresses gives access to
Error bit
I/O completion bit
Data being read

152
Memory mapped I/O

User processes cannot access these addresses
Only the kernel
Prevents user processes from accessing the disk
in an uncontrolled fashion

153
Dedicated I/O instructions

Privileged instructions that cannot be executed
by User processes cannot access these addresses
Only the kernel
Prevents user processes from accessing the disk
in an uncontrolled fashion

154
Polling

Simplest way for an I/O device to communicate
with the CPU
CPU periodically checks the status of pending I/O
operations
High CPU overhead

155
I/O completion interrupts

Notify the CPU that an I/O operation has
completed
Allows th CPU to do something else while waiting
for the completion of an I/O operation
Multiprogramming
I/O completion interrupts are processed by CPU
between instructions
No internal instruction state to save

156
Interrupts levels

See previous chapter

157
Direct memory access

DMA
Lets disk controller access main memory w/o any
intervention of the CPU

158
DMA and virtual memory

A single DMA transfer may cross page boundaries
with
One page being in main memory
One missing page

159
Solutions

Make DMA work with virtual addresses
Issue is then dealt by the virtual memory
subsystem
Break DMA transfers crossing page boundaries into
chains of transfers that do not corss page
boundaries

160
An Example
Break
DMA transfer
DMA
DMA
into
161
DMA and cache hierarchy

Three approaches for handling temporary
inconsistencies between caches and main memory

162
Solutions

Running all DMA accesses to the cache
Bad solution
Have OS selectively
Invalidate affected cache entries when
performing a read
Forcing immediate flush of dirty cache entries
when performing a write
Have specific hardware to do same

163
Benchmarking I/O
164
Benchmarks

Specific benchmarks for
Transaction processing
Emphasis on speed and graceful recovery from
failures
Atomic transactions
All or nothing behavior

165
An important observation

Very difficult to operate a disk subsystem at a
reasonable fraction of its maximum throughput
Unless we access sequentially very large ranges
of data
512 KB and more

166
Major fallacies

Since rated MTTFs of disk drives exceed one
million hours, disk can last more than 100 years
MTTF expresses failure rate during the disk
actual lifetime
Disk failure rates in the field match the MMTTFS
mentioned in the manufacturers literature
They are up to ten times higher

167
Major fallacies

Neglecting to do end-to-end checks
Using magnetic tapes to back up disks
Tape formats can become quickly obsolescent
Disk bit densities have grown much faster than
tape data densities.

168
WRITING PARALLEL PROGRAMS
169
Overview

Some problems are embarrassingly parallel
Many computer graphics tasks
Brute force searches in cryptography or password
guessing
Much more difficult for other applications
Communication overhead among sub-tasks
Ahmdahl's law
Balancing the load

170
Amdahl's Law

Assume a sequential process takes
tp seconds to perform operations that could be
performed in parallel
ts seconds to perform purely sequential
operations
The maximum speedup will be
(tp ts )/ts

171
Balancing the load

Must ensure that workload is equally divided
among all the processors
Worst case is when one of the processors does
much more work than all others

172
A last issue

Humans likes to address issues one after the
order
We have meeting agendas
We do not like to be interrupted
We write sequential programs

173
MULTI PROCESSOR ORGANIZATIONS
174
Shared memory multiprocessors

Interconnection network
RAM
I/O
175
Shared memory multiprocessor

Can offer
Uniform memory access to all processors(UMA)
Easiest to program
Non-uniform memory access to all
processors(NUMA)
Can scale up to larger sizes
Offer faster access to nearby memory

176
Computer clusters

Interconnection network
177
Computer clusters

Very easy to assemble
Can take advantage of high-speed LANs
Gigabit Ethernet, Myrinet,
Data exchanges must be done throughmessage
passing

178
HARDWARE MULTITHREADING
179
General idea

Let the processor switch to another thread of
computation while them current one is stalled
Motivation
Increased cost of cache misses

180
Implementation

Entirely controlled by the hardware
Unlike multiprogramming
Requires a processor capable of
Keeping track of the state of each thread
One set of registersincluding PC for each
concurrent thread
Quickly switching among concurrent threads

181
Approaches

Fine-grained multithreading
Switches between threads for each instruction
Provides highest throughputs
Slows down execution of individual threads

182
Approaches

Coarse-grained multithreading
Switches between threads whenever a long stall
is detected
Easier to implement
Cannot eliminate all stalls

183
Approaches

Simultaneous multi-threading
Takes advantage of the possibility of modern
hardware to perform different tasks in parallel
for instructions of different threads
Best solution

184
ALPHABET SOUP
185
Classification

SISD
Single instruction, single data
Conventional uniprocessor architecture
MIMD
Multiple instructions, multiple data
Conventional multiprocessor architecture

186
Classification

SIMD
Single instruction, multiple data
Perform same operations on a set of similar data
Think of adding two vectors
for (i 0 i i lt VECSIZE) sumi ai
bi

187
PERFORMANCE ISSUES
188
Roofline model

Takes into account
Memory bandwidth
Floating-point performance
Introduces arithmetic intensity
Total number of floating point operations in a
program divided by total number of bytes
transferred to main memory
Measured in FLOPS/byte

189
Roofline model

Attainable GFLOPS/s Min(Peak Memory
BW?Arithmetic Intensity, Peak
Floating-Point Performance

190
Roofline model
Peak floating-point performance
Floating-point performance is limited by memory
bandwidth

Write a Comment

User Comments (0)

About PowerShow.com

THIRD REVIEW SESSION PowerPoint PPT Presentation