Multiprocessors

About This Presentation

Title:

Multiprocessors

Description:

Using every possible technique to speedup single-processor systems... Processors run the same program, but don't have to stay in lockstep. 8/19/09. 6 ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 67

Provided by: constantin56

Category:

more less

Transcript and Presenter's Notes

Title: Multiprocessors

1

Multiprocessors ( multicores)

ECE 411 Spring 2009
2
Outline

Multiprocessing intro
Classifying Multiprocessors
Synchronization

3
Multiprocessors
4
Motivation

Using every possible technique to speedup
single-processor systems
If I have N processors, shouldnt I be able to
get N times as much work done?
Anyone whos ever done a group project (MP3!)
will tell you why its not this simple
If I have many programs to run, I can use
multiple processors to run them at the same time
Multiprocessing for throughput
But can I use N processors to run one program in
1/Nth the time?
Making this work is the holy grail of parallel
processing

5
Processor Taxonomy (Flynns)

Single Instruction, Single Data (SISD)
Basic 1-wide CPU
Single Instruction, Multiple Data (SIMD)
Vector processors, some multimedia, array
processors
Multiple Instruction, Single Data (MISD)
Non-practical
Multiple Instruction, Multiple Data (MIMD)
Most multiprocessors, superscalar CPUs
More recent addition
Single Program, Multiple Data (SPMD)
Processors run the same program, but dont have
to stay in lockstep

6
Multiprocessor Performance
7
Multiprocessor Performance

Ideal speedup of N on N processors
Work evenly divided among the processors with no
overhead
More typical speedups are sub-linear (lt N)
Communication overhead
Synchronization overhead
Load balancing
Occasionally, see superlinear (gt N) speedup
Indicates that parallel program or computer
system is more efficient than the base program or
system

8
How Not to Lie With Multiprocessor Performance

Basic rule be fair to the uniprocessor
Compare multiprocessor against the best possible
version of the uniprocessor program
Do a good job on the uniprocessor program, use
optimizing compiler, etc.
Make sure uniprocessor is running an efficient
algorithm for uniprocessors (may require two
versions of the program)
Use the same input data for all versions
Much easier to get speedup of N if you increase
work by factor of N
Some performance measures explicitly cover rate
of work, in which case increasing data size is ok

9
Classifying Multiprocessors

Can think about categorizing multiprocessors
based on three questions
How do the processors exchange data?
How are the processors connected to each other?
How is memory organized?

10
How do Processors Exchange Data?

Two major alternatives
Message Passing
Programs execute explicit operations (sends,
receives) to transfer data from one processor to
another
Can be more efficient, because can send data
before it is needed
Often viewed as harder to program, particularly
for irregular applications
Shared Memory
System maintains one view of memory, programs on
one processor see the results of writes by
programs on other processors
Data generally not sent until program tries to
access it
Often viewed as easier to program
Generally accepted that this is true for just
getting code working. Less clear if you consider
effort for high performance

11
Message-Passing Example

Processor 1
send (a, processor2)
c 0
for (i 1 i lt 100 i)
c aI
send (c, processor2)

Processor 2
receive(a)
d 0
for (i 100 i lt 200 i)
d ai
receive(c)
d c
printf(Sum is d\n, d)

12
Shared-Memory Example

Processor 1
c 0
for (i 0 i lt 100 i)
c ai

Processor 2
d 0
for (j 100 j lt200 j)
d aj
/ wait for first processor to be done /
d c
Printf(Sum is d\n, d)

13
How are the Processors Connected?

Shared Bus
Simple
All processors see every communication
Problem bandwidth doesnt scale as number of
processors increases
May actually go down because of wire length,
loading
Network
Many different types of network
Generally consists of a set of switches that
connect subsets of the processors
Can increase bandwidth as number of processors
increases
Lots of complexity/wire/latency tradeoffs in
network design

14
How is Memory Organized?
15
Centralized vs. Distributed Memory

Centralized Memory
Conceptually Simple
Integrates well with shared memory
Bandwidth doesnt scale with number of processors
Note can still have caches on each processor
with centralized memory
Creates cache coherence problem that well talk
about next time
Distributed Memory
Integrates well with message-passing model
Bandwidth scales with number of processors

16
Multiprocessor Parallelism

Origin time-sharing
Threads, tasks, multithreading, multitasking
Instruction-level parallelism independent
instructions within a single thread executing
simultaneously.
Thread-level parallelism independent parts of
an application executing simultaneously.
Terminology Processes, tasks, threads, and their
OS counterparts.

17
Synchronization

Sometimes, we need to ensure that events on
different processors happen in a particular order
to get the correct result from programs.
Example Weather simulation
Synchronization refers to both the process of
ensuring ordering and the technique used to
enforce order

18
Synchronization in Message-Passing Systems

Send-Receive paradigm enforces ordering because
receive() operation doesnt complete until the
data from the matching send() arrives.
In many cases, getting the set of sends and
receives right to transfer the data required by
the program on each processor provides all of the
synchronization you need
In some cases, need to add extra sent-receive
pairs to enforce ordering
Example keeping a producer thread from getting
too far ahead of a consumer

19
Synchronization in Shared-Memory Systems

Synchronization is much more of an issue on
shared-memory systems than on message-passing
systems
Example If processor P1 one does a store to
address 37, and P2 does a load from address 37,
does P2 see the value that P1 wrote?
Depends on whether the store or the load happens
first
Need synchronization to enforce the ordering the
programmer wants
Shared-Memory Model We will assume strong
consistency, which means that
On any processor, memory operations happen in
program order (or at least return the same result
as if they did)
Across all processors, the set of memory
operations gives the same result as if they
executed in some sequential order

20
Synchronization

Two basic primitives
Mutual exclusion (lock)
Only one processor can acquire a lock at any time
Example Shared counter
lock(lockvar)
a a 1
release(lockvar)
Barrier
When processor hits a barrier, stops until all
processors reach the barrier
Example Weather simulation typically divides
time into discrete steps, executes a barrier at
the end of each step so that no processor
simulates the next step until all are done with
the current one.

21
Implementing Locks

It is possible to implement a lock with just load
and store operations but very inefficient
Locks become much more efficient if the processor
provides some sort of atomic read-modify-write
operation
Example Test-and-set
Single instruction that reads the value of a
memory location, writes a new value to that
location, and returns the old value to a
destination register
Key feature is that no other operation can read
the memory location between the time the
test-and-set reads it and the time that the new
value is written

22
Implementing Locks With Test-and-Set

void lock(lockvar)
while (test-and-set(lockvar, 1) ! 0)
void unlock(lockvar)
lockvar 0

23
Executing Multiple Threads

What is a thread?
Loop iterations, function calls, basic blocks,
external functions,
How do we implement threads?
Thread-level parallelism
Synchronization
Multiprocessors
Explicit multithreading
Implicit multithreading
Redundant multithreading
Summary

24
Thread-level Parallelism

Reduces effectiveness of temporal and spatial
locality

25
Thread-level Parallelism

Parallelism limited by sharing
Amdahls law
Access to shared state must be serialized
Serial portion limits parallel speedup
Many important applications share (lots of) state
Relational databases (transaction processing)
GBs of shared state
Even completely independent processes share
virtualized hardware, hence must synchronize
access
Access to shared state/shared variables
Must occur in a predictable, repeatable manner
Otherwise, chaos results
Architecture must provide primitives for
serializing access to shared state
Multithreading may reduce the effectiveness of
both spatial and temporal locality
temporal same piece of code takes longer to
execute (context switching)
spatial multiple active threads make working set
much larger

26
Synchronization Memory Consistency (A0)
A 3
A 4
A 4
A 1
27
Some Other Synchronization Primitives

Only one is necessary
Intels Itanium supports all three.

28
Synchronization Examples

All three provide guarantee same semantic
Initial value of A 0
Final value of A 4 in ALL cases
(b) uses additional lock variable AL to protect
critical section with a spin lock
This is the most common synchronization method in
modern multithreaded applications

29
Multiprocessor Systems the four key abstractions

Fully shared memory
All processors have equivalent view of all of
memory
Uniform (Unit) latency
All memory requests satisfied in a single cycle
Lack of contention
A processors memory references are never slowed
down by other processors memory references
Instantaneous propagation of writes
Any write operation (by any processor) is
instantaneously visible by all processors.

30
Cache Coherence

Simple to build, but long memory latencies and
lots of contention for the memory bus

31
Cache Coherence

Reduced average memory latency
Less contention for the bus
Problem What happens when both caches have
copies of the same address?

32
Snooping Caches
33
Implementing Cache Coherence

Snooping implementation
Origins in shared-bus systems
All CPUs could observe all other CPUs requests on
the bus hence snooping
Bus Read, Bus Write, Bus Upgrade
React appropriately to snooped commands
Invalidate shared copies
Provide up-to-date copies of dirty lines
Flush (writeback) to memory, or
Direct intervention (modified intervention or
dirty miss)
Snooping suffers from
Scalability shared busses not practical
Ordering of requests without a shared bus
Lots of recent and on-going work on scaling
snoop-based systems

34
Snooping Caches

Each cache watches all of the transactions on the
memory bus
If another processor requests a copy of a line in
your cache, then you handle the request by
sending the line over the bus and adjusting the
state in your cache appropriately
Main memory provides data if no cache does
If your copy is shared, it stays shared if the
other processor wanted to read the data, becomes
invalid if the other processor wanted to write
the data
If your copy is exclusive or modified, becomes
shared if another processor wants to read the
data, invalid if another processor wants to write

35
Cache Coherence

Basic problem If we have multiple copies of a
memory address, need to keep those copies
coherent (the same)
Writes on one processor must become visible to
all
One solution would be to broadcast all writes to
every processor
Lots of wasted bus bandwidth -- other processors
may not have copies of a given address
Need to know when to send values to other
processors

36
Invalidation-Based Cache Coherence

Basic idea its ok if multiple processors have
copies of a memory address so long as none of
them are writing to it
Allow multiple processors to have read-only
(shared) copies of a cache line
If a processor wants to write a cache line, it
must acquire a writable (exclusive) copy of the
line
If a processor has a writable copy of a line, no
other processor may have a copy of the line
Requesting an exclusive copy of a line requires
that all other processors invalidate their copy
of the line
Alternative approach update-based cache
coherence
Many processors can write to a line, have to send
written values to all processors with copies of
the line

37
Update vs. Invalidation Protocols

Coherent Shared Memory
All processors see the effects of others writes
How/when writes are propagated
Determine by coherence protocol

38
MESI An Invalidate Protocol
39
Illinois (MESI) Protocol

In a processors cache, each line can be in one
of four states
Modified (this processor has the only copy, line
is writable and readable, line has been written
since fetched)
Exclusive (this processor has the only copy, line
is writable and readable, line has not been
written since fetched)
Shared (this processor and others have copies,
line can be read but not written)
Invalid (this processor has no copy of the line,
line cannot be written or read)
Each tag in the cache records the state of its
line, similar to how uniprocessor caches track
valid/invalid and dirty/clean

40
An Invalidate Protocol The Illinois (MESI)
Protocol
41
Illinois Protocol on Snooping Cache System
42
Limitations of Snooping Cache

Bus bandwidth and bandwidth to the main memory
doesnt increase as number of processors goes up
As number of processors goes up, one of these two
factors eventually becomes the bottleneck
Worse than that, bus bandwidth will actually go
down as number of processors increases due to
electrical effects
Network made up of point-to-point connections can
be much faster for large machines

43
Distributed Shared Memory

Each processor has a cache and a main memory
attached to it, and communicates with the other
processors over a network
The shared address space is divided up among the
processors, and each processor becomes the home
node for the data stored in its main memory
Home nodes keep a directory of the state of each
line they are responsible for and who has copies
of the line
When processors try to access a line they dont
have a copy of, or try to write a line they have
a shared copy of, they send a message to the home
node of the line requesting it
Home node sends messages to all sharing nodes
telling them how the state of their copies needs
to change
Processor with the most up-to-date copy of the
line sends it back to the requesting processor

44
UMA vs. NUMA
Distributed Memory
Centralized Memory
(NUMA)
(UMA)
Proc.
Proc.
Proc.
Proc.
Proc.
Mem.
Proc.
Mem.
Network
Network
Memory
Proc.
Mem.
Proc.
Mem.
45
Distributed Shared-Memory Machine
46
Consistency vs. Coherence

Protocol vs Implementation
The memory consistency model tells you when the
results of a memory operation on one processor
will be visible on another processor
The cache-coherence protocol tells you how the
memory system implements the memory consistency
model

47
Strong (Sequential) Consistency

A multiprocessor system is sequentially
consistent if
The result of any execution is the same as if all
of the operations on all the processors were
executed in some (unspecified) sequential order
Atomic execution of memory operations
On any processor, the result of any execution is
the same as if all of the operations on that
processor were executed in program order
Intuitively, this model gives the same result as
if a set of in-order processors shared a single
memory system
Relaxed consistency models generally require the
programmer to specify when writes by one
processor become visible on other processors
Reduces communication traffic, but increases
programming effort

48
Consistency Example -- Dekkers Algorithm

Processor 1 Processor 2
While(1) While(1)
Flag1 1 Flag2 1
If (Flag2 0) If (Flag1 0)
/ Have lock / / Have lock /
return() return()
Flag1 0 Flag2 0
On a system with strong consistency, this
implements locks without a read-modify-write
operation
Very inefficient, though, particularly as the
number of processors grows

49
Implementing Cache Coherence

Directory implementation
Extra bits stored in memory (directory) record
state of line
Memory controller maintains coherence based on
the current state
Other CPUs commands are not snooped, instead
Directory forwards relevant commands
Powerful filtering effect only observe commands
that you need to observe
Meanwhile, bandwidth at directory scales by
adding memory controllers as you increase size of
the system
Leads to very scalable designs (100s to 1000s of
CPUs)
Directory shortcomings
Indirection through directory has latency penalty
If shared line is dirty in other CPUs cache,
directory must forward request, adding latency
This can severely impact performance of
applications with heavy sharing (e.g. relational
databases)

50
Memory Consistency Dijkstras 2-way Mutual
Exclusion

How are memory references from different
processors interleaved?
If this is not well-specified, synchronization
becomes difficult or even impossible
ISA must specify consistency model
Common example using Dekkers algorithm for
synchronization
If load reordered ahead of store (as we assume
for a baseline OOO CPU)
Both Proc0 and Proc1 enter critical section,
since both observe that others lock variable
(A/B) is not set
If consistency model allows loads to execute
ahead of stores, Dekkers algorithm no longer
works
Common ISAs allow this IA-32, PowerPC, SPARC,
Alpha

51
Sequential Consistency Lamport 1979

Processors treated as if they are interleaved
processes on a single time-shared CPU
All references must fit into a total global order
or interleaving that does not violate any CPUs
program order
Otherwise sequential consistency not maintained
Now Dekkers algorithm will work
Appears to preclude any OOO memory references
Hence precludes any real benefit from OOO CPUs

52
High-Performance Sequential Consistency

Coherent caches isolate CPUs if no sharing is
occurring
Absence of coherence activity means CPU is free
to reorder references
Still have to order references with respect to
misses and other coherence activity (snoops)
Key use speculation
Reorder references speculatively
Track which addresses were touched speculatively
Force replay (in order execution) of such
references that collide with coherence activity
(snoops)

53
High-Performance Sequential Consistency

Load queue records all speculative loads
Bus writes/upgrades are checked against LQ
Any matching load gets marked for replay
At commit, loads are checked and replayed if
necessary
Results in machine flush, since load-dependent
ops must also replay
Practically, conflicts are rare, so expensive
flush is OK

54
Relaxed Consistency Models

Key insight only synchronization references need
to be ordered
Hence, relax memory for all references
Enable high-performance out-of-order
implementation
Require programmer to label synchronization
references
Hardware must carefully order these labeled
references
All other references can be performed out of
order
Labeling schemes
Explicit synchronization ops (acquire/release)
Memory fence or memory barrier ops
All preceding ops must finish before following
ones begin
Often barrier ops cause pipeline drain in modern
out-of-order machine

55
Coherent Memory Interface Example
56
Split Transaction Buses

Packet switched vs. circuit switched
Release bus after request issued
Allow multiple concurrent requests to overlap
memory latency
More complicated control, arbitration for bus
Much better throughput

57
Explicitly Multithreaded Processors

Many approaches for executing multiple threads on
a single die
Mix-and-match IBM Power5 CMPSMT

58
IBM Power4 Example CMP
59
Coarse-grained Multithreading

Low-overhead approach for improving processor
throughput
Also known as switch-on-event
Long history Denelcor HEP
Commercialized in IBM Northstar, Pulsar
Rumored in Sun Rock, Niagara

60
SMT Resource Sharing
61
Implicitly Multithreaded Processors

Goalspeed up execution of a single thread
Implicitly break program up into multiple smaller
threads, execute them in parallel
Parallelize loop iterations across multiple
processing units
Usually, exploit control independence in some
fashion
Many challenges
Maintain data dependences (RAW, WAR, WAW) for
registers
Maintain precise state for exception handling
Maintain memory dependences (RAW/WAR/WAW)
Maintain memory consistency model
Not really addressed in any of the literature
Active area of research
Only a subset is covered here, in a superficial
manner

62
Sources of Control Independence
63
Implicit Multithreading Approaches
64
Executing The Same Thread

Why execute the same thread twice?
Detect faults
Better performance
Prefetch, resolve branches

65
Fault Detection

AR/SMT Rotenberg 1999
Use second SMT thread to execute program twice
Compare results to check for hard and soft errors
(faults)
DIVA Austin 1999
Use simple check-processor at commit
Re-execute all ops in order
Possibly relax main processors correctness
constraints and safety margins to improve
performance
Lower voltage, higher frequency, etc.
Lots of other variations proposed in more recent
work

66
Speculative Pre-execution

Idea create a runahead or future thread that
helps the main trailing thread
Advantage speculative future thread has no
correctness requirement
Slipstream processors Rotenberg 2000
Construct speculative, stripped-down version
(future thread)
Let it run ahead and prefetch
Speculative precomputation Roth 2001, Zilles
2002, Collins et al. 2001
Construct backward dataflow slice for problematic
instructions (mispredicted branches, cache
misses)
Pre-execute this slice of the program
Resolve branches, prefetch data
Implemented in Intel production compiler,
reflected in Intel Pentium 4 SPEC results