Title: Multiprocessors
1- Multiprocessors ( multicores)
ECE 411 Spring 2009
2Outline
- Multiprocessing intro
- Classifying Multiprocessors
- Synchronization
3Multiprocessors
4Motivation
- Using every possible technique to speedup
single-processor systems - If I have N processors, shouldnt I be able to
get N times as much work done? - Anyone whos ever done a group project (MP3!)
will tell you why its not this simple - If I have many programs to run, I can use
multiple processors to run them at the same time - Multiprocessing for throughput
- But can I use N processors to run one program in
1/Nth the time? - Making this work is the holy grail of parallel
processing
5Processor Taxonomy (Flynns)
- Single Instruction, Single Data (SISD)
- Basic 1-wide CPU
- Single Instruction, Multiple Data (SIMD)
- Vector processors, some multimedia, array
processors - Multiple Instruction, Single Data (MISD)
- Non-practical
- Multiple Instruction, Multiple Data (MIMD)
- Most multiprocessors, superscalar CPUs
- More recent addition
- Single Program, Multiple Data (SPMD)
- Processors run the same program, but dont have
to stay in lockstep
6Multiprocessor Performance
7Multiprocessor Performance
- Ideal speedup of N on N processors
- Work evenly divided among the processors with no
overhead - More typical speedups are sub-linear (lt N)
- Communication overhead
- Synchronization overhead
- Load balancing
- Occasionally, see superlinear (gt N) speedup
- Indicates that parallel program or computer
system is more efficient than the base program or
system
8How Not to Lie With Multiprocessor Performance
- Basic rule be fair to the uniprocessor
- Compare multiprocessor against the best possible
version of the uniprocessor program - Do a good job on the uniprocessor program, use
optimizing compiler, etc. - Make sure uniprocessor is running an efficient
algorithm for uniprocessors (may require two
versions of the program) - Use the same input data for all versions
- Much easier to get speedup of N if you increase
work by factor of N - Some performance measures explicitly cover rate
of work, in which case increasing data size is ok
9Classifying Multiprocessors
- Can think about categorizing multiprocessors
based on three questions - How do the processors exchange data?
- How are the processors connected to each other?
- How is memory organized?
10How do Processors Exchange Data?
- Two major alternatives
- Message Passing
- Programs execute explicit operations (sends,
receives) to transfer data from one processor to
another - Can be more efficient, because can send data
before it is needed - Often viewed as harder to program, particularly
for irregular applications - Shared Memory
- System maintains one view of memory, programs on
one processor see the results of writes by
programs on other processors - Data generally not sent until program tries to
access it - Often viewed as easier to program
- Generally accepted that this is true for just
getting code working. Less clear if you consider
effort for high performance
11Message-Passing Example
- Processor 1
- send (a, processor2)
- c 0
- for (i 1 i lt 100 i)
- c aI
-
- send (c, processor2)
-
- Processor 2
- receive(a)
- d 0
- for (i 100 i lt 200 i)
- d ai
-
- receive(c)
- d c
- printf(Sum is d\n, d)
12Shared-Memory Example
- Processor 1
- c 0
- for (i 0 i lt 100 i)
- c ai
-
-
- Processor 2
- d 0
- for (j 100 j lt200 j)
- d aj
-
- / wait for first processor to be done /
- d c
- Printf(Sum is d\n, d)
13How are the Processors Connected?
- Shared Bus
- Simple
- All processors see every communication
- Problem bandwidth doesnt scale as number of
processors increases - May actually go down because of wire length,
loading - Network
- Many different types of network
- Generally consists of a set of switches that
connect subsets of the processors - Can increase bandwidth as number of processors
increases - Lots of complexity/wire/latency tradeoffs in
network design
14How is Memory Organized?
15Centralized vs. Distributed Memory
- Centralized Memory
- Conceptually Simple
- Integrates well with shared memory
- Bandwidth doesnt scale with number of processors
- Note can still have caches on each processor
with centralized memory - Creates cache coherence problem that well talk
about next time - Distributed Memory
- Integrates well with message-passing model
- Bandwidth scales with number of processors
16Multiprocessor Parallelism
- Origin time-sharing
- Threads, tasks, multithreading, multitasking
- Instruction-level parallelism independent
instructions within a single thread executing
simultaneously. - Thread-level parallelism independent parts of
an application executing simultaneously. - Terminology Processes, tasks, threads, and their
OS counterparts.
17Synchronization
- Sometimes, we need to ensure that events on
different processors happen in a particular order
to get the correct result from programs. - Example Weather simulation
- Synchronization refers to both the process of
ensuring ordering and the technique used to
enforce order
18Synchronization in Message-Passing Systems
- Send-Receive paradigm enforces ordering because
receive() operation doesnt complete until the
data from the matching send() arrives. - In many cases, getting the set of sends and
receives right to transfer the data required by
the program on each processor provides all of the
synchronization you need - In some cases, need to add extra sent-receive
pairs to enforce ordering - Example keeping a producer thread from getting
too far ahead of a consumer
19Synchronization in Shared-Memory Systems
- Synchronization is much more of an issue on
shared-memory systems than on message-passing
systems - Example If processor P1 one does a store to
address 37, and P2 does a load from address 37,
does P2 see the value that P1 wrote? - Depends on whether the store or the load happens
first - Need synchronization to enforce the ordering the
programmer wants - Shared-Memory Model We will assume strong
consistency, which means that - On any processor, memory operations happen in
program order (or at least return the same result
as if they did) - Across all processors, the set of memory
operations gives the same result as if they
executed in some sequential order
20Synchronization
- Two basic primitives
- Mutual exclusion (lock)
- Only one processor can acquire a lock at any time
- Example Shared counter
- lock(lockvar)
- a a 1
- release(lockvar)
- Barrier
- When processor hits a barrier, stops until all
processors reach the barrier - Example Weather simulation typically divides
time into discrete steps, executes a barrier at
the end of each step so that no processor
simulates the next step until all are done with
the current one.
21Implementing Locks
- It is possible to implement a lock with just load
and store operations but very inefficient - Locks become much more efficient if the processor
provides some sort of atomic read-modify-write
operation - Example Test-and-set
- Single instruction that reads the value of a
memory location, writes a new value to that
location, and returns the old value to a
destination register - Key feature is that no other operation can read
the memory location between the time the
test-and-set reads it and the time that the new
value is written
22Implementing Locks With Test-and-Set
- void lock(lockvar)
- while (test-and-set(lockvar, 1) ! 0)
-
-
- void unlock(lockvar)
- lockvar 0
23Executing Multiple Threads
- What is a thread?
- Loop iterations, function calls, basic blocks,
external functions, - How do we implement threads?
- Thread-level parallelism
- Synchronization
- Multiprocessors
- Explicit multithreading
- Implicit multithreading
- Redundant multithreading
- Summary
24Thread-level Parallelism
- Reduces effectiveness of temporal and spatial
locality
25Thread-level Parallelism
- Parallelism limited by sharing
- Amdahls law
- Access to shared state must be serialized
- Serial portion limits parallel speedup
- Many important applications share (lots of) state
- Relational databases (transaction processing)
GBs of shared state - Even completely independent processes share
virtualized hardware, hence must synchronize
access - Access to shared state/shared variables
- Must occur in a predictable, repeatable manner
- Otherwise, chaos results
- Architecture must provide primitives for
serializing access to shared state - Multithreading may reduce the effectiveness of
both spatial and temporal locality - temporal same piece of code takes longer to
execute (context switching) - spatial multiple active threads make working set
much larger
26 Synchronization Memory Consistency (A0)
A 3
A 4
A 4
A 1
27Some Other Synchronization Primitives
- Only one is necessary
- Intels Itanium supports all three.
28Synchronization Examples
- All three provide guarantee same semantic
- Initial value of A 0
- Final value of A 4 in ALL cases
- (b) uses additional lock variable AL to protect
critical section with a spin lock - This is the most common synchronization method in
modern multithreaded applications
29Multiprocessor Systems the four key abstractions
- Fully shared memory
- All processors have equivalent view of all of
memory - Uniform (Unit) latency
- All memory requests satisfied in a single cycle
- Lack of contention
- A processors memory references are never slowed
down by other processors memory references - Instantaneous propagation of writes
- Any write operation (by any processor) is
instantaneously visible by all processors.
30Cache Coherence
- Simple to build, but long memory latencies and
lots of contention for the memory bus
31Cache Coherence
- Reduced average memory latency
- Less contention for the bus
- Problem What happens when both caches have
copies of the same address?
32Snooping Caches
33Implementing Cache Coherence
- Snooping implementation
- Origins in shared-bus systems
- All CPUs could observe all other CPUs requests on
the bus hence snooping - Bus Read, Bus Write, Bus Upgrade
- React appropriately to snooped commands
- Invalidate shared copies
- Provide up-to-date copies of dirty lines
- Flush (writeback) to memory, or
- Direct intervention (modified intervention or
dirty miss) - Snooping suffers from
- Scalability shared busses not practical
- Ordering of requests without a shared bus
- Lots of recent and on-going work on scaling
snoop-based systems
34Snooping Caches
- Each cache watches all of the transactions on the
memory bus - If another processor requests a copy of a line in
your cache, then you handle the request by
sending the line over the bus and adjusting the
state in your cache appropriately - Main memory provides data if no cache does
- If your copy is shared, it stays shared if the
other processor wanted to read the data, becomes
invalid if the other processor wanted to write
the data - If your copy is exclusive or modified, becomes
shared if another processor wants to read the
data, invalid if another processor wants to write
35Cache Coherence
- Basic problem If we have multiple copies of a
memory address, need to keep those copies
coherent (the same) - Writes on one processor must become visible to
all - One solution would be to broadcast all writes to
every processor - Lots of wasted bus bandwidth -- other processors
may not have copies of a given address - Need to know when to send values to other
processors
36Invalidation-Based Cache Coherence
- Basic idea its ok if multiple processors have
copies of a memory address so long as none of
them are writing to it - Allow multiple processors to have read-only
(shared) copies of a cache line - If a processor wants to write a cache line, it
must acquire a writable (exclusive) copy of the
line - If a processor has a writable copy of a line, no
other processor may have a copy of the line - Requesting an exclusive copy of a line requires
that all other processors invalidate their copy
of the line - Alternative approach update-based cache
coherence - Many processors can write to a line, have to send
written values to all processors with copies of
the line
37Update vs. Invalidation Protocols
- Coherent Shared Memory
- All processors see the effects of others writes
- How/when writes are propagated
- Determine by coherence protocol
38 MESI An Invalidate Protocol
39Illinois (MESI) Protocol
- In a processors cache, each line can be in one
of four states - Modified (this processor has the only copy, line
is writable and readable, line has been written
since fetched) - Exclusive (this processor has the only copy, line
is writable and readable, line has not been
written since fetched) - Shared (this processor and others have copies,
line can be read but not written) - Invalid (this processor has no copy of the line,
line cannot be written or read) - Each tag in the cache records the state of its
line, similar to how uniprocessor caches track
valid/invalid and dirty/clean
40An Invalidate Protocol The Illinois (MESI)
Protocol
41Illinois Protocol on Snooping Cache System
42Limitations of Snooping Cache
- Bus bandwidth and bandwidth to the main memory
doesnt increase as number of processors goes up - As number of processors goes up, one of these two
factors eventually becomes the bottleneck - Worse than that, bus bandwidth will actually go
down as number of processors increases due to
electrical effects - Network made up of point-to-point connections can
be much faster for large machines
43Distributed Shared Memory
- Each processor has a cache and a main memory
attached to it, and communicates with the other
processors over a network - The shared address space is divided up among the
processors, and each processor becomes the home
node for the data stored in its main memory - Home nodes keep a directory of the state of each
line they are responsible for and who has copies
of the line - When processors try to access a line they dont
have a copy of, or try to write a line they have
a shared copy of, they send a message to the home
node of the line requesting it - Home node sends messages to all sharing nodes
telling them how the state of their copies needs
to change - Processor with the most up-to-date copy of the
line sends it back to the requesting processor
44UMA vs. NUMA
Distributed Memory
Centralized Memory
(NUMA)
(UMA)
Proc.
Proc.
Proc.
Proc.
Proc.
Mem.
Proc.
Mem.
Network
Network
Memory
Proc.
Mem.
Proc.
Mem.
45Distributed Shared-Memory Machine
46Consistency vs. Coherence
- Protocol vs Implementation
- The memory consistency model tells you when the
results of a memory operation on one processor
will be visible on another processor - The cache-coherence protocol tells you how the
memory system implements the memory consistency
model
47Strong (Sequential) Consistency
- A multiprocessor system is sequentially
consistent if - The result of any execution is the same as if all
of the operations on all the processors were
executed in some (unspecified) sequential order - Atomic execution of memory operations
- On any processor, the result of any execution is
the same as if all of the operations on that
processor were executed in program order - Intuitively, this model gives the same result as
if a set of in-order processors shared a single
memory system - Relaxed consistency models generally require the
programmer to specify when writes by one
processor become visible on other processors - Reduces communication traffic, but increases
programming effort
48Consistency Example -- Dekkers Algorithm
- Processor 1 Processor 2
- While(1) While(1)
- Flag1 1 Flag2 1
- If (Flag2 0) If (Flag1 0)
- / Have lock / / Have lock /
- return() return()
-
- Flag1 0 Flag2 0
-
- On a system with strong consistency, this
implements locks without a read-modify-write
operation - Very inefficient, though, particularly as the
number of processors grows
49Implementing Cache Coherence
- Directory implementation
- Extra bits stored in memory (directory) record
state of line - Memory controller maintains coherence based on
the current state - Other CPUs commands are not snooped, instead
- Directory forwards relevant commands
- Powerful filtering effect only observe commands
that you need to observe - Meanwhile, bandwidth at directory scales by
adding memory controllers as you increase size of
the system - Leads to very scalable designs (100s to 1000s of
CPUs) - Directory shortcomings
- Indirection through directory has latency penalty
- If shared line is dirty in other CPUs cache,
directory must forward request, adding latency - This can severely impact performance of
applications with heavy sharing (e.g. relational
databases)
50Memory Consistency Dijkstras 2-way Mutual
Exclusion
- How are memory references from different
processors interleaved? - If this is not well-specified, synchronization
becomes difficult or even impossible - ISA must specify consistency model
- Common example using Dekkers algorithm for
synchronization - If load reordered ahead of store (as we assume
for a baseline OOO CPU) - Both Proc0 and Proc1 enter critical section,
since both observe that others lock variable
(A/B) is not set - If consistency model allows loads to execute
ahead of stores, Dekkers algorithm no longer
works - Common ISAs allow this IA-32, PowerPC, SPARC,
Alpha
51Sequential Consistency Lamport 1979
- Processors treated as if they are interleaved
processes on a single time-shared CPU - All references must fit into a total global order
or interleaving that does not violate any CPUs
program order - Otherwise sequential consistency not maintained
- Now Dekkers algorithm will work
- Appears to preclude any OOO memory references
- Hence precludes any real benefit from OOO CPUs
52High-Performance Sequential Consistency
- Coherent caches isolate CPUs if no sharing is
occurring - Absence of coherence activity means CPU is free
to reorder references - Still have to order references with respect to
misses and other coherence activity (snoops) - Key use speculation
- Reorder references speculatively
- Track which addresses were touched speculatively
- Force replay (in order execution) of such
references that collide with coherence activity
(snoops)
53High-Performance Sequential Consistency
- Load queue records all speculative loads
- Bus writes/upgrades are checked against LQ
- Any matching load gets marked for replay
- At commit, loads are checked and replayed if
necessary - Results in machine flush, since load-dependent
ops must also replay - Practically, conflicts are rare, so expensive
flush is OK
54Relaxed Consistency Models
- Key insight only synchronization references need
to be ordered - Hence, relax memory for all references
- Enable high-performance out-of-order
implementation - Require programmer to label synchronization
references - Hardware must carefully order these labeled
references - All other references can be performed out of
order - Labeling schemes
- Explicit synchronization ops (acquire/release)
- Memory fence or memory barrier ops
- All preceding ops must finish before following
ones begin - Often barrier ops cause pipeline drain in modern
out-of-order machine
55Coherent Memory Interface Example
56Split Transaction Buses
- Packet switched vs. circuit switched
- Release bus after request issued
- Allow multiple concurrent requests to overlap
memory latency - More complicated control, arbitration for bus
- Much better throughput
57Explicitly Multithreaded Processors
- Many approaches for executing multiple threads on
a single die - Mix-and-match IBM Power5 CMPSMT
58IBM Power4 Example CMP
59Coarse-grained Multithreading
- Low-overhead approach for improving processor
throughput - Also known as switch-on-event
- Long history Denelcor HEP
- Commercialized in IBM Northstar, Pulsar
- Rumored in Sun Rock, Niagara
60SMT Resource Sharing
61Implicitly Multithreaded Processors
- Goalspeed up execution of a single thread
- Implicitly break program up into multiple smaller
threads, execute them in parallel - Parallelize loop iterations across multiple
processing units - Usually, exploit control independence in some
fashion - Many challenges
- Maintain data dependences (RAW, WAR, WAW) for
registers - Maintain precise state for exception handling
- Maintain memory dependences (RAW/WAR/WAW)
- Maintain memory consistency model
- Not really addressed in any of the literature
- Active area of research
- Only a subset is covered here, in a superficial
manner
62Sources of Control Independence
63Implicit Multithreading Approaches
64Executing The Same Thread
- Why execute the same thread twice?
- Detect faults
- Better performance
- Prefetch, resolve branches
65Fault Detection
- AR/SMT Rotenberg 1999
- Use second SMT thread to execute program twice
- Compare results to check for hard and soft errors
(faults) - DIVA Austin 1999
- Use simple check-processor at commit
- Re-execute all ops in order
- Possibly relax main processors correctness
constraints and safety margins to improve
performance - Lower voltage, higher frequency, etc.
- Lots of other variations proposed in more recent
work
66Speculative Pre-execution
- Idea create a runahead or future thread that
helps the main trailing thread - Advantage speculative future thread has no
correctness requirement - Slipstream processors Rotenberg 2000
- Construct speculative, stripped-down version
(future thread) - Let it run ahead and prefetch
- Speculative precomputation Roth 2001, Zilles
2002, Collins et al. 2001 - Construct backward dataflow slice for problematic
instructions (mispredicted branches, cache
misses) - Pre-execute this slice of the program
- Resolve branches, prefetch data
- Implemented in Intel production compiler,
reflected in Intel Pentium 4 SPEC results