Title: ECE 1747: Parallel Programming
1ECE 1747 Parallel Programming
- Basics of Parallel Architectures
- Shared-Memory Machines
2Two Parallel Architectures
- Shared memory machines.
- Distributed memory machines.
3Shared Memory Logical View
Shared memory space
proc1
proc2
proc3
procN
4Shared Memory Machines
- Small number of processors shared memory with
coherent caches (SMP). - Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).
5SMPs
- 2- or 4-processors PCs are now commodity.
- Good price/performance ratio.
- Memory sometimes bottleneck (see later).
- Typical price (8-node) 20-40k.
6Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7Shared Memory Machines
- Small number of processors shared memory with
coherent caches (SMP). - Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).
8CC-NUMA Physical Implementation
mem2
mem3
memN
mem1
inter- connect
cache2
cache1
cacheN
cache3
proc1
proc2
proc3
procN
9Caches in Multiprocessors
- Suffer from the coherence problem
- same line appears in two or more caches
- one processor writes word in line
- other processors now can read stale data
- Leads to need for a coherence protocol
- avoids coherence problems
- Many exist, will just look at simple one.
10What is coherence?
- What does it mean to be shared?
- Intuitively, read last value written.
- Notion is not well-defined in a system without a
global clock.
11The Notion of last written in a Multi-processor
System
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
12The Notion of last written in a Single-machine
System
w(x)
w(x)
r(x)
r(x)
13Coherence a Clean Definition
- Is achieved by referring back to the single
machine case. - Called sequential consistency.
14Sequential Consistency (SC)
- Memory is sequentially consistent if and only if
it behaves as if the processors were executing
in a time-shared fashion on a single machine.
15Returning to our Example
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
16Another Way of Defining SC
- All memory references of a single process execute
in program order. - All writes are globally ordered.
17SC Example 1
Initial values of x,y are 0.
w(x,1)
w(y,1)
r(x)
r(y)
What are possible final values?
18SC Example 2
w(x,1)
w(y,1)
r(y)
r(x)
19SC Example 3
w(x,1)
w(y,1)
r(y)
r(x)
20SC Example 4
r(x)
w(x,1)
w(x,2)
r(x)
21Implementation
- Many ways of implementing SC.
- In fact, sometimes stronger conditions.
- Will look at a simple one MSI protocol.
22Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
23Fundamental Assumption
- The bus is a reliable, ordered broadcast bus.
- Every message sent by a processor is received by
all other processors in the same order. - Also called a snooping bus
- Processors (or caches) snoop on the bus.
24States of a Cache Line
- Invalid
- Shared
- read-only, one of many cached copies
- Modified
- read-write, sole valid copy
25Processor Transactions
- processor read(x)
- processor write(x)
26Bus Transactions
- bus read(x)
- asks for copy with no intent to modify
- bus read-exclusive(x)
- asks for copy with intent to modify
27State Diagram Step 0
I
S
M
28State Diagram Step 1
PrRd/BuRd
I
S
M
29State Diagram Step 2
PrRd/-
PrRd/BuRd
I
S
M
30State Diagram Step 3
PrWr/BuRdX
PrRd/-
PrRd/BuRd
I
S
M
31State Diagram Step 4
PrWr/BuRdX
PrRd/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
32State Diagram Step 5
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
33State Diagram Step 6
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
34State Diagram Step 7
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
BuRd/-
35State Diagram Step 8
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
36State Diagram Step 9
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
BuRdX/Flush
37In Reality
- Most machines use a slightly more complicated
protocol (4 states instead of 3). - See architecture books (MESI protocol).
38Problem False Sharing
- Occurs when two or more processors access
different data in same cache line, and at least
one of them writes. - Leads to ping-pong effect.
39False Sharing Example (1 of 3)
- for( i0 iltn i )
- ai bi
- Lets assume we parallelize code
- p 2
- element of a takes 4 words
- cache line has 32 words
40False Sharing Example (2 of 3)
cache line
a0
a1
a2
a3
a4
a5
a6
a7
Written by processor 0
Written by processor 1
41False Sharing Example (3 of 3)
a2
a4
P0
a0
...
inv
data
P1
a3
a5
a1
42Summary
- Sequential consistency.
- Bus-based coherence protocols.
- False sharing.