ECE 1747: Parallel Programming - PowerPoint PPT Presentation

About This Presentation

Title:

ECE 1747: Parallel Programming

Description:

... and only if it behaves 'as if' the processors were executing in a ... processors access different data in same cache line, and at least one of them writes. ... – PowerPoint PPT presentation

Number of Views:10

Avg rating:3.0/5.0

Slides: 43

Provided by: CITI

Learn more at: https://www.eecg.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: ECE 1747: Parallel Programming

1
ECE 1747 Parallel Programming

Basics of Parallel Architectures
Shared-Memory Machines

2
Two Parallel Architectures

Shared memory machines.
Distributed memory machines.

3
Shared Memory Logical View
Shared memory space
proc1
proc2
proc3
procN
4
Shared Memory Machines

Small number of processors shared memory with
coherent caches (SMP).
Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).

5
SMPs

2- or 4-processors PCs are now commodity.
Good price/performance ratio.
Memory sometimes bottleneck (see later).
Typical price (8-node) 20-40k.

6
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7
Shared Memory Machines

Small number of processors shared memory with
coherent caches (SMP).
Larger number of processors distributed shared
memory with coherent caches (CC-NUMA).

8
CC-NUMA Physical Implementation
mem2
mem3
memN
mem1
inter- connect
cache2
cache1
cacheN
cache3
proc1
proc2
proc3
procN
9
Caches in Multiprocessors

Suffer from the coherence problem
same line appears in two or more caches
one processor writes word in line
other processors now can read stale data
Leads to need for a coherence protocol
avoids coherence problems
Many exist, will just look at simple one.

10
What is coherence?

What does it mean to be shared?
Intuitively, read last value written.
Notion is not well-defined in a system without a
global clock.

11
The Notion of last written in a Multi-processor
System
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
12
The Notion of last written in a Single-machine
System
w(x)
w(x)
r(x)
r(x)
13
Coherence a Clean Definition

Is achieved by referring back to the single
machine case.
Called sequential consistency.

14
Sequential Consistency (SC)

Memory is sequentially consistent if and only if
it behaves as if the processors were executing
in a time-shared fashion on a single machine.

15
Returning to our Example
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
16
Another Way of Defining SC

All memory references of a single process execute
in program order.
All writes are globally ordered.

17
SC Example 1
Initial values of x,y are 0.
w(x,1)
w(y,1)
r(x)
r(y)
What are possible final values?
18
SC Example 2
w(x,1)
w(y,1)
r(y)
r(x)
19
SC Example 3
w(x,1)
w(y,1)
r(y)
r(x)
20
SC Example 4
r(x)
w(x,1)
w(x,2)
r(x)
21
Implementation

Many ways of implementing SC.
In fact, sometimes stronger conditions.
Will look at a simple one MSI protocol.

22
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
23
Fundamental Assumption

The bus is a reliable, ordered broadcast bus.
Every message sent by a processor is received by
all other processors in the same order.
Also called a snooping bus
Processors (or caches) snoop on the bus.

24
States of a Cache Line

Invalid
Shared
read-only, one of many cached copies
Modified
read-write, sole valid copy

25
Processor Transactions

processor read(x)
processor write(x)

26
Bus Transactions

bus read(x)
asks for copy with no intent to modify
bus read-exclusive(x)
asks for copy with intent to modify

27
State Diagram Step 0
I
S
M
28
State Diagram Step 1
PrRd/BuRd
I
S
M
29
State Diagram Step 2
PrRd/-
PrRd/BuRd
I
S
M
30
State Diagram Step 3
PrWr/BuRdX
PrRd/-
PrRd/BuRd
I
S
M
31
State Diagram Step 4
PrWr/BuRdX
PrRd/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
32
State Diagram Step 5
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
33
State Diagram Step 6
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
34
State Diagram Step 7
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
BuRd/-
35
State Diagram Step 8
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
36
State Diagram Step 9
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
BuRdX/Flush
37
In Reality

Most machines use a slightly more complicated
protocol (4 states instead of 3).
See architecture books (MESI protocol).

38
Problem False Sharing

Occurs when two or more processors access
different data in same cache line, and at least
one of them writes.
Leads to ping-pong effect.

39
False Sharing Example (1 of 3)

for( i0 iltn i )
ai bi
Lets assume we parallelize code
p 2
element of a takes 4 words
cache line has 32 words

40
False Sharing Example (2 of 3)
cache line
a0
a1
a2
a3
a4
a5
a6
a7
Written by processor 0
Written by processor 1
41
False Sharing Example (3 of 3)
a2
a4
P0
a0
...
inv
data
P1
a3
a5
a1
42
Summary