ECE 1747: Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

ECE 1747: Parallel Programming

Description:

... and only if it behaves 'as if' the processors were executing in a ... processors access different data in same cache line, and at least one of them writes. ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 43
Provided by: CITI
Category:

less

Transcript and Presenter's Notes

Title: ECE 1747: Parallel Programming


1
ECE 1747 Parallel Programming
  • Basics of Parallel Architectures
  • Shared-Memory Machines

2
Two Parallel Architectures
  • Shared memory machines.
  • Distributed memory machines.

3
Shared Memory Logical View
Shared memory space
proc1
proc2
proc3
procN
4
Shared Memory Machines
  • Small number of processors shared memory with
    coherent caches (SMP).
  • Larger number of processors distributed shared
    memory with coherent caches (CC-NUMA).

5
SMPs
  • 2- or 4-processors PCs are now commodity.
  • Good price/performance ratio.
  • Memory sometimes bottleneck (see later).
  • Typical price (8-node) 20-40k.

6
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7
Shared Memory Machines
  • Small number of processors shared memory with
    coherent caches (SMP).
  • Larger number of processors distributed shared
    memory with coherent caches (CC-NUMA).

8
CC-NUMA Physical Implementation
mem2
mem3
memN
mem1
inter- connect
cache2
cache1
cacheN
cache3
proc1
proc2
proc3
procN
9
Caches in Multiprocessors
  • Suffer from the coherence problem
  • same line appears in two or more caches
  • one processor writes word in line
  • other processors now can read stale data
  • Leads to need for a coherence protocol
  • avoids coherence problems
  • Many exist, will just look at simple one.

10
What is coherence?
  • What does it mean to be shared?
  • Intuitively, read last value written.
  • Notion is not well-defined in a system without a
    global clock.

11
The Notion of last written in a Multi-processor
System
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
12
The Notion of last written in a Single-machine
System
w(x)
w(x)
r(x)
r(x)
13
Coherence a Clean Definition
  • Is achieved by referring back to the single
    machine case.
  • Called sequential consistency.

14
Sequential Consistency (SC)
  • Memory is sequentially consistent if and only if
    it behaves as if the processors were executing
    in a time-shared fashion on a single machine.

15
Returning to our Example
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
16
Another Way of Defining SC
  • All memory references of a single process execute
    in program order.
  • All writes are globally ordered.

17
SC Example 1
Initial values of x,y are 0.
w(x,1)
w(y,1)
r(x)
r(y)
What are possible final values?
18
SC Example 2
w(x,1)
w(y,1)
r(y)
r(x)
19
SC Example 3
w(x,1)
w(y,1)
r(y)
r(x)
20
SC Example 4
r(x)
w(x,1)
w(x,2)
r(x)
21
Implementation
  • Many ways of implementing SC.
  • In fact, sometimes stronger conditions.
  • Will look at a simple one MSI protocol.

22
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
23
Fundamental Assumption
  • The bus is a reliable, ordered broadcast bus.
  • Every message sent by a processor is received by
    all other processors in the same order.
  • Also called a snooping bus
  • Processors (or caches) snoop on the bus.

24
States of a Cache Line
  • Invalid
  • Shared
  • read-only, one of many cached copies
  • Modified
  • read-write, sole valid copy

25
Processor Transactions
  • processor read(x)
  • processor write(x)

26
Bus Transactions
  • bus read(x)
  • asks for copy with no intent to modify
  • bus read-exclusive(x)
  • asks for copy with intent to modify

27
State Diagram Step 0
I
S
M
28
State Diagram Step 1
PrRd/BuRd
I
S
M
29
State Diagram Step 2
PrRd/-
PrRd/BuRd
I
S
M
30
State Diagram Step 3
PrWr/BuRdX
PrRd/-
PrRd/BuRd
I
S
M
31
State Diagram Step 4
PrWr/BuRdX
PrRd/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
32
State Diagram Step 5
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
33
State Diagram Step 6
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
34
State Diagram Step 7
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
BuRd/-
35
State Diagram Step 8
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
36
State Diagram Step 9
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
BuRdX/Flush
37
In Reality
  • Most machines use a slightly more complicated
    protocol (4 states instead of 3).
  • See architecture books (MESI protocol).

38
Problem False Sharing
  • Occurs when two or more processors access
    different data in same cache line, and at least
    one of them writes.
  • Leads to ping-pong effect.

39
False Sharing Example (1 of 3)
  • for( i0 iltn i )
  • ai bi
  • Lets assume we parallelize code
  • p 2
  • element of a takes 4 words
  • cache line has 32 words

40
False Sharing Example (2 of 3)
cache line
a0
a1
a2
a3
a4
a5
a6
a7
Written by processor 0
Written by processor 1
41
False Sharing Example (3 of 3)
a2
a4
P0
a0
...
inv
data
P1
a3
a5
a1
42
Summary
  • Sequential consistency.
  • Bus-based coherence protocols.
  • False sharing.
Write a Comment
User Comments (0)
About PowerShow.com