Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18

About This Presentation

Title:

Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18

Description:

Acquire the bus. Send out address and/or data. Wait for data (read), wait for write to complete ... Acquire bus time slot. Send address and/or data ... – PowerPoint PPT presentation

Number of Views:1026

Avg rating:3.0/5.0

Slides: 119

Provided by: david2523

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18

1
Shared Memory MIMDArchitecturesSima, Fountain
and KacsukChapter 18

CSE462

2
Design choices

Types of shared memory
Physically shared memory
Virtual (or distributed) shared memory
Scalability issues
Organisation of memory
Design of interconnection network
Cache coherence protocols

3
Design space of shared memory computers
Shared memory computers
Single address space memory access
Interconnection scheme
Cache coherency
Physical shared memory UMA
Virtual shared memory
Shared path
Switching network
Hardware based
Software based
NUMA
Singled bus based
Crossbar
Multistage network
CC-NUMA
Multiple bus based
Bus multiplication
Omega
Banyan
Benes
COMA
Grid of buses
Hierarchical system
4
Classification of dynamic interconnection networks

Enable temporary connection of any two components
of a multiprocessor

5
Buses

Very limited scalability
Typically 3-5 processors unless special
techniques (TDM)
Can be expanded significantly if
Use private memory
Coherent cache memory
Multiple buses

6
Structure of a single bus multiprocessor
(nocaches)
P1
Pk
M1
Mn
Address
Bus arbiter and control logic
Data
Control
Interrupt
Bus exchange lines
I/O 1
I/O M
7
Locking or multiplexing the bus

Two main approaches
Locking and holding
Acquire the bus
Send out address and/or data
Wait for data (read), wait for write to complete
Release the bus
Multiplexing
Acquire bus time slot
Send address and/or data
Come back for data n cycles later (read), or keep
going if write

8
Memory write on locked bus
Processors
P4
P3
P2
P1
4
8
12
16
Time
Bus cycle
Memory cycle
9
Memory write on multiplexed buses
Note This assumes different memory banks
Processors
P4
P3
P2
P1
4
7
Time
Bus cycle
Memory cycle
10
Memory read on locked bus
Processors
P3
P2
P1
5
10
15
20
25
30
Time
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
11
Memory read on multiplexed bus
Processors
P3
P2
P1
Time
5
10
12
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
12
Memory read on split-transaction bus
Next transfer started before last one completed!
Processors
P3
P2
P1
Needs special associative hardware
Time
5
10
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
13
Arbiter Logic

Because bus is shared resource, but arbitrate for
access
Arbiter may be
Centralised
Central unit which looks at all requests
Decentralised.
Logic is split amongst bus masters
Scalable
Each new master adds more logic

14
Design space of arbiter logic
Arbiter logics
Organization
Bus allocation policy
Handling of requests
Handling of grants
Centralized
Fixed priority
Fixed priority
Fixed priority
Distributed
Rotating
Rotating
Rotating
Round robin
Least recently used
First come first served
15
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
16
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Masters Request Bus
17
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
One is granted
18
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Successful master claims bus
19
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Bus is released
20
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
21
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Masters Request Bus
22
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus grant generated
23
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus grant not propagated
24
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Master claims bus
25
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus released
26
Decentralized rotating arbiter with independent
requests and grants

Problems with previous design
lack of fairness
Wait whilst grant signal propagates
Rotating priority solves lack of fairness
Logical first not same as physical first

Bus lines
Master 1
Master 2
Master N
R1
G1
R2
G2
R3
G3
PN
Arbiter 1
Arbiter 2
Arbiter N
P1
P2
Bus busy
R Request
G Grant
P Priority
27
Multiple buses

Increase bandwidth by adding additional resources
Bus is limiting factor

28
1-dimension multiple bus multiprocessor

Each processor connected to all buses
Each memory connected to all buses
Processor chooses bus dynamically
Load can be spread across buses

B1
B2
Bb
P1
P2
Pn
M1
M2
Mm
29
2 and 3 dimensional bus system
P
M
30
2 Dimensional bus design

Can support specialised access patterns
e.g. Climate model
Access to local data
Access to data in same latitude
Access to data in same longitude

31
Figure 18.11 Structure of a b-of-m arbiter
State register
s1
e1
s2
e2
sm
sm
C1
C2
Cm
Arbiter 1
Arbiter 2
Arbiter m
R1
G1
BA1
R2
G2
BA2
Rm
Gm
BAm
32
Cluster bus architecture

Hierarchy of buses
Arbitrary large networks
Cache coherence becomes very difficult

Global bus (Nanobus)
Uniform interconn. card
Uniform interconn. card
Cluster 1
Cluster 8
Uniform interconn. card
Uniform interconn. card
Uniform interconn. card
Multimax
Uniform cluster cache
Uniform interconn. card
Uniform interconn. card
Uniform interconn. card
Multimax
Uniform cluster cache
Cluster bus (Nanobus)
Cluster bus (Nanobus)
33
Switching Networks
34
View of a crossbar network

Cross bar allows any processor to connect with
any memory
As long as there is no contention for the memory,
network in non-blocking

P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
35
View of a crossbar network

Cross bar allows any processor to connect with
any memory
As long as there is no contention for the memory,
network in non-blocking

P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
36
View of a crossbar network

Cross bar allows any processor to connect with
any memory
As long as there is no contention for the memory,
network in non-blocking

P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
37
Detailed structure of a crossbar network
Control
P1
Address
Data bus
BBCU
Control
Address
Arbiter
Data bus
Switch
Control
P1
Address
Data bus
BBCU
Arbiter
Switch
Mi
38
Multi-stage interconnection networks

Cannot directly connect processor to memory
Use cross-bar switches as components to build
larger network
Minimum number of stages is logarithmic
Single path
No fault tolerance
Blocking (if intermediate switch in use)

39
Omega network topology

2 x 2 cross bar switch components
Butterfly built from 8x8
Unique path from one port to another
Log depth

Straight through
Straight through
000
000
001
001
010
010
011
011
Lower broadcast
100
100
101
101
Upper broadcast
110
110
111
111
40
Omega network topology

Some configurations are non blocking
e.g. reversal

000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
0-gt7, 1-gt6, 2-gt5, 3-gt4, 4-gt3,
5-gt2, 6-gt1, 7-gt0
41
Broadcast in the omega network
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
42
Blocking in an omega network
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
(0-gt5, . . ., 6-gt4, . . .)
43
Multistage Network Portperties
44
Hot-spot saturation in a blocking omega network
M0
P0
M1
P1
P5-gtM7 blocked
P2-gtM4 active
M2
P2
M3
P3
M4
P4
P7-gtM4 blocked
M5
P5
P1-gtM5 blocked
M6
P6
M7
P7
P2-gtM4 active gt P7-gtM4 blocked gt P1-gtM5 blocked
gt P5-gtM7 blocked
45
Hotspots in Omega networks

In shared memory machine two sorts of contention
Memory unit
Switch elements
Certain access patterns can repeatedly block each
other even though addressing different memory
units
Message combining can solve these problems
Switch element buffers request
Memory only sees one request

Read 100
Read 100
Read 100
46
Structure of a combining switch

Introduced on NYU Ultracomputer

Combining queue
Proc(i)
Mem(k)
Wait buffer
Noncomb. queue
Combining queue
Proc(j)
Mem(I)
Wait buffer
Noncomb. queue
47
Cache Coherence

Cache coherence problems
Sharing of writable data
Process migration
I/O activity

Memory
Processor
Cache
Memory
Processor
Cache
Memory
Processor
Cache
48
Cache Coherence

Cache coherence problems
Sharing of writable data
Process migration
I/O activity

Memory
Processor
Cache
Cache
Cache
Memory
Processor
Cache
Memory
Processor
Cache
Cache
49
Cache Coherence

Cache coherence problems
Sharing of writable data
Process migration
I/O activity

Memory
Processor
Cache
Cache
Cache
Write 100,100
Memory
Processor
Cache
Memory
Processor
Cache
Cache
50
Cache Coherence

Cache coherence problems
Sharing of writable data
Process migration
I/O activity

IO Dev
Memory
Memory
Processor
Cache
Cache
IO
Read 100
Memory
Processor
Cache
Memory
Processor
Cache
51
Classification of data structures

Read only
Never cause cache coherence problems
Shared writable
Main source of cache coherence problems
Private writable data
Causes problems with process migration
Solutions
Hardware based protocols
Software based protocols

52
Design space of hardware-based cache coherence
protocols
53
Design space of hardware-based cache coherence
protocols (cont.)
54
Write-through memory update policy

Memory always updated on a write
Intuitively easier to keep caches coherent

Memory
Pi
Pj
Processor
D1
Store D1
Cache
D1
D
D1
55
Write-back memory update policy

Data only written back to memory when flushed
Processor can do many writes before flushed

Memory
Pi
Pj
Processor
D
Store D1
Cache
D1
D
56
Write-update cache coherence policy

When a processor writes a variable, updates all
other copies in other processors

Pj
Pk
Pi
Processor
Store D1
Cache
D1
D1
D1
Update (D1)
57
Write-invalidate cache coherence policy

When a processor writes a variable invalidates
copy in any other caches
Makes one processor the owner

Pj
Pk
Pi
Processor
Store D1
D1
Cache
Invalidate (addr(D))
Invalid data
58
Snoopy Protocols

If interconnection network supports broadcasting
(cheaply) then a snoopy policy is effective
Every cache watches every transaction to memory
Works for buses
If broadcast is not efficient
Directory based scheme
Keeps track of where cache blocks are located

59
Snoopy write update protocol

Possible cache block states
Used to support cache coherence protocol
Valid-exclusive
Only copy of this cache block. Cache and memory
are consistent
Shared
Several copies of this cache block
Dirty
Only copy but cache and memory are inconsistent

60
Read Miss logic

Snoopy cache controller broadcasts a Read-Blk
command on the bus
If there are shared copies
Delivered by cache with copy
If dirty copies
It is supplied and flushed to main memory.
All copies become shared.
If a valid-exclusive copy exists
Copy supplied and all become shared
If no cache copy
Memory supplies data
Becomes valid exclusive

61
Snoopy Update - Read miss
Memory
Pi
Pj
Processor
D
Load D
Load D
Cache
D
Read-blk (addr(D))
Exclusive
Shared
Dirty
62
Write hit logic

If block is valid-exclusive or dirty
Write is performed locally
New state is dirty
If block is shared
Broadcast update block on bus
All copies (including memory) update
Status remain shared.

63
Snoopy Update Write hit Exclusive
Memory
Pi
Pj
Processor
D
Write D
Load D
Cache
D
D
Read-blk (addr(D))
Exclusive
Shared
Dirty
64
Snoopy Update Write hit Shared
Memory
Pi
Pj
Processor
D
Write D
Load D
Load D
Cache
D
Read-blk (addr(D))
Write (addr(D))
Read-blk (addr(D))
Exclusive
Shared
Dirty
65
Write miss

If only memory contains copy
Memory updated
Requesting cache loaded with data valid
exclusive
If shared copies are available
All copies (including memory one) updated
Requesting cache loaded with data shared
If dirty or valid exclusive exist
Other blocks updated
Memory updated
Requesting cache loaded with data shared

66
Snoopy Update Write miss
Memory
Pi
Pj
Processor
D
Write D
Cache
D
Write (addr(D))
Exclusive
Shared
Dirty
67
State transition graph for snoopy update

Cache responds to
P-READs, P-WRITEs from the Processor and
READ-BLK, WRITE-BLK from the Bus

P-Read/P-Write
P-Read
Valid-exclusive
Shared
Read-Blk/Write-Blk
Read-blk/Write-Blk/Update-Blk
P-Write
Read-Blk/Write-Blk
Dirty
P-Read/P-Write
68
Structure of the snoopy cache controller
PEi
Processor
Snoopy cache controller
Interface
Cache directory
Cache controller
Memory
PEn
Cache
Proc.
D
A
D
Snoopy controller
D
A
Interface
Cache
A

Snoopy controller needs to operate at bus speed

69
Directory Schemes

Directory schemes only send consistency commands
to those caches where a valid copy of the shared
Designed for systems where snooping is not
possible
Three main approaches
Full map directory
Each entry points to all caches
Entry indicates whether block is present in
remote caches
Not efficient for large systems
Limited directory
Only point to subset of the caches
Works because tend not to share a variable with
all processors
Same information as in full map
Chained directory
Directory entries form a linked list
Scalable can add processors without increasing
directory width

70
Chained directory scheme
PE0
PE1
PEn
P0
P1
Pn
Processor
C1
Cn
X, CT
Cache
Read X
Shared memory
C
X
Directory entry
71
Chained directory scheme
PE0
PE1
PEn
P0
P1
Pn
Processor
Cn
X, CT
X,
Cache
Shared memory
C
X
Directory entry
72
Scalable Coherence Interface

Concrete example of chained directory
IEEE Standard
Defines
Interface to interconnection network
Not any particular interconnection network
Interface
Point to point
Well suited to networks like Convex Exemplar
Simple, uni-directional ring
Designed for building scalable shared memory
machines

73
Structure of sharing-lists in the SCI

Operations defined for
Creation
Insertion
Deletion
Reduction to single node

Nodei
Nodej
Nodek
Memory
Pi
Pj
Pk
Ci
Cj
Ck
mstate
forw_id
forw_id
back_id
data (64 bits)
cstate
mem_id
data (64 bits)
74
Insertion in a sharing-list
Nodej
Nodek
Memory
Pj
Pk
Cj
Ck
responses
New-head
Pi
prepend
Ci
Nodei
75
Messages for deletion
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
1
2
Update forward
Update forward
76
Structure of the sharing-list after deletion
Nodek
Memory
Nodei
Pi
Pk
77
Hierarchical cache coherence
Main memory
B20
write
C20
C21
C22
X
X
B10
write
B11
B12
Invalidate
C11
C11
C11
C11
C10
C10
C10
C10
P0
P1
P1
P0
P1
P1
P0
P0
78
Software Based Coherence

Software approaches rely on compiler assistance
Identify different classes of variables
Read-only
Read-only for any number of processors and
read-write for one process
Read-write for one process
Read-write for any number of processes
Once identified (by static analysis), handled
differently

79
Software based cache coherence

Read only variables
Can be cached any time
Read only for any number and read-write for one
process
Can only be cached on writing processor
Read-write for one process
Cache only on that processor
Read-write for many processes
Cannot be cached at all
Clearly need accurate information in order to
limit performance hit

80
Classification of software-based cache coherence
protocols
81
Invalidation

Can invalidate the entire cache
Single hardware mechanism for clearing valid bits
Very conservative!
Selective invalidation
Invalidate before critical sections
Understand parallel for-loop and invalidate
Still needs hardware support to clear effectively

Key
Data
v
v
v
82
Using knowledge of critical regions
Secure_lock() Invalidate_cache() Flush
_cache() Release_lock()
Variables in here can be used without worrying
about any other processes
83
Using knowledge of parallel loops
Par For (I 0 I lt 100 i)
Par For (I 0 I lt 50 i)
Processor 0
Par For (I 50 I lt 100 i)
Processor 1
Knowledge about loops
84
Selective invalidation schemes

Add change bit to cache block status
Set change bit to true
If read on block then invalidate and reload
Add timestamp to cache block
Clock associated with a data structure
Update timestamp in cache when block changed
Can compare timestamp in block with current
timestamp
Adding version number
Similar to clock scheme

85
Synchronization Event Ordering

Mutual exclusion required in many parallel
algorithms
Monitors
Sempahores
All high level schemes base don low level
synchronization tools
Atomic test-and-set common in shared memory
multiprocessor
Needs to take account of cache
Minimum traffic generated while waiting
Low latency release of a waiting processor
Low latency acquisition of a free lock
Typically work well on small bus based machines

86
Synchronization with test-and-set

Lock variable
Open
Closed
Acquire lock
char lock
while (exchange(lock,CLOSED) CLOSED)
Release lock
lock OPEN

87
Cache states after Pi successfully executed
testset on lock
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
invalid
dirty
88
Bus commands when Pj executes testset on lock
and cache states after
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
Lock
(closed)
Read-Blk (lock)
1
Block (lock)
2
Invalidate (lock)
3
dirty
invalid
89
Cache states after Pk executed testset on lock
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
Lock
Lock
(closed)
dirty
invalid
90
Busy waiting with cache coherence

Indivisable test-and-set instruction requires
write access to lock
Causes processor doing test-and-set to acquire
variable in cache, invalidating all other copies
When multiple processors spin on lock
Each one tries to acquire the variable in cache
Causes cache trashing
Instead use snooping lock
Spin on test without indivisible test-and-set
Only exchange once OPEN

91
Efficient algorithm for locking

while (exchange(lock, CLOSE) CLOSE)
while (lock CLOSED)
First while will claim lock if OPEN and lock it
But if already CLOSE transfer control to second
loop
Continuously reads lock
No bus traffic during this phase
When lock is OPEN try and test-and-set again.

92
Test and test-and-set

Even more efficient to test the lock before
trying to set it
For ()
While (lock CLOSED)
If (exchange(lock,CLOSED) ! CLOSED)
Break
Introduces extra latency for unused locks

93
Lock implementation on scalable multiprocessors

New York Ultracomputer and IBM RP3 implemented
Fetch-and-add
Fetch-and-add
Atomic operation
All memory modules augmented with adder circuit
fetch-and-add(x,a)
int x, a
int temp
temp x
x x a
return (temp)

94
Example of fetch-and-add

Suppose we want to implement parallel loop
DOALL N 1 to 1000
ltloop body using Ngt
ENDDO
Suppose want to allocate to processors
dynamically
N 0
i fetch-and-add(N,1)
While (i lt 1000)
loop_body(i)
i fetch-and-add(N,1)
Regardless of how many processors execute the
look
Each processor will get a different value of i.

95
Fetch-and-add

Fetch-and-add automatically allocates look
indexes in this example
But location N becomes a hotspot.
Combining network described before will not work
correctly without modification.
Same value is returned from a read operation
Change each switch element so it can implement
the fetch and add operation.
Distributed operation without hotspots

96
Forward propagation of fetch-and-add
M0
P0 F A (N,1)
1
M1
P0 F A (N,1)
F A (N,2)
M2
P0 F A (N,1)
1
2
M3
P0 F A (N,1)
F A (N,2)
F A (N,4)
M4
P0 F A (N,1)
1
M5
P0 F A (N,1)
F A (N,2)
F A (N,8) M6 returns N1 N becomes 9
P0 F A (N,1)
F A (N,4)
1
2
4
P0 F A (N,1)
F A (N,2)
97
Back propagation of fetch-and-add
M0
P0 1
1
M1
P0 5
11
1
M2
P0 3
5
1
M3
P0 7
51
12
3
1
M4
3
P0 2
M5
P0 6
31
5
7
M6
1
5
P0 4
5
7
P0 8
M7
71
14
52
98
Event ordering in cache coherent systems

Programmers usually assume sequential consistency
Consider
P1 . store A, .
P2 . load A, store B, .
P3 . load B, load A
P3 expects values of A and B to be same
BUT
May not occur because invalidation messages from
P1 may reach P2 before P3.
This occurs only because of cache

99
Figure 18.35 classification of shared writable
variable accesses
100
Figure 18.36 (a) weak consistency
Request (L1)
Load/Store Load/Store
Release (L1)
Request (L2)
Load/Store Load/Store
Release (L2)
101
Figure 18.36 (b) release consistency
Request (L1)
Load/Store Load/Store
Request (L2)
Load/Store Load/Store
Request (L3)
Release (L1)
Load/Store Load/Store
Release (L2)
Release (L3)
102
Figure 18.36 (c) load and store operations can be
executed in any order
Load/Store Load/Store
103
A quick tour of some UMA machines
104
Some real UMA machines
105
Structure of the Hector machine - NUMA
Station
Station
Station
Station controller
Local ring
Inter-ring interfaces
Global ring
Local ring
Station
Station
Station
To be continued
106
Structure of the Hector machine (cont.)
Station
Station controller
Station bus
Proc. module
Proc. module
I/O. module
Proc. module
Station bus
Station bus
Station bus interface
I/O adaptor
Proc. cache
Memory
display
ehternet
disk
107
Structure of the Cray T3D system NUMA
Cray Y-MP host
I/O clusters
Tapedrives
Workstations
Disks
Networks
108
Design space of CC-NUMA machines
109
Structure of the Wisconsin multicube machine
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
M
M
M
M
P Processor C Cache M Memory
110
Mechanism of reading block X in state modified
P00
P02
P01
Read X
C00
C02
C01
read (X)
value (X)
P10
P12
P11
C10
C12
C11
X
Mem-write (X, unmodified)
read (X)
value (X)
M0
M1
M2
modified
X
111
The Stanford Dash interconnection network
Cluster 11
Cluster 12
Cluster 13
Cluster 21
Cluster 22
Cluster 23
112
Structure of a cluster
Memory
Pi 1
Ci 1
Directory and Intercluster Interface
I/O Interface
113
Processor level
Pi 1
Load X
X
Ci 1
Access time 1 clock
114
Local cluster level
Memory
Pi 1
Pi j
Load X
Ci 1
Ci j
X
Access time 30 clocks
115
Home cluster level
Cluster C1j (home cluster)
Cluster C1i (local cluster)
Memory
Memory
Pi 1
Pj 1
X
Load X
Ci 1
Cj 1
Interconnection network
Access time 100 clocks
116
Remote cluster level
Cluster C1j
Cluster C1i (local cluster)
Memory
Memory
Pi 1
Pj 1
Load X
Ci 1
Cj 1
DLi
DLj
5
1
DL Directory logicAccess time 135 clocks
Continue on next slide
117
Remote cluster level (cont.)
Continue from previous slide
Cluster C1m (home cluster)
Cluster C1k (remote cluster)
Memory
Memory
DDirty
Pm 1
Pk 1
Read-Req
Read-Rply
X
D
Clk
Cm 1
Ck 1
Sharing-Writeback
8
4
2
3
Read
Read
DLi
DLj
Read-Req
DL Directory logicAccess time 135 clocks
118
Structure of the dash directory
Replies to clusters Y1/Y-1
Requests to clusters Y1/Y-1
Replies to clusters X1/X-1
Requests to clusters X1/X-1
Reply Y-dimension router
Reply Y-dimension router
Reply X-dimension router
Reply X-dimension router
RCboard
DCboard
Reply controller (RC)
Pseudo-CPU(PCPU)
Performance monitor
Directory controller (DC)
Remote cache status, bus retry
Arbitration masks
Cluster bus request
Events
Cluster address/control bus
Cluster datal bus
119
Sequence of actions in a store operation
requiring remote service
Cluster C1j
Cluster C1i
Memory
Memory
Pi 1
Pj 1
Store X
Ci 1
Cj 1
DLi
DLj
Inv-Ack
5
Read- Ex-Req
4
Read-exclusive
Read-exclusive
1
3
Inv-Req
Continue on next slide
DL Directory logic
120
Sequence of actions in a store operation
requiring remote service (cont.)
Continue from previous slide
Cluster C1m
Cluster C1k
Memory
Memory
SShared
Pm 1
Pk 1
Read-Ex Req
X
S
Clk
Read-Ex Rply
Clj
Cm 1
Ck 1
DLm
DLk
2
3
Read-Ex Req
Read-exclusive
Inv-Req
DL Directory logic
121
Structure of the FLASH machine
2nd-level cache
Microproc.
DRAM
MAGIC
Net
I/O
122
Structure of the MAGIC node controller
Processor
Net
I/O
Message split unit
Message data
Message headers
Data transfer logic
Control pipeline
Memory
Message split unit
Processor
Net
I/O
123
Structure of the control micropipeline
Processor
Net
I/O
PI
NI
I/O
Software queue head
Mem. control
Inbox
MAGIC data cache
Protocol processor
MAGIC instr. cache
Outbox
PI
NI
I/O
124
Convex exemplar architecture
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
CPU7
CPU8
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
I/O subsystem
Agent
Agent
Agent
Agent
5x5 crossbar (1.25 Gbytes/sec)
Cache/mem control
Cache/mem control
Cache/mem control
Cache/mem control
512 Mb memory
512 Mb memory
512 Mb memory
512 Mb memory
Hypernode 1
Hypernode 2
Hypernode 16
Scalable Coherent Interface Rings (600Mbyte/sec
each)
125
Parallel matrix multiply code for cache-coherent
machine
global c(idim, idim), a(idim, idim) global
b(idim, idim), nCPUs private i, j, k, itid call
spawn ( nCPUs ) do j 1, idim if ( jmod(j,
nCPU).eq.itid ) then do i 1, idim
c(i, j) 0.0 do k 1, idim c(i,
j) c(i, j) a(i, k) b(k, j) enddo
enddo endif enddo call join
126
Parallel matrix multiply code for
non-cache-coherent machine
global c(idim, idim), a(idim, idim) global
b(idim, idim), nCPUs private i, j, k, itid,
tmp semaphore is (idim,idim) call spawn ( nCPUs
) do j 1, idim if ( jmod(j, nCPU).eq.itid )
then do i 1, idim tmp 0.0 do
k 1, idim call flush (a(i, k))
call flush (b(k, j)) tmp tmp a(i, k)
b(k, j) enddo
127
Parallel matrix multiply code for
non-cache-coherent machine (cont.)
call lock (c(i ,j), is(i, j)) c(i,
j) tmp call flush(c(i, j)) call
unlock(c(i, j), is(i, j)) enddo
endif enddo call join
128
Structure of the basic Data Diffusion Machine
(DDM)
Arbitration, selection
Top protocol
DDM bus
Attraction memory
Output buffer
Attraction memory
State data memory
Above protocol
Controller
Below protocol
Processor
Processor
129
State transition diagram of the attraction memory
protocol
Pread, Nread/Ndata
Nerase/Nread
Notation In-transaction/out-transaction P from
processor N from network E ExclusiveI
Invalid R Reading RW Reading-and-waiting S
Shared W Waiting
I
R
S
Pread/Nread
Ndata
Nerase
Nread/Ndate
Pwrite/Nerase
RW
W
E
Ndata/Nerase
Nexclusive
Nerase/Nread
Nerase/Nread, Nexclusive/Nread
Pread, Pwrite
130
Structure of the directory unit
DDM bus
Output buffer
Directory
State memory
Above protocol
Controller
Below protocol
Output buffer
Intput buffer
DDM bus
131
Write race in the hierarchical DDM
Exclusive acknowledge
Top
Erase request (winner)
W
W
Erase
Erase request (loser)
W
W
W
S
X
I
I
W
I
I
I
I
W
W
S
I
I
X
P
P
Pi
P
Pj
P
P
P
P
Pk
P
P
I Invalid W Waiting S
Shared P Processor AM Attraction
Memory D Directory
132
R read operations by processor B
Column for each local cache
A
0
B
C
m
EO
Row for each cache subpage
0
C
C
NO
EO
C
C
C
NO
NO
C
C
EONO
EO
n
133
Write operations by processor C
A
0
B
C
m
EO
0
CI
CI
NOI
EO
EO
C
C
NO
NO
C
EOI
EO
EO
n
134
The hierarchical structure of the Kendall Square
Research (KSR1) machine - COMA
Ring 1 (All CACHE Group 1)
Ring 0 directory
Ring 0
Ring 0
Ring 0 directory
Ring 0 (All CACHE Group 0)
Responder 2
Local cache directory
Local cache directory
Local cache directory
Local cache
Local cache
Local cache
Processor
Processor
Processor
Requester 1
Responder 1
Requester 2
135
The convergence of scalable MIMD computers
Distributed memory computers
Shared memory computers
Multi-threaded computers
Scalable
Scalable
Small size
Scalable
Hypercube (Store forward)
Multistage (No cache consistency)
Shared bus (snoopy cache)
1st generation
Mesh (Wormhole routing
NUMA (No cache consistency)
2nd generation
Processor comm. proc router
CC-NUMA COMA (Cluster concept)
3rd generation
4th generation
Multi-threaded processor communication
processor router cache directory

Write a Comment

User Comments (0)