Title: Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18
1Shared Memory MIMDArchitecturesSima, Fountain
and KacsukChapter 18
2Design choices
- Types of shared memory
- Physically shared memory
- Virtual (or distributed) shared memory
- Scalability issues
- Organisation of memory
- Design of interconnection network
- Cache coherence protocols
3Design space of shared memory computers
Shared memory computers
Single address space memory access
Interconnection scheme
Cache coherency
Physical shared memory UMA
Virtual shared memory
Shared path
Switching network
Hardware based
Software based
NUMA
Singled bus based
Crossbar
Multistage network
CC-NUMA
Multiple bus based
Bus multiplication
Omega
Banyan
Benes
COMA
Grid of buses
Hierarchical system
4Classification of dynamic interconnection networks
- Enable temporary connection of any two components
of a multiprocessor
5Buses
- Very limited scalability
- Typically 3-5 processors unless special
techniques (TDM) - Can be expanded significantly if
- Use private memory
- Coherent cache memory
- Multiple buses
6Structure of a single bus multiprocessor
(nocaches)
P1
Pk
M1
Mn
Address
Bus arbiter and control logic
Data
Control
Interrupt
Bus exchange lines
I/O 1
I/O M
7Locking or multiplexing the bus
- Two main approaches
- Locking and holding
- Acquire the bus
- Send out address and/or data
- Wait for data (read), wait for write to complete
- Release the bus
- Multiplexing
- Acquire bus time slot
- Send address and/or data
- Come back for data n cycles later (read), or keep
going if write
8Memory write on locked bus
Processors
P4
P3
P2
P1
4
8
12
16
Time
Bus cycle
Memory cycle
9Memory write on multiplexed buses
Note This assumes different memory banks
Processors
P4
P3
P2
P1
4
7
Time
Bus cycle
Memory cycle
10Memory read on locked bus
Processors
P3
P2
P1
5
10
15
20
25
30
Time
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
11Memory read on multiplexed bus
Processors
P3
P2
P1
Time
5
10
12
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
12Memory read on split-transaction bus
Next transfer started before last one completed!
Processors
P3
P2
P1
Needs special associative hardware
Time
5
10
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
13Arbiter Logic
- Because bus is shared resource, but arbitrate for
access - Arbiter may be
- Centralised
- Central unit which looks at all requests
- Decentralised.
- Logic is split amongst bus masters
- Scalable
- Each new master adds more logic
14Design space of arbiter logic
Arbiter logics
Organization
Bus allocation policy
Handling of requests
Handling of grants
Centralized
Fixed priority
Fixed priority
Fixed priority
Distributed
Rotating
Rotating
Rotating
Round robin
Least recently used
First come first served
15Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
16Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Masters Request Bus
17Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
One is granted
18Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Successful master claims bus
19Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Bus is released
20Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
21Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Masters Request Bus
22Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus grant generated
23Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus grant not propagated
24Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Master claims bus
25Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus released
26Decentralized rotating arbiter with independent
requests and grants
- Problems with previous design
- lack of fairness
- Wait whilst grant signal propagates
- Rotating priority solves lack of fairness
- Logical first not same as physical first
Bus lines
Master 1
Master 2
Master N
R1
G1
R2
G2
R3
G3
PN
Arbiter 1
Arbiter 2
Arbiter N
P1
P2
Bus busy
R Request
G Grant
P Priority
27Multiple buses
- Increase bandwidth by adding additional resources
- Bus is limiting factor
281-dimension multiple bus multiprocessor
- Each processor connected to all buses
- Each memory connected to all buses
- Processor chooses bus dynamically
- Load can be spread across buses
B1
B2
Bb
P1
P2
Pn
M1
M2
Mm
292 and 3 dimensional bus system
P
M
302 Dimensional bus design
- Can support specialised access patterns
- e.g. Climate model
- Access to local data
- Access to data in same latitude
- Access to data in same longitude
31Figure 18.11 Structure of a b-of-m arbiter
State register
s1
e1
s2
e2
sm
sm
C1
C2
Cm
Arbiter 1
Arbiter 2
Arbiter m
R1
G1
BA1
R2
G2
BA2
Rm
Gm
BAm
32Cluster bus architecture
- Hierarchy of buses
- Arbitrary large networks
- Cache coherence becomes very difficult
Global bus (Nanobus)
Uniform interconn. card
Uniform interconn. card
Cluster 1
Cluster 8
Uniform interconn. card
Uniform interconn. card
Uniform interconn. card
Multimax
Uniform cluster cache
Uniform interconn. card
Uniform interconn. card
Uniform interconn. card
Multimax
Uniform cluster cache
Cluster bus (Nanobus)
Cluster bus (Nanobus)
33Switching Networks
34View of a crossbar network
- Cross bar allows any processor to connect with
any memory - As long as there is no contention for the memory,
network in non-blocking
P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
35View of a crossbar network
- Cross bar allows any processor to connect with
any memory - As long as there is no contention for the memory,
network in non-blocking
P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
36View of a crossbar network
- Cross bar allows any processor to connect with
any memory - As long as there is no contention for the memory,
network in non-blocking
P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
37Detailed structure of a crossbar network
Control
P1
Address
Data bus
BBCU
Control
Address
Arbiter
Data bus
Switch
Control
P1
Address
Data bus
BBCU
Arbiter
Switch
Mi
38Multi-stage interconnection networks
- Cannot directly connect processor to memory
- Use cross-bar switches as components to build
larger network - Minimum number of stages is logarithmic
- Single path
- No fault tolerance
- Blocking (if intermediate switch in use)
39Omega network topology
- 2 x 2 cross bar switch components
- Butterfly built from 8x8
- Unique path from one port to another
- Log depth
Straight through
Straight through
000
000
001
001
010
010
011
011
Lower broadcast
100
100
101
101
Upper broadcast
110
110
111
111
40Omega network topology
- Some configurations are non blocking
- e.g. reversal
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
0-gt7, 1-gt6, 2-gt5, 3-gt4, 4-gt3,
5-gt2, 6-gt1, 7-gt0
41Broadcast in the omega network
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
42Blocking in an omega network
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
(0-gt5, . . ., 6-gt4, . . .)
43Multistage Network Portperties
44Hot-spot saturation in a blocking omega network
M0
P0
M1
P1
P5-gtM7 blocked
P2-gtM4 active
M2
P2
M3
P3
M4
P4
P7-gtM4 blocked
M5
P5
P1-gtM5 blocked
M6
P6
M7
P7
P2-gtM4 active gt P7-gtM4 blocked gt P1-gtM5 blocked
gt P5-gtM7 blocked
45Hotspots in Omega networks
- In shared memory machine two sorts of contention
- Memory unit
- Switch elements
- Certain access patterns can repeatedly block each
other even though addressing different memory
units - Message combining can solve these problems
- Switch element buffers request
- Memory only sees one request
Read 100
Read 100
Read 100
46Structure of a combining switch
- Introduced on NYU Ultracomputer
Combining queue
Proc(i)
Mem(k)
Wait buffer
Noncomb. queue
Combining queue
Proc(j)
Mem(I)
Wait buffer
Noncomb. queue
47Cache Coherence
- Cache coherence problems
- Sharing of writable data
- Process migration
- I/O activity
Memory
Processor
Cache
Memory
Processor
Cache
Memory
Processor
Cache
48Cache Coherence
- Cache coherence problems
- Sharing of writable data
- Process migration
- I/O activity
Memory
Processor
Cache
Cache
Cache
Memory
Processor
Cache
Memory
Processor
Cache
Cache
49Cache Coherence
- Cache coherence problems
- Sharing of writable data
- Process migration
- I/O activity
Memory
Processor
Cache
Cache
Cache
Write 100,100
Memory
Processor
Cache
Memory
Processor
Cache
Cache
50Cache Coherence
- Cache coherence problems
- Sharing of writable data
- Process migration
- I/O activity
IO Dev
Memory
Memory
Processor
Cache
Cache
IO
Read 100
Memory
Processor
Cache
Memory
Processor
Cache
51Classification of data structures
- Read only
- Never cause cache coherence problems
- Shared writable
- Main source of cache coherence problems
- Private writable data
- Causes problems with process migration
- Solutions
- Hardware based protocols
- Software based protocols
52Design space of hardware-based cache coherence
protocols
53Design space of hardware-based cache coherence
protocols (cont.)
54Write-through memory update policy
- Memory always updated on a write
- Intuitively easier to keep caches coherent
Memory
Pi
Pj
Processor
D1
Store D1
Cache
D1
D
D1
55Write-back memory update policy
- Data only written back to memory when flushed
- Processor can do many writes before flushed
Memory
Pi
Pj
Processor
D
Store D1
Cache
D1
D
56Write-update cache coherence policy
- When a processor writes a variable, updates all
other copies in other processors
Pj
Pk
Pi
Processor
Store D1
Cache
D1
D1
D1
Update (D1)
57Write-invalidate cache coherence policy
- When a processor writes a variable invalidates
copy in any other caches - Makes one processor the owner
Pj
Pk
Pi
Processor
Store D1
D1
Cache
Invalidate (addr(D))
Invalid data
58Snoopy Protocols
- If interconnection network supports broadcasting
(cheaply) then a snoopy policy is effective - Every cache watches every transaction to memory
- Works for buses
- If broadcast is not efficient
- Directory based scheme
- Keeps track of where cache blocks are located
59Snoopy write update protocol
- Possible cache block states
- Used to support cache coherence protocol
- Valid-exclusive
- Only copy of this cache block. Cache and memory
are consistent - Shared
- Several copies of this cache block
- Dirty
- Only copy but cache and memory are inconsistent
60Read Miss logic
- Snoopy cache controller broadcasts a Read-Blk
command on the bus - If there are shared copies
- Delivered by cache with copy
- If dirty copies
- It is supplied and flushed to main memory.
- All copies become shared.
- If a valid-exclusive copy exists
- Copy supplied and all become shared
- If no cache copy
- Memory supplies data
- Becomes valid exclusive
61Snoopy Update - Read miss
Memory
Pi
Pj
Processor
D
Load D
Load D
Cache
D
Read-blk (addr(D))
Exclusive
Shared
Dirty
62Write hit logic
- If block is valid-exclusive or dirty
- Write is performed locally
- New state is dirty
- If block is shared
- Broadcast update block on bus
- All copies (including memory) update
- Status remain shared.
63Snoopy Update Write hit Exclusive
Memory
Pi
Pj
Processor
D
Write D
Load D
Cache
D
D
Read-blk (addr(D))
Exclusive
Shared
Dirty
64Snoopy Update Write hit Shared
Memory
Pi
Pj
Processor
D
Write D
Load D
Load D
Cache
D
Read-blk (addr(D))
Write (addr(D))
Read-blk (addr(D))
Exclusive
Shared
Dirty
65Write miss
- If only memory contains copy
- Memory updated
- Requesting cache loaded with data valid
exclusive - If shared copies are available
- All copies (including memory one) updated
- Requesting cache loaded with data shared
- If dirty or valid exclusive exist
- Other blocks updated
- Memory updated
- Requesting cache loaded with data shared
66Snoopy Update Write miss
Memory
Pi
Pj
Processor
D
Write D
Cache
D
Write (addr(D))
Exclusive
Shared
Dirty
67State transition graph for snoopy update
- Cache responds to
- P-READs, P-WRITEs from the Processor and
- READ-BLK, WRITE-BLK from the Bus
P-Read/P-Write
P-Read
Valid-exclusive
Shared
Read-Blk/Write-Blk
Read-blk/Write-Blk/Update-Blk
P-Write
Read-Blk/Write-Blk
Dirty
P-Read/P-Write
68Structure of the snoopy cache controller
PEi
Processor
Snoopy cache controller
Interface
Cache directory
Cache controller
Memory
PEn
Cache
Proc.
D
A
D
Snoopy controller
D
A
Interface
Cache
A
- Snoopy controller needs to operate at bus speed
69Directory Schemes
- Directory schemes only send consistency commands
to those caches where a valid copy of the shared - Designed for systems where snooping is not
possible - Three main approaches
- Full map directory
- Each entry points to all caches
- Entry indicates whether block is present in
remote caches - Not efficient for large systems
- Limited directory
- Only point to subset of the caches
- Works because tend not to share a variable with
all processors - Same information as in full map
- Chained directory
- Directory entries form a linked list
- Scalable can add processors without increasing
directory width
70Chained directory scheme
PE0
PE1
PEn
P0
P1
Pn
Processor
C1
Cn
X, CT
Cache
Read X
Shared memory
C
X
Directory entry
71Chained directory scheme
PE0
PE1
PEn
P0
P1
Pn
Processor
Cn
X, CT
X,
Cache
Shared memory
C
X
Directory entry
72Scalable Coherence Interface
- Concrete example of chained directory
- IEEE Standard
- Defines
- Interface to interconnection network
- Not any particular interconnection network
- Interface
- Point to point
- Well suited to networks like Convex Exemplar
- Simple, uni-directional ring
- Designed for building scalable shared memory
machines
73Structure of sharing-lists in the SCI
- Operations defined for
- Creation
- Insertion
- Deletion
- Reduction to single node
Nodei
Nodej
Nodek
Memory
Pi
Pj
Pk
Ci
Cj
Ck
mstate
forw_id
forw_id
back_id
data (64 bits)
cstate
mem_id
data (64 bits)
74Insertion in a sharing-list
Nodej
Nodek
Memory
Pj
Pk
Cj
Ck
responses
New-head
Pi
prepend
Ci
Nodei
75Messages for deletion
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
1
2
Update forward
Update forward
76Structure of the sharing-list after deletion
Nodek
Memory
Nodei
Pi
Pk
77Hierarchical cache coherence
Main memory
B20
write
C20
C21
C22
X
X
B10
write
B11
B12
Invalidate
C11
C11
C11
C11
C10
C10
C10
C10
P0
P1
P1
P0
P1
P1
P0
P0
78Software Based Coherence
- Software approaches rely on compiler assistance
- Identify different classes of variables
- Read-only
- Read-only for any number of processors and
read-write for one process - Read-write for one process
- Read-write for any number of processes
- Once identified (by static analysis), handled
differently
79Software based cache coherence
- Read only variables
- Can be cached any time
- Read only for any number and read-write for one
process - Can only be cached on writing processor
- Read-write for one process
- Cache only on that processor
- Read-write for many processes
- Cannot be cached at all
- Clearly need accurate information in order to
limit performance hit
80Classification of software-based cache coherence
protocols
81Invalidation
- Can invalidate the entire cache
- Single hardware mechanism for clearing valid bits
- Very conservative!
- Selective invalidation
- Invalidate before critical sections
- Understand parallel for-loop and invalidate
- Still needs hardware support to clear effectively
Key
Data
v
v
v
82Using knowledge of critical regions
Secure_lock() Invalidate_cache() Flush
_cache() Release_lock()
Variables in here can be used without worrying
about any other processes
83Using knowledge of parallel loops
Par For (I 0 I lt 100 i)
Par For (I 0 I lt 50 i)
Processor 0
Par For (I 50 I lt 100 i)
Processor 1
Knowledge about loops
84Selective invalidation schemes
- Add change bit to cache block status
- Set change bit to true
- If read on block then invalidate and reload
- Add timestamp to cache block
- Clock associated with a data structure
- Update timestamp in cache when block changed
- Can compare timestamp in block with current
timestamp - Adding version number
- Similar to clock scheme
85Synchronization Event Ordering
- Mutual exclusion required in many parallel
algorithms - Monitors
- Sempahores
- All high level schemes base don low level
synchronization tools - Atomic test-and-set common in shared memory
multiprocessor - Needs to take account of cache
- Minimum traffic generated while waiting
- Low latency release of a waiting processor
- Low latency acquisition of a free lock
- Typically work well on small bus based machines
86Synchronization with test-and-set
- Lock variable
- Open
- Closed
- Acquire lock
- char lock
- while (exchange(lock,CLOSED) CLOSED)
- Release lock
- lock OPEN
87Cache states after Pi successfully executed
testset on lock
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
invalid
dirty
88Bus commands when Pj executes testset on lock
and cache states after
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
Lock
(closed)
Read-Blk (lock)
1
Block (lock)
2
Invalidate (lock)
3
dirty
invalid
89Cache states after Pk executed testset on lock
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
Lock
Lock
(closed)
dirty
invalid
90Busy waiting with cache coherence
- Indivisable test-and-set instruction requires
write access to lock - Causes processor doing test-and-set to acquire
variable in cache, invalidating all other copies - When multiple processors spin on lock
- Each one tries to acquire the variable in cache
- Causes cache trashing
- Instead use snooping lock
- Spin on test without indivisible test-and-set
- Only exchange once OPEN
91Efficient algorithm for locking
- while (exchange(lock, CLOSE) CLOSE)
- while (lock CLOSED)
- First while will claim lock if OPEN and lock it
- But if already CLOSE transfer control to second
loop - Continuously reads lock
- No bus traffic during this phase
- When lock is OPEN try and test-and-set again.
92Test and test-and-set
- Even more efficient to test the lock before
trying to set it - For ()
- While (lock CLOSED)
- If (exchange(lock,CLOSED) ! CLOSED)
- Break
-
- Introduces extra latency for unused locks
93Lock implementation on scalable multiprocessors
- New York Ultracomputer and IBM RP3 implemented
- Fetch-and-add
- Fetch-and-add
- Atomic operation
- All memory modules augmented with adder circuit
- fetch-and-add(x,a)
- int x, a
- int temp
- temp x
- x x a
- return (temp)
94Example of fetch-and-add
- Suppose we want to implement parallel loop
- DOALL N 1 to 1000
- ltloop body using Ngt
- ENDDO
- Suppose want to allocate to processors
dynamically - N 0
- i fetch-and-add(N,1)
- While (i lt 1000)
- loop_body(i)
- i fetch-and-add(N,1)
-
- Regardless of how many processors execute the
look - Each processor will get a different value of i.
95Fetch-and-add
- Fetch-and-add automatically allocates look
indexes in this example - But location N becomes a hotspot.
- Combining network described before will not work
correctly without modification. - Same value is returned from a read operation
- Change each switch element so it can implement
the fetch and add operation. - Distributed operation without hotspots
96Forward propagation of fetch-and-add
M0
P0 F A (N,1)
1
M1
P0 F A (N,1)
F A (N,2)
M2
P0 F A (N,1)
1
2
M3
P0 F A (N,1)
F A (N,2)
F A (N,4)
M4
P0 F A (N,1)
1
M5
P0 F A (N,1)
F A (N,2)
F A (N,8) M6 returns N1 N becomes 9
P0 F A (N,1)
F A (N,4)
1
2
4
P0 F A (N,1)
F A (N,2)
97Back propagation of fetch-and-add
M0
P0 1
1
M1
P0 5
11
1
M2
P0 3
5
1
M3
P0 7
51
12
3
1
M4
3
P0 2
M5
P0 6
31
5
7
M6
1
5
P0 4
5
7
P0 8
M7
71
14
52
98Event ordering in cache coherent systems
- Programmers usually assume sequential consistency
- Consider
- P1 . store A, .
- P2 . load A, store B, .
- P3 . load B, load A
- P3 expects values of A and B to be same
- BUT
- May not occur because invalidation messages from
P1 may reach P2 before P3. - This occurs only because of cache
99Figure 18.35 classification of shared writable
variable accesses
100Figure 18.36 (a) weak consistency
Request (L1)
Load/Store Load/Store
Release (L1)
Request (L2)
Load/Store Load/Store
Release (L2)
101Figure 18.36 (b) release consistency
Request (L1)
Load/Store Load/Store
Request (L2)
Load/Store Load/Store
Request (L3)
Release (L1)
Load/Store Load/Store
Release (L2)
Release (L3)
102Figure 18.36 (c) load and store operations can be
executed in any order
Load/Store Load/Store
103A quick tour of some UMA machines
104Some real UMA machines
105Structure of the Hector machine - NUMA
Station
Station
Station
Station controller
Local ring
Inter-ring interfaces
Global ring
Local ring
Station
Station
Station
To be continued
106Structure of the Hector machine (cont.)
Station
Station controller
Station bus
Proc. module
Proc. module
I/O. module
Proc. module
Station bus
Station bus
Station bus interface
I/O adaptor
Proc. cache
Memory
display
ehternet
disk
107Structure of the Cray T3D system NUMA
Cray Y-MP host
I/O clusters
Tapedrives
Workstations
Disks
Networks
108Design space of CC-NUMA machines
109Structure of the Wisconsin multicube machine
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
M
M
M
M
P Processor C Cache M Memory
110Mechanism of reading block X in state modified
P00
P02
P01
Read X
C00
C02
C01
read (X)
value (X)
P10
P12
P11
C10
C12
C11
X
Mem-write (X, unmodified)
read (X)
value (X)
M0
M1
M2
modified
X
111The Stanford Dash interconnection network
Cluster 11
Cluster 12
Cluster 13
Cluster 21
Cluster 22
Cluster 23
112Structure of a cluster
Memory
Pi 1
Ci 1
Directory and Intercluster Interface
I/O Interface
113Processor level
Pi 1
Load X
X
Ci 1
Access time 1 clock
114Local cluster level
Memory
Pi 1
Pi j
Load X
Ci 1
Ci j
X
Access time 30 clocks
115Home cluster level
Cluster C1j (home cluster)
Cluster C1i (local cluster)
Memory
Memory
Pi 1
Pj 1
X
Load X
Ci 1
Cj 1
Interconnection network
Access time 100 clocks
116Remote cluster level
Cluster C1j
Cluster C1i (local cluster)
Memory
Memory
Pi 1
Pj 1
Load X
Ci 1
Cj 1
DLi
DLj
5
1
DL Directory logicAccess time 135 clocks
Continue on next slide
117Remote cluster level (cont.)
Continue from previous slide
Cluster C1m (home cluster)
Cluster C1k (remote cluster)
Memory
Memory
DDirty
Pm 1
Pk 1
Read-Req
Read-Rply
X
D
Clk
Cm 1
Ck 1
Sharing-Writeback
8
4
2
3
Read
Read
DLi
DLj
Read-Req
DL Directory logicAccess time 135 clocks
118Structure of the dash directory
Replies to clusters Y1/Y-1
Requests to clusters Y1/Y-1
Replies to clusters X1/X-1
Requests to clusters X1/X-1
Reply Y-dimension router
Reply Y-dimension router
Reply X-dimension router
Reply X-dimension router
RCboard
DCboard
Reply controller (RC)
Pseudo-CPU(PCPU)
Performance monitor
Directory controller (DC)
Remote cache status, bus retry
Arbitration masks
Cluster bus request
Events
Cluster address/control bus
Cluster datal bus
119Sequence of actions in a store operation
requiring remote service
Cluster C1j
Cluster C1i
Memory
Memory
Pi 1
Pj 1
Store X
Ci 1
Cj 1
DLi
DLj
Inv-Ack
5
Read- Ex-Req
4
Read-exclusive
Read-exclusive
1
3
Inv-Req
Continue on next slide
DL Directory logic
120Sequence of actions in a store operation
requiring remote service (cont.)
Continue from previous slide
Cluster C1m
Cluster C1k
Memory
Memory
SShared
Pm 1
Pk 1
Read-Ex Req
X
S
Clk
Read-Ex Rply
Clj
Cm 1
Ck 1
DLm
DLk
2
3
Read-Ex Req
Read-exclusive
Inv-Req
DL Directory logic
121Structure of the FLASH machine
2nd-level cache
Microproc.
DRAM
MAGIC
Net
I/O
122Structure of the MAGIC node controller
Processor
Net
I/O
Message split unit
Message data
Message headers
Data transfer logic
Control pipeline
Memory
Message split unit
Processor
Net
I/O
123Structure of the control micropipeline
Processor
Net
I/O
PI
NI
I/O
Software queue head
Mem. control
Inbox
MAGIC data cache
Protocol processor
MAGIC instr. cache
Outbox
PI
NI
I/O
124Convex exemplar architecture
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
CPU7
CPU8
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
I/O subsystem
Agent
Agent
Agent
Agent
5x5 crossbar (1.25 Gbytes/sec)
Cache/mem control
Cache/mem control
Cache/mem control
Cache/mem control
512 Mb memory
512 Mb memory
512 Mb memory
512 Mb memory
Hypernode 1
Hypernode 2
Hypernode 16
Scalable Coherent Interface Rings (600Mbyte/sec
each)
125Parallel matrix multiply code for cache-coherent
machine
global c(idim, idim), a(idim, idim) global
b(idim, idim), nCPUs private i, j, k, itid call
spawn ( nCPUs ) do j 1, idim if ( jmod(j,
nCPU).eq.itid ) then do i 1, idim
c(i, j) 0.0 do k 1, idim c(i,
j) c(i, j) a(i, k) b(k, j) enddo
enddo endif enddo call join
126Parallel matrix multiply code for
non-cache-coherent machine
global c(idim, idim), a(idim, idim) global
b(idim, idim), nCPUs private i, j, k, itid,
tmp semaphore is (idim,idim) call spawn ( nCPUs
) do j 1, idim if ( jmod(j, nCPU).eq.itid )
then do i 1, idim tmp 0.0 do
k 1, idim call flush (a(i, k))
call flush (b(k, j)) tmp tmp a(i, k)
b(k, j) enddo
127Parallel matrix multiply code for
non-cache-coherent machine (cont.)
call lock (c(i ,j), is(i, j)) c(i,
j) tmp call flush(c(i, j)) call
unlock(c(i, j), is(i, j)) enddo
endif enddo call join
128Structure of the basic Data Diffusion Machine
(DDM)
Arbitration, selection
Top protocol
DDM bus
Attraction memory
Output buffer
Attraction memory
State data memory
Above protocol
Controller
Below protocol
Processor
Processor
129State transition diagram of the attraction memory
protocol
Pread, Nread/Ndata
Nerase/Nread
Notation In-transaction/out-transaction P from
processor N from network E ExclusiveI
Invalid R Reading RW Reading-and-waiting S
Shared W Waiting
I
R
S
Pread/Nread
Ndata
Nerase
Nread/Ndate
Pwrite/Nerase
RW
W
E
Ndata/Nerase
Nexclusive
Nerase/Nread
Nerase/Nread, Nexclusive/Nread
Pread, Pwrite
130Structure of the directory unit
DDM bus
Output buffer
Directory
State memory
Above protocol
Controller
Below protocol
Output buffer
Intput buffer
DDM bus
131Write race in the hierarchical DDM
Exclusive acknowledge
Top
Erase request (winner)
W
W
Erase
Erase request (loser)
W
W
W
S
X
I
I
W
I
I
I
I
W
W
S
I
I
X
P
P
Pi
P
Pj
P
P
P
P
Pk
P
P
I Invalid W Waiting S
Shared P Processor AM Attraction
Memory D Directory
132R read operations by processor B
Column for each local cache
A
0
B
C
m
EO
Row for each cache subpage
0
C
C
NO
EO
C
C
C
NO
NO
C
C
EONO
EO
n
133Write operations by processor C
A
0
B
C
m
EO
0
CI
CI
NOI
EO
EO
C
C
NO
NO
C
EOI
EO
EO
n
134The hierarchical structure of the Kendall Square
Research (KSR1) machine - COMA
Ring 1 (All CACHE Group 1)
Ring 0 directory
Ring 0
Ring 0
Ring 0 directory
Ring 0 (All CACHE Group 0)
Responder 2
Local cache directory
Local cache directory
Local cache directory
Local cache
Local cache
Local cache
Processor
Processor
Processor
Requester 1
Responder 1
Requester 2
135The convergence of scalable MIMD computers
Distributed memory computers
Shared memory computers
Multi-threaded computers
Scalable
Scalable
Small size
Scalable
Hypercube (Store forward)
Multistage (No cache consistency)
Shared bus (snoopy cache)
1st generation
Mesh (Wormhole routing
NUMA (No cache consistency)
2nd generation
Processor comm. proc router
CC-NUMA COMA (Cluster concept)
3rd generation
4th generation
Multi-threaded processor communication
processor router cache directory