Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 - PowerPoint PPT Presentation

1 / 118
About This Presentation
Title:

Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18

Description:

Acquire the bus. Send out address and/or data. Wait for data (read), wait for write to complete ... Acquire bus time slot. Send address and/or data ... – PowerPoint PPT presentation

Number of Views:1026
Avg rating:3.0/5.0
Slides: 119
Provided by: david2523
Category:

less

Transcript and Presenter's Notes

Title: Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18


1
Shared Memory MIMDArchitecturesSima, Fountain
and KacsukChapter 18
  • CSE462

2
Design choices
  • Types of shared memory
  • Physically shared memory
  • Virtual (or distributed) shared memory
  • Scalability issues
  • Organisation of memory
  • Design of interconnection network
  • Cache coherence protocols

3
Design space of shared memory computers
Shared memory computers
Single address space memory access
Interconnection scheme
Cache coherency
Physical shared memory UMA
Virtual shared memory
Shared path
Switching network
Hardware based
Software based
NUMA
Singled bus based
Crossbar
Multistage network
CC-NUMA
Multiple bus based
Bus multiplication
Omega
Banyan
Benes
COMA
Grid of buses
Hierarchical system
4
Classification of dynamic interconnection networks
  • Enable temporary connection of any two components
    of a multiprocessor

5
Buses
  • Very limited scalability
  • Typically 3-5 processors unless special
    techniques (TDM)
  • Can be expanded significantly if
  • Use private memory
  • Coherent cache memory
  • Multiple buses

6
Structure of a single bus multiprocessor
(nocaches)
P1
Pk
M1
Mn
Address
Bus arbiter and control logic
Data
Control
Interrupt
Bus exchange lines
I/O 1
I/O M
7
Locking or multiplexing the bus
  • Two main approaches
  • Locking and holding
  • Acquire the bus
  • Send out address and/or data
  • Wait for data (read), wait for write to complete
  • Release the bus
  • Multiplexing
  • Acquire bus time slot
  • Send address and/or data
  • Come back for data n cycles later (read), or keep
    going if write

8
Memory write on locked bus
Processors
P4
P3
P2
P1
4
8
12
16
Time
Bus cycle
Memory cycle
9
Memory write on multiplexed buses
Note This assumes different memory banks
Processors
P4
P3
P2
P1
4
7
Time
Bus cycle
Memory cycle
10
Memory read on locked bus
Processors
P3
P2
P1
5
10
15
20
25
30
Time
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
11
Memory read on multiplexed bus
Processors
P3
P2
P1
Time
5
10
12
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
12
Memory read on split-transaction bus
Next transfer started before last one completed!
Processors
P3
P2
P1
Needs special associative hardware
Time
5
10
Phase 1 address bus is used
Phase 2 bus is not used
Phase 3 data bus is used
13
Arbiter Logic
  • Because bus is shared resource, but arbitrate for
    access
  • Arbiter may be
  • Centralised
  • Central unit which looks at all requests
  • Decentralised.
  • Logic is split amongst bus masters
  • Scalable
  • Each new master adds more logic

14
Design space of arbiter logic
Arbiter logics
Organization
Bus allocation policy
Handling of requests
Handling of grants
Centralized
Fixed priority
Fixed priority
Fixed priority
Distributed
Rotating
Rotating
Rotating
Round robin
Least recently used
First come first served
15
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
16
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Masters Request Bus
17
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
One is granted
18
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Successful master claims bus
19
Centralized arbitration with independent requests
and grants
Bus lines
Master 1
Master 2
Master N
R1
Central bus arbiter
G1
R2
G2
RN
GN
Bus busy
Bus is released
20
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
21
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Masters Request Bus
22
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus grant generated
23
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus grant not propagated
24
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Master claims bus
25
Daisy-chained bus arbitration scheme
Bus lines
Master 1
Master 2
Master N
Central bus arbiter
Bus
G2
GN
Grant 1
Bus request
Bus busy
Bus released
26
Decentralized rotating arbiter with independent
requests and grants
  • Problems with previous design
  • lack of fairness
  • Wait whilst grant signal propagates
  • Rotating priority solves lack of fairness
  • Logical first not same as physical first

Bus lines
Master 1
Master 2
Master N
R1
G1
R2
G2
R3
G3
PN
Arbiter 1
Arbiter 2
Arbiter N
P1
P2
Bus busy
R Request
G Grant
P Priority
27
Multiple buses
  • Increase bandwidth by adding additional resources
  • Bus is limiting factor

28
1-dimension multiple bus multiprocessor
  • Each processor connected to all buses
  • Each memory connected to all buses
  • Processor chooses bus dynamically
  • Load can be spread across buses

B1
B2
Bb
P1
P2
Pn
M1
M2
Mm
29
2 and 3 dimensional bus system
P
M
30
2 Dimensional bus design
  • Can support specialised access patterns
  • e.g. Climate model
  • Access to local data
  • Access to data in same latitude
  • Access to data in same longitude

31
Figure 18.11 Structure of a b-of-m arbiter
State register
s1
e1
s2
e2
sm
sm
C1
C2
Cm
Arbiter 1
Arbiter 2
Arbiter m
R1
G1
BA1
R2
G2
BA2
Rm
Gm
BAm
32
Cluster bus architecture
  • Hierarchy of buses
  • Arbitrary large networks
  • Cache coherence becomes very difficult

Global bus (Nanobus)
Uniform interconn. card
Uniform interconn. card
Cluster 1
Cluster 8
Uniform interconn. card
Uniform interconn. card
Uniform interconn. card
Multimax
Uniform cluster cache
Uniform interconn. card
Uniform interconn. card
Uniform interconn. card
Multimax
Uniform cluster cache
Cluster bus (Nanobus)
Cluster bus (Nanobus)
33
Switching Networks
34
View of a crossbar network
  • Cross bar allows any processor to connect with
    any memory
  • As long as there is no contention for the memory,
    network in non-blocking

P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
35
View of a crossbar network
  • Cross bar allows any processor to connect with
    any memory
  • As long as there is no contention for the memory,
    network in non-blocking

P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
36
View of a crossbar network
  • Cross bar allows any processor to connect with
    any memory
  • As long as there is no contention for the memory,
    network in non-blocking

P1
S
S
S
P2
S
S
S
Pn
S
S
S
Mn
M1
M2
S Switch
37
Detailed structure of a crossbar network
Control
P1
Address
Data bus
BBCU
Control
Address
Arbiter
Data bus
Switch
Control
P1
Address
Data bus
BBCU
Arbiter
Switch
Mi
38
Multi-stage interconnection networks
  • Cannot directly connect processor to memory
  • Use cross-bar switches as components to build
    larger network
  • Minimum number of stages is logarithmic
  • Single path
  • No fault tolerance
  • Blocking (if intermediate switch in use)

39
Omega network topology
  • 2 x 2 cross bar switch components
  • Butterfly built from 8x8
  • Unique path from one port to another
  • Log depth

Straight through
Straight through
000
000
001
001
010
010
011
011
Lower broadcast
100
100
101
101
Upper broadcast
110
110
111
111
40
Omega network topology
  • Some configurations are non blocking
  • e.g. reversal

000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
0-gt7, 1-gt6, 2-gt5, 3-gt4, 4-gt3,
5-gt2, 6-gt1, 7-gt0
41
Broadcast in the omega network
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
42
Blocking in an omega network
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
(0-gt5, . . ., 6-gt4, . . .)
43
Multistage Network Portperties
44
Hot-spot saturation in a blocking omega network
M0
P0
M1
P1
P5-gtM7 blocked
P2-gtM4 active
M2
P2
M3
P3
M4
P4
P7-gtM4 blocked
M5
P5
P1-gtM5 blocked
M6
P6
M7
P7
P2-gtM4 active gt P7-gtM4 blocked gt P1-gtM5 blocked
gt P5-gtM7 blocked
45
Hotspots in Omega networks
  • In shared memory machine two sorts of contention
  • Memory unit
  • Switch elements
  • Certain access patterns can repeatedly block each
    other even though addressing different memory
    units
  • Message combining can solve these problems
  • Switch element buffers request
  • Memory only sees one request

Read 100
Read 100
Read 100
46
Structure of a combining switch
  • Introduced on NYU Ultracomputer

Combining queue
Proc(i)
Mem(k)
Wait buffer
Noncomb. queue
Combining queue
Proc(j)
Mem(I)
Wait buffer
Noncomb. queue
47
Cache Coherence
  • Cache coherence problems
  • Sharing of writable data
  • Process migration
  • I/O activity

Memory
Processor
Cache
Memory
Processor
Cache
Memory
Processor
Cache
48
Cache Coherence
  • Cache coherence problems
  • Sharing of writable data
  • Process migration
  • I/O activity

Memory
Processor
Cache
Cache
Cache
Memory
Processor
Cache
Memory
Processor
Cache
Cache
49
Cache Coherence
  • Cache coherence problems
  • Sharing of writable data
  • Process migration
  • I/O activity

Memory
Processor
Cache
Cache
Cache
Write 100,100
Memory
Processor
Cache
Memory
Processor
Cache
Cache
50
Cache Coherence
  • Cache coherence problems
  • Sharing of writable data
  • Process migration
  • I/O activity

IO Dev
Memory
Memory
Processor
Cache
Cache
IO
Read 100
Memory
Processor
Cache
Memory
Processor
Cache
51
Classification of data structures
  • Read only
  • Never cause cache coherence problems
  • Shared writable
  • Main source of cache coherence problems
  • Private writable data
  • Causes problems with process migration
  • Solutions
  • Hardware based protocols
  • Software based protocols

52
Design space of hardware-based cache coherence
protocols
53
Design space of hardware-based cache coherence
protocols (cont.)
54
Write-through memory update policy
  • Memory always updated on a write
  • Intuitively easier to keep caches coherent

Memory
Pi
Pj
Processor
D1
Store D1
Cache
D1
D
D1
55
Write-back memory update policy
  • Data only written back to memory when flushed
  • Processor can do many writes before flushed

Memory
Pi
Pj
Processor
D
Store D1
Cache
D1
D
56
Write-update cache coherence policy
  • When a processor writes a variable, updates all
    other copies in other processors

Pj
Pk
Pi
Processor
Store D1
Cache
D1
D1
D1
Update (D1)
57
Write-invalidate cache coherence policy
  • When a processor writes a variable invalidates
    copy in any other caches
  • Makes one processor the owner

Pj
Pk
Pi
Processor
Store D1
D1
Cache
Invalidate (addr(D))
Invalid data
58
Snoopy Protocols
  • If interconnection network supports broadcasting
    (cheaply) then a snoopy policy is effective
  • Every cache watches every transaction to memory
  • Works for buses
  • If broadcast is not efficient
  • Directory based scheme
  • Keeps track of where cache blocks are located

59
Snoopy write update protocol
  • Possible cache block states
  • Used to support cache coherence protocol
  • Valid-exclusive
  • Only copy of this cache block. Cache and memory
    are consistent
  • Shared
  • Several copies of this cache block
  • Dirty
  • Only copy but cache and memory are inconsistent

60
Read Miss logic
  • Snoopy cache controller broadcasts a Read-Blk
    command on the bus
  • If there are shared copies
  • Delivered by cache with copy
  • If dirty copies
  • It is supplied and flushed to main memory.
  • All copies become shared.
  • If a valid-exclusive copy exists
  • Copy supplied and all become shared
  • If no cache copy
  • Memory supplies data
  • Becomes valid exclusive

61
Snoopy Update - Read miss
Memory
Pi
Pj
Processor
D
Load D
Load D
Cache
D
Read-blk (addr(D))
Exclusive
Shared
Dirty
62
Write hit logic
  • If block is valid-exclusive or dirty
  • Write is performed locally
  • New state is dirty
  • If block is shared
  • Broadcast update block on bus
  • All copies (including memory) update
  • Status remain shared.

63
Snoopy Update Write hit Exclusive
Memory
Pi
Pj
Processor
D
Write D
Load D
Cache
D
D
Read-blk (addr(D))
Exclusive
Shared
Dirty
64
Snoopy Update Write hit Shared
Memory
Pi
Pj
Processor
D
Write D
Load D
Load D
Cache
D
Read-blk (addr(D))
Write (addr(D))
Read-blk (addr(D))
Exclusive
Shared
Dirty
65
Write miss
  • If only memory contains copy
  • Memory updated
  • Requesting cache loaded with data valid
    exclusive
  • If shared copies are available
  • All copies (including memory one) updated
  • Requesting cache loaded with data shared
  • If dirty or valid exclusive exist
  • Other blocks updated
  • Memory updated
  • Requesting cache loaded with data shared

66
Snoopy Update Write miss
Memory
Pi
Pj
Processor
D
Write D
Cache
D
Write (addr(D))
Exclusive
Shared
Dirty
67
State transition graph for snoopy update
  • Cache responds to
  • P-READs, P-WRITEs from the Processor and
  • READ-BLK, WRITE-BLK from the Bus

P-Read/P-Write
P-Read
Valid-exclusive
Shared
Read-Blk/Write-Blk
Read-blk/Write-Blk/Update-Blk
P-Write
Read-Blk/Write-Blk
Dirty
P-Read/P-Write
68
Structure of the snoopy cache controller
PEi
Processor
Snoopy cache controller
Interface
Cache directory
Cache controller
Memory
PEn
Cache
Proc.
D
A
D
Snoopy controller
D
A
Interface
Cache
A
  • Snoopy controller needs to operate at bus speed

69
Directory Schemes
  • Directory schemes only send consistency commands
    to those caches where a valid copy of the shared
  • Designed for systems where snooping is not
    possible
  • Three main approaches
  • Full map directory
  • Each entry points to all caches
  • Entry indicates whether block is present in
    remote caches
  • Not efficient for large systems
  • Limited directory
  • Only point to subset of the caches
  • Works because tend not to share a variable with
    all processors
  • Same information as in full map
  • Chained directory
  • Directory entries form a linked list
  • Scalable can add processors without increasing
    directory width

70
Chained directory scheme
PE0
PE1
PEn
P0
P1
Pn
Processor
C1
Cn
X, CT
Cache
Read X
Shared memory
C
X
Directory entry
71
Chained directory scheme
PE0
PE1
PEn
P0
P1
Pn
Processor
Cn
X, CT
X,
Cache
Shared memory
C
X
Directory entry
72
Scalable Coherence Interface
  • Concrete example of chained directory
  • IEEE Standard
  • Defines
  • Interface to interconnection network
  • Not any particular interconnection network
  • Interface
  • Point to point
  • Well suited to networks like Convex Exemplar
  • Simple, uni-directional ring
  • Designed for building scalable shared memory
    machines

73
Structure of sharing-lists in the SCI
  • Operations defined for
  • Creation
  • Insertion
  • Deletion
  • Reduction to single node

Nodei
Nodej
Nodek
Memory
Pi
Pj
Pk
Ci
Cj
Ck
mstate
forw_id
forw_id
back_id
data (64 bits)
cstate
mem_id
data (64 bits)
74
Insertion in a sharing-list
Nodej
Nodek
Memory
Pj
Pk
Cj
Ck
responses
New-head
Pi
prepend
Ci
Nodei
75
Messages for deletion
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
1
2
Update forward
Update forward
76
Structure of the sharing-list after deletion
Nodek
Memory
Nodei
Pi
Pk
77
Hierarchical cache coherence
Main memory
B20
write
C20
C21
C22
X
X
B10
write
B11
B12
Invalidate
C11
C11
C11
C11
C10
C10
C10
C10
P0
P1
P1
P0
P1
P1
P0
P0
78
Software Based Coherence
  • Software approaches rely on compiler assistance
  • Identify different classes of variables
  • Read-only
  • Read-only for any number of processors and
    read-write for one process
  • Read-write for one process
  • Read-write for any number of processes
  • Once identified (by static analysis), handled
    differently

79
Software based cache coherence
  • Read only variables
  • Can be cached any time
  • Read only for any number and read-write for one
    process
  • Can only be cached on writing processor
  • Read-write for one process
  • Cache only on that processor
  • Read-write for many processes
  • Cannot be cached at all
  • Clearly need accurate information in order to
    limit performance hit

80
Classification of software-based cache coherence
protocols
81
Invalidation
  • Can invalidate the entire cache
  • Single hardware mechanism for clearing valid bits
  • Very conservative!
  • Selective invalidation
  • Invalidate before critical sections
  • Understand parallel for-loop and invalidate
  • Still needs hardware support to clear effectively

Key
Data
v
v
v
82
Using knowledge of critical regions
Secure_lock() Invalidate_cache() Flush
_cache() Release_lock()
Variables in here can be used without worrying
about any other processes
83
Using knowledge of parallel loops
Par For (I 0 I lt 100 i)
Par For (I 0 I lt 50 i)
Processor 0
Par For (I 50 I lt 100 i)
Processor 1
Knowledge about loops
84
Selective invalidation schemes
  • Add change bit to cache block status
  • Set change bit to true
  • If read on block then invalidate and reload
  • Add timestamp to cache block
  • Clock associated with a data structure
  • Update timestamp in cache when block changed
  • Can compare timestamp in block with current
    timestamp
  • Adding version number
  • Similar to clock scheme

85
Synchronization Event Ordering
  • Mutual exclusion required in many parallel
    algorithms
  • Monitors
  • Sempahores
  • All high level schemes base don low level
    synchronization tools
  • Atomic test-and-set common in shared memory
    multiprocessor
  • Needs to take account of cache
  • Minimum traffic generated while waiting
  • Low latency release of a waiting processor
  • Low latency acquisition of a free lock
  • Typically work well on small bus based machines

86
Synchronization with test-and-set
  • Lock variable
  • Open
  • Closed
  • Acquire lock
  • char lock
  • while (exchange(lock,CLOSED) CLOSED)
  • Release lock
  • lock OPEN

87
Cache states after Pi successfully executed
testset on lock
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
invalid
dirty
88
Bus commands when Pj executes testset on lock
and cache states after
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
Lock
(closed)
Read-Blk (lock)
1
Block (lock)
2
Invalidate (lock)
3
dirty
invalid
89
Cache states after Pk executed testset on lock
Nodej
Nodek
Memory
Nodei
Pi
Pj
Pk
Exchange
Ci
Cj
Ck
Lock
Lock
Lock
Lock
(closed)
dirty
invalid
90
Busy waiting with cache coherence
  • Indivisable test-and-set instruction requires
    write access to lock
  • Causes processor doing test-and-set to acquire
    variable in cache, invalidating all other copies
  • When multiple processors spin on lock
  • Each one tries to acquire the variable in cache
  • Causes cache trashing
  • Instead use snooping lock
  • Spin on test without indivisible test-and-set
  • Only exchange once OPEN

91
Efficient algorithm for locking
  • while (exchange(lock, CLOSE) CLOSE)
  • while (lock CLOSED)
  • First while will claim lock if OPEN and lock it
  • But if already CLOSE transfer control to second
    loop
  • Continuously reads lock
  • No bus traffic during this phase
  • When lock is OPEN try and test-and-set again.

92
Test and test-and-set
  • Even more efficient to test the lock before
    trying to set it
  • For ()
  • While (lock CLOSED)
  • If (exchange(lock,CLOSED) ! CLOSED)
  • Break
  • Introduces extra latency for unused locks

93
Lock implementation on scalable multiprocessors
  • New York Ultracomputer and IBM RP3 implemented
  • Fetch-and-add
  • Fetch-and-add
  • Atomic operation
  • All memory modules augmented with adder circuit
  • fetch-and-add(x,a)
  • int x, a
  • int temp
  • temp x
  • x x a
  • return (temp)

94
Example of fetch-and-add
  • Suppose we want to implement parallel loop
  • DOALL N 1 to 1000
  • ltloop body using Ngt
  • ENDDO
  • Suppose want to allocate to processors
    dynamically
  • N 0
  • i fetch-and-add(N,1)
  • While (i lt 1000)
  • loop_body(i)
  • i fetch-and-add(N,1)
  • Regardless of how many processors execute the
    look
  • Each processor will get a different value of i.

95
Fetch-and-add
  • Fetch-and-add automatically allocates look
    indexes in this example
  • But location N becomes a hotspot.
  • Combining network described before will not work
    correctly without modification.
  • Same value is returned from a read operation
  • Change each switch element so it can implement
    the fetch and add operation.
  • Distributed operation without hotspots

96
Forward propagation of fetch-and-add
M0
P0 F A (N,1)
1
M1
P0 F A (N,1)
F A (N,2)
M2
P0 F A (N,1)
1
2
M3
P0 F A (N,1)
F A (N,2)
F A (N,4)
M4
P0 F A (N,1)
1
M5
P0 F A (N,1)
F A (N,2)
F A (N,8) M6 returns N1 N becomes 9
P0 F A (N,1)
F A (N,4)
1
2
4
P0 F A (N,1)
F A (N,2)
97
Back propagation of fetch-and-add
M0
P0 1
1
M1
P0 5
11
1
M2
P0 3
5
1
M3
P0 7
51
12
3
1
M4
3
P0 2
M5
P0 6
31
5
7
M6
1
5
P0 4
5
7
P0 8
M7
71
14
52
98
Event ordering in cache coherent systems
  • Programmers usually assume sequential consistency
  • Consider
  • P1 . store A, .
  • P2 . load A, store B, .
  • P3 . load B, load A
  • P3 expects values of A and B to be same
  • BUT
  • May not occur because invalidation messages from
    P1 may reach P2 before P3.
  • This occurs only because of cache

99
Figure 18.35 classification of shared writable
variable accesses
100
Figure 18.36 (a) weak consistency
Request (L1)
Load/Store Load/Store
Release (L1)
Request (L2)
Load/Store Load/Store
Release (L2)
101
Figure 18.36 (b) release consistency
Request (L1)
Load/Store Load/Store
Request (L2)
Load/Store Load/Store
Request (L3)
Release (L1)
Load/Store Load/Store
Release (L2)
Release (L3)
102
Figure 18.36 (c) load and store operations can be
executed in any order
Load/Store Load/Store
103
A quick tour of some UMA machines
104
Some real UMA machines
105
Structure of the Hector machine - NUMA
Station
Station
Station
Station controller
Local ring
Inter-ring interfaces
Global ring
Local ring
Station
Station
Station
To be continued
106
Structure of the Hector machine (cont.)
Station
Station controller
Station bus
Proc. module
Proc. module
I/O. module
Proc. module
Station bus
Station bus
Station bus interface
I/O adaptor
Proc. cache
Memory
display
ehternet
disk
107
Structure of the Cray T3D system NUMA
Cray Y-MP host
I/O clusters
Tapedrives
Workstations
Disks
Networks
108
Design space of CC-NUMA machines
109
Structure of the Wisconsin multicube machine
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
P
P
P
P
C
C
C
C
M
M
M
M
P Processor C Cache M Memory
110
Mechanism of reading block X in state modified
P00
P02
P01
Read X
C00
C02
C01
read (X)
value (X)
P10
P12
P11
C10
C12
C11
X
Mem-write (X, unmodified)
read (X)
value (X)
M0
M1
M2
modified
X
111
The Stanford Dash interconnection network
Cluster 11
Cluster 12
Cluster 13
Cluster 21
Cluster 22
Cluster 23
112
Structure of a cluster
Memory
Pi 1
Ci 1
Directory and Intercluster Interface
I/O Interface
113
Processor level
Pi 1
Load X
X
Ci 1
Access time 1 clock
114
Local cluster level
Memory
Pi 1
Pi j
Load X
Ci 1
Ci j
X
Access time 30 clocks
115
Home cluster level
Cluster C1j (home cluster)
Cluster C1i (local cluster)
Memory
Memory
Pi 1
Pj 1
X
Load X
Ci 1
Cj 1
Interconnection network
Access time 100 clocks
116
Remote cluster level
Cluster C1j
Cluster C1i (local cluster)
Memory
Memory
Pi 1
Pj 1
Load X
Ci 1
Cj 1
DLi
DLj
5
1
DL Directory logicAccess time 135 clocks
Continue on next slide
117
Remote cluster level (cont.)
Continue from previous slide
Cluster C1m (home cluster)
Cluster C1k (remote cluster)
Memory
Memory
DDirty
Pm 1
Pk 1
Read-Req
Read-Rply
X
D
Clk
Cm 1
Ck 1
Sharing-Writeback
8
4
2
3
Read
Read
DLi
DLj
Read-Req
DL Directory logicAccess time 135 clocks
118
Structure of the dash directory
Replies to clusters Y1/Y-1
Requests to clusters Y1/Y-1
Replies to clusters X1/X-1
Requests to clusters X1/X-1
Reply Y-dimension router
Reply Y-dimension router
Reply X-dimension router
Reply X-dimension router
RCboard
DCboard
Reply controller (RC)
Pseudo-CPU(PCPU)
Performance monitor
Directory controller (DC)
Remote cache status, bus retry
Arbitration masks
Cluster bus request
Events
Cluster address/control bus
Cluster datal bus
119
Sequence of actions in a store operation
requiring remote service
Cluster C1j
Cluster C1i
Memory
Memory
Pi 1
Pj 1
Store X
Ci 1
Cj 1
DLi
DLj
Inv-Ack
5
Read- Ex-Req
4
Read-exclusive
Read-exclusive
1
3
Inv-Req
Continue on next slide
DL Directory logic
120
Sequence of actions in a store operation
requiring remote service (cont.)
Continue from previous slide
Cluster C1m
Cluster C1k
Memory
Memory
SShared
Pm 1
Pk 1
Read-Ex Req
X
S
Clk
Read-Ex Rply
Clj
Cm 1
Ck 1
DLm
DLk
2
3
Read-Ex Req
Read-exclusive
Inv-Req
DL Directory logic
121
Structure of the FLASH machine
2nd-level cache
Microproc.
DRAM
MAGIC
Net
I/O
122
Structure of the MAGIC node controller
Processor
Net
I/O
Message split unit
Message data
Message headers
Data transfer logic
Control pipeline
Memory
Message split unit
Processor
Net
I/O
123
Structure of the control micropipeline
Processor
Net
I/O
PI
NI
I/O
Software queue head
Mem. control
Inbox
MAGIC data cache
Protocol processor
MAGIC instr. cache
Outbox
PI
NI
I/O
124
Convex exemplar architecture
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
CPU7
CPU8
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
Cache 2 Mb
I/O subsystem
Agent
Agent
Agent
Agent
5x5 crossbar (1.25 Gbytes/sec)
Cache/mem control
Cache/mem control
Cache/mem control
Cache/mem control
512 Mb memory
512 Mb memory
512 Mb memory
512 Mb memory
Hypernode 1
Hypernode 2
Hypernode 16
Scalable Coherent Interface Rings (600Mbyte/sec
each)
125
Parallel matrix multiply code for cache-coherent
machine
global c(idim, idim), a(idim, idim) global
b(idim, idim), nCPUs private i, j, k, itid call
spawn ( nCPUs ) do j 1, idim if ( jmod(j,
nCPU).eq.itid ) then do i 1, idim
c(i, j) 0.0 do k 1, idim c(i,
j) c(i, j) a(i, k) b(k, j) enddo
enddo endif enddo call join
126
Parallel matrix multiply code for
non-cache-coherent machine
global c(idim, idim), a(idim, idim) global
b(idim, idim), nCPUs private i, j, k, itid,
tmp semaphore is (idim,idim) call spawn ( nCPUs
) do j 1, idim if ( jmod(j, nCPU).eq.itid )
then do i 1, idim tmp 0.0 do
k 1, idim call flush (a(i, k))
call flush (b(k, j)) tmp tmp a(i, k)
b(k, j) enddo
127
Parallel matrix multiply code for
non-cache-coherent machine (cont.)
call lock (c(i ,j), is(i, j)) c(i,
j) tmp call flush(c(i, j)) call
unlock(c(i, j), is(i, j)) enddo
endif enddo call join
128
Structure of the basic Data Diffusion Machine
(DDM)
Arbitration, selection
Top protocol
DDM bus
Attraction memory
Output buffer
Attraction memory
State data memory
Above protocol
Controller
Below protocol
Processor
Processor
129
State transition diagram of the attraction memory
protocol
Pread, Nread/Ndata
Nerase/Nread
Notation In-transaction/out-transaction P from
processor N from network E ExclusiveI
Invalid R Reading RW Reading-and-waiting S
Shared W Waiting
I
R
S
Pread/Nread
Ndata
Nerase
Nread/Ndate
Pwrite/Nerase
RW
W
E
Ndata/Nerase
Nexclusive
Nerase/Nread
Nerase/Nread, Nexclusive/Nread
Pread, Pwrite
130
Structure of the directory unit
DDM bus
Output buffer
Directory
State memory
Above protocol
Controller
Below protocol
Output buffer
Intput buffer
DDM bus
131
Write race in the hierarchical DDM
Exclusive acknowledge
Top
Erase request (winner)
W
W
Erase
Erase request (loser)
W
W
W
S
X
I
I
W
I
I
I
I
W
W
S
I
I
X
P
P
Pi
P
Pj
P
P
P
P
Pk
P
P
I Invalid W Waiting S
Shared P Processor AM Attraction
Memory D Directory
132
R read operations by processor B
Column for each local cache
A
0
B
C
m
EO
Row for each cache subpage
0
C
C
NO
EO
C
C
C
NO
NO
C
C
EONO
EO
n
133
Write operations by processor C
A
0
B
C
m
EO
0
CI
CI
NOI
EO
EO
C
C
NO
NO
C
EOI
EO
EO
n
134
The hierarchical structure of the Kendall Square
Research (KSR1) machine - COMA
Ring 1 (All CACHE Group 1)
Ring 0 directory
Ring 0
Ring 0
Ring 0 directory
Ring 0 (All CACHE Group 0)
Responder 2
Local cache directory
Local cache directory
Local cache directory
Local cache
Local cache
Local cache
Processor
Processor
Processor
Requester 1
Responder 1
Requester 2
135
The convergence of scalable MIMD computers
Distributed memory computers
Shared memory computers
Multi-threaded computers
Scalable
Scalable
Small size
Scalable
Hypercube (Store forward)
Multistage (No cache consistency)
Shared bus (snoopy cache)
1st generation
Mesh (Wormhole routing
NUMA (No cache consistency)
2nd generation
Processor comm. proc router
CC-NUMA COMA (Cluster concept)
3rd generation
4th generation
Multi-threaded processor communication
processor router cache directory
Write a Comment
User Comments (0)
About PowerShow.com