Title: Scalable Multiprocessors III
1Scalable Multiprocessors (III)
2Spectrum of Designs
- None Physical bit stream
- blind, physical DMA nCUBE, iPSC, . . .
- User/System
- User-level port CM-5, T
- User-level handler J-Machine, Monsoon, . . .
- Dedicated Processor
- Message passing, Remote virtual address
- Processing, translation Paragon, Meiko CS-2
- Global physical address
- Proc Memory controller RP3, BBN, T3D
- Cache-to-cache
- Cache controller Dash, KSR, Flash
Increasing HW Support, Specialization,
Intrusiveness, Performance (???)
3Dedicated Message Processing
- Without binding the interpretation in the
hardware design - Interpretation by software
- Off-loading the protocol processing to the CP
- Can support a global address space
4Without Specialized Hardware Design
- General Purpose processor performs arbitrary
output processing (at system level) - General Purpose processor interprets incoming
network transactions (at system level) - User Processor Msg Processor share memory
- Msg Processor Msg Processor via system network
transaction
5Levels of Network Transaction
Network
User information
dest
  Â
Mem
Mem
NI
NI
M P
M P
User
System
- User Processor stores cmd / msg / data into
shared output queue - must still check for output queue full (or make
elastic) - Communication assists make transaction happen
- checking, translation, scheduling, transport,
interpretation - Effect observed on destination address space
and/or events - Protocol divided between two layers
6Example Intel Paragon
Service
Network
I/O Nodes
I/O Nodes
Devices
Devices
16
175 MB/s Duplex
rte
MP handler
Mem
  Â
EOP
2048 B
Var data
NI
64
i860xp 50 MHz 16 KB 4-way 32B Block MESI
400 MB/s
sDMA
rDMA
P
M P
7User Level Abstraction
IQ
IQ
Proc
Proc
OQ
OQ
VAS
VAS
IQ
IQ
Proc
Proc
OQ
OQ
VAS
VAS
- Any user process can post a transaction for any
other in protection domain - communication layer moves OQsrc gt IQdest
- may involve indirection VASsrc gt VASdest
8Basic Implementation Costs Scalar
10.5 µs
Net
CP
MP
MP
CP
2
1.5
2
2
2
2
Registers
7 wds
Cache
User OQ
User IQ
Net FIFO
250ns H40ns
- Cache-to-cache transfer (quad word ops)
- Producer, consumer cache miss, hit -gt bus
transactions - to NI FIFO read status, chk, write, . . .
- from NI FIFO read status, chk, dispatch, read,
read, . . .
9Virtual DMA -gt Virtual DMA
- Send MP segments into 8K pages and does VA gt PA
- Recv MP reassembles, does dispatch and VA gt PA
per page
10Single Page Transfer Rate
Effective Buffer Size 3232
Actual Buffer Size 2048
11Case Study Meiko CS2 Concept
- Asymmetric CP
- Circuit-switched Network Transaction
- source-dest circuit held open for request
response - limited cmd set executed directly on NI
- Dedicated communication processor for each step
in flow
12Case Study Meiko CS2 Organization
N
e
t
w
o
r
k
P
P
P
r
e
p
l
y
t
h
r
e
a
d
D
M
A
e
m
o
r
y
I
s
s
u
e
t
r
a
n
s
a
c
t
i
o
n
s
w
r
i
t
e
_
b
l
o
c
k
P
c
m
d
m
(
5
0
-
s
l
i
m
i
t
)
R
I
S
C
i
n
s
t
r
u
c
t
i
o
n
s
e
t
6
4
-
K
n
o
n
p
r
e
e
m
p
t
i
v
e
t
h
r
e
a
d
s
C
o
n
s
t
r
u
c
t
a
r
b
i
t
r
a
r
y
n
e
t
t
r
a
n
s
a
c
t
i
o
n
s
O
u
t
p
u
t
p
r
o
t
o
c
o
l
elan microprocessor
13Spectrum of Designs
- None Physical bit stream
- blind, physical DMA nCUBE, iPSC, . . .
- User/System
- User-level port CM-5, T
- User-level handler J-Machine, Monsoon, . . .
- Dedicated Processor
- Message passing, Remote virtual address
- Processing, translation Paragon, Meiko CS-2
- Global physical address
- Proc Memory controller RP3, BBN, T3D
- Cache-to-cache
- Cache controller Dash, KSR, Flash
Increasing HW Support, Specialization,
Intrusiveness, Performance (???)
14Shared Physical Address Space
D
a
t
a
T
a
g
R
r
s
p
S
r
c
O
u
t
p
u
t
p
r
o
c
e
s
s
i
n
g
M
e
m
a
c
c
e
s
s
R
e
s
p
o
n
s
e
C
o
m
m
m
u
n
i
c
a
t
i
o
n
I
n
p
u
t
p
r
o
c
e
s
s
i
n
g
a
s
s
i
s
t
P
a
r
s
e
C
o
m
p
l
e
t
e
r
e
a
d
P
s
e
u
d
o
-
P
s
e
u
d
o
-
m
e
m
o
r
y
p
r
o
c
e
s
s
o
r
M
e
m
M
e
m
P
M
M
U
- NI emulates memory controller at source
- NI emulates processor at dest
- must be deadlock free
15Case Study Cray T3D
300 MB/s
- shell of support circuitry that embodies the
parallel processing capability - Remote memory operations encoded in address
16Case Study Cray T3D
- No L2 cache
- local memory access 155ns(23 cycle) 300 ns on a
DEC Alpha workstation - Single blocking remote write 900 ns plus annex
setup and address arithmetic - Special support for synchronization
- dedicated network to support global-or and
global-and operations - atomic swap and fetchinc
- user-level message queue
- involves a remote kernel trap
- enqueue 25 ms invoking 75 ms
- Small messages
- Using fetch-and-inc
- Enqueue 3 ms dequeue 1.5 ms
17Clusters and NOW
- Cluster
- Collections of complete computers with dedicated
interconnects - Types of clusters
- Older systems
- Availability clusters
- Multiprogramming clusters Vax VMS Clusters
- New systems mainly used as parallel machines
- High performance clusters Beowulf
- Load-leveling clusters Mosix
- Web-service clusters Linux Virtual Server
- Storage clusters GFS and OpenGFS
- Database clusters Oracle Parallel Server
- High Availability clusters FailSafe, Heartbeat
- SSI clusters Open SSI cluster project
18Technology breakthrough
- Scalable, low-latency interconnects
- Traditional local area networks
- Shared bus Ethernet
- Ring token ring and FDDI
- Scalable bandwidth
- Switch-based local area networks
- HPPI switches, FDDI switches, and FiberChannels
- ATM
- Fast Ethernet and Gigabit Ethernet
- System Area Net
- ServerNet Tandom Corp.
- Myrinet
- Switch 8 ports at 160MB/s each
- InfiniBand
19Issues
- Communication Abstractions
- TCP/IP
- Active Messages user-level network transactions
- Reflective Memory Shrimp project
- VIA led by Intel, Microsoft, and Compaq
- Hardware support for the Communication Assists
- Memory Bus vs. I/O Bus
- Node architecture
- Single processor vs. SMP
20Case Study NOW
- General purpose processor embedded in NIC
21Reflective Memory
N
o
d
e
j
N
o
d
e
i
V
A
P
h
y
s
i
c
a
l
V
A
0
a
d
d
r
e
s
s
T
0
T
T
1
1
T
2
T
T
2
2
T
I
/
O
3
R
0
R
1
R
R
2
R
1
2
R
2
V
A
2
R
3
N
o
d
e
k
V
A
T
3
T
0
R
3
T
1
R
1
R
0
- Writes to local region reflected to remote
- one of memory-based message passing
22Case Study DEC Memory Channel
PCT page control table
23Implications for Parallel Software and
Synchronization
24Communication Performance
- Microbenchmarks
- The basic net transactions on a user-to-user
basis - Active messages
- Shared address space
- Standard MPI message-passing
- Application Level
- Read Section 7.8.4 in the text book
25Message Time Breakdown
T
o
t
a
l
c
o
m
m
u
n
i
c
a
t
i
o
n
l
a
t
e
n
c
y
O
L
O
r
s
O
b
s
e
r
v
e
d
n
e
t
w
o
r
k
D
e
s
t
i
n
a
t
i
o
n
p
r
o
c
e
s
s
o
r
l
a
t
e
n
c
y
e
c
r
u
C
o
m
m
u
n
i
c
a
t
i
o
n
a
s
s
i
s
t
o
s
e
r
N
e
t
w
o
r
k
e
n
i
h
C
o
m
m
u
n
i
c
a
t
i
o
n
a
s
s
i
s
t
c
a
M
S
o
u
r
c
e
p
r
o
c
e
s
s
o
r
T
i
m
e
o
f
t
h
e
m
e
s
s
a
g
e
- The end-to-end message time Round-trip time / 2
- Measured by source processor no global clock
- Overhead cannot be used for useful computation
- Latency can potentially masked by other useful
work
26Message Time Comparison
P
r
o
c
e
s
s
i
n
g
o
v
e
r
h
e
a
d
,
C
o
m
m
u
n
i
c
a
t
i
o
n
T
i
m
e
p
e
r
1
4
s
e
n
d
i
n
g
s
i
d
e
(
O
)
l
a
t
e
n
c
y
(
L
)
m
e
s
s
a
g
e
,
s
p
i
p
e
l
i
n
e
d
Accessing system memory
P
r
o
c
e
s
s
i
n
g
o
v
e
r
h
e
a
d
,
s
e
q
u
e
n
c
e
r
e
c
e
i
v
i
n
g
s
i
d
e
(
O
)
1
2
r
o
f
r
e
q
u
e
s
t
-
r
e
s
p
o
n
s
e
o
p
e
r
a
t
i
o
n
s
1
0
Uncached Read over I/O bus
(
g
)
s
d
8
n
o
c
e
s
o
r
6
c
i
M
4
2
0
5
2
D
5
2
D
n
n
a
a
-
-
-
-
o
3
o
3
r
r
M
S
t
M
S
t
g
T
g
T
l
l
C
C
a
U
a
U
C
C
r
r
o
o
a
a
W
W
k
k
P
P
i
i
O
O
e
e
M
N
M
N
- One-way Active Message time (five-word message)
27Performance Analysis
- Send Overhead
- CM-5
- uncached writes of data and uncached read of NI
status - Paragon
- bus-based cache coherence protocol within the
node - Meiko
- a pointer is enqueued in the NI with a single
swap instr. - Swap is very slow
- -gt the cost of uncached operations, synch, and
misses is critical to communication performance - Receive Overhead
- Cache-to-cache transfer Paragon
- Uncached transfer CM-5, CS-2 - faster than
cache-to-cache - NOW uncached read over I/O bus
28Performance Analysis (contd)
- Latency
- CA occupancy, channel occupancy, network delay
- CM-5 20MB/s links
- Channel occupancy
- Paragon 175 MB/s
- CP occupancy
- Meiko 40 MB/s
- Accessing system memory from the CP
29SAS Time Comparison
- Performance of a remote read