Scalable Multiprocessors III - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Scalable Multiprocessors III

Description:

Processing, translation Paragon, Meiko CS-2. Global physical address. Proc Memory controller RP3, BBN, T3D. Cache-to-cache. Cache controller Dash, KSR, Flash ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 30

Provided by: jaswi3

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Multiprocessors III

1
Scalable Multiprocessors (III)
2
Spectrum of Designs

None Physical bit stream
blind, physical DMA nCUBE, iPSC, . . .
User/System
User-level port CM-5, T
User-level handler J-Machine, Monsoon, . . .
Dedicated Processor
Message passing, Remote virtual address
Processing, translation Paragon, Meiko CS-2
Global physical address
Proc Memory controller RP3, BBN, T3D
Cache-to-cache
Cache controller Dash, KSR, Flash

Increasing HW Support, Specialization,
Intrusiveness, Performance (???)
3
Dedicated Message Processing

Without binding the interpretation in the
hardware design
Interpretation by software
Off-loading the protocol processing to the CP
Can support a global address space

4
Without Specialized Hardware Design

General Purpose processor performs arbitrary
output processing (at system level)
General Purpose processor interprets incoming
network transactions (at system level)
User Processor Msg Processor share memory
Msg Processor Msg Processor via system network
transaction

5
Levels of Network Transaction
Network
User information
dest

Mem
Mem
NI
NI
M P
M P
User
System

User Processor stores cmd / msg / data into
shared output queue
must still check for output queue full (or make
elastic)
Communication assists make transaction happen
checking, translation, scheduling, transport,
interpretation
Effect observed on destination address space
and/or events
Protocol divided between two layers

6
Example Intel Paragon
Service
Network
I/O Nodes
I/O Nodes
Devices
Devices
16
175 MB/s Duplex
rte
MP handler
Mem

EOP
2048 B
Var data
NI
64
i860xp 50 MHz 16 KB 4-way 32B Block MESI
400 MB/s
sDMA

rDMA
P
M P
7
User Level Abstraction
IQ
IQ
Proc
Proc
OQ
OQ
VAS
VAS
IQ
IQ
Proc
Proc
OQ
OQ
VAS
VAS

Any user process can post a transaction for any
other in protection domain
communication layer moves OQsrc gt IQdest
may involve indirection VASsrc gt VASdest

8
Basic Implementation Costs Scalar
10.5 µs
Net
CP
MP
MP
CP
2
1.5
2
2
2
2
Registers
7 wds
Cache
User OQ
User IQ
Net FIFO
250ns H40ns

Cache-to-cache transfer (quad word ops)
Producer, consumer cache miss, hit -gt bus
transactions
to NI FIFO read status, chk, write, . . .
from NI FIFO read status, chk, dispatch, read,
read, . . .

9
Virtual DMA -gt Virtual DMA

Send MP segments into 8K pages and does VA gt PA
Recv MP reassembles, does dispatch and VA gt PA
per page

10
Single Page Transfer Rate
Effective Buffer Size 3232
Actual Buffer Size 2048
11
Case Study Meiko CS2 Concept

Asymmetric CP
Circuit-switched Network Transaction
source-dest circuit held open for request
response
limited cmd set executed directly on NI
Dedicated communication processor for each step
in flow

12
Case Study Meiko CS2 Organization
N
e
t
w
o
r
k
P
P
P
r
e
p
l
y
t
h
r
e
a
d
D
M
A
e
m
o
r
y
I
s
s
u
e

t
r
a
n
s
a
c
t
i
o
n
s
w
r
i
t
e
_
b
l
o
c
k
P
c
m
d
m
(
5
0
-
s

l
i
m
i
t
)
R
I
S
C

i
n
s
t
r
u
c
t
i
o
n

s
e
t
6
4
-
K

n
o
n
p
r
e
e
m
p
t
i
v
e

t
h
r
e
a
d
s
C
o
n
s
t
r
u
c
t

a
r
b
i
t
r
a
r
y

n
e
t

t
r
a
n
s
a
c
t
i
o
n
s
O
u
t
p
u
t

p
r
o
t
o
c
o
l
elan microprocessor
13
Spectrum of Designs

None Physical bit stream
blind, physical DMA nCUBE, iPSC, . . .
User/System
User-level port CM-5, T
User-level handler J-Machine, Monsoon, . . .
Dedicated Processor
Message passing, Remote virtual address
Processing, translation Paragon, Meiko CS-2
Global physical address
Proc Memory controller RP3, BBN, T3D
Cache-to-cache
Cache controller Dash, KSR, Flash

Increasing HW Support, Specialization,
Intrusiveness, Performance (???)
14
Shared Physical Address Space
D
a
t
a
T
a
g
R
r
s
p
S
r
c
O
u
t
p
u
t

p
r
o
c
e
s
s
i
n
g

M
e
m

a
c
c
e
s
s

R
e
s
p
o
n
s
e
C
o
m
m
m
u
n
i
c
a
t
i
o
n
I
n
p
u
t

p
r
o
c
e
s
s
i
n
g
a
s
s
i
s
t

P
a
r
s
e

C
o
m
p
l
e
t
e

r
e
a
d
P
s
e
u
d
o
-
P
s
e
u
d
o
-
m
e
m
o
r
y
p
r
o
c
e
s
s
o
r
M
e
m
M
e
m

P
M
M
U

NI emulates memory controller at source
NI emulates processor at dest
must be deadlock free

15
Case Study Cray T3D
300 MB/s

shell of support circuitry that embodies the
parallel processing capability
Remote memory operations encoded in address

16
Case Study Cray T3D

No L2 cache
local memory access 155ns(23 cycle) 300 ns on a
DEC Alpha workstation
Single blocking remote write 900 ns plus annex
setup and address arithmetic
Special support for synchronization
dedicated network to support global-or and
global-and operations
atomic swap and fetchinc
user-level message queue
involves a remote kernel trap
enqueue 25 ms invoking 75 ms
Small messages
Using fetch-and-inc
Enqueue 3 ms dequeue 1.5 ms

17
Clusters and NOW

Cluster
Collections of complete computers with dedicated
interconnects
Types of clusters
Older systems
Availability clusters
Multiprogramming clusters Vax VMS Clusters
New systems mainly used as parallel machines
High performance clusters Beowulf
Load-leveling clusters Mosix
Web-service clusters Linux Virtual Server
Storage clusters GFS and OpenGFS
Database clusters Oracle Parallel Server
High Availability clusters FailSafe, Heartbeat
SSI clusters Open SSI cluster project

18
Technology breakthrough

Scalable, low-latency interconnects
Traditional local area networks
Shared bus Ethernet
Ring token ring and FDDI
Scalable bandwidth
Switch-based local area networks
HPPI switches, FDDI switches, and FiberChannels
ATM
Fast Ethernet and Gigabit Ethernet
System Area Net
ServerNet Tandom Corp.
Myrinet
Switch 8 ports at 160MB/s each
InfiniBand

19
Issues

Communication Abstractions
TCP/IP
Active Messages user-level network transactions
Reflective Memory Shrimp project
VIA led by Intel, Microsoft, and Compaq
Hardware support for the Communication Assists
Memory Bus vs. I/O Bus
Node architecture
Single processor vs. SMP

20
Case Study NOW

General purpose processor embedded in NIC

21
Reflective Memory
N
o
d
e
j
N
o
d
e
i
V
A
P
h
y
s
i
c
a
l
V
A
0
a
d
d
r
e
s
s
T
0
T
T
1
1
T
2
T
T
2
2
T
I
/
O
3
R
0
R
1
R
R
2
R
1
2
R
2
V
A
2
R
3
N
o
d
e
k
V
A
T
3
T
0
R
3
T
1
R
1
R
0

Writes to local region reflected to remote
one of memory-based message passing

22
Case Study DEC Memory Channel
PCT page control table

See also Shrimp

23
Implications for Parallel Software and
Synchronization
24
Communication Performance

Microbenchmarks
The basic net transactions on a user-to-user
basis
Active messages
Shared address space
Standard MPI message-passing
Application Level
Read Section 7.8.4 in the text book

25
Message Time Breakdown
T
o
t
a
l

c
o
m
m
u
n
i
c
a
t
i
o
n

l
a
t
e
n
c
y
O
L
O
r
s
O
b
s
e
r
v
e
d

n
e
t
w
o
r
k
D
e
s
t
i
n
a
t
i
o
n

p
r
o
c
e
s
s
o
r
l
a
t
e
n
c
y
e
c
r
u
C
o
m
m
u
n
i
c
a
t
i
o
n

a
s
s
i
s
t
o
s
e
r
N
e
t
w
o
r
k

e
n
i
h
C
o
m
m
u
n
i
c
a
t
i
o
n

a
s
s
i
s
t
c
a
M
S
o
u
r
c
e

p
r
o
c
e
s
s
o
r
T
i
m
e

o
f

t
h
e

m
e
s
s
a
g
e

The end-to-end message time Round-trip time / 2
Measured by source processor no global clock
Overhead cannot be used for useful computation
Latency can potentially masked by other useful
work

26
Message Time Comparison
P
r
o
c
e
s
s
i
n
g

o
v
e
r
h
e
a
d
,
C
o
m
m
u
n
i
c
a
t
i
o
n
T
i
m
e

p
e
r

1
4
s
e
n
d
i
n
g

s
i
d
e

(
O
)
l
a
t
e
n
c
y

(
L
)
m
e
s
s
a
g
e
,

s
p
i
p
e
l
i
n
e
d
Accessing system memory
P
r
o
c
e
s
s
i
n
g

o
v
e
r
h
e
a
d
,
s
e
q
u
e
n
c
e

r
e
c
e
i
v
i
n
g

s
i
d
e

(
O
)
1
2
r
o
f

r
e
q
u
e
s
t
-
r
e
s
p
o
n
s
e

o
p
e
r
a
t
i
o
n
s

1
0
Uncached Read over I/O bus
(
g
)
s
d
8
n
o
c
e
s
o
r
6
c
i
M
4
2
0
5
2
D
5
2
D
n
n
a
a
-
-
-
-
o
3
o
3
r
r
M
S
t
M
S
t
g
T
g
T
l
l
C
C
a
U
a
U
C
C

r

r

o
o
a
a
W
W
k
k
P
P
i
i
O
O
e
e
M
N
M
N

One-way Active Message time (five-word message)

27
Performance Analysis

Send Overhead
CM-5
uncached writes of data and uncached read of NI
status
Paragon
bus-based cache coherence protocol within the
node
Meiko
a pointer is enqueued in the NI with a single
swap instr.
Swap is very slow
-gt the cost of uncached operations, synch, and
misses is critical to communication performance
Receive Overhead
Cache-to-cache transfer Paragon
Uncached transfer CM-5, CS-2 - faster than
cache-to-cache
NOW uncached read over I/O bus