Scalable Multiprocessors III - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Scalable Multiprocessors III

Description:

Processing, translation Paragon, Meiko CS-2. Global physical address. Proc Memory controller RP3, BBN, T3D. Cache-to-cache. Cache controller Dash, KSR, Flash ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 30
Provided by: jaswi3
Category:

less

Transcript and Presenter's Notes

Title: Scalable Multiprocessors III


1
Scalable Multiprocessors (III)
2
Spectrum of Designs
  • None Physical bit stream
  • blind, physical DMA nCUBE, iPSC, . . .
  • User/System
  • User-level port CM-5, T
  • User-level handler J-Machine, Monsoon, . . .
  • Dedicated Processor
  • Message passing, Remote virtual address
  • Processing, translation Paragon, Meiko CS-2
  • Global physical address
  • Proc Memory controller RP3, BBN, T3D
  • Cache-to-cache
  • Cache controller Dash, KSR, Flash

Increasing HW Support, Specialization,
Intrusiveness, Performance (???)
3
Dedicated Message Processing
  • Without binding the interpretation in the
    hardware design
  • Interpretation by software
  • Off-loading the protocol processing to the CP
  • Can support a global address space

4
Without Specialized Hardware Design
  • General Purpose processor performs arbitrary
    output processing (at system level)
  • General Purpose processor interprets incoming
    network transactions (at system level)
  • User Processor Msg Processor share memory
  • Msg Processor Msg Processor via system network
    transaction

5
Levels of Network Transaction
Network
User information
dest
   
Mem
Mem
NI
NI
M P
M P
User
System
  • User Processor stores cmd / msg / data into
    shared output queue
  • must still check for output queue full (or make
    elastic)
  • Communication assists make transaction happen
  • checking, translation, scheduling, transport,
    interpretation
  • Effect observed on destination address space
    and/or events
  • Protocol divided between two layers

6
Example Intel Paragon
Service
Network
I/O Nodes
I/O Nodes
Devices
Devices
16
175 MB/s Duplex
rte
MP handler
Mem
   
EOP
2048 B
Var data
NI
64
i860xp 50 MHz 16 KB 4-way 32B Block MESI
400 MB/s
sDMA


rDMA
P
M P
7
User Level Abstraction
IQ
IQ
Proc
Proc
OQ
OQ
VAS
VAS
IQ
IQ
Proc
Proc
OQ
OQ
VAS
VAS
  • Any user process can post a transaction for any
    other in protection domain
  • communication layer moves OQsrc gt IQdest
  • may involve indirection VASsrc gt VASdest

8
Basic Implementation Costs Scalar
10.5 µs
Net
CP
MP
MP
CP
2
1.5
2
2
2
2
Registers
7 wds
Cache
User OQ
User IQ
Net FIFO
250ns H40ns
  • Cache-to-cache transfer (quad word ops)
  • Producer, consumer cache miss, hit -gt bus
    transactions
  • to NI FIFO read status, chk, write, . . .
  • from NI FIFO read status, chk, dispatch, read,
    read, . . .

9
Virtual DMA -gt Virtual DMA
  • Send MP segments into 8K pages and does VA gt PA
  • Recv MP reassembles, does dispatch and VA gt PA
    per page

10
Single Page Transfer Rate
Effective Buffer Size 3232
Actual Buffer Size 2048
11
Case Study Meiko CS2 Concept
  • Asymmetric CP
  • Circuit-switched Network Transaction
  • source-dest circuit held open for request
    response
  • limited cmd set executed directly on NI
  • Dedicated communication processor for each step
    in flow

12
Case Study Meiko CS2 Organization
N
e
t
w
o
r
k
P
P
P
r
e
p
l
y
t
h
r
e
a
d
D
M
A
e
m
o
r
y
I
s
s
u
e


t
r
a
n
s
a
c
t
i
o
n
s
w
r
i
t
e
_
b
l
o
c
k
P
c
m
d
m
(
5
0
-
s

l
i
m
i
t
)
R
I
S
C

i
n
s
t
r
u
c
t
i
o
n

s
e
t
6
4
-
K

n
o
n
p
r
e
e
m
p
t
i
v
e

t
h
r
e
a
d
s
C
o
n
s
t
r
u
c
t

a
r
b
i
t
r
a
r
y

n
e
t

t
r
a
n
s
a
c
t
i
o
n
s
O
u
t
p
u
t

p
r
o
t
o
c
o
l
elan microprocessor
13
Spectrum of Designs
  • None Physical bit stream
  • blind, physical DMA nCUBE, iPSC, . . .
  • User/System
  • User-level port CM-5, T
  • User-level handler J-Machine, Monsoon, . . .
  • Dedicated Processor
  • Message passing, Remote virtual address
  • Processing, translation Paragon, Meiko CS-2
  • Global physical address
  • Proc Memory controller RP3, BBN, T3D
  • Cache-to-cache
  • Cache controller Dash, KSR, Flash

Increasing HW Support, Specialization,
Intrusiveness, Performance (???)
14
Shared Physical Address Space
D
a
t
a
T
a
g
R
r
s
p
S
r
c
O
u
t
p
u
t

p
r
o
c
e
s
s
i
n
g

M
e
m

a
c
c
e
s
s

R
e
s
p
o
n
s
e
C
o
m
m
m
u
n
i
c
a
t
i
o
n
I
n
p
u
t

p
r
o
c
e
s
s
i
n
g
a
s
s
i
s
t

P
a
r
s
e

C
o
m
p
l
e
t
e

r
e
a
d
P
s
e
u
d
o
-
P
s
e
u
d
o
-
m
e
m
o
r
y
p
r
o
c
e
s
s
o
r
M
e
m
M
e
m

P
M
M
U
  • NI emulates memory controller at source
  • NI emulates processor at dest
  • must be deadlock free

15
Case Study Cray T3D
300 MB/s
  • shell of support circuitry that embodies the
    parallel processing capability
  • Remote memory operations encoded in address

16
Case Study Cray T3D
  • No L2 cache
  • local memory access 155ns(23 cycle) 300 ns on a
    DEC Alpha workstation
  • Single blocking remote write 900 ns plus annex
    setup and address arithmetic
  • Special support for synchronization
  • dedicated network to support global-or and
    global-and operations
  • atomic swap and fetchinc
  • user-level message queue
  • involves a remote kernel trap
  • enqueue 25 ms invoking 75 ms
  • Small messages
  • Using fetch-and-inc
  • Enqueue 3 ms dequeue 1.5 ms

17
Clusters and NOW
  • Cluster
  • Collections of complete computers with dedicated
    interconnects
  • Types of clusters
  • Older systems
  • Availability clusters
  • Multiprogramming clusters Vax VMS Clusters
  • New systems mainly used as parallel machines
  • High performance clusters Beowulf
  • Load-leveling clusters Mosix
  • Web-service clusters Linux Virtual Server
  • Storage clusters GFS and OpenGFS
  • Database clusters Oracle Parallel Server
  • High Availability clusters FailSafe, Heartbeat
  • SSI clusters Open SSI cluster project

18
Technology breakthrough
  • Scalable, low-latency interconnects
  • Traditional local area networks
  • Shared bus Ethernet
  • Ring token ring and FDDI
  • Scalable bandwidth
  • Switch-based local area networks
  • HPPI switches, FDDI switches, and FiberChannels
  • ATM
  • Fast Ethernet and Gigabit Ethernet
  • System Area Net
  • ServerNet Tandom Corp.
  • Myrinet
  • Switch 8 ports at 160MB/s each
  • InfiniBand

19
Issues
  • Communication Abstractions
  • TCP/IP
  • Active Messages user-level network transactions
  • Reflective Memory Shrimp project
  • VIA led by Intel, Microsoft, and Compaq
  • Hardware support for the Communication Assists
  • Memory Bus vs. I/O Bus
  • Node architecture
  • Single processor vs. SMP

20
Case Study NOW
  • General purpose processor embedded in NIC

21
Reflective Memory
N
o
d
e
j
N
o
d
e
i
V
A
P
h
y
s
i
c
a
l
V
A
0
a
d
d
r
e
s
s
T
0
T
T
1
1
T
2
T
T
2
2
T
I
/
O
3
R
0
R
1
R
R
2
R
1
2
R
2
V
A
2
R
3
N
o
d
e
k
V
A
T
3
T
0
R
3
T
1
R
1
R
0
  • Writes to local region reflected to remote
  • one of memory-based message passing

22
Case Study DEC Memory Channel
PCT page control table
  • See also Shrimp

23
Implications for Parallel Software and
Synchronization
24
Communication Performance
  • Microbenchmarks
  • The basic net transactions on a user-to-user
    basis
  • Active messages
  • Shared address space
  • Standard MPI message-passing
  • Application Level
  • Read Section 7.8.4 in the text book

25
Message Time Breakdown
T
o
t
a
l

c
o
m
m
u
n
i
c
a
t
i
o
n

l
a
t
e
n
c
y
O
L
O
r
s
O
b
s
e
r
v
e
d

n
e
t
w
o
r
k
D
e
s
t
i
n
a
t
i
o
n

p
r
o
c
e
s
s
o
r
l
a
t
e
n
c
y
e
c
r
u
C
o
m
m
u
n
i
c
a
t
i
o
n

a
s
s
i
s
t
o
s
e
r
N
e
t
w
o
r
k

e
n
i
h
C
o
m
m
u
n
i
c
a
t
i
o
n

a
s
s
i
s
t
c
a
M
S
o
u
r
c
e

p
r
o
c
e
s
s
o
r
T
i
m
e

o
f

t
h
e

m
e
s
s
a
g
e
  • The end-to-end message time Round-trip time / 2
  • Measured by source processor no global clock
  • Overhead cannot be used for useful computation
  • Latency can potentially masked by other useful
    work

26
Message Time Comparison
P
r
o
c
e
s
s
i
n
g

o
v
e
r
h
e
a
d
,
C
o
m
m
u
n
i
c
a
t
i
o
n
T
i
m
e

p
e
r

1
4
s
e
n
d
i
n
g

s
i
d
e

(
O
)
l
a
t
e
n
c
y

(
L
)
m
e
s
s
a
g
e
,

s
p
i
p
e
l
i
n
e
d
Accessing system memory
P
r
o
c
e
s
s
i
n
g

o
v
e
r
h
e
a
d
,
s
e
q
u
e
n
c
e

r
e
c
e
i
v
i
n
g

s
i
d
e

(
O
)
1
2
r
o
f

r
e
q
u
e
s
t
-
r
e
s
p
o
n
s
e

o
p
e
r
a
t
i
o
n
s

1
0
Uncached Read over I/O bus
(
g
)
s
d
8
n
o
c
e
s
o
r
6
c
i
M
4
2
0
5
2
D
5
2
D
n
n
a
a
-
-
-
-
o
3
o
3
r
r
M
S
t
M
S
t
g
T
g
T
l
l
C
C
a
U
a
U
C
C


r

r

o
o
a
a
W
W
k
k
P
P
i
i
O
O
e
e
M
N
M
N
  • One-way Active Message time (five-word message)

27
Performance Analysis
  • Send Overhead
  • CM-5
  • uncached writes of data and uncached read of NI
    status
  • Paragon
  • bus-based cache coherence protocol within the
    node
  • Meiko
  • a pointer is enqueued in the NI with a single
    swap instr.
  • Swap is very slow
  • -gt the cost of uncached operations, synch, and
    misses is critical to communication performance
  • Receive Overhead
  • Cache-to-cache transfer Paragon
  • Uncached transfer CM-5, CS-2 - faster than
    cache-to-cache
  • NOW uncached read over I/O bus

28
Performance Analysis (contd)
  • Latency
  • CA occupancy, channel occupancy, network delay
  • CM-5 20MB/s links
  • Channel occupancy
  • Paragon 175 MB/s
  • CP occupancy
  • Meiko 40 MB/s
  • Accessing system memory from the CP

29
SAS Time Comparison
  • Performance of a remote read
Write a Comment
User Comments (0)
About PowerShow.com