Send and Receive Based Message-Passing for SCMP - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Send and Receive Based Message-Passing for SCMP

Description:

Send and Receive Based Message-Passing for SCMP Charles W. Lewis, Jr. Thesis Defense Virginia Tech April 28th, 2004 This presentation introduces the SCMP architecture ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 44
Provided by: chj46
Category:

less

Transcript and Presenter's Notes

Title: Send and Receive Based Message-Passing for SCMP


1
Send and Receive Based Message-Passing for SCMP
  • Charles W. Lewis, Jr.
  • Thesis Defense
  • Virginia Tech
  • April 28th, 2004

2
This presentation introduces the SCMP
architecture, discusses problems with the current
SCMP message-passing system, and focuses on the
design and performance of a new SCMP
message-passing system.
1. Overview of SCMP
2. Original Message-Passing System
4. Performance Comparisons
3. New Message-Passing System
3
Problems with current design trends motivate the
SCMP concept.
  • As transistor sizes shrink, so do communication
    wires. This leads to higher cross-chip
    communication latencies.
  • ILP faces diminishing returns.
  • Large and complex uni-processors require
    extensive amounts of design and verification.

4
SCMP provides PLP through replication.
  • Up to 64 identical nodes on-chip
  • Replicated nodes reduce complexity
  • 2-D network eliminates cross-chip wires

SCMP Network with 64 Nodes
5
SCMP provides TLP through multi-thread hardware
support.
  • Up to 16 threads
  • Round-robin thread scheduling by hardware
  • On every node
  • 4-stage RISC pipeline
  • 8MB memory
  • Networking hardware

SCMP Node
6
The original messaging system has two message
types.
H T Payload Payload Payload Payload
1 0 X Y THREAD 1
0 0 Address Address Address Address
0 0 Register Data Register Data Register Data Register Data
. . . .
. . . .
. . . .
0 1 Register Data Register Data Register Data Register Data
H T Payload Payload Payload Payload
1 0 X Y DATA Stride
0 0 Address Address Address Address
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
Thread Message
Data Message
Because they contain handling information these
message formats borrow from the Active-Messages
message-passing system.
7
Network uses wormhole and dimension-order routing.
0
1
2
3
4
5
6
7
  • Every router multiplexes virtual channel buffers
    over physical channels.
  • Head flits claim virtual channel resources as
    they travel
  • If one message blocks, other messages may still
    continue as long as enough virtual channels are
    free.
  • Messages move along X axis, then Y axis
  • Tail flits release virtual channel resources as
    they travel.

8
Dimension-order routing is deadlock free as long
as messages eventually drain.
Router
  • Even with VCs, network can still deadlock if
    messages dont drain.
  • If all contexts are consumed, thread messages
    block at NIU
  • Threads may not release until a data message is
    received
  • Data messages must not be stopped by congested
    thread messages
  • Data messages must have a separate path through
    network.

Thread VCs
West
East
Data VCs
9
The NIU bears most of the messaging load.
NIU
Thread Buffer
Context 1
Context 2
Injection Channel
Data Buffer
Context 2
To Router
From Router
Receive Buffer
Ejection Channel
Memory
10
Messages are built through assembly instructions.
Instruction Arguments Description
sendh d_node, type, d_address, d_stride send a header flit
send data send one data flit
send2 data, data send two data flits
sende data send one data flit and end message
send2e data, data send two data flits and end message
sendm l_address, l_stride, count send data block from memory
sendme l_address, l_stride, count send data block from memory and end message
11
The thread library facilitates thread messages.
Operation Arguments Description
createThread int dst_node create a thread on dst_node
void(addr)()
void(callback)()

parExecute int num_nodes create threads on
void(addr)() num_nodes nodes

getBlock unsigned int node_id request a block of values
char dst_addr from node node_id
unsigned int dst_stride
char src_addr
unsigned int src_offset
unsigned int src_stride
unsigned int num_words
12
The send library facilitates data messages.
Operation Arguments Description
sendDataIntValue int dst_node send an integer to dst_node
int dst_addr
int value
sendDataFloatValue int dst_node send a double to dst_node
double dst_addr
double value
sendDataBlock int dst_node send a block of values from
int dst_addr memory to dst_node
int dst_stride (blocking)
int src_addr
int src_stride
int num_words
sendDataBlockNB int dst_node send a block of values from
int dst_addr memory to dst_node
int dst_stride (non-blocking)
int src_addr
int src_stride
int num_words
13
The original message-passing system uses requests
and replies.
  • Node A requires data held by Node B
  • Node A creates a thread on Node B
  • New thread on Node B sends data to Node A
  • New thread on Node B sends SYNC message when done

A
B
Thread
B
A
Data
Sync
14
Dynamic memory is a problem.
  • Request thread on node B must know
  • Source Address
  • Source Stride
  • Destination Address
  • Destination Stride
  • Number of Values to Send
  • How can Node A know the source address and stride
    if Node B allocates the buffer dynamically?
  • Program must contain global pointers

H T Payload Payload Payload Payload
1 0 X Y DATA Stride
0 0 Address Address Address Address
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
15
In-order delivery of messages is a problem.
  • SCMP network does not guarantee in-order delivery
    of messages
  • SYNC message may reach Node A before data message
  • Node A will read bad values from memory

B
A
Data
Sync
16
Request threads and finite thread contexts are a
problem.
Contexts
0X0000de5a
0X00000f70
0X00000ff8
0X00000ff8
NIU
0X00000ff8
Thread
Thread
Thread
0X00000ff8
0X00000ff8
0X00000ff8
  • If a node holds highly demanded data, request
    threads may consume all of its contexts
  • Additional thread messages will block in the
    network

17
Send-and-Receive message-passing eliminates all
of these problems.
  • A thread must execute a receive before data will
    be accepted
  • Dont need request messages
  • Messages are identified abstractly
  • Dont need global pointers
  • Completion notification occurs locally
  • Dont need SYNC messages

18
Rendezvous mode uses an RTS/CTS handshake.
  • Node B holds data required by Node A
  • Node B sends Node A an RTS message when send is
    executed
  • After receive is executed Node A sends Node B a
    CTS message
  • Node B sends data after receiving RTS

A
B
RTS
B
A
CTS
B
A
Data
19
Ready mode foregoes the handshake to reduce
message latency.
  • Node B holds data required by Node A
  • Node B sends data when send is executed
  • User must ensure that receive has executed on
    Node A

B
A
Data
20
The implementation centers around two tables.
Send Table Entry
33 2 1 0
id id state state
Receive Table Entry
83 50 49 29 28 13 12 7 6 3 2 0
id id address address stride stride r_node r_node r_cntxt r_cntxt state state
21
Send Table Entries may be in 4 states, and
Receive Table Entries may be in 5 states.
Value State
00 Empty
01 In Use
10 In Progress
11 Complete
Value State
000 Empty
001 In Use
010 In Progress
011 RTS Received
10X NOT USED
110 NOT USED
111 Complete
Send Table Entry States
Receive Table Entry States
22
The new messaging system has four message types.
H T Payload Payload Payload Payload
1 0 X Y THREAD
1 1 Handler Address Handler Address Handler Address Handler Address
0 0 Register Data Register Data Register Data Register Data
. . . .
. . . .
. . . .
0 1 Register Data Register Data Register Data Register Data
H T Payload Payload Payload Payload
1 0 X Y DATA
1 1 Message ID Message ID Message ID Message ID
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
Data Message
Thread Message
H T Payload Payload Payload Payload Payload
1 0 X Y RTS cntxt node
0 1 Message ID Message ID Message ID Message ID Message ID
H T Payload Payload Payload Payload
1 0 X Y CTS cntxt
0 1 Message ID Message ID Message ID Message ID
CTS Message
RTS Message
23
The NIU now contains a data queue for every
context.
NIU
Thread Buffer
Injection Channel
To Router
Data Buffer
Context 1
Context 2
Context 2
RTS Buffer
From Router
CTS Buffer
Receive Buffer
Ejection Channel
Memory
24
Only five new instructions and one modified
instruction are needed.
Instruction Arguments Description
sendh d_node, type, d_address send a header flit
send data send one data flit
send2 data, data send two data flits
sende data send one data flit and end message
send2e data, data send two data flits and end message
sendm l_address, l_stride, count send data block from memory
sendme l_address, l_stride, count send data block from memory and end message
ldss r, message_id poll send operation status
ldsr r, message_id poll receive operation status
str message_id, address, stride store a receive to table
rms message_id clear a send operation
rmr message_id clear a receive operation
25
The thread library remains nearly the same.
Operation Arguments Description
createThread int dst_node create a thread on dst_node
void(addr)()
void(callback)()

parExecute int num_nodes create threads on
void(addr)() num_nodes nodes

getBlock unsigned int node_id request a block of values
char dst_addr from node node_id
unsigned int dst_stride
char src_addr
unsigned int src_offset
unsigned int src_stride
message_id
unsigned int num_words
26
The new send library is more familiar.
Operation Arguments Description
SCMPSendInt int dst_node send an integer to dst_node
int message_id
int value
SCMPSendFloat int dst_node send a double to dst_node
int message_id
double value
SCMPSend int dst_node send a block of values from
int message_id memory to dst_node
int address (blocking)
int stride
int num_words
SCMPSendNB int dst_node send a block of values from
int message_id memory to dst_node
int address (non-blocking)
int stride
int num_words
SCMPPollSend int message_id poll status of send operation
SCMPWaitSend int message_id suspend until message sends
SCMPClearSend int message_id clear send operation
27
The receive library is all new.
Operation Arguments Description
SCMPReceive int message_id receive a message and
int address store it at address
int stride (blocking)
SCMPReceiveNB int message_id receive a message and
int address store it at address
int stride (non-blocking)
SCMPPollReceive int message_id poll status of receive operation
SCMPWaitReceive int message_id suspend until message arrives
SCMPClearReceive int message_id clear receive operation
28
Rendezvous Mode Operation at the Sender
sendh
No Entry?
F
SUSPEND
CTS Message Arrives
T
Queue Head And Tag
Queue Waiting
F
Create Entry-gtIn Use
ERROR
T
Head Flit _at_ Queue Head
Tail Flit Not Sent
Send Flit
No Entry?
T
ERROR
Entry-gtComplete
F
Send RTS
Entry-gtIn Progress
29
Rendezvous Mode Operation at the Receiver
RTS Message Arrives
Data Message Arrives
No Entry
No Entry
T
T
DISCARD
Record RTS
F
In Progress
Entry-gtRTS Rcvd
F
F
Block Data
In Use
T
Send CTS
T
Tail Flit Not Stored
F
Entry-gtIn Progress
Store Flit
Block RTS
Entry-gtComplete
RTS Rcvd
No Entry
F
F
SUSPEND
str
T
T
Record str
Send CTS
Entry-gtIn Use
Entry-gtIn Progress
30
RTS and CTS Messages also need separate VC paths.
Router
  • RTS messages can block in the network.
  • For a given RTS message to leave the network, RTS
    messages ahead of it must be satisfied
  • CTS message to source
  • Data message back
  • RTS and CTS messages have their own VC paths.

Thread VCs
Data VCs
West
East
RTS VCs
CTS VCs
31
Ready Mode Operation at the Sender
Head Flit _at_ Queue Head
F
sendh
No Entry?
ERROR
No Entry?
F
T
SUSPEND
Entry-gtIn Progress
T
Queue Head And Tag
Tail Flit Not Sent
Send Flit
Create Entry-gtIn Use
Entry-gtComplete
32
Ready Mode Operation at the Receiver
33
Stressmark testing was used to verify that
performance was not hurt.
  • DIS Stressmark Suite
  • Neighborhood Stressmark
  • Matrix Stressmark
  • Transitive Closure Stressmark
  • LU Factorization Stressmark

34
The neighborhood stressmark measures image
texture.
  • Every node owns a portion of the total rows
  • Every row owns complete sum and difference
    histograms
  • Each node determines, and requests, the pairs
    for pixels in its rows
  • Each node fills in sum and difference histogram
  • Histograms are shared
  • Each node manages only a portion of each
    histogram
  • Only the correct portion is sent to a node

35
Queues with 16 flits perform best.
36
The new system out performs the old under the
neighborhood stressmark.
37
Matrix stressmark solves a linear system of
equations using the Conjugate Gradient Method.
  • Additional vectors r and p used for intermediate
    steps
  • Every node has
  • Rows of A
  • Elements of b and r
  • Complete x and p
  • After each iteration p must be globally
    redistributed
  • Share with columns
  • Share with rows

38
The new system provides marginal improvement over
the original under the matrix stressmark.
39
The transitive closure stressmark solves the
all-pairs shortest-path problem.
  • Floyd-Warshall Algorithm
  • Adjacency Matrix
  • Dij
  • Iterative Improvements
  • Dij min(Dij, DikDkj)
  • Each node owns sub-block of adjacency matrix
  • Each node needs portion of row k
  • Each node needs portion of column k

40
The new system provides marginal improvement over
the original under the transitive closure
stressmark.
41
The LU factorization stressmark is used by linear
system solvers.
  • Factors matrix into a lower triangular matrix and
    an upper triangular matrix.
  • Matrix is divided into blocks
  • Pivot block is factored
  • Pivot column and row blocks are divided by pivot.
  • Inner active matrix blocks are modified by the
    pivot row and column blocks.

Pivot Row
Pivot
Pivot Column
Inner Active Matrix
42
The new system out performs the original under
the LU factorization stressmark.
43
Send-and-Receive Messaging for SCMP is
worthwhile.
  • Fixes Problems With Original SCMP Messaging
    System
  • Global Buffer Pointers
  • Races between Data and SYNC messages
  • Request Thread Storms
  • Programming Model is more familiar
  • Performance is better

Questions?
Write a Comment
User Comments (0)
About PowerShow.com