CSE 661 PAPER PRESENTATION

About This Presentation

Title:

CSE 661 PAPER PRESENTATION

Description:

CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D. Wentzlaff et al Presented By SALAMI, Hamza Onoruoiza g201002240 – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 26

Provided by: Salm157

Category:

more less

Transcript and Presenter's Notes

Title: CSE 661 PAPER PRESENTATION

1
CSE 661 PAPER PRESENTATION

ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE
PROCESSOR
By D. Wentzlaff et al

Presented By SALAMI, Hamza Onoruoiza g201002240
2
OUTLINE OF PRESENTATION

Introduction
Tile64 Architecture
Interconnect Hardware
Network Uses
Network to Tile Interface
Receive-side Hardware Demultiplexing
Protection
Shared Memory Communication and Ordering
Interconnect Software
Communication Interface
Applications
Conclusion

3
INTRODUCTION

Tile Processors five on-chip 2D mesh networks
differ from
traditional bus based scheme requires global
broadcast hence not scalable beyond 8 16 cores
1D ring not scalable bisection BW is constant
Can support few or many processors with minimal
changes to network structure

4
TILE64 ARCHITECTURE

2D grid of 64 identical compute elements (tiles)
arranged in 8 x 8 mesh
1GHz clock, 3-way VLIW, 192 bil. 32-bit
instructions/sec
4.8MB distributed cache, per tile TLB
Supports DMA and virtual memory
Tiles may run independent OSs. May be combined to
run multiprocessor OS such as SMP Linux
Shared memory.
Cores directly access other cores cache through
on-chip interconnects

5
TILE64 ARCHITECTURE (2)
Off chip memory BW 200Gbps I/O BW 40Gbps
6
TILE64 ARCHITECTURE (3)
Courtesy http//www.tilera.com/products/processor
s/TILE64
7
INTERCONNECT HARDWARE

5 low latency mesh networks
Each network connects tile in five directions
north, south, east, west and processor
Each link made of two 32-bit unidirectional links

8
INTERCONNECT HARDWARE(2)
1.28Tb/s BW in and out of a single tile
9
NETWORK USES

4 dynamic networks
packet header contains destinations (x, y)
coordinate and packet length (128 words)
Flow controlled, reliable delivery
UDN low latency comm. between userland
processes without OS intervention
IDN direct communication with I/O devices
MDN communication with off-chip memory
TDN direct tile-to-tile transfers requests
through TDN, response through MDN
1 static network
Streams of data instead of packets
First setup route, then send streams (circuit
switched)
Also a userland network

10
LOGICAL VS. PHYSICAL NETWORKS

5 physically independent networks
Lots of free nearest neighbor on-chip wiring
Buffer space takes about 60 tile area vs 1.1
for each network
More reliable on-chip network gt less buffering
to manage link failure

11
NETWORK TO TILE INTERFACE

Tiles have register access to on-chip networks.
Instructions can read/write from/to UDN, IDN or
STN.
MDN and UDN used indirectly on cache miss
Register-mapped network access is provided

12
RECEIVE-SIDE HARDWARE DEMULTIPLEXING

Tag word (sending node, stream num., message
type)
Receiving hardware demultiplexes message into
appropriate queue using tag.
On a tag miss, send data to catch all queue,
then raise interrupt
UDN has 4 deMUX queues, one catch all
IDN has 2 deMUX queues, one catch all
128-word reverse side buffering per tile

13
RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)
14
PROTECTION

Tile Architecture implements Multicore Hardwall
(MH)
MH protects UDN, IDN and STN links
Standard memory protection mechanisms used for
MDN and TDN
MH blocks attempts to send traffic over
hardwalled link, then signals an interrupt to
system software
Protection is implemented on outbound links

15
SHARED MEMORY COMMUNICATION AND ORDERING

On-chip distributed shared cache
Data could be retrieved from
Local cache
Home tile (request sent through TDN). Data
available only in home tile. Coherency maintained
here.
Main Memory
No guaranteed ordering between networks and
shared memory
Memory fence instructions used to enforce ordering

16
INTERCONNECT SOFTWARE

C based iLib provides communication primitives
implemented via UDN
Lightweight socket-like streaming channels for
streaming algorithms
MPI-like message passing interface for adhoc
messaging

17
COMMUNICATION INTERFACES

iLib Socket
Long-lived FIFO point-to-point connection between
two processes
Good for producer-consumer relationship
Multiple senders-one receiver possible good for
forwarding results to single node for aggregation
Raw Channels low overhead use as much space as
available in buffer
Buffered Channels higher overhead, but
virtualization of memory is possible

18
COMMUNICATION INTERFACES(2)

Message Passing API
Similar to MPI
Messages can be sent from a node to any other at
all times
No need to establish connections
Implementation
Sender Send packet with message key and size
Receivers catch-all queue interrupts processor
If expecting a message with this key, send packet
to sender to begin transfer
Else, save notification.
On ilib_msg_receive() with same key, send packet
to interrupt sender to begin transfer

19
COMMUNICATION INTERFACES(3)
20
COMMUNICATION INTERFACES(4)

UDNs maximum BW is 4 bytes/cycle
Raw Channels max BW 3.93 bytes/cycle overhead
due to header word and tag word
Buffered Channel Overhead of memory read/write
Message Passing Overhead of interrupting
receiving tile

21
COMMUNICATION INTERFACES(5)
22
APPLICATIONS

Corner Turn
Reorganize distributed array from 1 dimension to
another
Each core send data to every other core
Important Factors
Network for Distribution (TDN using shared memory
or UDN using raw channels)
Network for tiles synchronization (STN or UDN)

23
APPLICATIONS (2)

Raw Channel, STN synch best performance. Minimum
overhead raw channels. STN ensures synch messages
dont interfere with data
Raw Channel, UDN synch UDN used for data and
synch messages. Extra overhead data to
distinguish between both messages.
Shared Memory Simpler to program . Each user
data word incurs four extra words to manage
network and avoid deadlock

24
APPLICATIONS (3)

Dot Product
Pairwise element multiplication, followed by
addition of all products.
65,536-element dot product
Shared memory not scalable, higher communication
overhead
From 2 to 4 tiles, speedup is sublinear because
dataset completely fits into tiles L2 cache.

25
CONCLUSION

Tile uses unconventional architecture to achieve
high on-chip communication BW
Effective use of BW possible due to synergy
between hardware architecture and software APIs
(iLib).

Write a Comment

User Comments (0)