Title: CSE 661 PAPER PRESENTATION
1CSE 661 PAPER PRESENTATION
- ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE
PROCESSOR - By D. Wentzlaff et al
Presented By SALAMI, Hamza Onoruoiza g201002240
2OUTLINE OF PRESENTATION
- Introduction
- Tile64 Architecture
- Interconnect Hardware
- Network Uses
- Network to Tile Interface
- Receive-side Hardware Demultiplexing
- Protection
- Shared Memory Communication and Ordering
- Interconnect Software
- Communication Interface
- Applications
- Conclusion
3INTRODUCTION
- Tile Processors five on-chip 2D mesh networks
differ from - traditional bus based scheme requires global
broadcast hence not scalable beyond 8 16 cores - 1D ring not scalable bisection BW is constant
- Can support few or many processors with minimal
changes to network structure
4TILE64 ARCHITECTURE
- 2D grid of 64 identical compute elements (tiles)
arranged in 8 x 8 mesh - 1GHz clock, 3-way VLIW, 192 bil. 32-bit
instructions/sec - 4.8MB distributed cache, per tile TLB
- Supports DMA and virtual memory
- Tiles may run independent OSs. May be combined to
run multiprocessor OS such as SMP Linux - Shared memory.
- Cores directly access other cores cache through
on-chip interconnects
5TILE64 ARCHITECTURE (2)
Off chip memory BW 200Gbps I/O BW 40Gbps
6TILE64 ARCHITECTURE (3)
Courtesy http//www.tilera.com/products/processor
s/TILE64
7INTERCONNECT HARDWARE
- 5 low latency mesh networks
- Each network connects tile in five directions
north, south, east, west and processor - Each link made of two 32-bit unidirectional links
8INTERCONNECT HARDWARE(2)
1.28Tb/s BW in and out of a single tile
9NETWORK USES
- 4 dynamic networks
- packet header contains destinations (x, y)
coordinate and packet length (128 words) - Flow controlled, reliable delivery
- UDN low latency comm. between userland
processes without OS intervention - IDN direct communication with I/O devices
- MDN communication with off-chip memory
- TDN direct tile-to-tile transfers requests
through TDN, response through MDN - 1 static network
- Streams of data instead of packets
- First setup route, then send streams (circuit
switched) - Also a userland network
10LOGICAL VS. PHYSICAL NETWORKS
- 5 physically independent networks
- Lots of free nearest neighbor on-chip wiring
- Buffer space takes about 60 tile area vs 1.1
for each network - More reliable on-chip network gt less buffering
to manage link failure
11NETWORK TO TILE INTERFACE
- Tiles have register access to on-chip networks.
Instructions can read/write from/to UDN, IDN or
STN. - MDN and UDN used indirectly on cache miss
- Register-mapped network access is provided
12RECEIVE-SIDE HARDWARE DEMULTIPLEXING
- Tag word (sending node, stream num., message
type) - Receiving hardware demultiplexes message into
appropriate queue using tag. - On a tag miss, send data to catch all queue,
then raise interrupt - UDN has 4 deMUX queues, one catch all
- IDN has 2 deMUX queues, one catch all
- 128-word reverse side buffering per tile
13RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)
14PROTECTION
- Tile Architecture implements Multicore Hardwall
(MH) - MH protects UDN, IDN and STN links
- Standard memory protection mechanisms used for
MDN and TDN - MH blocks attempts to send traffic over
hardwalled link, then signals an interrupt to
system software - Protection is implemented on outbound links
15SHARED MEMORY COMMUNICATION AND ORDERING
- On-chip distributed shared cache
- Data could be retrieved from
- Local cache
- Home tile (request sent through TDN). Data
available only in home tile. Coherency maintained
here. - Main Memory
- No guaranteed ordering between networks and
shared memory - Memory fence instructions used to enforce ordering
16INTERCONNECT SOFTWARE
- C based iLib provides communication primitives
implemented via UDN - Lightweight socket-like streaming channels for
streaming algorithms - MPI-like message passing interface for adhoc
messaging
17COMMUNICATION INTERFACES
- iLib Socket
- Long-lived FIFO point-to-point connection between
two processes - Good for producer-consumer relationship
- Multiple senders-one receiver possible good for
forwarding results to single node for aggregation - Raw Channels low overhead use as much space as
available in buffer - Buffered Channels higher overhead, but
virtualization of memory is possible
18COMMUNICATION INTERFACES(2)
- Message Passing API
- Similar to MPI
- Messages can be sent from a node to any other at
all times - No need to establish connections
- Implementation
- Sender Send packet with message key and size
- Receivers catch-all queue interrupts processor
- If expecting a message with this key, send packet
to sender to begin transfer - Else, save notification.
- On ilib_msg_receive() with same key, send packet
to interrupt sender to begin transfer
19COMMUNICATION INTERFACES(3)
20COMMUNICATION INTERFACES(4)
- UDNs maximum BW is 4 bytes/cycle
- Raw Channels max BW 3.93 bytes/cycle overhead
due to header word and tag word - Buffered Channel Overhead of memory read/write
- Message Passing Overhead of interrupting
receiving tile
21COMMUNICATION INTERFACES(5)
22APPLICATIONS
- Corner Turn
- Reorganize distributed array from 1 dimension to
another - Each core send data to every other core
- Important Factors
- Network for Distribution (TDN using shared memory
or UDN using raw channels) - Network for tiles synchronization (STN or UDN)
23APPLICATIONS (2)
- Raw Channel, STN synch best performance. Minimum
overhead raw channels. STN ensures synch messages
dont interfere with data - Raw Channel, UDN synch UDN used for data and
synch messages. Extra overhead data to
distinguish between both messages. - Shared Memory Simpler to program . Each user
data word incurs four extra words to manage
network and avoid deadlock
24APPLICATIONS (3)
- Dot Product
- Pairwise element multiplication, followed by
addition of all products. - 65,536-element dot product
- Shared memory not scalable, higher communication
overhead - From 2 to 4 tiles, speedup is sublinear because
dataset completely fits into tiles L2 cache.
25CONCLUSION
- Tile uses unconventional architecture to achieve
high on-chip communication BW - Effective use of BW possible due to synergy
between hardware architecture and software APIs
(iLib).