CSE 661 PAPER PRESENTATION - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

CSE 661 PAPER PRESENTATION

Description:

CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D. Wentzlaff et al Presented By SALAMI, Hamza Onoruoiza g201002240 – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 26
Provided by: Salm157
Category:

less

Transcript and Presenter's Notes

Title: CSE 661 PAPER PRESENTATION


1
CSE 661 PAPER PRESENTATION
  • ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE
    PROCESSOR
  • By D. Wentzlaff et al

Presented By SALAMI, Hamza Onoruoiza g201002240
2
OUTLINE OF PRESENTATION
  • Introduction
  • Tile64 Architecture
  • Interconnect Hardware
  • Network Uses
  • Network to Tile Interface
  • Receive-side Hardware Demultiplexing
  • Protection
  • Shared Memory Communication and Ordering
  • Interconnect Software
  • Communication Interface
  • Applications
  • Conclusion

3
INTRODUCTION
  • Tile Processors five on-chip 2D mesh networks
    differ from
  • traditional bus based scheme requires global
    broadcast hence not scalable beyond 8 16 cores
  • 1D ring not scalable bisection BW is constant
  • Can support few or many processors with minimal
    changes to network structure

4
TILE64 ARCHITECTURE
  • 2D grid of 64 identical compute elements (tiles)
    arranged in 8 x 8 mesh
  • 1GHz clock, 3-way VLIW, 192 bil. 32-bit
    instructions/sec
  • 4.8MB distributed cache, per tile TLB
  • Supports DMA and virtual memory
  • Tiles may run independent OSs. May be combined to
    run multiprocessor OS such as SMP Linux
  • Shared memory.
  • Cores directly access other cores cache through
    on-chip interconnects

5
TILE64 ARCHITECTURE (2)
Off chip memory BW 200Gbps I/O BW 40Gbps
6
TILE64 ARCHITECTURE (3)
Courtesy http//www.tilera.com/products/processor
s/TILE64
7
INTERCONNECT HARDWARE
  • 5 low latency mesh networks
  • Each network connects tile in five directions
    north, south, east, west and processor
  • Each link made of two 32-bit unidirectional links

8
INTERCONNECT HARDWARE(2)
1.28Tb/s BW in and out of a single tile
9
NETWORK USES
  • 4 dynamic networks
  • packet header contains destinations (x, y)
    coordinate and packet length (128 words)
  • Flow controlled, reliable delivery
  • UDN low latency comm. between userland
    processes without OS intervention
  • IDN direct communication with I/O devices
  • MDN communication with off-chip memory
  • TDN direct tile-to-tile transfers requests
    through TDN, response through MDN
  • 1 static network
  • Streams of data instead of packets
  • First setup route, then send streams (circuit
    switched)
  • Also a userland network

10
LOGICAL VS. PHYSICAL NETWORKS
  • 5 physically independent networks
  • Lots of free nearest neighbor on-chip wiring
  • Buffer space takes about 60 tile area vs 1.1
    for each network
  • More reliable on-chip network gt less buffering
    to manage link failure

11
NETWORK TO TILE INTERFACE
  • Tiles have register access to on-chip networks.
    Instructions can read/write from/to UDN, IDN or
    STN.
  • MDN and UDN used indirectly on cache miss
  • Register-mapped network access is provided

12
RECEIVE-SIDE HARDWARE DEMULTIPLEXING
  • Tag word (sending node, stream num., message
    type)
  • Receiving hardware demultiplexes message into
    appropriate queue using tag.
  • On a tag miss, send data to catch all queue,
    then raise interrupt
  • UDN has 4 deMUX queues, one catch all
  • IDN has 2 deMUX queues, one catch all
  • 128-word reverse side buffering per tile

13
RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)
14
PROTECTION
  • Tile Architecture implements Multicore Hardwall
    (MH)
  • MH protects UDN, IDN and STN links
  • Standard memory protection mechanisms used for
    MDN and TDN
  • MH blocks attempts to send traffic over
    hardwalled link, then signals an interrupt to
    system software
  • Protection is implemented on outbound links

15
SHARED MEMORY COMMUNICATION AND ORDERING
  • On-chip distributed shared cache
  • Data could be retrieved from
  • Local cache
  • Home tile (request sent through TDN). Data
    available only in home tile. Coherency maintained
    here.
  • Main Memory
  • No guaranteed ordering between networks and
    shared memory
  • Memory fence instructions used to enforce ordering

16
INTERCONNECT SOFTWARE
  • C based iLib provides communication primitives
    implemented via UDN
  • Lightweight socket-like streaming channels for
    streaming algorithms
  • MPI-like message passing interface for adhoc
    messaging

17
COMMUNICATION INTERFACES
  • iLib Socket
  • Long-lived FIFO point-to-point connection between
    two processes
  • Good for producer-consumer relationship
  • Multiple senders-one receiver possible good for
    forwarding results to single node for aggregation
  • Raw Channels low overhead use as much space as
    available in buffer
  • Buffered Channels higher overhead, but
    virtualization of memory is possible

18
COMMUNICATION INTERFACES(2)
  • Message Passing API
  • Similar to MPI
  • Messages can be sent from a node to any other at
    all times
  • No need to establish connections
  • Implementation
  • Sender Send packet with message key and size
  • Receivers catch-all queue interrupts processor
  • If expecting a message with this key, send packet
    to sender to begin transfer
  • Else, save notification.
  • On ilib_msg_receive() with same key, send packet
    to interrupt sender to begin transfer

19
COMMUNICATION INTERFACES(3)
20
COMMUNICATION INTERFACES(4)
  • UDNs maximum BW is 4 bytes/cycle
  • Raw Channels max BW 3.93 bytes/cycle overhead
    due to header word and tag word
  • Buffered Channel Overhead of memory read/write
  • Message Passing Overhead of interrupting
    receiving tile

21
COMMUNICATION INTERFACES(5)
22
APPLICATIONS
  • Corner Turn
  • Reorganize distributed array from 1 dimension to
    another
  • Each core send data to every other core
  • Important Factors
  • Network for Distribution (TDN using shared memory
    or UDN using raw channels)
  • Network for tiles synchronization (STN or UDN)

23
APPLICATIONS (2)
  • Raw Channel, STN synch best performance. Minimum
    overhead raw channels. STN ensures synch messages
    dont interfere with data
  • Raw Channel, UDN synch UDN used for data and
    synch messages. Extra overhead data to
    distinguish between both messages.
  • Shared Memory Simpler to program . Each user
    data word incurs four extra words to manage
    network and avoid deadlock

24
APPLICATIONS (3)
  • Dot Product
  • Pairwise element multiplication, followed by
    addition of all products.
  • 65,536-element dot product
  • Shared memory not scalable, higher communication
    overhead
  • From 2 to 4 tiles, speedup is sublinear because
    dataset completely fits into tiles L2 cache.

25
CONCLUSION
  • Tile uses unconventional architecture to achieve
    high on-chip communication BW
  • Effective use of BW possible due to synergy
    between hardware architecture and software APIs
    (iLib).
Write a Comment
User Comments (0)
About PowerShow.com