Digital Space - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Digital Space

Description:

Lots of ALUs, lots of registers, lots of local memories huge on-chip ... Raw Motherboard. 33. Raw Ideas and Decisions: What Worked, What Did Not ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 54

Provided by: agar59

Category:

more less

Transcript and Presenter's Notes

Title: Digital Space

1
Digital Space
Anant Agarwal MIT and Tilera Corporation
2
Arecibo
3
Stages of Reality
4
Simulator reality
Simulator reality
Simulator reality
Simulator reality
Prototype reality
Prototype reality
Prototype reality
Product reality
Product reality
Virtual Reality
5
The Opportunity
1996
20MIPS cpu in 1987
Few thousand gates
6
The Opportunity
The billion transistor chip of 2007
7
How to Fritter Away Opportunity
Caches
Control
100 ported RegFil and RR
More resolution buffers, control
the x1786? does not scale
1/10 ns
8
Take Inspiration from ASICs
mem

Lots of ALUs, lots of registers, lots of local
memories huge on-chip parallelism but with a
slower clock
Custom-routed, short wires optimized for specific
applications

Fast, low power, area efficient But not
programmable
9
Our Early Raw Proposal

Got parallelism?

But how to build programmable, yet custom, wires?
10
A digital wire
11
Static Router
A static router!
Compiler
Application
12
Replace Wires with Routed Networks
13
50-Ported Register File ? Distributed Registers
Gigantic 50 ported register file
14
50-Ported Register File ? Distributed Registers
Gigantic 50 ported register file
15
Distributed Registers Routed Network
Distributed register file
Called NURA ASPLOS 1998
16
16-Way ALU Clump ? Distributed ALUs
RF
Bypass Net
17
Distributed ALUs, Routed Bypass Network
R
Scalar Operand Network (SON) TPDS 2005
18
Mongo Cache ? Distributed Cache
Gigantic 10 ported cache
19
Distributing the Cache
20
Distributed Shared Cache

R
Like DSM (distributed shared memory), cache is
distributed But, unlike NUCA, caches are local
to processors, not far away
ISCA 1999
21
Tiled Multicore Architecture

R
22
E.g., Operand Routing in 16-way Superscalar

RF
gtgt
Bypass Net
Source Taylor ISCA 2004
23
Operand Routing in a Tiled Architecture

gtgt

R
ALU
gtgt
24
Tiled Multicore

Scales to large numbers of cores
Modular design, layout and verify 1 tile
Power efficient MIT-CSAIL-TR-2008-066
Short wires CV2f
Chandrakasan effect CV2f
Dynamic and compiler scheduled routing

25

A Prototype Tiled Architecture The Raw
Microprocessor
Billion transistor IEEE Computer Issue
97 www.cag.csail.mit.edu/raw
The Raw Chip
Tile
26
Virtual reality
Simulator reality
Prototype reality
Product reality
27
Scalar Operand Transport in Raw
Goal flow controlled, in order delivery of
operands
fadd r5, r3, r24
fmul r24, r3, r4
route P-gtE, N-gtS
route W-gtP, S-gtN
software controlled crossbar
software controlled crossbar
28
RawCC Distributed ILP Compilation (DILP)
C
tmp0 (seed32)/2 tmp1 seedv12 tmp2
seedv2 2 tmp3 (seed62)/3 v2 (tmp1 -
tmp3)5 v1 (tmp1 tmp2)3 v0 tmp0 - v1 v3
tmp3 - v2
Place, Route, Schedule
seed.0seed
pval5seed.06.0
pval1seed.03.0
pval4pval52.0
pval0pval12.0
tmp3.6pval4/3.0
tmp3tmp3.6
tmp0.1pval0/2.0
v3.10tmp3.6-v2.7
tmp0tmp0.1
v3v3.10
Partitioning
v2.4v2
v1.2v1
seed.0seed
pval3seed.ov2.4
pval2seed.0v1.2
tmp2.5pval32.0
tmp1.3pval22.0
pval1seed.03.0
v1.2v1
tmp2tmp2.5
v2.4v2
tmp1tmp1.3
pval5seed.06.0
pval6tmp1.3-tmp2.5
pval7tmp1.3tmp2.5
pval2seed.0v1.2
pval0pval12.0
pval3seed.ov2.4
v2.7pval65.0
pval4pval52.0
v1.8pval73.0
tmp1.3pval22.0
tmp0.1pval0/2.0
v2v2.7
tmp2.5pval32.0
v0.9tmp0.1-v1.8
tmp3.6pval4/3.0
tmp1tmp1.3
v1v1.8
v0v0.9
tmp0tmp0.1
tmp2tmp2.5
pval7tmp1.3tmp2.5
tmp3tmp3.6
Black arrows Operand Communication over SON
pval6tmp1.3-tmp2.5
v1.8pval73.0
v2.7pval65.0
v0.9tmp0.1-v1.8
ASPLOS 1998
v1v1.8
v3.10tmp3.6-v2.7
v2v2.7
v0v0.9
v3v3.10
29
Virtual reality
Simulator reality
Prototype reality
Product reality
30
A Tiled Processor Architecture Prototype the Raw
Microprocessor
Michael Taylor Walter Lee Jason Miller David
Wentzlaff Ian Bratt Ben Greenwald Henry
Hoffmann Paul Johnson Jason Kim James
Psota Arvind Saraf Nathan Shnidman Volker
Strumpen Matt Frank Rajeev Barua Elliot
Waingold Jonathan Babb Sri Devabhaktuni Saman
Amarasinghe Anant Agarwal
October 02
31
Raw Die Photo
IBM .18 micron process, 16 tiles, 425MHz, 18
Watts (vpenta)
ISCA 2004
32
Raw Motherboard
33
Raw Ideas and Decisions What Worked, What Did
Not

Build a complete prototype system
Simple processor with single issue cores
FPGA logic block in each tile
Distributed ILP and static network
Static network for streaming
Multiple types of computation ILP, streams,
TLP, server
PC in every tile

34
Why Build?

Compiler (Amarasinghe), OS and runtimes (ISI),
apps (ISI, Lincoln Labs, Durand) folks will not
work with you unless you are serious about
building hardware
Need motivaion to build software tools --
compilers, runtimes, debugging, visualization
many challenges here
Run large data sets (simulation takes forever
even with 100 servers!)
Many hard problems show up or are better
understood after you begin building (how to
maintain ordering for distributed ILP, slack for
streaming codes)
Have to solve hard problems no magic!
The more radical the idea, the more important it
is to build
World will only trust end-to-end results since it
is too hard to dive into details and understand
all assumptions
Would you believe this Prof. John Bull has
demonstrated a simulation prototype of a 64-way
issue out-of-order superscalar
Cycle simulator became cycle accurate simulator
only after HW got precisely defined
Dont bother to commercialize unless you have a
working prototype
Total network power few percent for real apps
Aug 2003 ISLPED, Kim et al. Energy
characterization of a tiled architecture
processor with on-chip networks
MIT-CSAIL-TR-2008-066 Energy scalability of
on-chip interconnection networks in multicore
architecures
Network power is few percent in Raw for real
apps however, it is 36 only for a highly
contrived synthetic sequence meant to toggle
every network wire

35
Raw Ideas and Decisions What Worked, What Did
Not
Yes

Build a complete prototype system
Simple processor, single issue
FPGA logic block in each tile
Distributed ILP
Static network for streaming
Multiple types of computation ILP, streams,
TLP, server
PC in every tile

1GHz, 2-way, inorder in 2016
No
Yes 02, No 06, Yes 14
Yes
Yes
36
Raw Ideas and Decisions Streaming
Interconnect Support
route P-gtE, N-gtS
route W-gtP, S-gtN
software controlled crossbar
software controlled crossbar
37
Streaming in Tileras Tile Processor

Streaming done over dynamic interconnect with
stream demuxing (AsTrO SDS)
Automatic demultiplexing of streams into
registers
Number of streams is virtualized

38
Virtual reality
Simulator reality
Prototype reality
39
Why Do We Care?Markets Demanding More Performance

Wireless Networks
Demand for high thruput more channels
Fast moving standards LTE, services
Networking market
Demand for high performance 10Gbps
Demand for more services, intelligence
Digital Multimedia market
Demand for high performance H.264 HD
Demand for more services VoD, transcode

Base Station
GGSN
Switches
Security Appliances
Routers
Video Conferencing
Cable Broadcast
and with power efficiency and programming ease
39
40
Tileras TILEPro64 Processor
Multicore Performance (90nm)
Power Efficiency
I/O and Memory Bandwidth
Programming
Tile64, Hotchips 2007 Tile64, Microprocessor
Report Nov 2007
41
Tile Processor Block DiagramA Complete System on
a Chip
DDR2 Memory Controller 0
PCIe 0 MAC PHY
Serdes
UART, HPI JTAG, I2C, SPI
GbE 0
GbE 1
Flexible IO
Flexible IO
PCIe 1 MAC PHY
Serdes
DDR2 Memory Controller 3
42
Tile Processor NoC
Tiles

5 independent non-blocking networks
64 switches per network
1 Terabit/sec per Tile
Each network switch directly and independently
connected to tiles
One hop per clock on all networks
I/O write example
Memory write example
Tile to Tile access example
All accesses can be performed simultaneously on
non-blocking networks

UDN
STN
IDN
MDN
TDN
IEEE Micro Sep 2007
43
Multicore Hardwall ImplementationOr Protection
and Interconnects
44
Product Reality Differences

Market forces
Need crisper answer to who cares
SMP Linux programming with pthreads fully cache
coherent
C API approach to streaming vs new language
Streamit in Raw
Special instructions for video, networking
Floating point needed in research project, but
not in product for embedded market
Lessons from Raw
E.g., Dynamic network for streams
HW instruction cache
Protected interconnects
More substantial engineering
3-way VLIW CPU, subword arithmetic
Engineering for clock speed and power efficiency
Completeness I/O interfaces on chip complete
system chip. Just add DRAM for system
Support for virtual memory, 2D DMA
Runs SMP Linux (can run multiple OSes
simultaneously)

45
Simulator reality
Prototype reality
Product reality
46
What Does the Future Look Like?
Corollary of Moores law Number of cores will
double every 18 months
05
08
11
14
02
64
256
1024
4096
Research
16
Industry
16
64
256
1024
4
1K cores by 2014! Are we ready?
(Cores minimally big enough to run a self
respecting OS!)
47
Vision for the Future

The core is the logic gate of the 21st century

48
Research Challenges for 1K Cores

4-16 cores not interesting. Industry is there.
University must focus on 1K cores Everything
will change!
Can we use 4 cores to get 2X through DILP?
Remember cores will be 1GHz and simple! What is
the interconnect?
How should we program 1K cores? Can interconnect
help with programming?
Locality and reliability WILL matter for 1K
cores. Spatial view of multicore?
Can we add architectural support for programming
ease? E.g., suppose I told you cores are free.
Can you discover mechanisms to make programming
easier?
What is the right grain size for a core?
How must our computational models change in the
face of small memories per core?
How to feed the beast? I/O and external memory
bandwidth
Can we assume perfect reliability any longer?

49
ATAC Architecture
Electrical Mesh Interconnect (EMesh)
Optical Broadcast WDM Interconnect
Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018
50
Research Challenges for 1K Cores

4-16 cores not interesting. Industry is there.
University must focus on 1K cores Everything
will change!
Can we use 4 cores to get 2X through DILP? What
is the interconnect?
How should we program 1K cores? Can interconnect
help with programming?
Locality and reliability WILL matter for 1K
cores. Spatial view of multicore?
Can we add architectural support for programming
ease? E.g., suppose I told you cores are free.
Can you discover mechanisms to make programming
easier?
What is the right grain size for a core?
How must our computational models change in the
face of small memories per core?
How to feed the beast? I/O and external memory
bandwidth
Can we assume perfect reliability any longer?

51
FOS Factored Operating System
OS cores collaborate, inspired by distributed
internet services model
The key idea space sharing replaces time
sharing
FS
FS
Need new page
User App
I/O
FS
File System

Today User app and OS kernel thrash each other
in a cores cache
User/OS time sharing is inefficient
Angstrom OS assumes abstracted space model. OS
services bound to distinct cores, separate from
user cores. OS service cores collaborate to
achieve best resource management
User/OS space sharing is efficient

OS Review 2008
52
Research Challenges for 1K Cores

4-16 cores not interesting. Industry is there.
University must focus on 1K cores Everything
will change!
Can we use 4 cores to get 2X through DILP? What
is the interconnect?
How should we program 1K cores? Can interconnect
help with programming?
Locality and reliability WILL matter for 1K
cores. Spatial view of multicore?
Can we add architectural support for programming
ease? E.g., suppose I told you cores are free.
Can you discover mechanisms to make programming
easier?
What is the right grain size for a core?
How must our computational models change in the
face of small memories per core?
How to feed the beast? I/O and external memory
bandwidth
Can we assume perfect reliability any longer?

53
The following are trademarks of Tilera
Corporation Tilera, the Tilera Logo, Tile
Processor, TILE64, Embedding Multicore, Multicore
Development Environment, Gentle Slope
Programming, iLib, iMesh and Multicore Hardwall.
All other trademarks and/or registered trademarks
are the property of their respective owners.

Write a Comment

User Comments (0)