Title: Digital Space
1Digital Space
Anant Agarwal MIT and Tilera Corporation
2Arecibo
3Stages of Reality
4Simulator reality
Simulator reality
Simulator reality
Simulator reality
Prototype reality
Prototype reality
Prototype reality
Product reality
Product reality
Virtual Reality
5The Opportunity
1996
20MIPS cpu in 1987
Few thousand gates
6The Opportunity
The billion transistor chip of 2007
7How to Fritter Away Opportunity
Caches
Control
100 ported RegFil and RR
More resolution buffers, control
the x1786? does not scale
1/10 ns
8Take Inspiration from ASICs
mem
- Lots of ALUs, lots of registers, lots of local
memories huge on-chip parallelism but with a
slower clock - Custom-routed, short wires optimized for specific
applications
Fast, low power, area efficient But not
programmable
9Our Early Raw Proposal
But how to build programmable, yet custom, wires?
10A digital wire
11Static Router
A static router!
Compiler
Application
12Replace Wires with Routed Networks
1350-Ported Register File ? Distributed Registers
Gigantic 50 ported register file
1450-Ported Register File ? Distributed Registers
Gigantic 50 ported register file
15Distributed Registers Routed Network
Distributed register file
Called NURA ASPLOS 1998
1616-Way ALU Clump ? Distributed ALUs
RF
Bypass Net
17Distributed ALUs, Routed Bypass Network
R
Scalar Operand Network (SON) TPDS 2005
18Mongo Cache ? Distributed Cache
Gigantic 10 ported cache
19Distributing the Cache
20Distributed Shared Cache
R
Like DSM (distributed shared memory), cache is
distributed But, unlike NUCA, caches are local
to processors, not far away
ISCA 1999
21Tiled Multicore Architecture
R
22E.g., Operand Routing in 16-way Superscalar
RF
gtgt
Bypass Net
Source Taylor ISCA 2004
23Operand Routing in a Tiled Architecture
gtgt
R
ALU
gtgt
24Tiled Multicore
- Scales to large numbers of cores
- Modular design, layout and verify 1 tile
- Power efficient MIT-CSAIL-TR-2008-066
- Short wires CV2f
- Chandrakasan effect CV2f
- Dynamic and compiler scheduled routing
25 A Prototype Tiled Architecture The Raw
Microprocessor
Billion transistor IEEE Computer Issue
97 www.cag.csail.mit.edu/raw
The Raw Chip
Tile
26Virtual reality
Simulator reality
Prototype reality
Product reality
27Scalar Operand Transport in Raw
Goal flow controlled, in order delivery of
operands
fadd r5, r3, r24
fmul r24, r3, r4
route P-gtE, N-gtS
route W-gtP, S-gtN
software controlled crossbar
software controlled crossbar
28RawCC Distributed ILP Compilation (DILP)
C
tmp0 (seed32)/2 tmp1 seedv12 tmp2
seedv2 2 tmp3 (seed62)/3 v2 (tmp1 -
tmp3)5 v1 (tmp1 tmp2)3 v0 tmp0 - v1 v3
tmp3 - v2
Place, Route, Schedule
seed.0seed
pval5seed.06.0
pval1seed.03.0
pval4pval52.0
pval0pval12.0
tmp3.6pval4/3.0
tmp3tmp3.6
tmp0.1pval0/2.0
v3.10tmp3.6-v2.7
tmp0tmp0.1
v3v3.10
Partitioning
v2.4v2
v1.2v1
seed.0seed
pval3seed.ov2.4
pval2seed.0v1.2
tmp2.5pval32.0
tmp1.3pval22.0
pval1seed.03.0
v1.2v1
tmp2tmp2.5
v2.4v2
tmp1tmp1.3
pval5seed.06.0
pval6tmp1.3-tmp2.5
pval7tmp1.3tmp2.5
pval2seed.0v1.2
pval0pval12.0
pval3seed.ov2.4
v2.7pval65.0
pval4pval52.0
v1.8pval73.0
tmp1.3pval22.0
tmp0.1pval0/2.0
v2v2.7
tmp2.5pval32.0
v0.9tmp0.1-v1.8
tmp3.6pval4/3.0
tmp1tmp1.3
v1v1.8
v0v0.9
tmp0tmp0.1
tmp2tmp2.5
pval7tmp1.3tmp2.5
tmp3tmp3.6
Black arrows Operand Communication over SON
pval6tmp1.3-tmp2.5
v1.8pval73.0
v2.7pval65.0
v0.9tmp0.1-v1.8
ASPLOS 1998
v1v1.8
v3.10tmp3.6-v2.7
v2v2.7
v0v0.9
v3v3.10
29Virtual reality
Simulator reality
Prototype reality
Product reality
30A Tiled Processor Architecture Prototype the Raw
Microprocessor
Michael Taylor Walter Lee Jason Miller David
Wentzlaff Ian Bratt Ben Greenwald Henry
Hoffmann Paul Johnson Jason Kim James
Psota Arvind Saraf Nathan Shnidman Volker
Strumpen Matt Frank Rajeev Barua Elliot
Waingold Jonathan Babb Sri Devabhaktuni Saman
Amarasinghe Anant Agarwal
October 02
31Raw Die Photo
IBM .18 micron process, 16 tiles, 425MHz, 18
Watts (vpenta)
ISCA 2004
32Raw Motherboard
33Raw Ideas and Decisions What Worked, What Did
Not
- Build a complete prototype system
- Simple processor with single issue cores
- FPGA logic block in each tile
- Distributed ILP and static network
- Static network for streaming
- Multiple types of computation ILP, streams,
TLP, server - PC in every tile
34Why Build?
- Compiler (Amarasinghe), OS and runtimes (ISI),
apps (ISI, Lincoln Labs, Durand) folks will not
work with you unless you are serious about
building hardware - Need motivaion to build software tools --
compilers, runtimes, debugging, visualization
many challenges here - Run large data sets (simulation takes forever
even with 100 servers!) - Many hard problems show up or are better
understood after you begin building (how to
maintain ordering for distributed ILP, slack for
streaming codes) - Have to solve hard problems no magic!
- The more radical the idea, the more important it
is to build - World will only trust end-to-end results since it
is too hard to dive into details and understand
all assumptions - Would you believe this Prof. John Bull has
demonstrated a simulation prototype of a 64-way
issue out-of-order superscalar - Cycle simulator became cycle accurate simulator
only after HW got precisely defined - Dont bother to commercialize unless you have a
working prototype - Total network power few percent for real apps
Aug 2003 ISLPED, Kim et al. Energy
characterization of a tiled architecture
processor with on-chip networks
MIT-CSAIL-TR-2008-066 Energy scalability of
on-chip interconnection networks in multicore
architecures - Network power is few percent in Raw for real
apps however, it is 36 only for a highly
contrived synthetic sequence meant to toggle
every network wire
35Raw Ideas and Decisions What Worked, What Did
Not
Yes
- Build a complete prototype system
- Simple processor, single issue
- FPGA logic block in each tile
- Distributed ILP
- Static network for streaming
- Multiple types of computation ILP, streams,
TLP, server - PC in every tile
1GHz, 2-way, inorder in 2016
No
Yes 02, No 06, Yes 14
Yes
Yes
36Raw Ideas and Decisions Streaming
Interconnect Support
route P-gtE, N-gtS
route W-gtP, S-gtN
software controlled crossbar
software controlled crossbar
37Streaming in Tileras Tile Processor
- Streaming done over dynamic interconnect with
stream demuxing (AsTrO SDS) - Automatic demultiplexing of streams into
registers - Number of streams is virtualized
38Virtual reality
Simulator reality
Prototype reality
39Why Do We Care?Markets Demanding More Performance
- Wireless Networks
- Demand for high thruput more channels
- Fast moving standards LTE, services
- Networking market
- Demand for high performance 10Gbps
- Demand for more services, intelligence
- Digital Multimedia market
- Demand for high performance H.264 HD
- Demand for more services VoD, transcode
Base Station
GGSN
Switches
Security Appliances
Routers
Video Conferencing
Cable Broadcast
and with power efficiency and programming ease
39
40Tileras TILEPro64 Processor
Multicore Performance (90nm)
Power Efficiency
I/O and Memory Bandwidth
Programming
Tile64, Hotchips 2007 Tile64, Microprocessor
Report Nov 2007
41Tile Processor Block DiagramA Complete System on
a Chip
DDR2 Memory Controller 0
PCIe 0 MAC PHY
Serdes
UART, HPI JTAG, I2C, SPI
GbE 0
GbE 1
Flexible IO
Flexible IO
PCIe 1 MAC PHY
Serdes
DDR2 Memory Controller 3
42Tile Processor NoC
Tiles
- 5 independent non-blocking networks
- 64 switches per network
- 1 Terabit/sec per Tile
- Each network switch directly and independently
connected to tiles - One hop per clock on all networks
- I/O write example
- Memory write example
- Tile to Tile access example
- All accesses can be performed simultaneously on
non-blocking networks
UDN
STN
IDN
MDN
TDN
IEEE Micro Sep 2007
43Multicore Hardwall ImplementationOr Protection
and Interconnects
44Product Reality Differences
- Market forces
- Need crisper answer to who cares
- SMP Linux programming with pthreads fully cache
coherent - C API approach to streaming vs new language
Streamit in Raw - Special instructions for video, networking
- Floating point needed in research project, but
not in product for embedded market - Lessons from Raw
- E.g., Dynamic network for streams
- HW instruction cache
- Protected interconnects
- More substantial engineering
- 3-way VLIW CPU, subword arithmetic
- Engineering for clock speed and power efficiency
- Completeness I/O interfaces on chip complete
system chip. Just add DRAM for system - Support for virtual memory, 2D DMA
- Runs SMP Linux (can run multiple OSes
simultaneously)
45Simulator reality
Prototype reality
Product reality
46What Does the Future Look Like?
Corollary of Moores law Number of cores will
double every 18 months
05
08
11
14
02
64
256
1024
4096
Research
16
Industry
16
64
256
1024
4
1K cores by 2014! Are we ready?
(Cores minimally big enough to run a self
respecting OS!)
47 Vision for the Future
- The core is the logic gate of the 21st century
48Research Challenges for 1K Cores
- 4-16 cores not interesting. Industry is there.
University must focus on 1K cores Everything
will change! - Can we use 4 cores to get 2X through DILP?
Remember cores will be 1GHz and simple! What is
the interconnect? - How should we program 1K cores? Can interconnect
help with programming? - Locality and reliability WILL matter for 1K
cores. Spatial view of multicore? - Can we add architectural support for programming
ease? E.g., suppose I told you cores are free.
Can you discover mechanisms to make programming
easier? - What is the right grain size for a core?
- How must our computational models change in the
face of small memories per core? - How to feed the beast? I/O and external memory
bandwidth - Can we assume perfect reliability any longer?
49ATAC Architecture
Electrical Mesh Interconnect (EMesh)
Optical Broadcast WDM Interconnect
Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018
50Research Challenges for 1K Cores
- 4-16 cores not interesting. Industry is there.
University must focus on 1K cores Everything
will change! - Can we use 4 cores to get 2X through DILP? What
is the interconnect? - How should we program 1K cores? Can interconnect
help with programming? - Locality and reliability WILL matter for 1K
cores. Spatial view of multicore? - Can we add architectural support for programming
ease? E.g., suppose I told you cores are free.
Can you discover mechanisms to make programming
easier? - What is the right grain size for a core?
- How must our computational models change in the
face of small memories per core? - How to feed the beast? I/O and external memory
bandwidth - Can we assume perfect reliability any longer?
51FOS Factored Operating System
OS cores collaborate, inspired by distributed
internet services model
The key idea space sharing replaces time
sharing
FS
FS
Need new page
User App
I/O
FS
File System
- Today User app and OS kernel thrash each other
in a cores cache - User/OS time sharing is inefficient
- Angstrom OS assumes abstracted space model. OS
services bound to distinct cores, separate from
user cores. OS service cores collaborate to
achieve best resource management - User/OS space sharing is efficient
OS Review 2008
52Research Challenges for 1K Cores
- 4-16 cores not interesting. Industry is there.
University must focus on 1K cores Everything
will change! - Can we use 4 cores to get 2X through DILP? What
is the interconnect? - How should we program 1K cores? Can interconnect
help with programming? - Locality and reliability WILL matter for 1K
cores. Spatial view of multicore? - Can we add architectural support for programming
ease? E.g., suppose I told you cores are free.
Can you discover mechanisms to make programming
easier? - What is the right grain size for a core?
- How must our computational models change in the
face of small memories per core? - How to feed the beast? I/O and external memory
bandwidth - Can we assume perfect reliability any longer?
53 The following are trademarks of Tilera
Corporation Tilera, the Tilera Logo, Tile
Processor, TILE64, Embedding Multicore, Multicore
Development Environment, Gentle Slope
Programming, iLib, iMesh and Multicore Hardwall.
All other trademarks and/or registered trademarks
are the property of their respective owners.