Digital Space - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Digital Space

Description:

Lots of ALUs, lots of registers, lots of local memories huge on-chip ... Raw Motherboard. 33. Raw Ideas and Decisions: What Worked, What Did Not ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 54
Provided by: agar59
Category:

less

Transcript and Presenter's Notes

Title: Digital Space


1
Digital Space
Anant Agarwal MIT and Tilera Corporation
2
Arecibo
3
Stages of Reality
4
Simulator reality
Simulator reality
Simulator reality
Simulator reality
Prototype reality
Prototype reality
Prototype reality
Product reality
Product reality
Virtual Reality
5
The Opportunity
1996
20MIPS cpu in 1987
Few thousand gates
6
The Opportunity
The billion transistor chip of 2007
7
How to Fritter Away Opportunity
Caches
Control
100 ported RegFil and RR
More resolution buffers, control
the x1786? does not scale
1/10 ns
8
Take Inspiration from ASICs
mem
  • Lots of ALUs, lots of registers, lots of local
    memories huge on-chip parallelism but with a
    slower clock
  • Custom-routed, short wires optimized for specific
    applications

Fast, low power, area efficient But not
programmable
9
Our Early Raw Proposal
  • Got parallelism?

But how to build programmable, yet custom, wires?
10
A digital wire
11
Static Router
A static router!
Compiler
Application
12
Replace Wires with Routed Networks
13
50-Ported Register File ? Distributed Registers
Gigantic 50 ported register file
14
50-Ported Register File ? Distributed Registers
Gigantic 50 ported register file
15
Distributed Registers Routed Network
Distributed register file
Called NURA ASPLOS 1998
16
16-Way ALU Clump ? Distributed ALUs
RF
Bypass Net
17
Distributed ALUs, Routed Bypass Network
R
Scalar Operand Network (SON) TPDS 2005
18
Mongo Cache ? Distributed Cache
Gigantic 10 ported cache
19
Distributing the Cache
20
Distributed Shared Cache

R
Like DSM (distributed shared memory), cache is
distributed But, unlike NUCA, caches are local
to processors, not far away
ISCA 1999
21
Tiled Multicore Architecture

R
22
E.g., Operand Routing in 16-way Superscalar

RF
gtgt
Bypass Net
Source Taylor ISCA 2004
23
Operand Routing in a Tiled Architecture

gtgt

R
ALU
gtgt
24
Tiled Multicore
  • Scales to large numbers of cores
  • Modular design, layout and verify 1 tile
  • Power efficient MIT-CSAIL-TR-2008-066
  • Short wires CV2f
  • Chandrakasan effect CV2f
  • Dynamic and compiler scheduled routing

25

A Prototype Tiled Architecture The Raw
Microprocessor
Billion transistor IEEE Computer Issue
97 www.cag.csail.mit.edu/raw
The Raw Chip
Tile
26
Virtual reality
Simulator reality
Prototype reality
Product reality
27
Scalar Operand Transport in Raw
Goal flow controlled, in order delivery of
operands
fadd r5, r3, r24
fmul r24, r3, r4
route P-gtE, N-gtS
route W-gtP, S-gtN
software controlled crossbar
software controlled crossbar
28
RawCC Distributed ILP Compilation (DILP)
C
tmp0 (seed32)/2 tmp1 seedv12 tmp2
seedv2 2 tmp3 (seed62)/3 v2 (tmp1 -
tmp3)5 v1 (tmp1 tmp2)3 v0 tmp0 - v1 v3
tmp3 - v2
Place, Route, Schedule
seed.0seed
pval5seed.06.0
pval1seed.03.0
pval4pval52.0
pval0pval12.0
tmp3.6pval4/3.0
tmp3tmp3.6
tmp0.1pval0/2.0
v3.10tmp3.6-v2.7
tmp0tmp0.1
v3v3.10
Partitioning
v2.4v2
v1.2v1
seed.0seed
pval3seed.ov2.4
pval2seed.0v1.2
tmp2.5pval32.0
tmp1.3pval22.0
pval1seed.03.0
v1.2v1
tmp2tmp2.5
v2.4v2
tmp1tmp1.3
pval5seed.06.0
pval6tmp1.3-tmp2.5
pval7tmp1.3tmp2.5
pval2seed.0v1.2
pval0pval12.0
pval3seed.ov2.4
v2.7pval65.0
pval4pval52.0
v1.8pval73.0
tmp1.3pval22.0
tmp0.1pval0/2.0
v2v2.7
tmp2.5pval32.0
v0.9tmp0.1-v1.8
tmp3.6pval4/3.0
tmp1tmp1.3
v1v1.8
v0v0.9
tmp0tmp0.1
tmp2tmp2.5
pval7tmp1.3tmp2.5
tmp3tmp3.6
Black arrows Operand Communication over SON
pval6tmp1.3-tmp2.5
v1.8pval73.0
v2.7pval65.0
v0.9tmp0.1-v1.8
ASPLOS 1998
v1v1.8
v3.10tmp3.6-v2.7
v2v2.7
v0v0.9
v3v3.10
29
Virtual reality
Simulator reality
Prototype reality
Product reality
30
A Tiled Processor Architecture Prototype the Raw
Microprocessor
Michael Taylor Walter Lee Jason Miller David
Wentzlaff Ian Bratt Ben Greenwald Henry
Hoffmann Paul Johnson Jason Kim James
Psota Arvind Saraf Nathan Shnidman Volker
Strumpen Matt Frank Rajeev Barua Elliot
Waingold Jonathan Babb Sri Devabhaktuni Saman
Amarasinghe Anant Agarwal
October 02
31
Raw Die Photo
IBM .18 micron process, 16 tiles, 425MHz, 18
Watts (vpenta)
ISCA 2004
32
Raw Motherboard
33
Raw Ideas and Decisions What Worked, What Did
Not
  • Build a complete prototype system
  • Simple processor with single issue cores
  • FPGA logic block in each tile
  • Distributed ILP and static network
  • Static network for streaming
  • Multiple types of computation ILP, streams,
    TLP, server
  • PC in every tile

34
Why Build?
  • Compiler (Amarasinghe), OS and runtimes (ISI),
    apps (ISI, Lincoln Labs, Durand) folks will not
    work with you unless you are serious about
    building hardware
  • Need motivaion to build software tools --
    compilers, runtimes, debugging, visualization
    many challenges here
  • Run large data sets (simulation takes forever
    even with 100 servers!)
  • Many hard problems show up or are better
    understood after you begin building (how to
    maintain ordering for distributed ILP, slack for
    streaming codes)
  • Have to solve hard problems no magic!
  • The more radical the idea, the more important it
    is to build
  • World will only trust end-to-end results since it
    is too hard to dive into details and understand
    all assumptions
  • Would you believe this Prof. John Bull has
    demonstrated a simulation prototype of a 64-way
    issue out-of-order superscalar
  • Cycle simulator became cycle accurate simulator
    only after HW got precisely defined
  • Dont bother to commercialize unless you have a
    working prototype
  • Total network power few percent for real apps
    Aug 2003 ISLPED, Kim et al. Energy
    characterization of a tiled architecture
    processor with on-chip networks
    MIT-CSAIL-TR-2008-066 Energy scalability of
    on-chip interconnection networks in multicore
    architecures
  • Network power is few percent in Raw for real
    apps however, it is 36 only for a highly
    contrived synthetic sequence meant to toggle
    every network wire

35
Raw Ideas and Decisions What Worked, What Did
Not
Yes
  • Build a complete prototype system
  • Simple processor, single issue
  • FPGA logic block in each tile
  • Distributed ILP
  • Static network for streaming
  • Multiple types of computation ILP, streams,
    TLP, server
  • PC in every tile

1GHz, 2-way, inorder in 2016
No
Yes 02, No 06, Yes 14
Yes
Yes
36
Raw Ideas and Decisions Streaming
Interconnect Support
route P-gtE, N-gtS
route W-gtP, S-gtN
software controlled crossbar
software controlled crossbar
37
Streaming in Tileras Tile Processor
  • Streaming done over dynamic interconnect with
    stream demuxing (AsTrO SDS)
  • Automatic demultiplexing of streams into
    registers
  • Number of streams is virtualized

38
Virtual reality
Simulator reality
Prototype reality
39
Why Do We Care?Markets Demanding More Performance
  • Wireless Networks
  • Demand for high thruput more channels
  • Fast moving standards LTE, services
  • Networking market
  • Demand for high performance 10Gbps
  • Demand for more services, intelligence
  • Digital Multimedia market
  • Demand for high performance H.264 HD
  • Demand for more services VoD, transcode

Base Station
GGSN
Switches
Security Appliances
Routers
Video Conferencing
Cable Broadcast
and with power efficiency and programming ease
39
40
Tileras TILEPro64 Processor
Multicore Performance (90nm)
Power Efficiency
I/O and Memory Bandwidth
Programming
Tile64, Hotchips 2007 Tile64, Microprocessor
Report Nov 2007
41
Tile Processor Block DiagramA Complete System on
a Chip
DDR2 Memory Controller 0
PCIe 0 MAC PHY
Serdes
UART, HPI JTAG, I2C, SPI
GbE 0
GbE 1
Flexible IO
Flexible IO
PCIe 1 MAC PHY
Serdes
DDR2 Memory Controller 3
42
Tile Processor NoC
Tiles
  • 5 independent non-blocking networks
  • 64 switches per network
  • 1 Terabit/sec per Tile
  • Each network switch directly and independently
    connected to tiles
  • One hop per clock on all networks
  • I/O write example
  • Memory write example
  • Tile to Tile access example
  • All accesses can be performed simultaneously on
    non-blocking networks

UDN
STN
IDN
MDN
TDN
IEEE Micro Sep 2007
43
Multicore Hardwall ImplementationOr Protection
and Interconnects
44
Product Reality Differences
  • Market forces
  • Need crisper answer to who cares
  • SMP Linux programming with pthreads fully cache
    coherent
  • C API approach to streaming vs new language
    Streamit in Raw
  • Special instructions for video, networking
  • Floating point needed in research project, but
    not in product for embedded market
  • Lessons from Raw
  • E.g., Dynamic network for streams
  • HW instruction cache
  • Protected interconnects
  • More substantial engineering
  • 3-way VLIW CPU, subword arithmetic
  • Engineering for clock speed and power efficiency
  • Completeness I/O interfaces on chip complete
    system chip. Just add DRAM for system
  • Support for virtual memory, 2D DMA
  • Runs SMP Linux (can run multiple OSes
    simultaneously)

45
Simulator reality
Prototype reality
Product reality
46
What Does the Future Look Like?
Corollary of Moores law Number of cores will
double every 18 months
05
08
11
14
02
64
256
1024
4096
Research
16
Industry
16
64
256
1024
4
1K cores by 2014! Are we ready?
(Cores minimally big enough to run a self
respecting OS!)
47
Vision for the Future
  • The core is the logic gate of the 21st century

48
Research Challenges for 1K Cores
  • 4-16 cores not interesting. Industry is there.
    University must focus on 1K cores Everything
    will change!
  • Can we use 4 cores to get 2X through DILP?
    Remember cores will be 1GHz and simple! What is
    the interconnect?
  • How should we program 1K cores? Can interconnect
    help with programming?
  • Locality and reliability WILL matter for 1K
    cores. Spatial view of multicore?
  • Can we add architectural support for programming
    ease? E.g., suppose I told you cores are free.
    Can you discover mechanisms to make programming
    easier?
  • What is the right grain size for a core?
  • How must our computational models change in the
    face of small memories per core?
  • How to feed the beast? I/O and external memory
    bandwidth
  • Can we assume perfect reliability any longer?

49
ATAC Architecture
Electrical Mesh Interconnect (EMesh)
Optical Broadcast WDM Interconnect
Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018
50
Research Challenges for 1K Cores
  • 4-16 cores not interesting. Industry is there.
    University must focus on 1K cores Everything
    will change!
  • Can we use 4 cores to get 2X through DILP? What
    is the interconnect?
  • How should we program 1K cores? Can interconnect
    help with programming?
  • Locality and reliability WILL matter for 1K
    cores. Spatial view of multicore?
  • Can we add architectural support for programming
    ease? E.g., suppose I told you cores are free.
    Can you discover mechanisms to make programming
    easier?
  • What is the right grain size for a core?
  • How must our computational models change in the
    face of small memories per core?
  • How to feed the beast? I/O and external memory
    bandwidth
  • Can we assume perfect reliability any longer?

51
FOS Factored Operating System
OS cores collaborate, inspired by distributed
internet services model
The key idea space sharing replaces time
sharing
FS
FS
Need new page
User App
I/O
FS
File System
  • Today User app and OS kernel thrash each other
    in a cores cache
  • User/OS time sharing is inefficient
  • Angstrom OS assumes abstracted space model. OS
    services bound to distinct cores, separate from
    user cores. OS service cores collaborate to
    achieve best resource management
  • User/OS space sharing is efficient

OS Review 2008
52
Research Challenges for 1K Cores
  • 4-16 cores not interesting. Industry is there.
    University must focus on 1K cores Everything
    will change!
  • Can we use 4 cores to get 2X through DILP? What
    is the interconnect?
  • How should we program 1K cores? Can interconnect
    help with programming?
  • Locality and reliability WILL matter for 1K
    cores. Spatial view of multicore?
  • Can we add architectural support for programming
    ease? E.g., suppose I told you cores are free.
    Can you discover mechanisms to make programming
    easier?
  • What is the right grain size for a core?
  • How must our computational models change in the
    face of small memories per core?
  • How to feed the beast? I/O and external memory
    bandwidth
  • Can we assume perfect reliability any longer?

53
The following are trademarks of Tilera
Corporation Tilera, the Tilera Logo, Tile
Processor, TILE64, Embedding Multicore, Multicore
Development Environment, Gentle Slope
Programming, iLib, iMesh and Multicore Hardwall.
All other trademarks and/or registered trademarks
are the property of their respective owners.
Write a Comment
User Comments (0)
About PowerShow.com