Title: Evaluating The Raw Microprocessor:
1Evaluating The Raw Microprocessor Scalability
and Versatility
Michael Taylor Walter Lee, Jason Miller, David
Wentzlaff, Ian Bratt, Ben Greenwald, Henry
Hoffmann, Paul Johnson, Jason Kim, James Psota,
Arvind Saraf, Nathan Shnidman, Volker Strumpen,
Matt Frank, Saman Amarasinghe, and Anant
Agarwal. M.I.T.
2Could processors be even more general purpose?
Spec Office
General Purpose Microprocessor
Custom Chip
Video/3D Graphics Network Encryption Wireless/Cell
Phone Digital Camera MP3 Player Automotive
Square inch of silicon Gets more powerful every
generation
Why can custom chips run these apps?
3Custom Chips Efficient Extraction
of Parallelism
GP Micro 3-8 2 1
10s, 100s or 1000s of parallel operators 10s
or 100s of parallel memory ports 10s or 100s
of parallel I/O ops
Customized placement and routing of operators
operands -High locality -Minimum Control
-Operands routed over wires, not thru register
files ?Area and Power Efficient
But, not general purpose! Cant run GCC.
4The Raw Goal
Create an architecture that Scales to
100s-1000s of functional units, memory ports
by exploiting custom-chip like features
- in particular, application-specific routing
of operands
while being general purpose Run
ILP-based sequential programs Support
standard General Purpose Abstractions - like
context switching, caching and
instruction virtualization
IEEE Micro, Billion Transistor Issue, 1997
5Un-buildable Super-Wide Issue GP
6Area and Frequency Scalability Problems
N3
N ALUs
RF
Bypass Net
Ex Itanium 2
Without modification, freq decreases linearly or
worse.
7Operand Routing is Global
RF
gtgt
Bypass Net
8Idea Exploit Locality
RF
Bypass Net
9Idea Exploit Locality
RF
10Replace the crossbar with a point-to-point,
pipelined, routed network.
RF
11Replace the crossbar with a point-to-point,
pipelined, routed network.
RF
gtgt
12Operand Transport Scaling Bandwidth and Area
Un-pipelined crossbar Point-to-Point Routed Mesh Network
ALUs N N
Bisection BW N½ N½
Local BW N½ N
Area N2 N
Scales as 2-D VLSI
13Operand Transport Scaling - Latency
Time for operand to travel between instructions
mapped to different ALUs.
Un-pipelined crossbar Point-to-Point Routed Mesh Network
Non-local Placement N N½
Locality Driven Placement N 1
14Distribute the Register File
RF
15 SCALABLE
16More Scalability Problems
Control
Unified Load/Store Queue
17Distribute the rest.
ISCA99
18Tiles!
19Tiles!
20Tiled Processor Architectures
-composed of a replicated tile -all signals
registered at tile boundaries -NO global
signals -wire delay problem much easier -
easy scalability story
Easier to Tune the Frequency Easier to
Verify Easier to do the Physical Design
21Raw Compute Internals
22(No Transcript)
23Evaluation of Raw
- holistic approach - design a complete
architecture - design and build the processor
and enclosing system - build the compilers
- used the chip in real
systems - head-to-head versus Intel Chip in
same litho generation
24Raw
180 nm ASIC (IBM SA-27E) 16 tiles Core
Frequency 425 MHz _at_ 1.8 V 500 MHz _at_
2.2 V Frequency competitive with
IBM-implemented PowerPCs in same process. 18 W
(vpenta)
Critical Path Single-Ported 32 KB SRAM
14-bit Mux. Flip Flop
25Raw Chips
October 02
26Raw motherboard
Support Chipset implemented in FPGA (vs.
custom ASICs for P3)
27Comparison to Pentium 3
Honest
Self-comparisons hide architectural and compiler
inefficiency.
People can now compare to P3 and by extension to
Raw.
Whats hard Normalizations between processors
is very tricky. Especially academic projects
versus indutry. - ASIC cannot attain the same
frequencies.
Our solution -Pick closest Intel processor
implementation -Dont scale any numbers in any
way.
28Parameter IBM SA-27E (Raw) Intel P858 (P3) Favors
Litho 180 nm 180 nm -
Metal Layers Cu 6 Al 6 Raw
Wire sizing No Yes Intel
Dielectric k 4.1 3.55 Intel
FO1 Delay 23 ps 11 ps Intel
Design Style Std Cell ASIC Full custom Intel
Voltage Tweak 0 10 Intel
Initial Freq 425 500-733 -
Presumed Ave. Chip Freq 425 600 -
Pins 1100 190 Raw
Die Area 331 mm2 106 mm2 Raw
29Methodology - HW
Intel Pentium III Coppermine 600 MHz Dell
Precision 410, stocked with 2-2-2 PC100
DRAM Raw Validated Cycle-Accurate Simulator
- Matches RTL for Raw Chip to the precise cycle
for all 200,000 lines of test
code Simulator used so we could -
Normalize motherboard DRAM timings -
replace (research) software i-caching system
with conventional hardware i-cache.
30Methodology - SW
When applicable - normalize compiler P3
gcc 3.3 O3 marchpentium3 mfpmathsse
Raw gcc 3.3 O3 (non parallelizing) - normalize
stdio/stdlib P3 Raw Newlib 1.9.0 w/
Deionizer P3 Intel Performance
Primitives LAPACK/BLAS with SSE for linear
algebra routines Raw rawcc - home brew
parallelizing compiler Streamit - home brew
parallelizing compiler gcc 3.3 snippets inline
assembly for some parallel apps
31Performance Survey
32Sources of Speedup vs. P3 or 1 Tile
Factor Approx. Upper Bound on Speedup
Tile Parallelism 16x
Streaming I/O Bandwidth 60x
Streaming v. cache thrashing 15x
33Future Work Raw supercomputing fabric
Emulator of a 1K-tile Raw chip circa.
2010 Ultimate test of scaling
34Related Work AsTrO Taxonomy
Assignment (Static/Dynamic)
Is instruction assignment to ALUs
predetermined?
/
Transport (Static/Dynamic)
gtgt
gtgt
Are operand routes predetermined?
Ordering (Static/Dynamic)
Is the execution order of instructions
assigned to a node predetermined?
35How Raw relates to otherdistributed
microprocessorsusing AsTrO taxonomy
Static
Dynamic
Static
Dynamic
Dynamic
Static
Dynamic
Static
Static
Dynamic
GRID 01 WaveScalar 03
OOO- Superscalar
RawDyn 00
Raw 97 Scale 04
ILDP00
36Conclusions
- VLSI Scalable microprocessors are possible.
-
- Constant factors are beginning to give way to
asymptotics - - 16 ALU Raw Oct 2002
- - 64 ALU Raw Now
- - 1,024 ALU Raw - 2010
- - 32,768 ALU Raw If Moores Law makes it
to 2 nm
- There is an opportunity to make processors more
- versatile i.e., steal applications from custom
chips.
- Tiled Processor Architectures are a promising
approach and merit further research.
37 38Embedded system1020 Element Microphone Array