Evaluating The Raw Microprocessor: - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Evaluating The Raw Microprocessor:

Description:

Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, ... 10's, 100's or 1000's of parallel operators. 10's or 100's of parallel ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 39
Provided by: groupsC
Category:

less

Transcript and Presenter's Notes

Title: Evaluating The Raw Microprocessor:


1
Evaluating The Raw Microprocessor Scalability
and Versatility
Michael Taylor Walter Lee, Jason Miller, David
Wentzlaff, Ian Bratt, Ben Greenwald, Henry
Hoffmann, Paul Johnson, Jason Kim, James Psota,
Arvind Saraf, Nathan Shnidman, Volker Strumpen,
Matt Frank, Saman Amarasinghe, and Anant
Agarwal. M.I.T.
2
Could processors be even more general purpose?
Spec Office
General Purpose Microprocessor
Custom Chip
Video/3D Graphics Network Encryption Wireless/Cell
Phone Digital Camera MP3 Player Automotive
Square inch of silicon Gets more powerful every
generation
Why can custom chips run these apps?
3
Custom Chips Efficient Extraction
of Parallelism
GP Micro 3-8 2 1
10s, 100s or 1000s of parallel operators 10s
or 100s of parallel memory ports 10s or 100s
of parallel I/O ops
Customized placement and routing of operators
operands -High locality -Minimum Control
-Operands routed over wires, not thru register
files ?Area and Power Efficient
But, not general purpose! Cant run GCC.
4
The Raw Goal
Create an architecture that Scales to
100s-1000s of functional units, memory ports
by exploiting custom-chip like features
- in particular, application-specific routing
of operands
while being general purpose Run
ILP-based sequential programs Support
standard General Purpose Abstractions - like
context switching, caching and
instruction virtualization
IEEE Micro, Billion Transistor Issue, 1997
5
Un-buildable Super-Wide Issue GP
6
Area and Frequency Scalability Problems
N3
N ALUs
RF
Bypass Net
Ex Itanium 2
Without modification, freq decreases linearly or
worse.
7
Operand Routing is Global

RF
gtgt
Bypass Net
8
Idea Exploit Locality
RF
Bypass Net
9
Idea Exploit Locality
RF
10
Replace the crossbar with a point-to-point,
pipelined, routed network.
RF
11
Replace the crossbar with a point-to-point,
pipelined, routed network.

RF
gtgt
12
Operand Transport Scaling Bandwidth and Area
Un-pipelined crossbar Point-to-Point Routed Mesh Network
ALUs N N
Bisection BW N½ N½
Local BW N½ N
Area N2 N
Scales as 2-D VLSI
13
Operand Transport Scaling - Latency
Time for operand to travel between instructions
mapped to different ALUs.
Un-pipelined crossbar Point-to-Point Routed Mesh Network
Non-local Placement N N½
Locality Driven Placement N 1
14
Distribute the Register File
RF
15
SCALABLE
16
More Scalability Problems
Control
Unified Load/Store Queue
17
Distribute the rest.
ISCA99
18
Tiles!
19
Tiles!
20
Tiled Processor Architectures
-composed of a replicated tile -all signals
registered at tile boundaries -NO global
signals -wire delay problem much easier -
easy scalability story
Easier to Tune the Frequency Easier to
Verify Easier to do the Physical Design
21
Raw Compute Internals
22
(No Transcript)
23
Evaluation of Raw
- holistic approach - design a complete
architecture - design and build the processor
and enclosing system - build the compilers
- used the chip in real
systems - head-to-head versus Intel Chip in
same litho generation
24
Raw
180 nm ASIC (IBM SA-27E) 16 tiles Core
Frequency 425 MHz _at_ 1.8 V 500 MHz _at_
2.2 V Frequency competitive with
IBM-implemented PowerPCs in same process. 18 W
(vpenta)
Critical Path Single-Ported 32 KB SRAM
14-bit Mux. Flip Flop
25
Raw Chips
October 02
26
Raw motherboard
Support Chipset implemented in FPGA (vs.
custom ASICs for P3)
27
Comparison to Pentium 3
Honest
Self-comparisons hide architectural and compiler
inefficiency.
People can now compare to P3 and by extension to
Raw.
Whats hard Normalizations between processors
is very tricky. Especially academic projects
versus indutry. - ASIC cannot attain the same
frequencies.
Our solution -Pick closest Intel processor
implementation -Dont scale any numbers in any
way.
28
Parameter IBM SA-27E (Raw) Intel P858 (P3) Favors
Litho 180 nm 180 nm -
Metal Layers Cu 6 Al 6 Raw
Wire sizing No Yes Intel
Dielectric k 4.1 3.55 Intel
FO1 Delay 23 ps 11 ps Intel
Design Style Std Cell ASIC Full custom Intel
Voltage Tweak 0 10 Intel
Initial Freq 425 500-733 -
Presumed Ave. Chip Freq 425 600 -
Pins 1100 190 Raw
Die Area 331 mm2 106 mm2 Raw
29
Methodology - HW
Intel Pentium III Coppermine 600 MHz Dell
Precision 410, stocked with 2-2-2 PC100
DRAM Raw Validated Cycle-Accurate Simulator
- Matches RTL for Raw Chip to the precise cycle
for all 200,000 lines of test
code Simulator used so we could -
Normalize motherboard DRAM timings -
replace (research) software i-caching system
with conventional hardware i-cache.
30
Methodology - SW
When applicable - normalize compiler P3
gcc 3.3 O3 marchpentium3 mfpmathsse
Raw gcc 3.3 O3 (non parallelizing) - normalize
stdio/stdlib P3 Raw Newlib 1.9.0 w/
Deionizer P3 Intel Performance
Primitives LAPACK/BLAS with SSE for linear
algebra routines Raw rawcc - home brew
parallelizing compiler Streamit - home brew
parallelizing compiler gcc 3.3 snippets inline
assembly for some parallel apps
31
Performance Survey
32
Sources of Speedup vs. P3 or 1 Tile
Factor Approx. Upper Bound on Speedup
Tile Parallelism 16x
Streaming I/O Bandwidth 60x
Streaming v. cache thrashing 15x
33
Future Work Raw supercomputing fabric
Emulator of a 1K-tile Raw chip circa.
2010 Ultimate test of scaling
34
Related Work AsTrO Taxonomy
Assignment (Static/Dynamic)


Is instruction assignment to ALUs
predetermined?
/


Transport (Static/Dynamic)
gtgt
gtgt
Are operand routes predetermined?
Ordering (Static/Dynamic)
Is the execution order of instructions
assigned to a node predetermined?
35
How Raw relates to otherdistributed
microprocessorsusing AsTrO taxonomy
Static
Dynamic
Static
Dynamic
Dynamic
Static
Dynamic
Static
Static
Dynamic
GRID 01 WaveScalar 03
OOO- Superscalar
RawDyn 00
Raw 97 Scale 04
ILDP00
36
Conclusions
  • VLSI Scalable microprocessors are possible.
  • Constant factors are beginning to give way to
    asymptotics
  • - 16 ALU Raw Oct 2002
  • - 64 ALU Raw Now
  • - 1,024 ALU Raw - 2010
  • - 32,768 ALU Raw If Moores Law makes it
    to 2 nm
  • There is an opportunity to make processors more
  • versatile i.e., steal applications from custom
    chips.
  • Tiled Processor Architectures are a promising
    approach and merit further research.

37




38
Embedded system1020 Element Microphone Array
Write a Comment
User Comments (0)
About PowerShow.com