Title: Mihai Budiu
1Spatial ComputationComputing without
General-Purpose Processors
- Mihai Budiu
- mihaib_at_cs.cmu.edu
- Carnegie Mellon University
Presentation at
May 17, 2004
2Spatial Computation
Spatial Computation
- A computation model based on
- application-specific hardware
- no interpretation
- minimal resource sharing
Mihai Budiu mihaib_at_cs.cmu.edu Carnegie Mellon
University
3Research Scope
Object future architectures
Toolcompilers
Evaluationsimulators
4Three Spatial Computation Projects
Application-Specific Hardware (ASH)
nanoFabrics
- virtual reconfigurablehardware
C
Compiler
reconfigurable hardware
5Main Results of My Research (1)
- Developed DIL compiler
- Completely replaces CAD tool-chain
- 700 times faster than commercial tools
- New optimizations (BitValue, place-and-route)
- Streaming kernels execute 20-300 times faster
than on mP
6Main Results of My Research (2)
- nanoFabrics
- Identified strengths limitations of
nanodevices - Proposed new reconfigurable architecture
HLL ! HW compilation for spatial computation - Studied first-order properties of spatial
computation
7Main Results of My Research (3)
Application-Specific Hardware (ASH) Compiler-synth
esized architecture
- Fast prototyping automatic from ANSI C ! HW
- High performance sustained gt 0.8 GOPS 180nm
- Low power Energy/op 100-1000 better than mP
8Related Work
Nanotechnology
Dataflowmachines
Asynchronouscircuits
High-levelsynthesis
Embeddedsystems
Reconfigurablecomputing
Computerarchitecture
Compilation
9Outline
- Research overview
- Problems of current architectures
- Compiling Application-Specific Hardware
- ASH Evaluation
- New compiler optimizations
- Conclusions
1000
Performance
10Resources
Intel
- We do not worry about not having hardware
resources - We worry about being able to use hardware
resources
11Complexity
Cannot rely on global signals
(clock is a global signal)
12Instruction-Set Architecture
Software
ISA
Hardware
13Our Proposal
- ASH addresses these problems
- ASH is not a panacea
- ASH complementary to CPU
14Whats New?
- Investigate new computational model
- Source is full ANSI C
- Result is asynchronous circuit
- Build spatial dataflow hardware
- No resources limitations
- New compiler algorithms
- End-to-end results
- C to structural Verilog in seconds
- high performance results
- excellent power efficiency
- Investigate new computational model
- Source is full ANSI C
- Result is asynchronous circuit
- Build spatial dataflow hardware
- No resources limitations
- New compiler algorithms
- End-to-end results
- C to structural Verilog in seconds
- high performance results
- excellent power efficiency
black box
15Outline
- Research overview
- Problems of current architectures
- CASH Compiling ASH
- program representation
- compiling C programs
- ASH Evaluation
- New compiler optimizations
- Conclusions
16Application-Specific Hardware
C program
Compiler
Dataflow IR
Reconfigurable/custom hw
17Application-Specific Hardware
Soft
C program
Compiler
Dataflow IR
SW backend
Machine code
CPU predication
18Key Intermediate Representation
Traditionally
Our IR
- SSA predication speculation
- Uniform for scalars and memory
- Explicitly encodes may-depend
- Executable
- Precise semantics
- Dataflow IR
- Close to asynchronous target
may-dep.
CFG
...
def-use
19Computation Dataflow
Programs
Circuits
a
7
x a 7 ... y x gtgt 2
2
x
gtgt
- Operations ) functional units
- Variables ) wires
- No interpretation
20Basic Computation
latch
data
ack
valid
21Asynchronous Computation
data
ack
valid
1
22Distributed Control Logic
ack
rdy
-
short, local wires
asynchronous control
23Outline
- Research overview
- Problems of current architectures
- CASH Compiling ASH
- program representation
- compiling C programs
- ASH Evaluation
- New compiler optimizations
- Conclusions
24MUX Forward Branches
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
f
y
critical path
Conditionals ) Speculation
25Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
26Loops
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
pipelining
27Predication and Side-Effects
addr
token
to memory
Load
pred
data
token
28Memory Access
LD
Monolithic Memory
pipelined arbitrated network
ST
LD
local communication
global structures
Future work fragment this!
related work
complexity
29CASH Optimizations
- SSA-based optimizations
- unreachable/dead code, gcse, strength reduction,
loop-invariant code motion, software pipelining,
reassociation, algebraic simplifications,
induction variable optimizations, loop unrolling,
inlining - Memory optimizations
- dependence alias analysis, register promotion,
redundant load/store elimination, memory access
pipelining, loop decoupling - Boolean optimizations
- Espresso CAD tool, bitwidth analysis
30Outline
- Research overview
- Problems of current architectures
- Compiling ASH
- Evaluation CASH vs. clocked designs
- New compiler optimizations
- Conclusions
31Evaluating ASH
Mediabench kernels (1 hot function/benchmark)
C
CASHcore
Verilog back-end
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
ModelSim (Verilog simulation)
performancenumbers
Mem
ASIC
32ASH Area
P4 217
minimal RISC core
normalized area
33ASH vs 600MHz CPU .18 mm
34Bottleneck Memory Protocol
LD
Memory
ST
35Power
Xeon cache 67000
mP 4000
DSP 110
36Energy Efficiency
Dedicated hardware
ASH media kernels
Asynchronous ?P
General-purpose DSP
Microprocessors
0
.
1
1
0
1
1
0
0
0
0
0
1
1
0
0
.
Energy Efficiency Operations/nJ
37Outline
- Research overview
- Nanotechnology and architecture
- Compiling ASH
- ASH Evaluation
- New compiler optimizations
- BitValue dataflow analysis
-
- Optimizing memory accesses
- SIDE static instantiation, dynamic
evaluation - Conclusions
38Detecting Constant Bits
a
b a gtgt 4
b
0000
39Detecting Useless Bits
Dont care bits
a
XXXX
b a gtgt 4
b
40BitValue Dataflow Analysis
a
XXXX
b a gtgt 4
b
0000
41BitValue on C Programs
useless int arithmetic
27
Mediabench
SpecInt95
SpecInt2K
42Outline
- ...
- New compiler optimizations
- BitValue dataflow analysis
- Memory access optimization
- Static Instantiation, Dynamic Evaluation
- Conclusions
43Meaning of Token Edges
p
p
q
q
- Maybe dependent
- No intervening memory operation
Token graph is maintained transitively reduced
44Dead Code Elimination
(false)
p
45¼ PRE
(p1)
(p2)
(p1 Ç p2)
...p
...p
...p
This corresponds in the CFG to lifting the load
to a basic block dominating the original loads
46Register Promotion
(p1)
p
(p2 Æ p1)
(p2)
p
Load is executed only if store is not
47Register Promotion (2)
(p1)
p
(p1)
p
(p2 Æ p1)
(false)
p
(p2)
p
- When p2 ) p1 the load becomes dead...
- ...i.e., when store dominates load in CFG
48Outline
- ...
- New compiler optimizations
- BitValue dataflow analysis
- Memory access optimization
- A SIDE dish dataflow analysis Static
Instantiation, Dynamic Evaluation -
- Conclusions
49Availability Dataflow Analysis
y ab ... if (x) ... ... ab
50Dataflow Analysis Is Conservative
if (x) ... y ab ... ... ab
y?
51Static Instantiation, Dynamic Evaluation
flag false if (x) ... y ab
flag true ... ... flag ? y ab
52SIDE Register Promotion Effect
Loads
reduction
Stores
53Outline
- Research overview
- Problems of current architectures
- Compiling ASH
- ASH Evaluation
- New compiler optimizations
- Future work conclusions
54Future Work
- Optimizations for area/speed/power
- Memory partitioning
- Concurrency
- Compiler-guided layout
- Explore extensible ISAs
- Hybridization with superscalar mechanisms
- Reconfigurable hardware support for ASH
- Formal verification
55Grand VisionCertified Circuit Generation
- Translation validation input output
- Preserve input properties
- e.g., C programs cannot deadlock
- e.g., type-safe programs cannot crash
- Debug, test, verify only at source-level
How far can you go?
HLL
IR
IRopt
Verilog
gates
layout
formally validated
56Conclusions
Spatial computation strengths
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Design productivity, no ISA
57Backup Slides
- Reconfigurable hardware
- Critical paths
- Software pipelining
- Control logic
- More on PipeRench
- ASH vs ...
- ASH weaknesses
- Exceptions
- Research methodology
- Normalized area
- Why C?
- Splitting memory
- More performance
- Recursive calls
- Nanotech and architecture
58Reconfigurable Hardware
59Main RH Ingredient RAM Cell
data in
0
control
Switch controlled by a 1-bit RAM cell
back
back to talk
60Pipeline Stage
ackout
C
rdyin
rdyout
ackin
D
Reg
datain
dataout
back to talk
back
61Critical Paths
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
y
62Lenient Operations
x
b
0
if (x gt 0) y -x else y bx
!
y
Solves the problem of unbalanced paths
back to talk
back
63Pipelining
i
1
100
lt
pipelined multiplier (8 stages)
sum
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
step 1
64Pipelining
i
1
100
lt
sum
step 2
65Pipelining
i
1
100
lt
sum
step 3
66Pipelining
i
1
100
lt
sum
step 4
67Pipelining
i
1
100
i1
lt
i0
sum
step 5
68Pipelining
i
1
100
i1
lt
i0
sum
step 6
back
69Pipelining
i
1
100
lt
sum
step 7
70Pipelining
i
1
100
critical path
lt
Predicate ackedge is on the critical path.
sum
71Pipeline balancing
i
1
100
lt
decoupling FIFO
sum
step 7
72Pipeline balancing
i
1
100
lt
critical path
is loop
decoupling FIFO
sum
sums loop
back
back to talk
73Process 0.18 mm, 6 Al metal layers
Area 49 mm2
Clock 60MHz I/O 120MHz internal
Power lt 4W
Stripes 16 physical 256 virtual
- Compiler functional on first silicon
- Licensed by two companies
74Hardware Virtualization
Page out
compute
compute
compute
configure
Page in
Hardware
Overlap configuration with computation.
Configuration
75PipeRench Hardware
ALU
ALU
data flow
Interconnection Network
Register
ALU
ALU
Interconnection Network
Register
ALU
ALU
76Mapping Computation
gtgt
ltlt
concat
bit-shuffling
ltlt
substr
Network used for computation
77Compiler-Controlled Clock
Register
Register
Network
Network
Register
Register
Slow clock
Fast clock
78Time-Multiplexing Wires
1
2
1
2
?
4
3
4
3
One channel available for two wires
compute in even cycles
compute in odd cycles
79Compilation Times (sec on PII/400)
80Compilation Speed (PII/400)
81Placed Circuit Utilization
82PipeRench Performance
Speed-up vs. 300Mhz UltraSparc
83PipeRench Compiler Role
- Classical optimizations
- Partial evaluation
- Data width inference ( type
inference) - Module generation ( macro expansion)
- Placement ( VLIW
scheduling) - Routing (irregular register
allocation) - Network link multiplexing (
spilling) - Clock-cycle management
- Technology mapping ( instruction selection)
- Code generation
back
84HLL to HW
High-level Synthesis Behavioral HDL Synchronou
s Hardware
ReconfigurableComputing C subsets Hardware
configuration (spatial computation)
Asynchronous circuits Concurrent Language Async
hronous Hardware
Prior work
This research
85CASH vs High-Level Synthesis
- CASH the only existing tool to translate
complete ANSI C to hardware - CASH generates asynchronous circuits
- CASH does not treat C as an HDL
- no annotations required
- no reactivity model
- does not handle non-C, e.g., concurrency
back
86ASH Weaknesses
- Low efficiency for low-ILP code
- Does not adapt at runtime
- Monolithic memory
- Resource waste
- Not flexible
- No support for exceptions
87ASH Weaknesses (2)
- Both branch and join not free
- Static dataflow (no re-issue of same instr)
- Memory is far
- Fully static
- No branch prediction
- No dynamic unrolling
- No register renaming
- Calls/returns not lenient
back
88Branch Prediction
i
1
- for (i0 i lt N i)
- ...
- if (exception) break
lt
exception
!
back
89Research Methodology
Y (e.g., cost)
reasonable limits
state-of-the-art
X (e.g., power)
Constraint Space
back
90Exceptions
- Strictly speaking, C has no exceptions
- In practice hard to accommodate exceptions in
hardware implementations - An advantage of software flexibility PC is
single point of execution control
CPU
ASH
Low ILP computation OS VM exceptions
High-ILP computation
Memory
back
91Why C
- Huge installed base
- Embedded specifications written in C
- Small and simple language
- Can leverage existing tools
- Simpler compiler
- Techniques generally applicable
- Not a toy language
back
92Performance
93Parallelism Profile
94Normalized Area
back
back to talk
95Memory Partitioning
- MIT RAW project Babb FCCM 99, Barua HiPC
00,Lee ASPLOS 00 - Stanford SpC Semeria DAC 01, TVLSI 02
- Berkeley CCured Necula POPL 02
- Illinois FlexRAM Fraguella PPoPP 03
- Hand-annotations pragma
back
back to talk
96Memory Complexity
RAM
LSQ
addr
data
back
back to talk
97Recursion
save live values
recursive call
restore live values
stack
back
98Nanotechnology and Architecture
99Nanotechnology Implications
new architectures new compilers
new devices new manufacturing
100CAEN
- Study computer architecture implications of
Chemically-Assembled Electronic Nanotechnology
101No Complex Irregular Structures
102Regular Substrate
Control
1011 gates
103High Defect Rate
104Paradigm Shift
defects
Configuration
Executable
Complex fixed chip Program
Dense, regular structure Configuration
105New Computer Architecture
CMOS Self-assembled circuits
Transistor New molecular devices
Custom hardware Reconfigurable hardware
Yield (defect) control Defect tolerance through reconfiguration
Synchronous circuits Asynchronous computation
Microprocessors App-specific HardwareCPU
106Exploiting Nanotechnology
- Nanotechnology
- cheap
- high-density
- low-power
- unreliable
- Reconfigurable
- Computing
- defect tolerant
- high performance
- low density
- Computer architecture
- vast body of knowledge
- expensive
- high-power
107Research Convergence
Chemically-assembled electronic nanotechnology
Systems research issues
back
108Venues