Title: CASH: Compiling Application-Specific Hardware
1CASH Compiling Application-Specific Hardware
- Mihai Budiu
- ST Microelectronics, June 11, 2003
2CPU Problems
- Design Complexity
- Power
- Global Signals
- Limited issue window ) limited ILP
3Communication vs. Computation
wire
gate
5ps
20ps
Power consumption on wires is also dominant
4Global Communication
Instruction unit
Network
Reg
5Our Approach ASH Application-Specific
Hardware
61) Unroll Pipeline
Instruction unit
Network
Network
Reg
Reg
Network
Reg
7Resource Binding Time
1.
1.
Programs
2.
2.
Programs
CPU
ASH
82) Specialize Pipeline
Fixed program
Instruction unit
Network
Network
Reg
Reg
Network
Reg
92) Specialize PipelineFunctional Units
Fixed program
Instruction unit
Network
Network
Reg
Reg
Network
Reg
102) Specialize Pipeline Interconnection Network
Fixed program
Instruction unit
Reg
Reg
Reg
112) Specialize Pipeline Register Files
Fixed program
Instruction unit
0
1
122) Specialize Pipeline Shrink Wires
Fixed program
Instruction unit
0
1
132) Specialize Pipeline No Instruction Fetch,
Decode, Issue
0
1
14Loops and Memory
Spatial Computation
0
1
LSQ
To memory
15Outline
- Introduction spatial computation
- ASH vs CPU
- CASH Compiling for ASH
- ASH at run-time
- Conclusions
16Proposed Architecture
CPU
ASH
Low ILP computation OS VM
High-ILP computation
Memory
17Computational Bandwidth
CPU
ASH
18Registers
Unbounded
R0 R1 ... R31
ASH
CPU
19Register Bandwidth
Fixed
Unbounded
R1 R2 R3 W1 W2
CPU
ASH
20Parallelism
In-order
HLL program
Compiler
Fetch
Decode
Dispatch
Execute
Commit
Limited by window
Circuit
ASH
CPU
Compilers window is unbounded
21ASH vs CPU- summary -
- Late resource binding ) match application
needs - No centralized structures ) fast, local
communication - Inexpensive large bandwidth computation,
registers
22Outline
- Introduction
- ASH vs CPU
- CASH Compiling for ASH
- ASH at run-time
- Conclusions
23Our Solution
General applicable to todays software Automatic
compiler-driven RISC approach Scalable with
technology, hardware prog size Parallelism
exploit application parallelism bit-level, ILP,
pipeline, loop-level
24Application-Specific Hardware
C program
Dataflow IR
Compiler
dataflow machine
Reconfigurable/custom hw
25 Intermediate Representation
Traditionally
Our proposal
- SSA predication
- Uniform for scalars and memory
- Explicitly encode may-depend
- Summarize control-flow
- Executable
may-dep.
CFG
...
def-use
26New
- Entire C applications
- Dynamically scheduled circuits
- Custom dataflow machines
- - application-specific
- - direct execution (no interpretation)
27CASH Compiling for ASH
C Program
RH
Circuits
Memory partitioning
Interconnection net
28Asynchronous Computation
data
ack
data valid
29Distributed Control Logic
ack
rdy
-
30Outline
- Introduction
- ASH vs CPU
- CASH Compiling for ASH - some details -
- ASH at run-time
- Conclusions
31Forward Branches
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
y
Decoded mux
Conditionals ) Speculation
32Control Flow ) Data Flow
data
Merge
data
data
predicate
Gateway
33Loops
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
34Read-write Sets
Memory
p if (x) q else r
35Token Edges
Memory
p if (x) q else r
36Tokens in Hardware
addr
token
pred
LSQ
Load
Memory
data
token
37Outline
- Introduction
- ASH vs CPU
- Compiling for ASH
- ASH at run-time
- Conclusions
38Critical Paths
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
y
39Lenient Operations
x
b
0
if (x gt 0) y -x else y bx
!
y
Solve the problem of unbalanced paths
40Pipelining
i
1
100
lt
pipelined multiplier (8 stages)
sum
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
cycle1
41Pipelining
i
1
100
lt
sum
cycle2
42Pipelining
i
1
100
lt
sum
cycle3
43Pipelining
i
1
100
lt
sum
cycle4
44Pipelining
i
1
100
lt
sum
cycle5
45Pipelining
i
1
100
i1
lt
i0
sum
cycle6
46Pipelining
i
1
100
lt
sum
cycle7
47Pipelining
i
1
100
critical path
lt
Predicate ackedge is on the critical path.
sum
48Pipelining
i
1
100
lt
decoupling FIFO
sum
cycle7
49Pipelining
i
1
100
lt
critical path
is loop
decoupling FIFO
sum
sums loop
50FIFO Impact
i
1
100
lt
Pipe FIFO Cycles
N 0 903
N 1 903
Y 0 653
Y 1 474
Y 2 408
Y 3 408
decoupling FIFO
sum
51Dataflow Loop Pipelining
- Related to software pipelining
- Copes with unknown latencies
- control-flow
- memory accesses
- Does not require parallelization
- Applicable to memory accesses as well
52Performance of Selected Kernels
25
17/12
19/16
10.5
mpeg2_d
gsm_e
gsm_d
g721_d
mpeg2_e
pegwit_e
g721_e
jpeg_d
pegwit_d
jpeg_e
adpcm_e
adpcm_d
Mediabench
53OpenDIVX IDCT, Normalized Running Time
54OpenDIVX IDCT,Sustained IPC
includes speculative ops
no data
55Full Dataflow Speed
wrong!
- ASH runs at full dataflow speed, so CPU cannot
do any better(if compilers equally good)
- CPU weapons
- speculation (branch prediction)
- centralized memory access
56Muxes Speculation Squash
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
y
57Control-Flow Speculation
i
1
- for (i0 i lt N i)
- ...
- if (exception) break
lt
exception
!
58Summary
- ANSI C automatically translated to
dynamically-scheduled hardware circuits - EPIC-based compilation technology (predication,
speculation, hyperblocks) - Novel specific optimizations
- leniency, ack-on-critical path, loop decoupling,
bitwidth - ASH does not naturally support control-speculation
(aka branch prediction)
59Conclusions
- ASHcompiler-synthesized hw from HLL
- Exposes program parallelism
- ILP and loop pipelining
- Dataflow techniques applied to hardware
- Impressive performance on data-intensive
kernels
60Backup Slides
- Compiler structure
- Predication speculation
- Procedure calls
- Evaluation model
- Program partitioning
- Status/resources
- Control logic
61Compiler Structure
Pegasus
SUIF
C/FORTRAN
- CSE
- dead-code
- PRE
- induction variables
- strength reduction
- loop-invariant lift
- reassociation
- memory optimization
- constant propagation
- constant folding
- unreachable code
- register promotion
- inlining
- unrolling
- call-graph
- pointer analysis
- live var. analysis
- CFG construction
- unreachable code
- build hyperblocks
- ctrl dominance
- path predicates
call-graph
C circuitsimulation
Verilog
back
62Hyperblocks
Procedure
63Predication
hyperblock
if (!p) .......
p
!p
if (p) .......
q
q
64Speculation
if (!p) ......
if (!p) ......
ops w/ side-effects
q
q
65Computing Predicates
s
t
b
- Correct for irreducible graphs
- Correct even when speculatively computed
- Boolean operations are lenient
back
66Procedure calls
network
Extract args
args
call P
result
caller
ret
Procedure P
67Recursion
save live values
recursive call
localstack
restore live values
hyperblock
back
68How Performance Is Evaluated
C
Mem
L2 1/4M
L1 8K
LSQ
2
limited BW (2 words/c)
Unlimited ILP
8
72
69Simulation Parameters
- Compared to 4-wide OOO SimpleScalar
- Same operation latencies
- Same cache hierarchy
- No measurements in library functions/OS
- 3-cycle multiply, 20 cycle divide
back
70Unit of Partitioning Procedure
Program call-graph
recursive
leaves
library
71Peering
Program
a( ) b( ) b( ) c( ) c( ) d(
) d( )
a
CPU
ASH
b
c
d
72RPC
RH
CPU
a
b
b
c
c
d
d
back
73Status
- Handle all C constructs except
- longjmp
- exit
- alloca
- varargs
- Generate coarse C simulation of circuits
- Preliminary Verilog back-end available
- no FP, procedure calls
- uses a standard cell library
- generates inefficient memory interface
74How Many Resources?
- Using a back-of-the-envelope calculation
- Estimated SpecINT95 and Mediabench
- Average lt 100 bit-operations/SLOC
- Routing resources harder to estimate
back
75Control Logic
rdyin
C
C
ackin
D
rdyout
ackout
D
datain
dataout
Reg
back