CASH: Compiling Application-Specific Hardware - PowerPoint PPT Presentation

About This Presentation

Title:

CASH: Compiling Application-Specific Hardware

Description:

CASH: Compiling Application-Specific Hardware. Mihai Budiu. ST Microelectronics, June 11, 2003 ... fast, local communication. Inexpensive large bandwidth: ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 76

Provided by: MIh73

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CASH: Compiling Application-Specific Hardware

1
CASH Compiling Application-Specific Hardware

Mihai Budiu
ST Microelectronics, June 11, 2003

2
CPU Problems

Design Complexity
Power
Global Signals
Limited issue window ) limited ILP

3
Communication vs. Computation
wire
gate
5ps
20ps
Power consumption on wires is also dominant
4
Global Communication
Instruction unit
Network
Reg
5
Our Approach ASH Application-Specific
Hardware
6
1) Unroll Pipeline
Instruction unit
Network
Network
Reg
Reg
Network
Reg
7
Resource Binding Time

1.
1.
Programs
2.
2.
Programs
CPU
ASH
8
2) Specialize Pipeline
Fixed program
Instruction unit
Network
Network
Reg
Reg
Network
Reg
9
2) Specialize PipelineFunctional Units
Fixed program
Instruction unit
Network
Network
Reg
Reg
Network
Reg
10
2) Specialize Pipeline Interconnection Network
Fixed program
Instruction unit
Reg
Reg
Reg
11
2) Specialize Pipeline Register Files
Fixed program
Instruction unit
0
1
12
2) Specialize Pipeline Shrink Wires
Fixed program
Instruction unit
0
1
13
2) Specialize Pipeline No Instruction Fetch,
Decode, Issue
0
1
14
Loops and Memory
Spatial Computation
0
1
LSQ
To memory
15
Outline

Introduction spatial computation
ASH vs CPU
CASH Compiling for ASH
ASH at run-time
Conclusions

16
Proposed Architecture
CPU
ASH
Low ILP computation OS VM
High-ILP computation
Memory
17
Computational Bandwidth

FU clock freq

CPU
ASH
18
Registers

Fixed

Unbounded
R0 R1 ... R31
ASH
CPU
19
Register Bandwidth
Fixed
Unbounded
R1 R2 R3 W1 W2
CPU
ASH
20
Parallelism
In-order
HLL program
Compiler
Fetch
Decode
Dispatch
Execute
Commit
Limited by window
Circuit
ASH
CPU
Compilers window is unbounded
21
ASH vs CPU- summary -

Late resource binding ) match application
needs
No centralized structures ) fast, local
communication
Inexpensive large bandwidth computation,
registers

22
Outline

Introduction
ASH vs CPU
CASH Compiling for ASH
ASH at run-time
Conclusions

23
Our Solution
General applicable to todays software Automatic
compiler-driven RISC approach Scalable with
technology, hardware prog size Parallelism
exploit application parallelism bit-level, ILP,
pipeline, loop-level
24
Application-Specific Hardware
C program
Dataflow IR
Compiler
dataflow machine
Reconfigurable/custom hw
25
Intermediate Representation
Traditionally
Our proposal

SSA predication
Uniform for scalars and memory
Explicitly encode may-depend
Summarize control-flow
Executable

may-dep.
CFG
...
def-use
26
New

Entire C applications
Dynamically scheduled circuits
Custom dataflow machines
- application-specific
- direct execution (no interpretation)

27
CASH Compiling for ASH
C Program
RH
Circuits
Memory partitioning
Interconnection net
28
Asynchronous Computation

data
ack
data valid
29
Distributed Control Logic
ack
rdy

-
30
Outline

Introduction
ASH vs CPU
CASH Compiling for ASH - some details -
ASH at run-time
Conclusions

31
Forward Branches
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
Decoded mux
Conditionals ) Speculation
32
Control Flow ) Data Flow
data
Merge
data
data
predicate
Gateway
33
Loops

int sum0, i
for (i0 i lt 100 i)
sum ii
return sum

34
Read-write Sets
Memory
p if (x) q else r
35
Token Edges
Memory
p if (x) q else r
36
Tokens in Hardware
addr
token
pred
LSQ
Load
Memory
data
token
37
Outline

Introduction
ASH vs CPU
Compiling for ASH
ASH at run-time
Conclusions

38
Critical Paths
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
39
Lenient Operations
x
b
0
if (x gt 0) y -x else y bx

!
y
Solve the problem of unbalanced paths
40
Pipelining
i
1

100

lt
pipelined multiplier (8 stages)
sum

int sum0, i
for (i0 i lt 100 i)
sum ii
return sum

cycle1
41
Pipelining
i
1

100

lt
sum

cycle2
42
Pipelining
i
1

100

lt
sum

cycle3
43
Pipelining
i
1

100

lt
sum

cycle4
44
Pipelining
i
1

100

lt
sum

cycle5
45
Pipelining
i
1

100

i1
lt
i0
sum

cycle6
46
Pipelining
i
1

100

lt
sum

cycle7
47
Pipelining
i
1

100

critical path
lt
Predicate ackedge is on the critical path.
sum

48
Pipelining
i
1

100

lt
decoupling FIFO
sum

cycle7
49
Pipelining
i
1

100

lt
critical path
is loop
decoupling FIFO
sum
sums loop

50
FIFO Impact
i
1

100

lt
Pipe FIFO Cycles
N 0 903
N 1 903
Y 0 653
Y 1 474
Y 2 408
Y 3 408
decoupling FIFO
sum

51
Dataflow Loop Pipelining

Related to software pipelining
Copes with unknown latencies
control-flow
memory accesses
Does not require parallelization
Applicable to memory accesses as well

52
Performance of Selected Kernels
25
17/12
19/16
10.5
mpeg2_d
gsm_e
gsm_d
g721_d
mpeg2_e
pegwit_e
g721_e
jpeg_d
pegwit_d
jpeg_e
adpcm_e
adpcm_d
Mediabench
53
OpenDIVX IDCT, Normalized Running Time
54
OpenDIVX IDCT,Sustained IPC
includes speculative ops
no data
55
Full Dataflow Speed
wrong!

ASH runs at full dataflow speed, so CPU cannot
do any better(if compilers equally good)

CPU weapons
speculation (branch prediction)
centralized memory access

56
Muxes Speculation Squash
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
57
Control-Flow Speculation
i
1

for (i0 i lt N i)
...
if (exception) break

lt
exception
!

58
Summary

ANSI C automatically translated to
dynamically-scheduled hardware circuits
EPIC-based compilation technology (predication,
speculation, hyperblocks)
Novel specific optimizations
leniency, ack-on-critical path, loop decoupling,
bitwidth
ASH does not naturally support control-speculation
(aka branch prediction)

59
Conclusions

ASHcompiler-synthesized hw from HLL
Exposes program parallelism
ILP and loop pipelining
Dataflow techniques applied to hardware
Impressive performance on data-intensive
kernels

60
Backup Slides

Compiler structure
Predication speculation
Procedure calls
Evaluation model
Program partitioning
Status/resources
Control logic

61
Compiler Structure
Pegasus
SUIF
C/FORTRAN

CSE
dead-code
PRE
induction variables
strength reduction
loop-invariant lift
reassociation
memory optimization
constant propagation
constant folding
unreachable code
register promotion

inlining
unrolling
call-graph
pointer analysis
live var. analysis
CFG construction
unreachable code
build hyperblocks
ctrl dominance
path predicates

call-graph
C circuitsimulation
Verilog
back
62
Hyperblocks
Procedure
63
Predication
hyperblock
if (!p) .......
p
!p
if (p) .......
q
q
64
Speculation
if (!p) ......
if (!p) ......
ops w/ side-effects
q
q
65
Computing Predicates
s
t
b

Correct for irreducible graphs
Correct even when speculatively computed
Boolean operations are lenient

back
66
Procedure calls
network
Extract args
args
call P
result
caller
ret
Procedure P
67
Recursion
save live values
recursive call
localstack
restore live values
hyperblock
back
68
How Performance Is Evaluated
C
Mem
L2 1/4M
L1 8K
LSQ
2
limited BW (2 words/c)
Unlimited ILP
8
72
69
Simulation Parameters

Compared to 4-wide OOO SimpleScalar
Same operation latencies
Same cache hierarchy
No measurements in library functions/OS
3-cycle multiply, 20 cycle divide

back
70
Unit of Partitioning Procedure
Program call-graph
recursive
leaves
library
71
Peering
Program
a( ) b( ) b( ) c( ) c( ) d(
) d( )
a
CPU
ASH
b
c
d
72
RPC
RH
CPU
a
b
b
c
c
d
d
back
73
Status