Mihai Budiu

About This Presentation

Transcript and Presenter's Notes

Title: Mihai Budiu

1
Spatial ComputationComputing without
General-Purpose Processors

Mihai Budiu
mihaib_at_cs.cmu.edu
Carnegie Mellon University

Presentation at
May 17, 2004
2
Spatial Computation
Spatial Computation

A computation model based on
application-specific hardware
no interpretation
minimal resource sharing

Mihai Budiu mihaib_at_cs.cmu.edu Carnegie Mellon
University
3
Research Scope
Object future architectures
Toolcompilers
Evaluationsimulators
4
Three Spatial Computation Projects
Application-Specific Hardware (ASH)
nanoFabrics

virtual reconfigurablehardware

C
Compiler
reconfigurable hardware
5
Main Results of My Research (1)

Developed DIL compiler
Completely replaces CAD tool-chain
700 times faster than commercial tools
New optimizations (BitValue, place-and-route)
Streaming kernels execute 20-300 times faster
than on mP

6
Main Results of My Research (2)

nanoFabrics
Identified strengths limitations of
nanodevices
Proposed new reconfigurable architecture
HLL ! HW compilation for spatial computation
Studied first-order properties of spatial
computation

7
Main Results of My Research (3)
Application-Specific Hardware (ASH) Compiler-synth
esized architecture

Fast prototyping automatic from ANSI C ! HW
High performance sustained gt 0.8 GOPS 180nm
Low power Energy/op 100-1000 better than mP

8
Related Work
Nanotechnology
Dataflowmachines
Asynchronouscircuits
High-levelsynthesis
Embeddedsystems
Reconfigurablecomputing
Computerarchitecture
Compilation
9
Outline

Research overview
Problems of current architectures
Compiling Application-Specific Hardware
ASH Evaluation
New compiler optimizations
Conclusions

1000
Performance
10
Resources
Intel

We do not worry about not having hardware
resources
We worry about being able to use hardware
resources

11
Complexity
Cannot rely on global signals
(clock is a global signal)
12
Instruction-Set Architecture
Software
ISA
Hardware
13
Our Proposal

ASH addresses these problems
ASH is not a panacea
ASH complementary to CPU

14
Whats New?

Investigate new computational model
Source is full ANSI C
Result is asynchronous circuit
Build spatial dataflow hardware
No resources limitations
New compiler algorithms
End-to-end results
C to structural Verilog in seconds
high performance results
excellent power efficiency

Investigate new computational model
Source is full ANSI C
Result is asynchronous circuit
Build spatial dataflow hardware
No resources limitations
New compiler algorithms
End-to-end results
C to structural Verilog in seconds
high performance results
excellent power efficiency

black box
15
Outline

Research overview
Problems of current architectures
CASH Compiling ASH
program representation
compiling C programs
ASH Evaluation
New compiler optimizations
Conclusions

16
Application-Specific Hardware
C program
Compiler
Dataflow IR
Reconfigurable/custom hw
17
Application-Specific Hardware
Soft
C program
Compiler
Dataflow IR
SW backend
Machine code
CPU predication
18
Key Intermediate Representation
Traditionally
Our IR

SSA predication speculation
Uniform for scalars and memory
Explicitly encodes may-depend
Executable
Precise semantics
Dataflow IR
Close to asynchronous target

may-dep.
CFG
...
def-use
19
Computation Dataflow
Programs
Circuits
a
7
x a 7 ... y x gtgt 2

2
x
gtgt

Operations ) functional units
Variables ) wires
No interpretation

20
Basic Computation

latch
data
ack
valid
21
Asynchronous Computation

data
ack
valid
1
22
Distributed Control Logic
ack
rdy

-
short, local wires
asynchronous control
23
Outline

Research overview
Problems of current architectures
CASH Compiling ASH
program representation
compiling C programs
ASH Evaluation
New compiler optimizations
Conclusions

24
MUX Forward Branches
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
f
y
critical path
Conditionals ) Speculation
25
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
26
Loops

int sum0, i
for (i0 i lt 100 i)
sum ii
return sum

pipelining
27
Predication and Side-Effects
addr
token
to memory
Load
pred
data
token
28
Memory Access
LD
Monolithic Memory
pipelined arbitrated network
ST
LD
local communication
global structures
Future work fragment this!
related work
complexity
29
CASH Optimizations

SSA-based optimizations
unreachable/dead code, gcse, strength reduction,
loop-invariant code motion, software pipelining,
reassociation, algebraic simplifications,
induction variable optimizations, loop unrolling,
inlining
Memory optimizations
dependence alias analysis, register promotion,
redundant load/store elimination, memory access
pipelining, loop decoupling
Boolean optimizations
Espresso CAD tool, bitwidth analysis

30
Outline

Research overview
Problems of current architectures
Compiling ASH
Evaluation CASH vs. clocked designs
New compiler optimizations
Conclusions

31
Evaluating ASH
Mediabench kernels (1 hot function/benchmark)
C
CASHcore
Verilog back-end
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
ModelSim (Verilog simulation)
performancenumbers
Mem
ASIC
32
ASH Area
P4 217
minimal RISC core
normalized area
33
ASH vs 600MHz CPU .18 mm
34
Bottleneck Memory Protocol
LD
Memory
ST
35
Power
Xeon cache 67000
mP 4000
DSP 110
36
Energy Efficiency
Dedicated hardware
ASH media kernels
Asynchronous ?P
General-purpose DSP
Microprocessors
0
.
1
1
0
1
1
0
0
0
0
0
1
1
0
0
.
Energy Efficiency Operations/nJ
37
Outline

Research overview
Nanotechnology and architecture
Compiling ASH
ASH Evaluation
New compiler optimizations
BitValue dataflow analysis
Optimizing memory accesses
SIDE static instantiation, dynamic
evaluation
Conclusions

38
Detecting Constant Bits
a
b a gtgt 4
b
0000
39
Detecting Useless Bits
Dont care bits
a
XXXX
b a gtgt 4
b
40
BitValue Dataflow Analysis
a
XXXX
b a gtgt 4
b
0000
41
BitValue on C Programs
useless int arithmetic
27
Mediabench
SpecInt95
SpecInt2K
42
Outline

...
New compiler optimizations
BitValue dataflow analysis
Memory access optimization
Static Instantiation, Dynamic Evaluation
Conclusions

43
Meaning of Token Edges
p
p
q
q

Maybe dependent
No intervening memory operation

Independent

Token graph is maintained transitively reduced
44
Dead Code Elimination
(false)
p
45
¼ PRE
(p1)
(p2)
(p1 Ç p2)
...p
...p
...p
This corresponds in the CFG to lifting the load
to a basic block dominating the original loads
46
Register Promotion
(p1)
p
(p2 Æ p1)
(p2)
p
Load is executed only if store is not
47
Register Promotion (2)
(p1)
p
(p1)
p
(p2 Æ p1)
(false)
p
(p2)
p

When p2 ) p1 the load becomes dead...
...i.e., when store dominates load in CFG

48
Outline

...
New compiler optimizations
BitValue dataflow analysis
Memory access optimization
A SIDE dish dataflow analysis Static
Instantiation, Dynamic Evaluation
Conclusions

49
Availability Dataflow Analysis
y ab ... if (x) ... ... ab

50
Dataflow Analysis Is Conservative
if (x) ... y ab ... ... ab
y?
51
Static Instantiation, Dynamic Evaluation
flag false if (x) ... y ab
flag true ... ... flag ? y ab
52
SIDE Register Promotion Effect
Loads
reduction
Stores
53
Outline

Research overview
Problems of current architectures
Compiling ASH
ASH Evaluation
New compiler optimizations
Future work conclusions

54
Future Work

Optimizations for area/speed/power
Memory partitioning
Concurrency
Compiler-guided layout
Explore extensible ISAs
Hybridization with superscalar mechanisms
Reconfigurable hardware support for ASH
Formal verification

55
Grand VisionCertified Circuit Generation

Translation validation input output
Preserve input properties
e.g., C programs cannot deadlock
e.g., type-safe programs cannot crash
Debug, test, verify only at source-level

How far can you go?
HLL
IR
IRopt
Verilog
gates
layout
formally validated
56
Conclusions
Spatial computation strengths
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Design productivity, no ISA
57
Backup Slides

Reconfigurable hardware
Critical paths
Software pipelining
Control logic
More on PipeRench
ASH vs ...
ASH weaknesses
Exceptions
Research methodology
Normalized area
Why C?
Splitting memory
More performance
Recursive calls
Nanotech and architecture

58
Reconfigurable Hardware
59
Main RH Ingredient RAM Cell
data in
0
control
Switch controlled by a 1-bit RAM cell
back
back to talk
60
Pipeline Stage
ackout
C
rdyin
rdyout
ackin

D
Reg
datain
dataout
back to talk
back
61
Critical Paths
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
62
Lenient Operations
x
b
0
if (x gt 0) y -x else y bx

!
y
Solves the problem of unbalanced paths
back to talk
back
63
Pipelining
i
1

100

lt
pipelined multiplier (8 stages)
sum

int sum0, i
for (i0 i lt 100 i)
sum ii
return sum

step 1
64
Pipelining
i
1

100

lt
sum

step 2
65
Pipelining
i
1

100

lt
sum

step 3
66
Pipelining
i
1

100

lt
sum

step 4
67
Pipelining
i
1

100
i1
lt
i0
sum

step 5
68
Pipelining
i
1

100

i1
lt
i0
sum

step 6
back
69
Pipelining
i
1

100

lt
sum

step 7
70
Pipelining
i
1

100

critical path
lt
Predicate ackedge is on the critical path.
sum

71
Pipeline balancing
i
1

100

lt
decoupling FIFO
sum

step 7
72
Pipeline balancing
i
1

100

lt
critical path
is loop
decoupling FIFO
sum
sums loop

back
back to talk
73
Process 0.18 mm, 6 Al metal layers
Area 49 mm2
Clock 60MHz I/O 120MHz internal
Power lt 4W
Stripes 16 physical 256 virtual

Compiler functional on first silicon
Licensed by two companies

74
Hardware Virtualization
Page out
compute
compute
compute
configure
Page in
Hardware
Overlap configuration with computation.
Configuration
75
PipeRench Hardware
ALU
ALU
data flow
Interconnection Network
Register
ALU
ALU
Interconnection Network
Register
ALU
ALU
76
Mapping Computation

gtgt
ltlt
concat
bit-shuffling
ltlt
substr
Network used for computation

77
Compiler-Controlled Clock
Register
Register

Network
Network
Register
Register

Slow clock
Fast clock
78
Time-Multiplexing Wires
1
2

1
2
?
4
3
4
3
One channel available for two wires
compute in even cycles
compute in odd cycles
79
Compilation Times (sec on PII/400)
80
Compilation Speed (PII/400)
81
Placed Circuit Utilization
82
PipeRench Performance
Speed-up vs. 300Mhz UltraSparc
83
PipeRench Compiler Role

Classical optimizations
Partial evaluation
Data width inference ( type
inference)
Module generation ( macro expansion)
Placement ( VLIW
scheduling)
Routing (irregular register
allocation)
Network link multiplexing (
spilling)
Clock-cycle management
Technology mapping ( instruction selection)
Code generation

back
84
HLL to HW
High-level Synthesis Behavioral HDL Synchronou
s Hardware
ReconfigurableComputing C subsets Hardware
configuration (spatial computation)
Asynchronous circuits Concurrent Language Async
hronous Hardware
Prior work
This research
85
CASH vs High-Level Synthesis

CASH the only existing tool to translate
complete ANSI C to hardware
CASH generates asynchronous circuits
CASH does not treat C as an HDL
no annotations required
no reactivity model
does not handle non-C, e.g., concurrency

back
86
ASH Weaknesses

Low efficiency for low-ILP code
Does not adapt at runtime
Monolithic memory
Resource waste
Not flexible
No support for exceptions

87
ASH Weaknesses (2)

Both branch and join not free
Static dataflow (no re-issue of same instr)
Memory is far
Fully static
No branch prediction
No dynamic unrolling
No register renaming
Calls/returns not lenient

back
88
Branch Prediction
i
1

for (i0 i lt N i)
...
if (exception) break

lt
exception
!

back
89
Research Methodology
Y (e.g., cost)
reasonable limits
state-of-the-art
X (e.g., power)
Constraint Space
back
90
Exceptions

Strictly speaking, C has no exceptions
In practice hard to accommodate exceptions in
hardware implementations
An advantage of software flexibility PC is
single point of execution control

CPU
ASH
Low ILP computation OS VM exceptions
High-ILP computation

Memory
back
91
Why C

Huge installed base
Embedded specifications written in C
Small and simple language
Can leverage existing tools
Simpler compiler
Techniques generally applicable
Not a toy language

back
92
Performance
93
Parallelism Profile
94
Normalized Area
back
back to talk
95
Memory Partitioning

MIT RAW project Babb FCCM 99, Barua HiPC
00,Lee ASPLOS 00
Stanford SpC Semeria DAC 01, TVLSI 02
Berkeley CCured Necula POPL 02
Illinois FlexRAM Fraguella PPoPP 03
Hand-annotations pragma

back
back to talk
96
Memory Complexity
RAM
LSQ
addr
data
back
back to talk
97
Recursion
save live values
recursive call
restore live values
stack
back
98
Nanotechnology and Architecture
99
Nanotechnology Implications
new architectures new compilers
new devices new manufacturing
100
CAEN

Study computer architecture implications of
Chemically-Assembled Electronic Nanotechnology

101
No Complex Irregular Structures
102
Regular Substrate
Control
1011 gates
103
High Defect Rate
104
Paradigm Shift
defects
Configuration
Executable
Complex fixed chip Program
Dense, regular structure Configuration
105
New Computer Architecture
CMOS Self-assembled circuits
Transistor New molecular devices
Custom hardware Reconfigurable hardware
Yield (defect) control Defect tolerance through reconfiguration
Synchronous circuits Asynchronous computation
Microprocessors App-specific HardwareCPU
106
Exploiting Nanotechnology

Nanotechnology
cheap
high-density
low-power
unreliable

Reconfigurable
Computing
defect tolerant
high performance
low density

Computer architecture
vast body of knowledge
expensive
high-power

107
Research Convergence
Chemically-assembled electronic nanotechnology
Systems research issues
back
108
Venues

Write a Comment

User Comments (0)

About PowerShow.com

Mihai Budiu PowerPoint PPT Presentation