Mihai Budiu - PowerPoint PPT Presentation

About This Presentation
Title:

Mihai Budiu

Description:

Spatial Computation. A computation model based on: application ... HW compilation for spatial computation. Studied first-order properties of spatial computation ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 109
Provided by: MIh73
Learn more at: http://www.cs.cmu.edu
Category:
Tags: budiu | mihai

less

Transcript and Presenter's Notes

Title: Mihai Budiu


1
Spatial ComputationComputing without
General-Purpose Processors
  • Mihai Budiu
  • mihaib_at_cs.cmu.edu
  • Carnegie Mellon University

Presentation at
May 17, 2004
2
Spatial Computation
Spatial Computation
  • A computation model based on
  • application-specific hardware
  • no interpretation
  • minimal resource sharing

Mihai Budiu mihaib_at_cs.cmu.edu Carnegie Mellon
University
3
Research Scope
Object future architectures
Toolcompilers
Evaluationsimulators
4
Three Spatial Computation Projects
Application-Specific Hardware (ASH)
nanoFabrics
  • virtual reconfigurablehardware

C
Compiler
reconfigurable hardware
5
Main Results of My Research (1)
  • Developed DIL compiler
  • Completely replaces CAD tool-chain
  • 700 times faster than commercial tools
  • New optimizations (BitValue, place-and-route)
  • Streaming kernels execute 20-300 times faster
    than on mP

6
Main Results of My Research (2)
  • nanoFabrics
  • Identified strengths limitations of
    nanodevices
  • Proposed new reconfigurable architecture
    HLL ! HW compilation for spatial computation
  • Studied first-order properties of spatial
    computation

7
Main Results of My Research (3)
Application-Specific Hardware (ASH) Compiler-synth
esized architecture
  • Fast prototyping automatic from ANSI C ! HW
  • High performance sustained gt 0.8 GOPS 180nm
  • Low power Energy/op 100-1000 better than mP

8
Related Work
Nanotechnology
Dataflowmachines
Asynchronouscircuits
High-levelsynthesis
Embeddedsystems
Reconfigurablecomputing
Computerarchitecture
Compilation
9
Outline
  • Research overview
  • Problems of current architectures
  • Compiling Application-Specific Hardware
  • ASH Evaluation
  • New compiler optimizations
  • Conclusions

1000
Performance
10
Resources
Intel
  • We do not worry about not having hardware
    resources
  • We worry about being able to use hardware
    resources

11
Complexity
Cannot rely on global signals
(clock is a global signal)
12
Instruction-Set Architecture
Software
ISA
Hardware
13
Our Proposal
  • ASH addresses these problems
  • ASH is not a panacea
  • ASH complementary to CPU

14
Whats New?
  • Investigate new computational model
  • Source is full ANSI C
  • Result is asynchronous circuit
  • Build spatial dataflow hardware
  • No resources limitations
  • New compiler algorithms
  • End-to-end results
  • C to structural Verilog in seconds
  • high performance results
  • excellent power efficiency
  • Investigate new computational model
  • Source is full ANSI C
  • Result is asynchronous circuit
  • Build spatial dataflow hardware
  • No resources limitations
  • New compiler algorithms
  • End-to-end results
  • C to structural Verilog in seconds
  • high performance results
  • excellent power efficiency

black box
15
Outline
  • Research overview
  • Problems of current architectures
  • CASH Compiling ASH
  • program representation
  • compiling C programs
  • ASH Evaluation
  • New compiler optimizations
  • Conclusions

16
Application-Specific Hardware
C program
Compiler
Dataflow IR
Reconfigurable/custom hw
17
Application-Specific Hardware
Soft
C program
Compiler
Dataflow IR
SW backend
Machine code
CPU predication
18
Key Intermediate Representation
Traditionally
Our IR
  • SSA predication speculation
  • Uniform for scalars and memory
  • Explicitly encodes may-depend
  • Executable
  • Precise semantics
  • Dataflow IR
  • Close to asynchronous target

may-dep.
CFG
...
def-use
19
Computation Dataflow
Programs
Circuits
a
7
x a 7 ... y x gtgt 2

2
x
gtgt
  • Operations ) functional units
  • Variables ) wires
  • No interpretation

20
Basic Computation

latch
data
ack
valid
21
Asynchronous Computation

data
ack
valid
1
22
Distributed Control Logic
ack
rdy

-
short, local wires
asynchronous control
23
Outline
  • Research overview
  • Problems of current architectures
  • CASH Compiling ASH
  • program representation
  • compiling C programs
  • ASH Evaluation
  • New compiler optimizations
  • Conclusions

24
MUX Forward Branches
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
f
y
critical path
Conditionals ) Speculation
25
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
26
Loops
  • int sum0, i
  • for (i0 i lt 100 i)
  • sum ii
  • return sum

pipelining
27
Predication and Side-Effects
addr
token
to memory
Load
pred
data
token
28
Memory Access
LD
Monolithic Memory
pipelined arbitrated network
ST
LD
local communication
global structures
Future work fragment this!
related work
complexity
29
CASH Optimizations
  • SSA-based optimizations
  • unreachable/dead code, gcse, strength reduction,
    loop-invariant code motion, software pipelining,
    reassociation, algebraic simplifications,
    induction variable optimizations, loop unrolling,
    inlining
  • Memory optimizations
  • dependence alias analysis, register promotion,
    redundant load/store elimination, memory access
    pipelining, loop decoupling
  • Boolean optimizations
  • Espresso CAD tool, bitwidth analysis

30
Outline
  • Research overview
  • Problems of current architectures
  • Compiling ASH
  • Evaluation CASH vs. clocked designs
  • New compiler optimizations
  • Conclusions

31
Evaluating ASH
Mediabench kernels (1 hot function/benchmark)
C
CASHcore
Verilog back-end
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
ModelSim (Verilog simulation)
performancenumbers
Mem
ASIC
32
ASH Area
P4 217
minimal RISC core
normalized area
33
ASH vs 600MHz CPU .18 mm
34
Bottleneck Memory Protocol
LD
Memory
ST
35
Power
Xeon cache 67000
mP 4000
DSP 110
36
Energy Efficiency
Dedicated hardware
ASH media kernels
Asynchronous ?P
General-purpose DSP
Microprocessors
0
.
1
1
0
1
1
0
0
0
0
0
1
1
0
0
.
Energy Efficiency Operations/nJ
37
Outline
  • Research overview
  • Nanotechnology and architecture
  • Compiling ASH
  • ASH Evaluation
  • New compiler optimizations
  • BitValue dataflow analysis
  • Optimizing memory accesses
  • SIDE static instantiation, dynamic
    evaluation
  • Conclusions

38
Detecting Constant Bits
a
b a gtgt 4
b
0000
39
Detecting Useless Bits
Dont care bits
a
XXXX
b a gtgt 4
b
40
BitValue Dataflow Analysis
a
XXXX
b a gtgt 4
b
0000
41
BitValue on C Programs
useless int arithmetic
27
Mediabench
SpecInt95
SpecInt2K
42
Outline
  • ...
  • New compiler optimizations
  • BitValue dataflow analysis
  • Memory access optimization
  • Static Instantiation, Dynamic Evaluation
  • Conclusions

43
Meaning of Token Edges
p
p
q
q
  • Maybe dependent
  • No intervening memory operation
  • Independent

Token graph is maintained transitively reduced
44
Dead Code Elimination
(false)
p
45
¼ PRE
(p1)
(p2)
(p1 Ç p2)
...p
...p
...p
This corresponds in the CFG to lifting the load
to a basic block dominating the original loads
46
Register Promotion
(p1)
p
(p2 Æ p1)
(p2)
p
Load is executed only if store is not
47
Register Promotion (2)
(p1)
p
(p1)
p
(p2 Æ p1)
(false)
p
(p2)
p
  • When p2 ) p1 the load becomes dead...
  • ...i.e., when store dominates load in CFG

48
Outline
  • ...
  • New compiler optimizations
  • BitValue dataflow analysis
  • Memory access optimization
  • A SIDE dish dataflow analysis Static
    Instantiation, Dynamic Evaluation
  • Conclusions

49
Availability Dataflow Analysis
y ab ... if (x) ... ... ab
  • y

50
Dataflow Analysis Is Conservative
if (x) ... y ab ... ... ab
y?
51
Static Instantiation, Dynamic Evaluation
flag false if (x) ... y ab
flag true ... ... flag ? y ab
52
SIDE Register Promotion Effect
Loads
reduction
Stores
53
Outline
  • Research overview
  • Problems of current architectures
  • Compiling ASH
  • ASH Evaluation
  • New compiler optimizations
  • Future work conclusions

54
Future Work
  • Optimizations for area/speed/power
  • Memory partitioning
  • Concurrency
  • Compiler-guided layout
  • Explore extensible ISAs
  • Hybridization with superscalar mechanisms
  • Reconfigurable hardware support for ASH
  • Formal verification

55
Grand VisionCertified Circuit Generation
  • Translation validation input output
  • Preserve input properties
  • e.g., C programs cannot deadlock
  • e.g., type-safe programs cannot crash
  • Debug, test, verify only at source-level

How far can you go?
HLL
IR
IRopt
Verilog
gates
layout
formally validated
56
Conclusions
Spatial computation strengths
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Design productivity, no ISA
57
Backup Slides
  • Reconfigurable hardware
  • Critical paths
  • Software pipelining
  • Control logic
  • More on PipeRench
  • ASH vs ...
  • ASH weaknesses
  • Exceptions
  • Research methodology
  • Normalized area
  • Why C?
  • Splitting memory
  • More performance
  • Recursive calls
  • Nanotech and architecture

58
Reconfigurable Hardware
59
Main RH Ingredient RAM Cell
data in
0
control
Switch controlled by a 1-bit RAM cell
back
back to talk
60
Pipeline Stage
ackout
C
rdyin
rdyout
ackin

D
Reg
datain
dataout
back to talk
back
61
Critical Paths
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
62
Lenient Operations
x
b
0
if (x gt 0) y -x else y bx

!
y
Solves the problem of unbalanced paths
back to talk
back
63
Pipelining
i
1

100

lt
pipelined multiplier (8 stages)
sum
  • int sum0, i
  • for (i0 i lt 100 i)
  • sum ii
  • return sum


step 1
64
Pipelining
i
1

100

lt
sum

step 2
65
Pipelining
i
1

100

lt
sum

step 3
66
Pipelining
i
1

100

lt
sum

step 4
67
Pipelining
i
1

100
i1
lt
i0
sum

step 5
68
Pipelining
i
1

100

i1
lt
i0
sum

step 6
back
69
Pipelining
i
1

100

lt
sum

step 7
70
Pipelining
i
1

100

critical path
lt
Predicate ackedge is on the critical path.
sum

71
Pipeline balancing
i
1

100

lt
decoupling FIFO
sum

step 7
72
Pipeline balancing
i
1

100

lt
critical path
is loop
decoupling FIFO
sum
sums loop

back
back to talk
73
Process 0.18 mm, 6 Al metal layers
Area 49 mm2
Clock 60MHz I/O 120MHz internal
Power lt 4W
Stripes 16 physical 256 virtual
  • Compiler functional on first silicon
  • Licensed by two companies

74
Hardware Virtualization
Page out
compute
compute
compute
configure
Page in
Hardware
Overlap configuration with computation.
Configuration
75
PipeRench Hardware
ALU
ALU
data flow
Interconnection Network
Register
ALU
ALU
Interconnection Network
Register
ALU
ALU
76
Mapping Computation


gtgt
ltlt
concat
bit-shuffling
ltlt
substr
Network used for computation

77
Compiler-Controlled Clock
Register
Register




Network
Network
Register
Register




Slow clock
Fast clock
78
Time-Multiplexing Wires
1
2

1
2
?
4
3
4
3
One channel available for two wires
compute in even cycles
compute in odd cycles
79
Compilation Times (sec on PII/400)
80
Compilation Speed (PII/400)
81
Placed Circuit Utilization
82
PipeRench Performance
Speed-up vs. 300Mhz UltraSparc
83
PipeRench Compiler Role
  • Classical optimizations
  • Partial evaluation
  • Data width inference ( type
    inference)
  • Module generation ( macro expansion)
  • Placement ( VLIW
    scheduling)
  • Routing (irregular register
    allocation)
  • Network link multiplexing (
    spilling)
  • Clock-cycle management
  • Technology mapping ( instruction selection)
  • Code generation

back
84
HLL to HW
High-level Synthesis Behavioral HDL Synchronou
s Hardware
ReconfigurableComputing C subsets Hardware
configuration (spatial computation)
Asynchronous circuits Concurrent Language Async
hronous Hardware
Prior work
This research
85
CASH vs High-Level Synthesis
  • CASH the only existing tool to translate
    complete ANSI C to hardware
  • CASH generates asynchronous circuits
  • CASH does not treat C as an HDL
  • no annotations required
  • no reactivity model
  • does not handle non-C, e.g., concurrency

back
86
ASH Weaknesses
  • Low efficiency for low-ILP code
  • Does not adapt at runtime
  • Monolithic memory
  • Resource waste
  • Not flexible
  • No support for exceptions

87
ASH Weaknesses (2)
  • Both branch and join not free
  • Static dataflow (no re-issue of same instr)
  • Memory is far
  • Fully static
  • No branch prediction
  • No dynamic unrolling
  • No register renaming
  • Calls/returns not lenient

back
88
Branch Prediction
i
1
  • for (i0 i lt N i)
  • ...
  • if (exception) break

lt
exception
!

back
89
Research Methodology
Y (e.g., cost)
reasonable limits
state-of-the-art
X (e.g., power)
Constraint Space
back
90
Exceptions
  • Strictly speaking, C has no exceptions
  • In practice hard to accommodate exceptions in
    hardware implementations
  • An advantage of software flexibility PC is
    single point of execution control

CPU
ASH
Low ILP computation OS VM exceptions
High-ILP computation

Memory
back
91
Why C
  • Huge installed base
  • Embedded specifications written in C
  • Small and simple language
  • Can leverage existing tools
  • Simpler compiler
  • Techniques generally applicable
  • Not a toy language

back
92
Performance
93
Parallelism Profile
94
Normalized Area
back
back to talk
95
Memory Partitioning
  • MIT RAW project Babb FCCM 99, Barua HiPC
    00,Lee ASPLOS 00
  • Stanford SpC Semeria DAC 01, TVLSI 02
  • Berkeley CCured Necula POPL 02
  • Illinois FlexRAM Fraguella PPoPP 03
  • Hand-annotations pragma

back
back to talk
96
Memory Complexity
RAM
LSQ
addr
data
back
back to talk
97
Recursion
save live values
recursive call
restore live values
stack
back
98
Nanotechnology and Architecture
99
Nanotechnology Implications
new architectures new compilers
new devices new manufacturing
100
CAEN
  • Study computer architecture implications of
    Chemically-Assembled Electronic Nanotechnology

101
No Complex Irregular Structures
102
Regular Substrate
Control
1011 gates
103
High Defect Rate
104
Paradigm Shift
defects
Configuration
Executable
Complex fixed chip Program
Dense, regular structure Configuration
105
New Computer Architecture
CMOS Self-assembled circuits
Transistor New molecular devices
Custom hardware Reconfigurable hardware
Yield (defect) control Defect tolerance through reconfiguration
Synchronous circuits Asynchronous computation
Microprocessors App-specific HardwareCPU
106
Exploiting Nanotechnology
  • Nanotechnology
  • cheap
  • high-density
  • low-power
  • unreliable
  • Reconfigurable
  • Computing
  • defect tolerant
  • high performance
  • low density






  • Computer architecture
  • vast body of knowledge
  • expensive
  • high-power

107
Research Convergence
Chemically-assembled electronic nanotechnology
Systems research issues
back
108
Venues
Write a Comment
User Comments (0)
About PowerShow.com