Thesis idea evaluation - Automatic configuration of ASIP cores presentation

About This Presentation

Transcript and Presenter's Notes

Title: Thesis idea evaluation - Automatic configuration of ASIP cores

1
Thesis idea evaluation - Automatic configuration
of ASIP cores

by
Shobana Padmanabhan
June 23, 2004

2
Introduction

ASIP (parameterized) embedded soft core
In between custom and general-purpose designs
E.g. ArcCores, HP, Tensilica, LEON
Advantages
Better application performance than a generic
processor
Reuse existing components
Lower cost compared to custom processors
Goal is to get fastest or min runtime

3
Methodology considerations

Customize per application domain, not app

4
Methodology considerations

Customize per application domain, not app
Base architecture customizations
Customizations
Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions,

5
Methodology considerations

Customize per application domain, not app
Base architecture customizations
Customizations
Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions,
Avoid exhaustive simulation
As the number of configurations is exponential
Simulating large data sets would be prohibitively
time consuming

6
Methodology considerations

Customize per application domain, not app
Base architecture customizations
Customizations
Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions,
Avoid exhaustive simulation
As the number of configurations is exponential
Simulating large data sets would be prohibitively
time consuming
Constraints
FPGA limited area (cost, power constraints)

7
Methodology considerations

Customize per application domain, not app
Base architecture customizations
Customizations
Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions,
Avoid exhaustive simulation
As the number of configurations is exponential
Simulating large data sets would be prohibitively
time consuming
Constraints
FPGA limited area (cost, power constraints)
Architectural parameters are not independent

8
Methodology considerations

Evaluation of proposed methodology
Compare the resulting configuration and runtime
with hand-optimized configuration of benchmarks

9
Approach 1 - Compiler directed

Compiler-directed customization of ASIP cores
by Gupta - UMD, Ko - Cornell, Barua UMD
for the methodology
Processor evaluation in an embedded systems
design environment
by Gupta, Sharma, Balakrishna IIT Delhi, Malik
Princeton
for details of Processor description language and
architectural parameters
Predicting performance potential of modern DSPs,
Retargetable estimation scheme for DSP
architecture selection
by Ghazal, Newton, Rabaey UC Berkeley
use more advanced processor features and compiler
optimizations

10
Methodology basic idea

Start with basic architecture
Estimate application performance
Now, vary architecture (lt chip area) and find
the best runtime
To avoid (exhaustive) simulation
Estimate runtime for a given configuration
Use a profiler
When the configuration changes, re-compile and
not re-run
Change configuration, check area and infer new
runtime
By using statistical data on inter-dependence of
parameters

11
Approach
App
12
Approach
Profiler
App
13
Approach
Profiler
Retargetable performance estimator
App
14
Approach
Base arch space of proposed parameters
Profiler
Retargetable performance estimator
App
15
Approach
Base arch space of proposed parameters
Profiler
Retargetable performance estimator
Architecture exploration engine
App
16
Approach
Base arch space of proposed parameters
Area estimates budget
Profiler
Retargetable performance estimator
Architecture exploration engine
App
17
Approach
Base arch space of proposed parameters
Area estimates budget
Profiler
Optimal architectural parameters
Retargetable performance estimator
Architecture exploration engine
App
18
Performance estimator

runtime
(profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)

19
Performance estimator

runtime
(profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)
Basic block (i.e. of instructions that can be
executed in parallel) by
converting to an internal format (Stanford
University IF, which provides libraries to
extract such info)

20
Performance estimator

runtime
(profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)
Basic block by
converting to an internal format (Stanford
University IF, which provides libraries to
extract such info)
Execution frequencies of each basic block by
A compiler-inserted instruction increments a
global variable for each basic block

21
Performance estimator

runtime
(profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)
Basic block by
converting to an internal format (Stanford
University IF, which provides libraries to
extract such info)
Execution frequencies of each basic block by
A compiler-inserted instruction increments a
global variable for each basic block
Number of clock cycles
A scheduler schedules each basic block to derive
execution time on the processor (taking into
account all parameters)
A processor description is needed for this and a
language was developed (context free grammar)

22
Performance estimator

runtime
(profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)
Basic block by
converting to an internal format (Stanford
University IF, which provides libraries to
extract such info)
Execution frequencies of each basic block by
A compiler-inserted instruction increments a
global variable for each basic block
Number of clock cycles
A scheduler schedules each basic block to derive
execution time on the processor (taking into
account all parameters)
A processor description is needed for this and a
language was developed (context free grammar)
Scheduler combines this time, with frequencies of
basic blocks, to estimate overall runtime

23
Performance estimation, more formally

Derive runtime vs. parameter curve for each
parameter (just recompile for every param)
Runtime
(profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)
Runtime_function(pi) (runtime for pi) / (base
runtime)

24
Performance estimation, more formally

Derive runtime vs. parameter curve for each
parameter (just recompile for every param)
Runtime
(profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)
Runtime_function(pi) (runtime for pi) / (base
runtime)

25
Area estimation, formally

Obtain area vs. parameter curve for every
parameter
Area_function(pi) additional gate area for pi

26
Area estimation, formally

Obtain area vs. parameter curve for every
parameter
Area_function(pi) additional gate area for pi

27
Retargetable performance estimator

Profiler
Computes execution frequencies of each basic
block
A compiler-inserted instruction increments a
global variable for this

28
Retargetable performance estimator

Profiler
Computes execution frequencies of each basic
block
A compiler-inserted instruction increments a
global variable for this
Data flow graph builder, for scheduling
Directed acyclic graph for a basic block
captures all dependencies (blocks in sequence
within a block in parallel)
Priority of operation, based on height of that
operation in dependency graph

29
Retargetable performance estimator

Profiler
Computes execution frequencies of each basic
block
A compiler-inserted instruction increments a
global variable for this
Data flow graph builder, for scheduling
Directed acyclic graph for a basic block
captures all dependencies (blocks in sequence
within a block, in parallel)
Priority of operation, based on height of that
operation in dependency graph
Fine-grain scheduler estimates of clock cycles
by taking into account different architecture
parameters
Schedules each basic block to derive execution
time on the processor
Combines this with frequencies to estimate
overall runtime
List scheduling is a greedy method that chooses
next instruction in DAG in order of their
priority (longer critical paths have higher
priority)

30
Retargetable performance estimator

Assumptions
All operations operate on operands in registers
Address computation of an array instruction are
carried out by insertion of explicit address
computation instructions

31
The processor description language

Can express most embedded VLIW processors
Functional units in data path, w/ their
operations, corresponding latencies, delays
Constraints in terms of operation slots slot
restrictions
Number of registers, write buses, ports in memory
Delay of branch operations
Concurrent load/ store operations
Final operation delay (delay of functional
unit) (delay of operation)

32
Architecture exploration engine

Chooses optimal parameter values constrained
optimization problem
Sum of all area_functions lt area_budget

33
Architecture exploration engine

Chooses optimal parameter values constrained
optimization problem
Sum of all area_functions lt area_budget
If parameters are independent, pred_runtime
product of runtime for every parameter

34
Architecture exploration engine

Chooses optimal parameter values constrained
optimization problem
Sum of all area_functions lt area_budget
If parameters are independent, pred_runtime
product of runtime for every parameter
Since they are not, pred_runtime
(product of runtime for every parameter) /
dependence_constant(p1, , pn)
where dependence_constant is

35
Interdependence of parameters

dependence_constant is a heuristic for every
combo of parameters that adjusts the gain for
that combo

36
Interdependence of parameters

dependence_constant is a heuristic for every
combo of parameters that adjusts the gain for
that combo
obtained by one-time, exhaustive simulation of
standard benchmarks, for a combo of parameters

37
Interdependence of parameters

dependence_constant is a heuristic for every
combo of parameters that adjusts the gain for
that combo
obtained by one-time, exhaustive simulation of
standard benchmarks, for a combo of parameters
Dependence_constant(p1,,pn)
1 for all pi basei
1 for pj ! basej, for all i ! j, pi basei
(product of all runtime_function) /
(actual_runtime for that combo)

38
Evaluated parameters

On Philips TriMedia VLIW processor
Presence or absence of MAC
HW/ SW floating point
Single or dual-ported memory for parallel memory
operations
Pipelined or non-pipelined memory unit

39
Other customizable parameters

Register file size
Number of architectural clusters
Number and nature of functional units
Presence of an address generation unit
Optimized special operations
Multi-operation patterns
Memory data packing/ unpacking support
Memory addressing support
Control-flow support
Loop-level optimizations
Loop-level optimized patterns
Loop vectorization
Architecture-independent optimization

40
For DSP applications

Functional unit composition
Ignore cache misses, branch mis-predictions,
separation of register files (or functional unit
banks), register allocation conflicts
Register casting, if data-dependency interlocks
exist in the architecture

41
Performance gain from INDIVIDUAL parameters

Runtime_function for each benchmark
Application for each of the chosen parameters
MAC, FPU, dual-ported memory, pipelined memory

Figure from Gupta et al.
42
Performance gain from COMBINED parameters

Runtime_function for each benchmark
Application for selected combination of chosen
parameters

Figure from Gupta et al.
43
Dependence constants for the combinations
Figure from Gupta et al.
44
(DSP) FFT benchmark
Figure from Gupta et al.
45
Results

Performance estimation error 2.5
Recommended configuration same as hand-optimized

46
Profile use app parameters to eliminate
processor or processor configuration
47
App parameters relevant processor features

Average block size
(Acceptable) branch penalty
of multiply-accumulate operations
MAC
Ratio of address computation instructions to data
computation instructions
Separate address generation ALU
Ratio of I/O instructions to total instructions
Memory bandwidth requirements
Average arc length in the data flow graph
Total of registers
Unconstrained ASAP scheduler results
Operation concurrency and lower bound on
performance
Assumptions
In average block size module, instructions
associated with condition code evaluation of
conditional structures and loops ignored
Each array instruction contributes to total by
twice the of dimensions
Array accesses are assumed to point to data in
memory

48
Related work

Related work evaluates exhaustively or in
isolation no cost-area analysis
Commercial soft cores
User optimizes instruction set, addressing modes
sizes of internal memory banks tool estimates
area
Gong et. al
Performance analyzer evaluates machine
parallelism, number of buses connectivity,
memory ports does not account for dependency
Ghazal et. al
Predict runtime for advanced processor features
compiler optimizations such as optimized special
operations, memory addressing support,
control-flow support loop-level optimization
support.
Gupta et. al
Analyze application to select processor no
quantification of features performance
estimation thru exhaustive simulation
Kuulusa et. al, Herbert et. al, Shackleford et.
al
Tools for architecture exploration by exhaustive
search evaluate instruction extensions
Custom fit processors
Also exhaustive search but targets a VLIW
architecture changeable memory sizes, register
sizes, kinds and latencies of functional units
and clustered machines speedup/ cost graphs are
derived for all combinations yielding pareto
points

49
Other related papers

Kuulusa et. al., Herbert et. al., Shackleford
et. al. evaluate extensions to instruction set
Managing multi-configuration hardware via dynamic
working set analysis
By Dhodapkar, Smith, Wisc
Reconfigurable custom computing as a
supercomputer replacement
By Milne, University of South Australia

50
Discussion
51
Dependence constant

Dependence_constant close to 1 does not imply
independence because
The set of parameters the constant is modeling
impacts only a fraction of instructions in a
program
E.g. dual-ported memory and pipelined memory both
target memory parallelism
Results showed that were highly dependent
Gain from both 7.3
Gain from each 6.1
Dependence constant (0.939 0.939)/(0.927)
0.95
Close to one despite parameters being highly
dependent

52
3. Processor evaluation for ASIP changes Gupta et
al, IIT Delhi

Modifications may include
New instructions, increased of functional
units, registers, memory accesses in parallel,
depth of pipeline
Convert application and processor architecture to
intermediate representation (so that they can be
evaluated)
Application Stanford Univ. Intermediate Format,
that provides libraries to extract info
E.g.. average number of instructions that can be
executed in parallel this indicates concurrency
in app w/ performance specifications, become
requirements for the processor
Architectural characteristics custom description
format
Estimator now estimates code size and runtime
System partitioner can now estimate s/w
performance

53
Overview

Static profile info analysis, using math model
and no running
Describe processor using a processor description
language context-free grammar
Feed this description to Instruction scheduler in
the compiler for scheduling (i.e.) performance
estimation
By predicting performance for configuration w/
one CPU enhancement, infer for other
configurations using dependence constants
(statistical data on inter-dependence of the
given parameters)
Parameters (Philips TriMedia VLIW processor)
Include or exclude MAC
HW/ SW floating point
Single or dual-ported memory for parallel memory
operations
Pipelined or non-pipelined memory unit
Performance estimation error 2.5
Recommended configuration same as brute force
exhaustive simulation

54
Methodology

Performance analyzer
Avoid exhaustive simulation, as the number of
configurations is exponential
Simulating large data sets would be prohibitively
time consuming
Architecture optimizer
FPGA, chip constraints resources, area (cost,
power constraints)
Architectural parameters are not independent
Evaluation of proposed methodology
Compare the resulting configuration and runtime
with hand-optimized configuration

Write a Comment

User Comments (0)

About PowerShow.com

Thesis idea evaluation - Automatic configuration of ASIP cores PowerPoint PPT Presentation