Title: Thesis idea evaluation - Automatic configuration of ASIP cores
1Thesis idea evaluation - Automatic configuration
of ASIP cores
- by
- Shobana Padmanabhan
- June 23, 2004
2Introduction
- ASIP (parameterized) embedded soft core
- In between custom and general-purpose designs
- E.g. ArcCores, HP, Tensilica, LEON
- Advantages
- Better application performance than a generic
processor - Reuse existing components
- Lower cost compared to custom processors
- Goal is to get fastest or min runtime
3Methodology considerations
- Customize per application domain, not app
4Methodology considerations
- Customize per application domain, not app
- Base architecture customizations
- Customizations
- Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions,
5Methodology considerations
- Customize per application domain, not app
- Base architecture customizations
- Customizations
- Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions, - Avoid exhaustive simulation
- As the number of configurations is exponential
- Simulating large data sets would be prohibitively
time consuming
6Methodology considerations
- Customize per application domain, not app
- Base architecture customizations
- Customizations
- Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions, - Avoid exhaustive simulation
- As the number of configurations is exponential
- Simulating large data sets would be prohibitively
time consuming - Constraints
- FPGA limited area (cost, power constraints)
7Methodology considerations
- Customize per application domain, not app
- Base architecture customizations
- Customizations
- Increased of functional units, registers,
memory accesses in parallel, depth of pipeline,
possibly new instructions, - Avoid exhaustive simulation
- As the number of configurations is exponential
- Simulating large data sets would be prohibitively
time consuming - Constraints
- FPGA limited area (cost, power constraints)
- Architectural parameters are not independent
8Methodology considerations
- Evaluation of proposed methodology
- Compare the resulting configuration and runtime
with hand-optimized configuration of benchmarks
9Approach 1 - Compiler directed
- Compiler-directed customization of ASIP cores
- by Gupta - UMD, Ko - Cornell, Barua UMD
- for the methodology
- Processor evaluation in an embedded systems
design environment - by Gupta, Sharma, Balakrishna IIT Delhi, Malik
Princeton - for details of Processor description language and
architectural parameters - Predicting performance potential of modern DSPs,
- Retargetable estimation scheme for DSP
architecture selection - by Ghazal, Newton, Rabaey UC Berkeley
- use more advanced processor features and compiler
optimizations
10Methodology basic idea
- Start with basic architecture
- Estimate application performance
- Now, vary architecture (lt chip area) and find
the best runtime - To avoid (exhaustive) simulation
- Estimate runtime for a given configuration
- Use a profiler
- When the configuration changes, re-compile and
not re-run - Change configuration, check area and infer new
runtime - By using statistical data on inter-dependence of
parameters
11Approach
App
12Approach
Profiler
App
13Approach
Profiler
Retargetable performance estimator
App
14Approach
Base arch space of proposed parameters
Profiler
Retargetable performance estimator
App
15Approach
Base arch space of proposed parameters
Profiler
Retargetable performance estimator
Architecture exploration engine
App
16Approach
Base arch space of proposed parameters
Area estimates budget
Profiler
Retargetable performance estimator
Architecture exploration engine
App
17Approach
Base arch space of proposed parameters
Area estimates budget
Profiler
Optimal architectural parameters
Retargetable performance estimator
Architecture exploration engine
App
18Performance estimator
- runtime
- (profile-collected basic block frequencies)
(scheduler-predicted runtime of that block)
19Performance estimator
- runtime
- (profile-collected basic block frequencies)
(scheduler-predicted runtime of that block) - Basic block (i.e. of instructions that can be
executed in parallel) by - converting to an internal format (Stanford
University IF, which provides libraries to
extract such info)
20Performance estimator
- runtime
- (profile-collected basic block frequencies)
(scheduler-predicted runtime of that block) - Basic block by
- converting to an internal format (Stanford
University IF, which provides libraries to
extract such info) - Execution frequencies of each basic block by
- A compiler-inserted instruction increments a
global variable for each basic block
21Performance estimator
- runtime
- (profile-collected basic block frequencies)
(scheduler-predicted runtime of that block) - Basic block by
- converting to an internal format (Stanford
University IF, which provides libraries to
extract such info) - Execution frequencies of each basic block by
- A compiler-inserted instruction increments a
global variable for each basic block - Number of clock cycles
- A scheduler schedules each basic block to derive
execution time on the processor (taking into
account all parameters) - A processor description is needed for this and a
language was developed (context free grammar)
22Performance estimator
- runtime
- (profile-collected basic block frequencies)
- (scheduler-predicted runtime of that block)
- Basic block by
- converting to an internal format (Stanford
University IF, which provides libraries to
extract such info) - Execution frequencies of each basic block by
- A compiler-inserted instruction increments a
global variable for each basic block - Number of clock cycles
- A scheduler schedules each basic block to derive
execution time on the processor (taking into
account all parameters) - A processor description is needed for this and a
language was developed (context free grammar) - Scheduler combines this time, with frequencies of
basic blocks, to estimate overall runtime
23Performance estimation, more formally
- Derive runtime vs. parameter curve for each
parameter (just recompile for every param) - Runtime
- (profile-collected basic block frequencies)
(scheduler-predicted runtime of that block) - Runtime_function(pi) (runtime for pi) / (base
runtime)
24Performance estimation, more formally
- Derive runtime vs. parameter curve for each
parameter (just recompile for every param) - Runtime
- (profile-collected basic block frequencies)
(scheduler-predicted runtime of that block) - Runtime_function(pi) (runtime for pi) / (base
runtime)
25Area estimation, formally
- Obtain area vs. parameter curve for every
parameter - Area_function(pi) additional gate area for pi
26Area estimation, formally
- Obtain area vs. parameter curve for every
parameter - Area_function(pi) additional gate area for pi
27Retargetable performance estimator
- Profiler
- Computes execution frequencies of each basic
block - A compiler-inserted instruction increments a
global variable for this
28Retargetable performance estimator
- Profiler
- Computes execution frequencies of each basic
block - A compiler-inserted instruction increments a
global variable for this - Data flow graph builder, for scheduling
- Directed acyclic graph for a basic block
captures all dependencies (blocks in sequence
within a block in parallel) - Priority of operation, based on height of that
operation in dependency graph
29Retargetable performance estimator
- Profiler
- Computes execution frequencies of each basic
block - A compiler-inserted instruction increments a
global variable for this - Data flow graph builder, for scheduling
- Directed acyclic graph for a basic block
captures all dependencies (blocks in sequence
within a block, in parallel) - Priority of operation, based on height of that
operation in dependency graph - Fine-grain scheduler estimates of clock cycles
by taking into account different architecture
parameters - Schedules each basic block to derive execution
time on the processor - Combines this with frequencies to estimate
overall runtime - List scheduling is a greedy method that chooses
next instruction in DAG in order of their
priority (longer critical paths have higher
priority)
30Retargetable performance estimator
- Assumptions
- All operations operate on operands in registers
- Address computation of an array instruction are
carried out by insertion of explicit address
computation instructions
31The processor description language
- Can express most embedded VLIW processors
- Functional units in data path, w/ their
operations, corresponding latencies, delays - Constraints in terms of operation slots slot
restrictions - Number of registers, write buses, ports in memory
- Delay of branch operations
- Concurrent load/ store operations
- Final operation delay (delay of functional
unit) (delay of operation)
32Architecture exploration engine
- Chooses optimal parameter values constrained
optimization problem - Sum of all area_functions lt area_budget
33Architecture exploration engine
- Chooses optimal parameter values constrained
optimization problem - Sum of all area_functions lt area_budget
- If parameters are independent, pred_runtime
product of runtime for every parameter
34Architecture exploration engine
- Chooses optimal parameter values constrained
optimization problem - Sum of all area_functions lt area_budget
- If parameters are independent, pred_runtime
product of runtime for every parameter - Since they are not, pred_runtime
- (product of runtime for every parameter) /
dependence_constant(p1, , pn) - where dependence_constant is
35Interdependence of parameters
- dependence_constant is a heuristic for every
combo of parameters that adjusts the gain for
that combo
36Interdependence of parameters
- dependence_constant is a heuristic for every
combo of parameters that adjusts the gain for
that combo - obtained by one-time, exhaustive simulation of
standard benchmarks, for a combo of parameters
37Interdependence of parameters
- dependence_constant is a heuristic for every
combo of parameters that adjusts the gain for
that combo - obtained by one-time, exhaustive simulation of
standard benchmarks, for a combo of parameters - Dependence_constant(p1,,pn)
- 1 for all pi basei
- 1 for pj ! basej, for all i ! j, pi basei
- (product of all runtime_function) /
- (actual_runtime for that combo)
38Evaluated parameters
- On Philips TriMedia VLIW processor
- Presence or absence of MAC
- HW/ SW floating point
- Single or dual-ported memory for parallel memory
operations - Pipelined or non-pipelined memory unit
39Other customizable parameters
- Register file size
- Number of architectural clusters
- Number and nature of functional units
- Presence of an address generation unit
- Optimized special operations
- Multi-operation patterns
- Memory data packing/ unpacking support
- Memory addressing support
- Control-flow support
- Loop-level optimizations
- Loop-level optimized patterns
- Loop vectorization
- Architecture-independent optimization
40For DSP applications
- Functional unit composition
- Ignore cache misses, branch mis-predictions,
separation of register files (or functional unit
banks), register allocation conflicts - Register casting, if data-dependency interlocks
exist in the architecture
41Performance gain from INDIVIDUAL parameters
- Runtime_function for each benchmark
- Application for each of the chosen parameters
MAC, FPU, dual-ported memory, pipelined memory
Figure from Gupta et al.
42Performance gain from COMBINED parameters
- Runtime_function for each benchmark
- Application for selected combination of chosen
parameters
Figure from Gupta et al.
43Dependence constants for the combinations
Figure from Gupta et al.
44(DSP) FFT benchmark
Figure from Gupta et al.
45Results
- Performance estimation error 2.5
- Recommended configuration same as hand-optimized
46Profile use app parameters to eliminate
processor or processor configuration
47App parameters relevant processor features
- Average block size
- (Acceptable) branch penalty
- of multiply-accumulate operations
- MAC
- Ratio of address computation instructions to data
computation instructions - Separate address generation ALU
- Ratio of I/O instructions to total instructions
- Memory bandwidth requirements
- Average arc length in the data flow graph
- Total of registers
- Unconstrained ASAP scheduler results
- Operation concurrency and lower bound on
performance - Assumptions
- In average block size module, instructions
associated with condition code evaluation of
conditional structures and loops ignored - Each array instruction contributes to total by
twice the of dimensions - Array accesses are assumed to point to data in
memory
48Related work
- Related work evaluates exhaustively or in
isolation no cost-area analysis - Commercial soft cores
- User optimizes instruction set, addressing modes
sizes of internal memory banks tool estimates
area - Gong et. al
- Performance analyzer evaluates machine
parallelism, number of buses connectivity,
memory ports does not account for dependency - Ghazal et. al
- Predict runtime for advanced processor features
compiler optimizations such as optimized special
operations, memory addressing support,
control-flow support loop-level optimization
support. - Gupta et. al
- Analyze application to select processor no
quantification of features performance
estimation thru exhaustive simulation - Kuulusa et. al, Herbert et. al, Shackleford et.
al - Tools for architecture exploration by exhaustive
search evaluate instruction extensions - Custom fit processors
- Also exhaustive search but targets a VLIW
architecture changeable memory sizes, register
sizes, kinds and latencies of functional units
and clustered machines speedup/ cost graphs are
derived for all combinations yielding pareto
points
49Other related papers
- Kuulusa et. al., Herbert et. al., Shackleford
et. al. evaluate extensions to instruction set - Managing multi-configuration hardware via dynamic
working set analysis - By Dhodapkar, Smith, Wisc
- Reconfigurable custom computing as a
supercomputer replacement - By Milne, University of South Australia
50Discussion
51Dependence constant
- Dependence_constant close to 1 does not imply
independence because - The set of parameters the constant is modeling
impacts only a fraction of instructions in a
program - E.g. dual-ported memory and pipelined memory both
target memory parallelism - Results showed that were highly dependent
- Gain from both 7.3
- Gain from each 6.1
- Dependence constant (0.939 0.939)/(0.927)
0.95 - Close to one despite parameters being highly
dependent
523. Processor evaluation for ASIP changes Gupta et
al, IIT Delhi
- Modifications may include
- New instructions, increased of functional
units, registers, memory accesses in parallel,
depth of pipeline - Convert application and processor architecture to
intermediate representation (so that they can be
evaluated) - Application Stanford Univ. Intermediate Format,
that provides libraries to extract info - E.g.. average number of instructions that can be
executed in parallel this indicates concurrency
in app w/ performance specifications, become
requirements for the processor - Architectural characteristics custom description
format - Estimator now estimates code size and runtime
- System partitioner can now estimate s/w
performance
53Overview
- Static profile info analysis, using math model
and no running - Describe processor using a processor description
language context-free grammar - Feed this description to Instruction scheduler in
the compiler for scheduling (i.e.) performance
estimation - By predicting performance for configuration w/
one CPU enhancement, infer for other
configurations using dependence constants - (statistical data on inter-dependence of the
given parameters) - Parameters (Philips TriMedia VLIW processor)
- Include or exclude MAC
- HW/ SW floating point
- Single or dual-ported memory for parallel memory
operations - Pipelined or non-pipelined memory unit
- Performance estimation error 2.5
- Recommended configuration same as brute force
exhaustive simulation
54Methodology
- Performance analyzer
- Avoid exhaustive simulation, as the number of
configurations is exponential - Simulating large data sets would be prohibitively
time consuming - Architecture optimizer
- FPGA, chip constraints resources, area (cost,
power constraints) - Architectural parameters are not independent
- Evaluation of proposed methodology
- Compare the resulting configuration and runtime
with hand-optimized configuration