Thesis idea evaluation - Automatic configuration of ASIP cores PowerPoint PPT Presentation

presentation player overlay
1 / 54
About This Presentation
Transcript and Presenter's Notes

Title: Thesis idea evaluation - Automatic configuration of ASIP cores


1
Thesis idea evaluation - Automatic configuration
of ASIP cores
  • by
  • Shobana Padmanabhan
  • June 23, 2004

2
Introduction
  • ASIP (parameterized) embedded soft core
  • In between custom and general-purpose designs
  • E.g. ArcCores, HP, Tensilica, LEON
  • Advantages
  • Better application performance than a generic
    processor
  • Reuse existing components
  • Lower cost compared to custom processors
  • Goal is to get fastest or min runtime

3
Methodology considerations
  • Customize per application domain, not app

4
Methodology considerations
  • Customize per application domain, not app
  • Base architecture customizations
  • Customizations
  • Increased of functional units, registers,
    memory accesses in parallel, depth of pipeline,
    possibly new instructions,

5
Methodology considerations
  • Customize per application domain, not app
  • Base architecture customizations
  • Customizations
  • Increased of functional units, registers,
    memory accesses in parallel, depth of pipeline,
    possibly new instructions,
  • Avoid exhaustive simulation
  • As the number of configurations is exponential
  • Simulating large data sets would be prohibitively
    time consuming

6
Methodology considerations
  • Customize per application domain, not app
  • Base architecture customizations
  • Customizations
  • Increased of functional units, registers,
    memory accesses in parallel, depth of pipeline,
    possibly new instructions,
  • Avoid exhaustive simulation
  • As the number of configurations is exponential
  • Simulating large data sets would be prohibitively
    time consuming
  • Constraints
  • FPGA limited area (cost, power constraints)

7
Methodology considerations
  • Customize per application domain, not app
  • Base architecture customizations
  • Customizations
  • Increased of functional units, registers,
    memory accesses in parallel, depth of pipeline,
    possibly new instructions,
  • Avoid exhaustive simulation
  • As the number of configurations is exponential
  • Simulating large data sets would be prohibitively
    time consuming
  • Constraints
  • FPGA limited area (cost, power constraints)
  • Architectural parameters are not independent

8
Methodology considerations
  • Evaluation of proposed methodology
  • Compare the resulting configuration and runtime
    with hand-optimized configuration of benchmarks

9
Approach 1 - Compiler directed
  • Compiler-directed customization of ASIP cores
  • by Gupta - UMD, Ko - Cornell, Barua UMD
  • for the methodology
  • Processor evaluation in an embedded systems
    design environment
  • by Gupta, Sharma, Balakrishna IIT Delhi, Malik
    Princeton
  • for details of Processor description language and
    architectural parameters
  • Predicting performance potential of modern DSPs,
  • Retargetable estimation scheme for DSP
    architecture selection
  • by Ghazal, Newton, Rabaey UC Berkeley
  • use more advanced processor features and compiler
    optimizations

10
Methodology basic idea
  • Start with basic architecture
  • Estimate application performance
  • Now, vary architecture (lt chip area) and find
    the best runtime
  • To avoid (exhaustive) simulation
  • Estimate runtime for a given configuration
  • Use a profiler
  • When the configuration changes, re-compile and
    not re-run
  • Change configuration, check area and infer new
    runtime
  • By using statistical data on inter-dependence of
    parameters

11
Approach
App
12
Approach
Profiler
App
13
Approach
Profiler
Retargetable performance estimator
App
14
Approach
Base arch space of proposed parameters
Profiler
Retargetable performance estimator
App
15
Approach
Base arch space of proposed parameters
Profiler
Retargetable performance estimator
Architecture exploration engine
App
16
Approach
Base arch space of proposed parameters
Area estimates budget
Profiler
Retargetable performance estimator
Architecture exploration engine
App
17
Approach
Base arch space of proposed parameters
Area estimates budget
Profiler
Optimal architectural parameters
Retargetable performance estimator
Architecture exploration engine
App
18
Performance estimator
  • runtime
  • (profile-collected basic block frequencies)
    (scheduler-predicted runtime of that block)

19
Performance estimator
  • runtime
  • (profile-collected basic block frequencies)
    (scheduler-predicted runtime of that block)
  • Basic block (i.e. of instructions that can be
    executed in parallel) by
  • converting to an internal format (Stanford
    University IF, which provides libraries to
    extract such info)

20
Performance estimator
  • runtime
  • (profile-collected basic block frequencies)
    (scheduler-predicted runtime of that block)
  • Basic block by
  • converting to an internal format (Stanford
    University IF, which provides libraries to
    extract such info)
  • Execution frequencies of each basic block by
  • A compiler-inserted instruction increments a
    global variable for each basic block

21
Performance estimator
  • runtime
  • (profile-collected basic block frequencies)
    (scheduler-predicted runtime of that block)
  • Basic block by
  • converting to an internal format (Stanford
    University IF, which provides libraries to
    extract such info)
  • Execution frequencies of each basic block by
  • A compiler-inserted instruction increments a
    global variable for each basic block
  • Number of clock cycles
  • A scheduler schedules each basic block to derive
    execution time on the processor (taking into
    account all parameters)
  • A processor description is needed for this and a
    language was developed (context free grammar)

22
Performance estimator
  • runtime
  • (profile-collected basic block frequencies)
  • (scheduler-predicted runtime of that block)
  • Basic block by
  • converting to an internal format (Stanford
    University IF, which provides libraries to
    extract such info)
  • Execution frequencies of each basic block by
  • A compiler-inserted instruction increments a
    global variable for each basic block
  • Number of clock cycles
  • A scheduler schedules each basic block to derive
    execution time on the processor (taking into
    account all parameters)
  • A processor description is needed for this and a
    language was developed (context free grammar)
  • Scheduler combines this time, with frequencies of
    basic blocks, to estimate overall runtime

23
Performance estimation, more formally
  • Derive runtime vs. parameter curve for each
    parameter (just recompile for every param)
  • Runtime
  • (profile-collected basic block frequencies)
    (scheduler-predicted runtime of that block)
  • Runtime_function(pi) (runtime for pi) / (base
    runtime)

24
Performance estimation, more formally
  • Derive runtime vs. parameter curve for each
    parameter (just recompile for every param)
  • Runtime
  • (profile-collected basic block frequencies)
    (scheduler-predicted runtime of that block)
  • Runtime_function(pi) (runtime for pi) / (base
    runtime)

25
Area estimation, formally
  • Obtain area vs. parameter curve for every
    parameter
  • Area_function(pi) additional gate area for pi

26
Area estimation, formally
  • Obtain area vs. parameter curve for every
    parameter
  • Area_function(pi) additional gate area for pi

27
Retargetable performance estimator
  • Profiler
  • Computes execution frequencies of each basic
    block
  • A compiler-inserted instruction increments a
    global variable for this

28
Retargetable performance estimator
  • Profiler
  • Computes execution frequencies of each basic
    block
  • A compiler-inserted instruction increments a
    global variable for this
  • Data flow graph builder, for scheduling
  • Directed acyclic graph for a basic block
    captures all dependencies (blocks in sequence
    within a block in parallel)
  • Priority of operation, based on height of that
    operation in dependency graph

29
Retargetable performance estimator
  • Profiler
  • Computes execution frequencies of each basic
    block
  • A compiler-inserted instruction increments a
    global variable for this
  • Data flow graph builder, for scheduling
  • Directed acyclic graph for a basic block
    captures all dependencies (blocks in sequence
    within a block, in parallel)
  • Priority of operation, based on height of that
    operation in dependency graph
  • Fine-grain scheduler estimates of clock cycles
    by taking into account different architecture
    parameters
  • Schedules each basic block to derive execution
    time on the processor
  • Combines this with frequencies to estimate
    overall runtime
  • List scheduling is a greedy method that chooses
    next instruction in DAG in order of their
    priority (longer critical paths have higher
    priority)

30
Retargetable performance estimator
  • Assumptions
  • All operations operate on operands in registers
  • Address computation of an array instruction are
    carried out by insertion of explicit address
    computation instructions

31
The processor description language
  • Can express most embedded VLIW processors
  • Functional units in data path, w/ their
    operations, corresponding latencies, delays
  • Constraints in terms of operation slots slot
    restrictions
  • Number of registers, write buses, ports in memory
  • Delay of branch operations
  • Concurrent load/ store operations
  • Final operation delay (delay of functional
    unit) (delay of operation)

32
Architecture exploration engine
  • Chooses optimal parameter values constrained
    optimization problem
  • Sum of all area_functions lt area_budget

33
Architecture exploration engine
  • Chooses optimal parameter values constrained
    optimization problem
  • Sum of all area_functions lt area_budget
  • If parameters are independent, pred_runtime
    product of runtime for every parameter

34
Architecture exploration engine
  • Chooses optimal parameter values constrained
    optimization problem
  • Sum of all area_functions lt area_budget
  • If parameters are independent, pred_runtime
    product of runtime for every parameter
  • Since they are not, pred_runtime
  • (product of runtime for every parameter) /
    dependence_constant(p1, , pn)
  • where dependence_constant is

35
Interdependence of parameters
  • dependence_constant is a heuristic for every
    combo of parameters that adjusts the gain for
    that combo

36
Interdependence of parameters
  • dependence_constant is a heuristic for every
    combo of parameters that adjusts the gain for
    that combo
  • obtained by one-time, exhaustive simulation of
    standard benchmarks, for a combo of parameters

37
Interdependence of parameters
  • dependence_constant is a heuristic for every
    combo of parameters that adjusts the gain for
    that combo
  • obtained by one-time, exhaustive simulation of
    standard benchmarks, for a combo of parameters
  • Dependence_constant(p1,,pn)
  • 1 for all pi basei
  • 1 for pj ! basej, for all i ! j, pi basei
  • (product of all runtime_function) /
  • (actual_runtime for that combo)

38
Evaluated parameters
  • On Philips TriMedia VLIW processor
  • Presence or absence of MAC
  • HW/ SW floating point
  • Single or dual-ported memory for parallel memory
    operations
  • Pipelined or non-pipelined memory unit

39
Other customizable parameters
  • Register file size
  • Number of architectural clusters
  • Number and nature of functional units
  • Presence of an address generation unit
  • Optimized special operations
  • Multi-operation patterns
  • Memory data packing/ unpacking support
  • Memory addressing support
  • Control-flow support
  • Loop-level optimizations
  • Loop-level optimized patterns
  • Loop vectorization
  • Architecture-independent optimization

40
For DSP applications
  • Functional unit composition
  • Ignore cache misses, branch mis-predictions,
    separation of register files (or functional unit
    banks), register allocation conflicts
  • Register casting, if data-dependency interlocks
    exist in the architecture

41
Performance gain from INDIVIDUAL parameters
  • Runtime_function for each benchmark
  • Application for each of the chosen parameters
    MAC, FPU, dual-ported memory, pipelined memory

Figure from Gupta et al.
42
Performance gain from COMBINED parameters
  • Runtime_function for each benchmark
  • Application for selected combination of chosen
    parameters

Figure from Gupta et al.
43
Dependence constants for the combinations
Figure from Gupta et al.
44
(DSP) FFT benchmark
Figure from Gupta et al.
45
Results
  • Performance estimation error 2.5
  • Recommended configuration same as hand-optimized

46
Profile use app parameters to eliminate
processor or processor configuration
47
App parameters relevant processor features
  • Average block size
  • (Acceptable) branch penalty
  • of multiply-accumulate operations
  • MAC
  • Ratio of address computation instructions to data
    computation instructions
  • Separate address generation ALU
  • Ratio of I/O instructions to total instructions
  • Memory bandwidth requirements
  • Average arc length in the data flow graph
  • Total of registers
  • Unconstrained ASAP scheduler results
  • Operation concurrency and lower bound on
    performance
  • Assumptions
  • In average block size module, instructions
    associated with condition code evaluation of
    conditional structures and loops ignored
  • Each array instruction contributes to total by
    twice the of dimensions
  • Array accesses are assumed to point to data in
    memory

48
Related work
  • Related work evaluates exhaustively or in
    isolation no cost-area analysis
  • Commercial soft cores
  • User optimizes instruction set, addressing modes
    sizes of internal memory banks tool estimates
    area
  • Gong et. al
  • Performance analyzer evaluates machine
    parallelism, number of buses connectivity,
    memory ports does not account for dependency
  • Ghazal et. al
  • Predict runtime for advanced processor features
    compiler optimizations such as optimized special
    operations, memory addressing support,
    control-flow support loop-level optimization
    support.
  • Gupta et. al
  • Analyze application to select processor no
    quantification of features performance
    estimation thru exhaustive simulation
  • Kuulusa et. al, Herbert et. al, Shackleford et.
    al
  • Tools for architecture exploration by exhaustive
    search evaluate instruction extensions
  • Custom fit processors
  • Also exhaustive search but targets a VLIW
    architecture changeable memory sizes, register
    sizes, kinds and latencies of functional units
    and clustered machines speedup/ cost graphs are
    derived for all combinations yielding pareto
    points

49
Other related papers
  • Kuulusa et. al., Herbert et. al., Shackleford
    et. al. evaluate extensions to instruction set
  • Managing multi-configuration hardware via dynamic
    working set analysis
  • By Dhodapkar, Smith, Wisc
  • Reconfigurable custom computing as a
    supercomputer replacement
  • By Milne, University of South Australia

50
Discussion
51
Dependence constant
  • Dependence_constant close to 1 does not imply
    independence because
  • The set of parameters the constant is modeling
    impacts only a fraction of instructions in a
    program
  • E.g. dual-ported memory and pipelined memory both
    target memory parallelism
  • Results showed that were highly dependent
  • Gain from both 7.3
  • Gain from each 6.1
  • Dependence constant (0.939 0.939)/(0.927)
    0.95
  • Close to one despite parameters being highly
    dependent

52
3. Processor evaluation for ASIP changes Gupta et
al, IIT Delhi
  • Modifications may include
  • New instructions, increased of functional
    units, registers, memory accesses in parallel,
    depth of pipeline
  • Convert application and processor architecture to
    intermediate representation (so that they can be
    evaluated)
  • Application Stanford Univ. Intermediate Format,
    that provides libraries to extract info
  • E.g.. average number of instructions that can be
    executed in parallel this indicates concurrency
    in app w/ performance specifications, become
    requirements for the processor
  • Architectural characteristics custom description
    format
  • Estimator now estimates code size and runtime
  • System partitioner can now estimate s/w
    performance

53
Overview
  • Static profile info analysis, using math model
    and no running
  • Describe processor using a processor description
    language context-free grammar
  • Feed this description to Instruction scheduler in
    the compiler for scheduling (i.e.) performance
    estimation
  • By predicting performance for configuration w/
    one CPU enhancement, infer for other
    configurations using dependence constants
  • (statistical data on inter-dependence of the
    given parameters)
  • Parameters (Philips TriMedia VLIW processor)
  • Include or exclude MAC
  • HW/ SW floating point
  • Single or dual-ported memory for parallel memory
    operations
  • Pipelined or non-pipelined memory unit
  • Performance estimation error 2.5
  • Recommended configuration same as brute force
    exhaustive simulation

54
Methodology
  • Performance analyzer
  • Avoid exhaustive simulation, as the number of
    configurations is exponential
  • Simulating large data sets would be prohibitively
    time consuming
  • Architecture optimizer
  • FPGA, chip constraints resources, area (cost,
    power constraints)
  • Architectural parameters are not independent
  • Evaluation of proposed methodology
  • Compare the resulting configuration and runtime
    with hand-optimized configuration
Write a Comment
User Comments (0)
About PowerShow.com