Title: Predicting Performance Potential of Modern DSPs
1Predicting Performance Potential of Modern DSPs
- Naji S. Ghazal,
- Richard Newton, Jan Rabaey
- Department of Electrical Engineering and Computer
Sciences - University of California at Berkeley
- http//www-cad.eecs.berkeley.edu/naji/Research/
2Current Challenges in DSP Design
- Too Many Choices in DSP Architecture
- Diverging architecture styles Bier98
- New User-Configurable Instruction Sets (e.g.
CARMEL DSP) - Insufficient High-level Development Support
- DSP Compilers still cannot exploit architectures
crucial optimizations automatically - Challenges
- Statically unknown control flow
- Identifying supported arithmetic contexts
- Identifying memory addressing sequences (e.g.
streams) - Growing need for Tools to explore and predict
true potential of DSP architectural choices
for a particular application
Ever more crucial, and yet getting harder
3Key Opportunity for Improvement
- Using widely used Practices and Syntactic Styles
can be leveraged - Most code is in well-behaved Loops
- Stream Accesses are in (or convertible to) the
format - Arrayconstant-coefficient Loop-Index
constant-offset - Auxiliary Fixed-point Arithmetic Operations
usually identifiable - e.g. x ( (a ltlt SCALE) b) (1 ltlt RND) )
gtgt TRUNC - x a b (MAC)
- Predictions of run-time behavior with aggressive
exposure of potential optimization opportunities
is possible
4Approach Retargetable Estimation
Application (Generic C Code)
Parameterized Architecture Model
SUIF Compiler Front-end
Profiler
- Translation of SUIF instructions
- ? Architecture-Compatible instrs.
- Optimized Computation Patterns
- High-level Optimization Features
Intermediate Format (SUIF)
Translation Annotation
Architecture-specific SUIF
Cycle-level Estimation
- Address Generation Conditions
- Functional Unit Usage/Ordering Rules
- Instruction Set Attributes
Estimate, Profile annotated with chosen
optimizations and ranked bottlenecks
5Parameterized Architecture Model
- Special Optimization Features
- Optimized Special Operations
- Supported Arithmetic Modes
- (Scaling, Rounding, Truncating, Saturation)
- Multi-operation Hyper-Patterns
- Arithmetic(e.g. dual-MAC, complex-MUL)
- Packed (stream-related) Operations
- Memory Pack/Unpack Support
- Memory Addressing Support
- Auto-update Mode Ranges
- Hardware Circular Addressing
- and size of Circular buffers
- Control Flow Optimization Support
- Simple If-Conversion
- Looping Support
- Loop Vectorization/Packing
- Functional Unit Composition
- Functional Unit (FU) Types
- FU Usage Limits
- FU Latencies, Throughputs
- FU-to-FU Constraints
- Instruction Set
- List of Instruction Types (ITypes)
- default types Add/Sub, ALU(gen.), Mul, MAC,
Load, Store, Branch - Max of Instructions per Cycle
- Instructions FU Usage
- Operand Handling Rules
6Targeting the Model -- Example
- Target Processor LSI401 4-way Superscalar DSP
- Salient Features Double-Loads, Dual MAC,
Packed-Add/Sub - Estimation Example
- for (i0 iltN i)
- x (ptr bi)gtgt16
- yi ptr
- Multi-Operation Patterns
- Dual-MAC x x /- a b /- c d
- Arithmetic Support
- Truncation ON
- Loop Vectorization
- Degree 2
- Vectorization ITypes
- Add/Sub, Mul, MAC, Ld, Str
? Loop 2x Vectorizable Ld2, Ld2, Dual-MAC
Str2 Loops N ? N/2
? Ld, Ld, MAC Str
Truncation ?
7Results Comparison with Optimizing Compiler
Inaccuracy in Predicting Hand-Optimized
Performance of DSP Kernels
DSP 1 TI C6201
Ave. Error for Estimator 4.2
Ave. Error for Compiler 220
Frequency
0 1 2 3 4 5
100 200 300 400 500
600 700
Distribution of Percent Error
DSP 2 LSI401
Ave. Error for Compiler 310
Ave. Error for Estimator 3.5
Frequency
0 1 2 3 4 5
100 200 300 400
500 600 700
Distribution of Percent Error
- Latest Compilers fall short of hand-optimized
performance - substantially even for DSP Kernels
8Why Does Retargetable Estimation Work Well?
- Machine Description Method has sufficient detail
- Captures main Instructions, Functional Unit
constraints, pipelining effects - Uses well-tested abstractions of features in
todays DSPs - Estimation encompasses optimization the features
with heaviest impact on performance - Includes crucial optimizing compiler technology
- Targets common DSP styles/semantics and
characteristics - Reaches DSP-oriented, Loop-level and
computation-context-level optimizations
for their effect on performance
9Conclusions
With Predictive Analysis Architecture Model
capturing and profiling differentiating features
of DSPs
- Quick quantitative evaluation of architectural
tradeoffs - Guidance to shorten design development cycle
- Quick comparison of different versions of an
algorithm on a given architecture - What differentiates this approach
- No need for generating assembly code and
simulating - Rapid Retargetability
- Can be applied to other metrics, e.g. power,
memory usage.
10Architecture Selection Ever More Challenging
Todays tools for this task face many challenges
- Difficult to retarget
- Still lagging optimization
- technology
- Low reliability
- No development support
- Limited to supported DSPs
- Expensive!
11Background The DSP Domain
- DSP Applications have attributes different from
GP applicationsBeyond Generic Instruction-Level
Parallelism - High Computation Regularity
- Predictable Data Access Patterns (e.g.
sequential, circular access) - High, Well-structured Data Parallelism
- DSP Processors (old and new) leverage these
attributes, using - Special Complex/Parallel Instructions
- Variable Arithmetic Mode Support
- Specialized Memory addressing and control-flow
support
12Example Model for LSI401 (formerly ZSP16401)
- LSI4014-way Superscalar RISC-based DSP with
DSP-oriented features
- Optimized Operations
- - Multi-operation Patterns (SUIF Expression
trees) - MAC x ? ?
- Load2 pair of x, x/-1in same basic
block - MAC2 x x y
x/-1y/-1 - Add2/Sub2
- - Special Arithmetic modes Saturation, Rounding
- Memory Addressing Features
- - Address Generation Cost 1 cycle, if not a
register, - or if not array with offset -8..7
- - Hardware Circular Addressing 2 circular
buffers - Control Flow Optimization Features
- - Looping Support FOR Loop cost 0 (2
counters) - - Loop Vectorization/Packing
- (List of instructions eligible for 2x
parallelization - Add,Sub, Ld, Str, MAC)
FU Types (6) Limits Latency Throughput ALU
2 1 1 MAC 1 1 1 Prefetch-Lds
2 0 1 Load 1 1 1 Store
1 1 1 Branch 1 1 1 FU-to-FU
Latency Str-gtLd 1 cycle min. Instruction
Types (12) Ld, Pref-Ld, Str, Branch Add,
Sub, ALU(gen.) Mul, MAC, Add2, Sub2, MAC2 Max
Instructions issued in Parallel 4 Instrs FU
Usage (16- and 32-bit 12x6 tables) Operand
Handling Rules Padd/MACs allow 3 opds per
instr., (the rest allow 2, default)
13Assumptions in the Architecture Model
Assumptions not restrictive in DSP domain can be
used to simplify model
though register casting is performed
- No Register Allocation Conflicts
- Variable lifetimes usually short
- No Cache Misses/Extended Memory Latencies
- High Execution time locality and data access
predictability - Perfect Branch Prediction
- Predicable Control Flow
- Auto-update Address Modes(post-incr.
post-decr.) Available
14Results Estimated vs. Hand-Optimized Cycle Count
15Determining Level of Confidence of Estimation
- Empirical Approach Correlation to Benchmark
Results - Establish Benchmark sets with high confidence
(target different application types) - Correlate Application to Benchmark Set gt
determine how similar it is to benchmarks
Potkonjak98 - Application Estimates Confidence Level
(Accuracy) function of its similarity to
Benchmarks their Confidence Level
16Determining Distance from a Benchmark Set
Quantitative method based on Potkonjak98
- Characterize a Benchmark set numerically by
measuring some run-time properties - He used CPI, Cache hit rate, Bus utilization,
ALU issue rate... - Obtain Averages for each property over all
benchmarks - Characterize New application similarly
- Add Application to Benchmark Set, Obtain NEW
Averages - Distance of New application from Benchmark Set is
function of differences between Old and New
averages
17Architecture Selection Criteria--System Viewpoint
- Performance and Power constraints MUST be met
- Using Benchmark result lookup is increasingly
unreliable - Other Criteria
- Chip Cost
- Presence of Peripherals
- Vendor/Development Support
- Data Formats Supported
- can be encapsulated in Framework directly as
known data, used to rank final candidate
architectures - Memory Cost
- can be estimated using Retargetable Estimation
method) - Composite Metrics should be used
Cost/Energy-Efficiency
18Retargetable Compiler Architecture Model
- Example ISDL (Instruction Set Description
Language) Devadas98 - Language consists of
- Instruction Word Format
- Storage Resources (names, sizes of Memory and
Register files) - Instruction Set (with RTL description of
operation) - Constraints (grouping rules for parallel
instructions) - Optimization Hints (e.g. Branch Prediction Hints,
Delay Slot Use) - Emphasis on Code Generation for VLIW Embedded
Processors
19Characteristics of ConventionalDSP Processor
Architectures
- Highly Non-orthogonal Data Paths
- Restricted/specialized Parallelism
- Specialized Support for
- Control Flow and Addressing
- Special, small Register Files
- Complex Instruction Set
- (emphasis on High Code Density)
- High Memory Bandwidth
- (usually at least two words/cycle)
- Difficult for programming and Compilation
Data Bus
G.P. Registers
Address Registers
MULT
AGU
ALU
Accumulator
Address Bus
20New Trends in DSP Processor Architectures
- Diversification
- Enhanced Conventional DSPs (more specialized
parallelism) - VLIW (deeply pipelined) / Superscalar
(dynamically scheduled) - DSP-enhanced General-Purpose Processors /
Embedded Processors - More Parallelism, but harder to track instruction
behavior - Exploitation of Computation Locality (e.g. data
pre-fetching) - Data Paths with User-configurable Extensions
(e.g. Siemens CLIW ISA) - Still as much, if not more, difficulty to program
optimally - Software Development Tools becoming more crucial
21Capturing Multi-operation Hyper-Patterns
- Patterns described as expression trees with nodes
assigned not one value, but lists of possible
values
Dual-MAC (MAC2) for ZSP16401
SIMD Pattern (PADD, PSUB) for ZSP16401
Operand (can be thru a variable)
Array, double Var
Array
, -
Y
i
Array, CONST
Array
Array
Array
Array
Array
Array
X
0
Y
0
X
Y
X
i
Y
i
1, -1
1, -1