Title: Synthesis of Custom Processors based on Extensible Platforms
1Synthesis of Custom Processors based on
Extensible Platforms
- Fei Sun, Srivaths Ravi,
- Anand Raghunathan and Niraj K. Jha
- Dept. of Electrical Engineering
- Princeton University
- NEC Laboratories America, Inc.
2Outline
- SoC design constraints
- Background
- Previous work in ASIP design
- Xtensa platform
- Manual custom instruction generation procedure
- Automatic custom instruction generation flow
- Experimental results
- Conclusions
3SoC Design Constraints
- Time to market
- Cost
- Performance
- Power
- Cost-performance trade-off
- Flexibility
4Comparison of Different Approaches
ASIC ASIP GPP Time to market --
Cost --Performance
--Power
--Cost-performance --Flexibility
--
Very good Good -- Very bad
5Flexibility vs. Energy Efficiency
6Previous Work in ASIP Design
- ASIP architectures and overall design
methodologies - Huang, 1994, Adams, 1996, Fisher, 1999,
Kucukcakar, 1999 - Application-specific instruction set selection
- Choi, 1999, Gschwind, 1999, Arnold, 1999
- Low power ASIP design
- Kalambur, 1997, Dougherty, 1999, Ishihara,
2000, Sami, 2001 - Commercial offerings
- Xtensa, ARCtangent, Jazz, SP-5flex, Carmel
7Xtensa Architecture
TRACE Port
Instruction
JTAG Tap Control
Instruction Memory or Cache Tags
Instruction Address
On Chip Debug
Align and Decode
Interrupt Control
Branch Logic Instruction Fetch
Memory Protection Unit
Processor Interface
Window Register File
Date Memory or Cache Tags
Exception Support
Coprocessor Register File
ALU Address Generation
Processor Controls
Write Buffer
MAC 16
Base ISA Feature
Data Address
Coprocessor Execution Units
Designer Defined Instruction Execution Unit
Configurable Function
Timers 1 to n
Optional Function
Data
Special Function Register Access
Configurable Optional Function
Data Address Watch 0 to n
Extensible
Sourcewww.tensilica.com
Instruction Address Watch 0 to n
8Xtensa Processor Design Flow
Processor Configuration Inputs
Designer-DefinedInstruction Descriptions
Configuration File
Configured GNUC/C Compiler
Configured Processor HDL
Configured GNUAssembler/Disassembler
Configured Instruction SetSimulator/Emulator
Area, Power and Timing Estimation
Application Source Code
Generator Output
Sample Application Data
Internal Database
Design data
Use of Generated Data
Sourcewww.tensilica.com
Optimized Hardware
Optimized Software
9Manual Custom Instruction Generation Procedure
Identify potential new instructions
Profile, read source code
Slow and error-prone
Describe custom instructions
Understand source code
Insert custom instructions
Rewrite source code
Verify functional correctness
10Contributions of Our Work
- Automatic custom instruction selection
- Application program to extensible processors with
custom instructions - Features
- Efficient design space search
- Use accurate information from instruction set
simulator and synthesis - Bridge the gap between automatic synthesized and
manually designed architectures
11Automatic Custom Instruction Generation Flow
12Automatic Custom Instruction Generation Flow
13Example Illustration of Template Generation
14Example Illustration of Template Generation
15Example Illustration of Template Generation
16Example Illustration of Template Generation
17Example Illustration of Template Generation
18Key Observations for Pruning
- Higher the weight of the template, higher the
potential for improvement --- Amdahls law - Scope for optimization determined by computation
--- No. of cycles needed for executing the
template - Scope for optimization determined by read/write
ports limitation --- Additional cycles needed for
extra reading/writing of input/output variables
19Pruning Algorithm
- Ranking criterion
- OriginalTime Fraction of the total execution
time of the original program spent in the
template (weight) - In, Out Number of inputs and outputs of the
template, respectively - a, ß Number of inputs/outputs encoded in the
instruction - ? No. of cycles needed for executing the
template - Higher priority means greater potential for speed
up
20Template Generation with Pruning
Ranked pool of seed templates
Threshold 0.1
Template set
10.51
7.92
4.05
2.13
21Template Generation with Pruning
Highest priority
Threshold 0.1
Ranked pool of seed templates
12.73
Template set
1.18
16.35
22Template Generation with Pruning
Highest priority
Threshold 0.1
Ranked pool of seed templates
12.73
Template set
16.35
23Template Generation with Pruning
Highest priority
Threshold 0.1
Ranked pool of seed templates
12.73
16.35
Template set
24No. of Templates vs. Threshold Ratio
25Automatic Custom Instruction Generation Flow
26Automatic Custom Instruction Generation Flow
(Contd.)
27Automatic Custom Instruction Generation Flow
(Contd.)
28Custom Instruction Insertion
- Care must be taken to insert custom instructions
into appropriate places without affecting
programs functional correctness - If custom instructions need extra inputs
(outputs), care must be taken to select
appropriate variables to write to (read from)
user-defined registers
29Example Illustration of Custom Instruction
Insertion
30Example Illustration of Custom Instruction
Insertion (Contd.)
....offset t 1for (i0 ilt100 i)
j .... result offset i j....
....offset t 1for (i0 ilt100 i)
j .... result CustomInstr(i,j)
....
WUR(offset,0)
(a) (b)
31Automatic Custom Instruction Generation Flow
32Custom Instruction Combination Selection ---
Problem Statement
- Given a set of non-overlapping custom
instructions, with each instruction having
several versions, find a version for each
instruction such that performance is maximized
while area is under a certain threshold
33Custom Instruction Combination Selection --- Flow
Chart
34Automatic Custom Instruction Generation Flow
35Experimental Methodology
C Program
Aristotle
Automatic Custom Instruction Generation
Xtensa GNU Profiler
Xtensa TIE Compiler
Modified C program
Synopsys Design Compiler
Cross Compiler
Tensilica Processor Generator
Sente Wattwatcher
ISS
Synopsys Design Compiler
Execution Cycles
Power
Area
Clock Period
36Experimental Results (Contd.)
Average Performance improvement 3.4X Energy
reduction 3.2X Energydelay reduction 12.6X
Area increase 1.8
37Conclusions
- Automatic custom instruction synthesis for ASIPs
- Template generation/selection
- Custom instruction insertion
- Custom instruction combination selection
- Experimental results
- 3.4X average performance improvement
- 12.6X average energydelay reduction