Title: The Role Of ASIP
1The Role Of ASIP In Programmable Platforms
2Outline
- Using ASIP a new design paradigm
- EEMBC a case study
- Designing ASIP using Xtensa and TIE
- Addressing the needs of platforms
- ASIP computing capabilities
- ASIP communication capabilities
- Challenges
3A short story of a design paradigm shift
4Once upon a time
How do I solve the encryption problem?
5Data Encryption Standard (DES)
- Initial step
- (R, L) Initial_permutation(Din64)
- Iterate 16 times
- Key generation
- (C, D) PC1(k)
- n rotate_amount (function of iteration count)
- C rotate_right(C, n)
- D rotate_right (D, n)
- K PC2(D, C)
- Encryption
- R i1 Li ? Permutation ( S_Box ( K ? Expansion
( R ) ) ) - L i1 Ri
- Final step
- Dout64 Final_permutation(L, R)
6The SW engineer very proudly presented
- static unsigned permute(
- unsigned char table,
- in t n,
- unsigned hi,
- unsigned lo)
-
- int ib, ob
- unsigned out 0
- for (ob 0 ob lt n ob)
- ib tableob - 1
- if (ib gt 32)
- if (hi (1 ltlt (ib-32))) out 1 ltlt ob
- else
- if (lo (1 ltlt ib)) out 1 ltlt ob
-
-
- return out
This code is fast
7The HW engineer laughed
200 cycles? I can do it in 1!!!
- Initial step
- (R, L) Initial_permutation(Din64)
- Iterate 16 times
- Key generation
- (C, D) PC1(k)
- n rotate_amount (function of iteration count)
- C rotate_right(C, n)
- D rotate_right (D, n)
- K PC2(D, C)
- Encryption
- R i1 Li ? Permutation ( S_Box ( K ? Expansion
( R ) ) ) - L i1 Ri
- Final step
- Dout64 Final_permutation(L, R)
?
8The HW engineer presented
Initial Permutation
Ill show you how fast it can be
Expansion Permutation
Key Generation
?
S Boxes
State Machine
P Permutation
?
Final Permutation
9The SW engineer laughed
I can change this in 1 minute, can you?
Initial Permutation
Expansion Permutation
Key Generation
?
?
S Boxes
State Machine
P Permutation
?
Final Permutation
10Realizing that they each had something the other
wanted
If only I dont have to design the controller
If only I have just the instruction I need
11They decided to work together
Initial Permutation
Expansion Permutation
Key Generation
?
S Boxes
State Machine
P Permutation
?
Final Permutation
12and improved the SW solution by 70x
Encryption
Decryption
SETKEY(K_hi, K_lo) for () / read data
/ SETDATA(D_hi, D_lo) DES(ENCRYPT1)
DES(ENCRYPT1) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT1)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT1) E_hi GETDATA(hi) E_lo
GETDATA(lo) / write encrypted data /
SETKEY(K_hi, K_lo) for () / read
encrypted data / SETDATA(D_hi, D_lo)
DES(DECRYPT1) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT1)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT1) DES(DECRYPT1) E_hi
GETDATA(hi) E_lo GETDATA(lo) /
write data /
13When the boss asked how,the SW engineer said
SW Solution
Registers
Control
Memory (Program)
Datapath
?
X
Correct
Efficient
SW
14and the HW engineer said
HW Solution
X
?
Correct
Efficient
HW
15Together, they had the best of both world
ASIP
SW Solutions
HW Solutions
Registers
Control
Memory (Program)
Storage
FSM
Datapath
?
?
Correct
Efficient
SW
HW
16The boss was very happy
Optimality/ integration (e.g. mW, )
special hardware
ASIP
Use Application- specific datapath for computation
D 10x
traditional processors SW
Flexibility/modularity (e.g. time-to-market)
D 10x
17And they worked together happily ever after
18Outline
- Using ASIP a new design paradigm
- EEMBC a case study
- Designing ASIPs using Xtensa and TIE
- Addressing the needs of platforms
- ASIP computing capabilities
- ASIP communication capabilities
- Challenges
19What Is EEMBC?
- EDN Embedded Microprocessor Benchmark Consortium
- Pronounced Embassy
- Non-profit consortium, funded by over 40 members
- Including ARM, AMD, IBM, Intel, LSI Logic, MIPS,
Motorola, National Semi, NEC, TI, Toshiba,
Tensilica, and more - Objective Provide independently certified
benchmark scores relevant to deeply embedded
processor applications - Independent laboratory recreates and certifies
all benchmark results - no tricks
20EEMBC Benchmark Suites
- Five different benchmark suites
- Consumer
- Networking
- Telecom
- Automotive
- Office Automation
- Each suite comprised of a range (five to sixteen)
ofbenchmarks representative of that product
category - Example Consumer
- Image compression, image filtering, color
conversion
21Two Metrics Out-of-box vs. Optimized
- Out-of-Box
- Benchmark C code, no manual code optimization,no
assembly coding - Optimized, or Full-Fury
- Conventional Processors
- Laboriously hand-tuned assembly code
- Rewriting C code to fit the architecture for VLIW
or SIMD machines - Changing Code to Fit the Processor
- Xtensa
- Optimized processor using Xtensa processor
generator and TIE Compiler - Changing Processor to Fit the Application!!
22Xtensa Optimization Process
- Step 1 Configure processor via generator GUI
- Compile C-code, evaluate results
- Modify configuration as needed
- Out of Box results measurement taken here
- Step 2 Profile Code, Add TIE
- Step 3 Optimize Code to Utilize TIE
instructions - Optimized results measured on final hardware
configuration
Same Path Used by Tensilica Customers!
23Optimized Xtensa Configurations for EEMBC
OUT-OF-BOX Configured Xtensa (Using GUI Click
box options) Unmodified C-Code
OPTIMIZED Configured Xtensa Plus TIE Gates
Instructions C-Code optimizations
Consumer Configuration
25000 base gates 37600 config. gates 200MHz
127K total gates 200MHz
64.1K TIE
62.6K
Network Configuration
25000 base gates 25000 config. gates 200MHz
59K total gates 200MHz
50K
9.2K TIE
Telecom Configuration
25000 base gates 37000 config Gates 200MHz
180K total gates 200MHz
VECTRA
18K TIE
Illustrations conceptual, see EEBMC report for
full details
24EEMBC Consumer Benchmark
Consumermark
Optimized Xtensa
Out-of-box Xtensa
Processors
25EEMBC Consumer Benchmark
Consumermark / MHz
Optimized Xtensa
Out-of-box Xtensa
Processors
26EEMBC Networking Benchmark
Netmark
AMD K6
Optimized Xtensa
Out-of-box Xtensa
Processors
27EEMBC Networking Benchmark
Netmark / MHz
Optimized Xtensa
Out-of-box Xtensa
AMD K6
Processors
28EEMBC Telecom Benchmark
BOPS 2x2
Telemark
Optimized Xtensa
Out-of-box Xtensa
Processors
29EEMBC Telecom Benchmark
BOPS 2x2
Telemark / MHz
1.67
Optimized Xtensa
Out-of-box Xtensa
Processors
30Outline
- Using ASIP a new design paradigm
- EEMBC a case study
- Designing ASIPs using Xtensa and TIE
- Addressing the needs of platforms
- ASIP computing capabilities
- ASIP communication capabilities
- Challenges
31ASIP Generation Flow
ALU
I/O
Timer
Pipe
Cache
MMU
Register File
Tailored, synthesizable HDL uP core
Select processor options
Xtensa Processor Generator
- Optimizing C/C Compiler
- Cycle-accurate Simulator
- Assembler
- Linker
- C/C/asm/inst Debugger
- RTOS
Describe new instructions
In Minutes!
32Tensilica Instruction Extension (TIE) Lang.
- opcode PMAC op20 CUST0
- state ACC1 40
- state ACC2 40
- iclass rr PMACin ars, in artinout ACC1,
inout ACC2 - semantic pmac_sem PMAC
- assign ACC1 ACC1 ars150 art150
- assign ACC2 ACC2 ars3116 art3116
-
- schedule pmac_schd PMAC
- use ars 1 use art 1
- use ACC1 2 use ACC2 2
- def ACC1 2 def ACC2 2
33Outline
- Using ASIP a new design paradigm
- EEMBC a case study
- Designing ASIP using Xtensa and TIE
- Addressing the needs of platforms
- ASIP computing capabilities
- ASIP communication capabilities
- Challenges
34Sample platforms
Vitesse PRISM IQ2000
Intel IXP1200
Motorola C-Port CDP C-5
PMC-Sierra VoIP Gateway
35Observations
- Heterogeneous processing elements
- General purpose processors
- Micro-controllers
- Dedicated blocks
- Heterogeneous communication links
- Bandwidth
- Latency
- Hardware overhead
- Communication overhead
36Two Legs Of Platform Design
Platform Designer
Processing Element Design
Communication Design
37Outline
- Using ASIP a new design paradigm
- EEMBC a case study
- Designing ASIP using Xtensa and TIE
- Addressing the needs of platforms
- ASIP computing capabilities
- ASIP communication capabilities
- Challenges
38ASIP requirements
- Match the performance of hard-wired logic
- Offer variety of performance/cost points
- Easy to design
- Easy to use
39Fixed Processors Cannot Replace ASIC
Decoder
RF0
Source
FU0
Control
Result
40Adding Customized Function Units to Break
Temporal Bottleneck
Decoder
RF0
Source routing
FU0
FU1
FU2
FU3
Control
Result routing
41Example of Customized Functional Unit
opcode PMAC op20 CUST0 state ACC1 40 state ACC2
40 iclass rr PMACin ars, in artinout ACC1,
inout ACC2 semantic pmac_sem PMAC assign
ACC1 ACC1 ars150 art150 assign ACC2
ACC2 ars3116 art3116 schedule
pmac_schd PMAC use ars 1 use art 1 use ACC1
2 use ACC2 2 def ACC1 2 def ACC2 2
42Effectiveness of Customized Functional Unit
- Requirements
- Performance - similar
- Cost - similar
- Ease of design similar
- TIE assign ACC1 ACC1 ars150 art150
- Ease of use much easier
- C PMAC(x, y)
43Adding Processor States to Break Spatial
Bottleneck
Decoder
S1
S0
Source routing
Control
Result routing
44Example of Processor States
opcode PMAC op20 CUST0 state ACC1 40 state ACC2
40 iclass rr PMACin ars, in artinout ACC1,
inout ACC2 semantic pmac_sem PMAC assign
ACC1 ACC1 ars150 art150 assign ACC2
ACC2 ars3116 art3116 schedule
pmac_schd PMAC use ars 1 use art 1 use ACC1
2 use ACC2 2 def ACC1 2 def ACC2 2
45Effectiveness of Processor States
- Requirements
- Performance better
- Especially when used with pipelined functional
units - Cost higher due to pipelined implementation
- Ease of design very simple
- state ACC1 40
- Ease of use very easy
- PMAC(x, y) / implicitly using the states /
- x R_ACC1_Lo() W_ACC1_Hi(y)
46Sharing States Using Register Files
Decoder
S1
S0
Source routing
Control
Result routing
47Example of a Register File
regfile RF24 24 16 r operand vs s
RF24s operand vt t RF24t operand vr r
RF24r iclass rrr average out vr, in vs, in
vt reference average wire 80 t2
vs2316 vt2316 wire 80 t1 vs158
vt158 wire 80 t0 vs70
vt70 assign vr t281, t181,
t081 ctype rgb 24 32 RF24
Control
48Crossing the HW/SW Boundary
- Working with typed data
- rgb x, y, z / C code /
- Letting C-Compiler allocate the registers
- z average(x, y) / assembly average v1, v4,
v6 / - Letting C-Compiler spill the registers
- Letting C-Compiler convert to/from other types
- yuv a, b
- b average (a, y)
- Auto saved/restored on context switching
49Effectiveness of Register File
- Requirements
- Performance better
- Especially when used with pipelined functional
units - Cost higher due to pipelined implementation
- Ease of design very simple
- regfile RF24 24 16 r
- Ease of use very easy
- rgb x, y, z
- z average(x, y)
50Multi-cycle Instructions
Decoder
Source routing
Control
Result routing
51Example of a Multi-cycle Instruction
opcode PMAC op20 CUST0 state ACC1 40 state ACC2
40 iclass rr PMACin ars, in artinout ACC1,
inout ACC2 semantic pmac_sem PMAC assign
ACC1 ACC1 ars150 art150 assign ACC2
ACC2 ars3116 art3116 schedule
pmac_schd PMAC use ars 1 use art 1 use ACC1
2 use ACC2 2 def ACC1 2 def ACC2 2
ars
art
ACC1
ACC2
52Effectiveness of Multi-cycle Instructions
- Requirements
- Performance usually better
- difficult in hard-wired logic
- Cost higher due to bypass and interlock logic
- Ease of design very simple
- use arr 3
- Ease of use very easy and optimized by C
Compiler -
t sat_mult(x,y) z sat_add(z, t) t2
sat_mult(x2, y2)
sat_mult s3, s1, s2 sat_mult s6, s5, s4 sat_add
s7, s7, s3
53Replacing the State Machine
program
Decoder
Source routing
Control
Result routing
54Effectiveness of Control Programming
- Requirements
- Performance comparable
- 0-overhead loop, branch prediction, scheduling
- Cost comparable
- Ease of design very simple
- reference BT , assign BranchTarget
- Ease of use very easy
- while
- for
- if then else
- switch
- goto
- function call
55Short Summary of ASIP Computing Capability
- ASIP
- Performance comparable
- Cost higher due to pipelined implementation
- Ease of design easy using Xtensa/TIE
- Ease of use very easy using optimizing compiler
56Meet the Communication Requirements
Platform Designer
Processing Element Design
Communication Design
57Ways for ASIP to Communicate
MEM
Device
ASIP
Interrupt
Processor Interface (PIF)
I-RAM
D-RAM
I-Cache
D-Cache
Load/Store Units
External Interface
Functional Units
58Communicate Via PIF and Shared Memory
- Pros
- Simple
- Low cost
- Standard
- Cons
- Long latency
- Limited by PIF width
- Resource contention
- Polling
MEM
Device
ASIP
Interrupt
Processor Interface (PIF)
I-RAM
D-RAM
I-Cache
D-Cache
Load/Store Unit
External Interface
Functional Units
59Communicate Via Interrupts
- Pros
- Simple
- low cost
- Standard
- Event driven
- Cons
- Very low bandwidth
Interrupt
60Communicate Via Dual-ported Local Memory
- Pros
- Fast
- Cons
- High cost
- Special programming
- Limited bandwidth
Interrupt
61Communicate Via Local Memory Port
- Pros
- Configurable
- Low latency
- Low cost
- Cons
- Non-standard
- Limited bandwidth
- Special programming
- External HW design
- Expose to ASIP pipeline
Interrupt
62Communicate Via Processor States
- Pros
- Highly configurable
- Low latency
- Low cost
- High bandwidth
- Cons
- Non-standard
- Special programming
- One-way
- Restricted to level signal
- External HW design
Interrupt
63Communicate Via Instructions
- Pros
- Highly configurable
- No latency
- Very low cost
- High bandwidth
- Cons
- Non-standard
- Special programming
- Restricted to edge signal
- External HW design
- Expose to ASIP pipeline
Interrupt
64Outline
- Using ASIP a new design paradigm
- EEMBC a case study
- Designing ASIP using Xtensa and TIE
- Addressing the needs of platforms
- ASIP computing capabilities
- ASIP communication capabilities
- Challenges
65ASIP Challenges
- Balance computation and communication
- Performance, cost, power
- Choose the right instructions
- Flexibility, product longevity, ease of
programming - Let HW engineers design ASIP
- No FSMs!
- Let SW engineers design ASIP
- Efficient functional units!
- Support variety of communication
- Separation of platform designs and system designs