The Role Of ASIP - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

The Role Of ASIP

Description:

The HW engineer laughed. Initial step (R, L) ... The SW engineer laughed. Initial Permutation. Expansion. Permutation. S Boxes. P Permutation ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 66
Provided by: berniero8
Category:
Tags: asip | laughed | role

less

Transcript and Presenter's Notes

Title: The Role Of ASIP


1
The Role Of ASIP In Programmable Platforms
2
Outline
  • Using ASIP a new design paradigm
  • EEMBC a case study
  • Designing ASIP using Xtensa and TIE
  • Addressing the needs of platforms
  • ASIP computing capabilities
  • ASIP communication capabilities
  • Challenges

3
A short story of a design paradigm shift
4
Once upon a time
How do I solve the encryption problem?
5
Data Encryption Standard (DES)
  • Initial step
  • (R, L) Initial_permutation(Din64)
  • Iterate 16 times
  • Key generation
  • (C, D) PC1(k)
  • n rotate_amount (function of iteration count)
  • C rotate_right(C, n)
  • D rotate_right (D, n)
  • K PC2(D, C)
  • Encryption
  • R i1 Li ? Permutation ( S_Box ( K ? Expansion
    ( R ) ) )
  • L i1 Ri
  • Final step
  • Dout64 Final_permutation(L, R)

6
The SW engineer very proudly presented
  • static unsigned permute(
  • unsigned char table,
  • in t n,
  • unsigned hi,
  • unsigned lo)
  • int ib, ob
  • unsigned out 0
  • for (ob 0 ob lt n ob)
  • ib tableob - 1
  • if (ib gt 32)
  • if (hi (1 ltlt (ib-32))) out 1 ltlt ob
  • else
  • if (lo (1 ltlt ib)) out 1 ltlt ob
  • return out

This code is fast
7
The HW engineer laughed
200 cycles? I can do it in 1!!!
  • Initial step
  • (R, L) Initial_permutation(Din64)
  • Iterate 16 times
  • Key generation
  • (C, D) PC1(k)
  • n rotate_amount (function of iteration count)
  • C rotate_right(C, n)
  • D rotate_right (D, n)
  • K PC2(D, C)
  • Encryption
  • R i1 Li ? Permutation ( S_Box ( K ? Expansion
    ( R ) ) )
  • L i1 Ri
  • Final step
  • Dout64 Final_permutation(L, R)

?
8
The HW engineer presented
Initial Permutation
Ill show you how fast it can be
Expansion Permutation
Key Generation
?
S Boxes
State Machine
P Permutation
?
Final Permutation
9
The SW engineer laughed
I can change this in 1 minute, can you?
Initial Permutation
Expansion Permutation
Key Generation
?
?
S Boxes
State Machine
P Permutation
?
Final Permutation
10
Realizing that they each had something the other
wanted
If only I dont have to design the controller
If only I have just the instruction I need
11
They decided to work together
Initial Permutation
Expansion Permutation
Key Generation
?
S Boxes
State Machine
P Permutation
?
Final Permutation
12
and improved the SW solution by 70x
Encryption
Decryption
SETKEY(K_hi, K_lo) for () / read data
/ SETDATA(D_hi, D_lo) DES(ENCRYPT1)
DES(ENCRYPT1) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT1)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT2) DES(ENCRYPT2)
DES(ENCRYPT1) E_hi GETDATA(hi) E_lo
GETDATA(lo) / write encrypted data /
SETKEY(K_hi, K_lo) for () / read
encrypted data / SETDATA(D_hi, D_lo)
DES(DECRYPT1) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT1)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT2) DES(DECRYPT2)
DES(DECRYPT1) DES(DECRYPT1) E_hi
GETDATA(hi) E_lo GETDATA(lo) /
write data /
13
When the boss asked how,the SW engineer said
SW Solution
Registers
Control
Memory (Program)
Datapath
?
X
Correct
Efficient
SW
14
and the HW engineer said
HW Solution
X
?
Correct
Efficient
HW
15
Together, they had the best of both world
ASIP
SW Solutions
HW Solutions
Registers
Control
Memory (Program)
Storage
FSM
Datapath
?
?
Correct
Efficient
SW
HW
16
The boss was very happy
Optimality/ integration (e.g. mW, )
special hardware
ASIP
Use Application- specific datapath for computation
D 10x
traditional processors SW
Flexibility/modularity (e.g. time-to-market)
D 10x
17
And they worked together happily ever after
18
Outline
  • Using ASIP a new design paradigm
  • EEMBC a case study
  • Designing ASIPs using Xtensa and TIE
  • Addressing the needs of platforms
  • ASIP computing capabilities
  • ASIP communication capabilities
  • Challenges

19
What Is EEMBC?
  • EDN Embedded Microprocessor Benchmark Consortium
  • Pronounced Embassy
  • Non-profit consortium, funded by over 40 members
  • Including ARM, AMD, IBM, Intel, LSI Logic, MIPS,
    Motorola, National Semi, NEC, TI, Toshiba,
    Tensilica, and more
  • Objective Provide independently certified
    benchmark scores relevant to deeply embedded
    processor applications
  • Independent laboratory recreates and certifies
    all benchmark results - no tricks

20
EEMBC Benchmark Suites
  • Five different benchmark suites
  • Consumer
  • Networking
  • Telecom
  • Automotive
  • Office Automation
  • Each suite comprised of a range (five to sixteen)
    ofbenchmarks representative of that product
    category
  • Example Consumer
  • Image compression, image filtering, color
    conversion

21
Two Metrics Out-of-box vs. Optimized
  • Out-of-Box
  • Benchmark C code, no manual code optimization,no
    assembly coding
  • Optimized, or Full-Fury
  • Conventional Processors
  • Laboriously hand-tuned assembly code
  • Rewriting C code to fit the architecture for VLIW
    or SIMD machines
  • Changing Code to Fit the Processor
  • Xtensa
  • Optimized processor using Xtensa processor
    generator and TIE Compiler
  • Changing Processor to Fit the Application!!

22
Xtensa Optimization Process
  • Step 1 Configure processor via generator GUI
  • Compile C-code, evaluate results
  • Modify configuration as needed
  • Out of Box results measurement taken here
  • Step 2 Profile Code, Add TIE
  • Step 3 Optimize Code to Utilize TIE
    instructions
  • Optimized results measured on final hardware
    configuration

Same Path Used by Tensilica Customers!
23
Optimized Xtensa Configurations for EEMBC
OUT-OF-BOX Configured Xtensa (Using GUI Click
box options) Unmodified C-Code
OPTIMIZED Configured Xtensa Plus TIE Gates
Instructions C-Code optimizations
Consumer Configuration
25000 base gates 37600 config. gates 200MHz
127K total gates 200MHz
64.1K TIE
62.6K
Network Configuration
25000 base gates 25000 config. gates 200MHz
59K total gates 200MHz
50K
9.2K TIE
Telecom Configuration
25000 base gates 37000 config Gates 200MHz
180K total gates 200MHz
VECTRA
18K TIE
Illustrations conceptual, see EEBMC report for
full details
24
EEMBC Consumer Benchmark
Consumermark
Optimized Xtensa
Out-of-box Xtensa
Processors
25
EEMBC Consumer Benchmark
Consumermark / MHz
Optimized Xtensa
Out-of-box Xtensa
Processors
26
EEMBC Networking Benchmark
Netmark
AMD K6
Optimized Xtensa
Out-of-box Xtensa
Processors
27
EEMBC Networking Benchmark
Netmark / MHz
Optimized Xtensa
Out-of-box Xtensa
AMD K6
Processors
28
EEMBC Telecom Benchmark
BOPS 2x2
Telemark
Optimized Xtensa
Out-of-box Xtensa
Processors
29
EEMBC Telecom Benchmark
BOPS 2x2
Telemark / MHz
1.67
Optimized Xtensa
Out-of-box Xtensa
Processors
30
Outline
  • Using ASIP a new design paradigm
  • EEMBC a case study
  • Designing ASIPs using Xtensa and TIE
  • Addressing the needs of platforms
  • ASIP computing capabilities
  • ASIP communication capabilities
  • Challenges

31
ASIP Generation Flow
ALU
I/O
Timer
Pipe
Cache
MMU
Register File
Tailored, synthesizable HDL uP core
Select processor options
Xtensa Processor Generator
  • Optimizing C/C Compiler
  • Cycle-accurate Simulator
  • Assembler
  • Linker
  • C/C/asm/inst Debugger
  • RTOS

Describe new instructions
In Minutes!
32
Tensilica Instruction Extension (TIE) Lang.
  • opcode PMAC op20 CUST0
  • state ACC1 40
  • state ACC2 40
  • iclass rr PMACin ars, in artinout ACC1,
    inout ACC2
  • semantic pmac_sem PMAC
  • assign ACC1 ACC1 ars150 art150
  • assign ACC2 ACC2 ars3116 art3116
  • schedule pmac_schd PMAC
  • use ars 1 use art 1
  • use ACC1 2 use ACC2 2
  • def ACC1 2 def ACC2 2

33
Outline
  • Using ASIP a new design paradigm
  • EEMBC a case study
  • Designing ASIP using Xtensa and TIE
  • Addressing the needs of platforms
  • ASIP computing capabilities
  • ASIP communication capabilities
  • Challenges

34
Sample platforms
Vitesse PRISM IQ2000
Intel IXP1200
Motorola C-Port CDP C-5
PMC-Sierra VoIP Gateway
35
Observations
  • Heterogeneous processing elements
  • General purpose processors
  • Micro-controllers
  • Dedicated blocks
  • Heterogeneous communication links
  • Bandwidth
  • Latency
  • Hardware overhead
  • Communication overhead

36
Two Legs Of Platform Design
Platform Designer
Processing Element Design
Communication Design
37
Outline
  • Using ASIP a new design paradigm
  • EEMBC a case study
  • Designing ASIP using Xtensa and TIE
  • Addressing the needs of platforms
  • ASIP computing capabilities
  • ASIP communication capabilities
  • Challenges

38
ASIP requirements
  • Match the performance of hard-wired logic
  • Offer variety of performance/cost points
  • Easy to design
  • Easy to use

39
Fixed Processors Cannot Replace ASIC
Decoder
RF0
Source
FU0
Control
Result
40
Adding Customized Function Units to Break
Temporal Bottleneck
Decoder
RF0
Source routing
FU0
FU1
FU2
FU3
Control
Result routing
41
Example of Customized Functional Unit
opcode PMAC op20 CUST0 state ACC1 40 state ACC2
40 iclass rr PMACin ars, in artinout ACC1,
inout ACC2 semantic pmac_sem PMAC assign
ACC1 ACC1 ars150 art150 assign ACC2
ACC2 ars3116 art3116 schedule
pmac_schd PMAC use ars 1 use art 1 use ACC1
2 use ACC2 2 def ACC1 2 def ACC2 2
42
Effectiveness of Customized Functional Unit
  • Requirements
  • Performance - similar
  • Cost - similar
  • Ease of design similar
  • TIE assign ACC1 ACC1 ars150 art150
  • Ease of use much easier
  • C PMAC(x, y)

43
Adding Processor States to Break Spatial
Bottleneck
Decoder
S1
S0
Source routing
Control
Result routing
44
Example of Processor States
opcode PMAC op20 CUST0 state ACC1 40 state ACC2
40 iclass rr PMACin ars, in artinout ACC1,
inout ACC2 semantic pmac_sem PMAC assign
ACC1 ACC1 ars150 art150 assign ACC2
ACC2 ars3116 art3116 schedule
pmac_schd PMAC use ars 1 use art 1 use ACC1
2 use ACC2 2 def ACC1 2 def ACC2 2
45
Effectiveness of Processor States
  • Requirements
  • Performance better
  • Especially when used with pipelined functional
    units
  • Cost higher due to pipelined implementation
  • Ease of design very simple
  • state ACC1 40
  • Ease of use very easy
  • PMAC(x, y) / implicitly using the states /
  • x R_ACC1_Lo() W_ACC1_Hi(y)

46
Sharing States Using Register Files
Decoder
S1
S0
Source routing
Control
Result routing
47
Example of a Register File
regfile RF24 24 16 r operand vs s
RF24s operand vt t RF24t operand vr r
RF24r iclass rrr average out vr, in vs, in
vt reference average wire 80 t2
vs2316 vt2316 wire 80 t1 vs158
vt158 wire 80 t0 vs70
vt70 assign vr t281, t181,
t081 ctype rgb 24 32 RF24
Control
48
Crossing the HW/SW Boundary
  • Working with typed data
  • rgb x, y, z / C code /
  • Letting C-Compiler allocate the registers
  • z average(x, y) / assembly average v1, v4,
    v6 /
  • Letting C-Compiler spill the registers
  • Letting C-Compiler convert to/from other types
  • yuv a, b
  • b average (a, y)
  • Auto saved/restored on context switching

49
Effectiveness of Register File
  • Requirements
  • Performance better
  • Especially when used with pipelined functional
    units
  • Cost higher due to pipelined implementation
  • Ease of design very simple
  • regfile RF24 24 16 r
  • Ease of use very easy
  • rgb x, y, z
  • z average(x, y)

50
Multi-cycle Instructions
Decoder
Source routing
Control
Result routing
51
Example of a Multi-cycle Instruction
opcode PMAC op20 CUST0 state ACC1 40 state ACC2
40 iclass rr PMACin ars, in artinout ACC1,
inout ACC2 semantic pmac_sem PMAC assign
ACC1 ACC1 ars150 art150 assign ACC2
ACC2 ars3116 art3116 schedule
pmac_schd PMAC use ars 1 use art 1 use ACC1
2 use ACC2 2 def ACC1 2 def ACC2 2
ars
art
ACC1
ACC2
52
Effectiveness of Multi-cycle Instructions
  • Requirements
  • Performance usually better
  • difficult in hard-wired logic
  • Cost higher due to bypass and interlock logic
  • Ease of design very simple
  • use arr 3
  • Ease of use very easy and optimized by C
    Compiler

t sat_mult(x,y) z sat_add(z, t) t2
sat_mult(x2, y2)
sat_mult s3, s1, s2 sat_mult s6, s5, s4 sat_add
s7, s7, s3
53
Replacing the State Machine
program
Decoder
Source routing
Control
Result routing
54
Effectiveness of Control Programming
  • Requirements
  • Performance comparable
  • 0-overhead loop, branch prediction, scheduling
  • Cost comparable
  • Ease of design very simple
  • reference BT , assign BranchTarget
  • Ease of use very easy
  • while
  • for
  • if then else
  • switch
  • goto
  • function call

55
Short Summary of ASIP Computing Capability
  • ASIP
  • Performance comparable
  • Cost higher due to pipelined implementation
  • Ease of design easy using Xtensa/TIE
  • Ease of use very easy using optimizing compiler

56
Meet the Communication Requirements
Platform Designer
Processing Element Design
Communication Design
57
Ways for ASIP to Communicate
MEM
Device
ASIP
Interrupt
Processor Interface (PIF)
I-RAM
D-RAM
I-Cache
D-Cache
Load/Store Units
External Interface
Functional Units
58
Communicate Via PIF and Shared Memory
  • Pros
  • Simple
  • Low cost
  • Standard
  • Cons
  • Long latency
  • Limited by PIF width
  • Resource contention
  • Polling

MEM
Device
ASIP
Interrupt
Processor Interface (PIF)
I-RAM
D-RAM
I-Cache
D-Cache
Load/Store Unit
External Interface
Functional Units
59
Communicate Via Interrupts
  • Pros
  • Simple
  • low cost
  • Standard
  • Event driven
  • Cons
  • Very low bandwidth

Interrupt
60
Communicate Via Dual-ported Local Memory
  • Pros
  • Fast
  • Cons
  • High cost
  • Special programming
  • Limited bandwidth

Interrupt
61
Communicate Via Local Memory Port
  • Pros
  • Configurable
  • Low latency
  • Low cost
  • Cons
  • Non-standard
  • Limited bandwidth
  • Special programming
  • External HW design
  • Expose to ASIP pipeline

Interrupt
62
Communicate Via Processor States
  • Pros
  • Highly configurable
  • Low latency
  • Low cost
  • High bandwidth
  • Cons
  • Non-standard
  • Special programming
  • One-way
  • Restricted to level signal
  • External HW design

Interrupt
63
Communicate Via Instructions
  • Pros
  • Highly configurable
  • No latency
  • Very low cost
  • High bandwidth
  • Cons
  • Non-standard
  • Special programming
  • Restricted to edge signal
  • External HW design
  • Expose to ASIP pipeline

Interrupt
64
Outline
  • Using ASIP a new design paradigm
  • EEMBC a case study
  • Designing ASIP using Xtensa and TIE
  • Addressing the needs of platforms
  • ASIP computing capabilities
  • ASIP communication capabilities
  • Challenges

65
ASIP Challenges
  • Balance computation and communication
  • Performance, cost, power
  • Choose the right instructions
  • Flexibility, product longevity, ease of
    programming
  • Let HW engineers design ASIP
  • No FSMs!
  • Let SW engineers design ASIP
  • Efficient functional units!
  • Support variety of communication
  • Separation of platform designs and system designs
Write a Comment
User Comments (0)
About PowerShow.com