Title: Platform-based Design
1Platform-based Design
2DSP
Programmable CPU
Programmable DSP
Application specific instruction set processor
(ASIP)
Application specific processor
3efficiency
ASIC
high medium low
ASIP
DSP
GP proc FPGA
low medium high
flexibility
4Programmable CPU cores
- introduction
- architecture of the MIPS core
- discussed as an example
- pipelining
- application examples
- software issues
- comparison between different CPU cores
- towards application specific architectures
- discussion
5Introduction
- rationale General-purpose -gt large market
- consequence often handcrafted design optimised
for clock rate - problem fast changes in the IC process
technology - examples embedded
- MIPS (first one, licensing instruction set
architecture) - ARM (Advanced Risc Machines, telecom, low power,
- small code size, most popular one, licensing also
- the micro-architecture as hard or soft IP)
- derivatives from general purpose CPUs
- Intel, NEC, Hitachi, National, PowerPC
6Introduction
Instruction set architectures
implicit operands
explicit operands
7Architecture of the MIPS core
Hennessy Patterson
8MIPS instruction formats ( 32 bits )
Hennessy Patterson
op operation of the instruction rs,rt,rd source
and destination registers shamt shift
amount funct operation of the instruction-part
2 imm for program constants addr target address
of a jump
9Example 1 R - type add instruction
Hennessy Patterson
10Critical path R-type operation
Clk
PC
Hennessy Patterson
Instruction address
Instruction Memory
Instruction
Rd
Rt
Rs
Imm
5
5
5
16
32
Rw Ra Rb 32 32-bit registers
Data Memory
Data address
32
32
Data out
Data in
Clk
32
Clk
11Example 2 I-type load word
Hennessy Patterson
- lw rs, rt, imm16
- memPC
- addr Rrs extimm16
- Rrt memaddr
- PC PC 4
12Example 3 I-type branch
Hennessy Patterson
- beq rs, rt, imm16
- memPC
- cond Rrs - Rrt
- if cond 0
- PC PC 4 ext(imm16)4
- else
- PC PC 4
13Example 3 I-type branch
Hennessy Patterson
Rd
Rt
RedDst
Branch
dc (Rt)
Rs
Clk
ALUctr
PC
5
5
5
Reg Wr
Next Address Logic
BusA 32
Imm 16 16
Rw Ra Rb 32 32-bit registers
Bus W
32
BusB 32
Zero
Clk
To Instruction Memory
Imm 16 16
32
Extender
ALUSrc
ExtOp
14Example 3 I-type branch
HennessyPatterson
30
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
00
0
30
Clk
1
30
32
1
Imm 16 16
Instruction lt310gt
30
SignExt
Branch Zero
Instruction lt150gt
15Architecture of the MIPS core
- problem long critical path
- defined by the slowest instruction (load)
- solution ?
- pipelining
- break the instruction into smaller steps
- all steps have about the same critical path
16Pipelining lw instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
lw
Ifetch
RF read
ALU
dmem
RF write
Ifetch
RF read
ALU
dmem
RF write
lw
- One instructions enters the pipeline every clock
cycle - One instructions leaves the pipeline every clock
cycle - gt CPI 1 (Cycles per Instruction)
17Pipelining lw instructions
I
R
A
M
W
Instructions
Data
Current CPU cycle
184 stages of R-type instruction
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
Ifetch
RF read
ALU
RF write
E.g. ADD
19Pipelining lw and R-type instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
add
Ifetch
RF read
ALU
RF write
20Solution stretch R-type to 5 stages
Ifetch
RF read
ALU
dmem
RF write
Dummy op (noop)
HennessyPatterson
21Ifetch
Reg/dec
exec
mem
wr
RegWr
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra
Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
MemtoReg
MemWr
HennessyPatterson
RegDst
ALUSrc
ExtOp
ALUop
22Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
23Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
Solution bypasses
24Bypasses
HennessyPatterson
adr
Data mem
25Data dependencies load instruction
HennessyPatterson
R1 lw...
R1 ...
R1 ...
R1 ...
26Data dependencies load instruction
HennessyPatterson
R1 lw...
Bypass is no solution for instruction
R1 ...
R1 - ...
R1 - ...
27Data dependencies load instruction
HennessyPatterson
R1 lw...
DM
IM
RF
RF
R1 ...
DM
IM
RF
RF
R1 - ...
R1 - ...
IM
RF
DM
RF
Solution pipeline interlock detects a data
hazard and stalls the pipeline until the hazard
is cleared
28Application examples (1)
29Application examples (1)
19 instructions per tap!!
30Application examples (2)
Bit level operations finite field arithmetic
10 instructions!! Very simple in hardware
31Application examples (2)
Bit level operations DES example
32Application examples (2)
Bit level operations A5 example (GSM encryption)
33Application examples conclusions
- CPUs offer flexibility, but
- not efficient in performance
- not efficient in code size
- not efficient in power consumption
34Power Consumption in microprocessors
- Power consumption is (becoming) the limiting
factor in processor design - Solution in direction of
- Hardware acceleration
- Instruction Level Parallelism instead of clock
speed - Code size efficiency
source ISSCC2001, Patrick Gelsinger, Intel
35Amdahls law
- Impact of an improvement on the execution time of
a program depends on 2 parameters - f fraction of the original computation time
that is affected by the improvement - s speedup factor (local)
- exec_time_new exec_time_old (1-f)
exec_time_old f / s - speedup_overall exec_time_old /
exec_time_new 1 / ( 1 f f / s) - if s gtgt 1 then speedup_overall 1 / ( 1 f )
- Example 40 of program can be executed 10 x
faster speedup_overall 1 / ( 0.6 0.4 / 10 )
1.56
36Conclusions
- Programmable CPU cores are important for the
control parts of the application. - They are well supported with tools to support
the development of end-user software. ( vs.
deeply embedded sw) - Keep it Simple heuristic (RISC vs. CISC)
- Make frequent cases fast and rare cases correct.
- Regular (orthogonal) instruction set
- No special features that match a high level
language construct. - At least 16 registers to ease register
allocation. - Embedded cores are often light cores which are a
compromise between performance, area and power
dissipation. (vs.
stand-alone CPU cores which are optimised for
performance)
37Programmable Digital Signal Processors
- instruction level parallelism (ILP)
- hardware support for loop control
- attention for high level data types e.g. arrays,
delaylines - (vs. scalars for CPUs)
- difficult to compare architectures
- e.g. DIT, DIF, radix 2/4, FFT loop unrolling,
scaling, - shuffling, intialisation can be included or
forgotten - benchmarking (Berkeley Design Technology Inc
(BDTi)) - (compare to SpecInt benchmarks for CPs)
38Outline
- architectures for programmable DSPs
- multiplier-accumulator
- modified Harvard architecture
- extension with an ALU (decision making)
- controller architectures
- examples TI, Motorola, Philips
- code generation
- recent developments VLIW (Very Long Instruction
Word) - examples C6 and TM
39Goal 1 cycle per iteration
- position ACR (1 or 2)
- adder/subtractor
- extra pipelines
- asymmetric inputs
- multi-precision
- Modifications
- extra inputs/outputs
40DSP data types
- not every signal requires 32 bits
- 2 types of DSP floating point and integer
- advantages FP most specs are in FP
- (conversion to int is time consuming since the
behaviour - may change)
- disadvantage FP cost (area, speed, power)
- wanted type of output of an operation type
of input - (because both stored in RAM)
- no problem for FP but for integer
- integer multiplication doubles the number of
bits n n gt 2n - What about fractional numbers ?
41DSP data types
- integer and fractional numbers are a special
case of fixed point - fix ltp,qgt (ART designer SystemC)
p
q
1
1
0
1
1
0
1
-19/8 -2.375
1
fix lt8,3gt
Scale factor 1/8
2-2
2-3
2-1
20
21
22
23
-24
negative weight 2s complement
quantization error
Same alu handles fix lt8,1gt, fix lt8,2gt, fix
lt8,3gt, ...
if q0 then integer e.g. int lt8,0gt if qp-1 then
fractional e.g. int lt8,7gt
42DSP data types
- continue (after multiplication) with msb only
- represents the limit of the accuracy of the
result - (can not be larger than the accuracy of the
inputs) - more efficient solution
- continue with msb lsb
- sum-of-product operations generate accumulative
noise at 32nd - vs. 16th bit
- Still overflow for addition overflow bits
- double precision accumulator
- extra overflow bits
- shift, round, truncate unit
43(No Transcript)
44Prog/data memory
prog mem.
data mem.
prog mem.
data mem. 1
data mem. 2
EXU
EXU
EXU
Harvard
Modified Harvard
Von Neumann (sequencial)
? c(i) x(i)
Goal 1 cycle per iteration
45RAM_A
RAM_B
MAC
46time loop
1 cycle/tap ?
? ci xi
filter loop i
How updating the delayline ?
x5
x4
x3
x2
x1
Z-1
Z-1
Z-1
Z-1
c4
c5
c3
c2
c1
y
47Solution 2 indirect adressing
- use of a pointer to mark the begin of the delay
line - update the pointer instead of moving the data
- problem trashing of the whole memory
- solution modulo addressing
- need for a register to store the pointer
48ACU architecture and Instruction set
A
S
Output reg A reg S Read_A A
A S Read_S S A S incA A1
A1 S decA A-1 A-1 S Step
AS AS S Inc_step S1 A S1
Modulo
16 10 000 23 10 111 mask hold
Modulo can be implemented as a mask operation
if the size is 2k
output to RAM
49Addressing modes
- register ADD R4, R3
RR4 RR4 RR3 - immediate ADD R4, 3 RR4
RR4 3 - direct ADD R4, (100)
RR4 RR4 Mem100 - indirect ADD R4, (R3)
RR4 RR4 MemRR3 - w. inc/dec ADD R4, (R3) RR4 RR4
MemRR3 - RR3
RR3 1 - indexed ADD R4, (R3R2) RR4 RR4
MemRR3 - RR3
RR3 RR2
- Remarks
- direct for static data
- indirect for arrays
- inc/dec for stepping through arrays e.g. ? xn
- index for stepping through arrays e.g. ? x2n
50Addressing modes extra for DSP
- 8 ARs (address or auxiliary register) available
- extra indirect modes
- circular ARn post inc/dec by
1 - circular - ARn AR0 post inc/dec by AR0 -
circular - bit reverse ARn AR0 B post inc/dec by AR0 -
bit rev.
51Interrupt address
Reset
ACU_A
ACU_B
AR_A
AR_B
Stack
1
PC
RAM_A
RAM_B
Program Memory
DR_A
DR_B
IR
MAC
ALU
Control Bus
Rfile
52first solution
? c(i) x(i)
Not shown coefficient RAMACU
resources
6 clockcycles/sample limit pipelines in the
controller
time (cc)
53Loopfolding (software pipelining)
54Loopfolding (software pipelining)
? c(i) x(i)
Pre- and postamble 4 clockcycles /sample
55 hardware support for loop control
? c(i) x(i)
1 clockcycles/sample repeat instruction and
repeat block
56TMS320C5000
T register
E
P
D
C
D
T
B
A
T
A
B
C
C
D
A
D
Sign ctr
Sign ctr
Sign ctr
Sign ctr
Sign ctr
A(40)
B(40)
Multiplier (1717)
MUX
A
ALU (40)
M
U
A
B
0
B
A
B
Barrer shifter
fractional
MUX
MUX
COMP
Adder (40)
MSW/LSW select
TRN
ZERO
SAT
ROUND
TC
57Address bus
16 bits
Motorola 56K family
EXTERNAL ADRESS SWITCH
P Address
Y Address
X Address
Y memory 256-by-24-bit RAM 256-by-24-bit ROM
Address ALU
X memory 256-by-24-bit RAM 256-by-24-bit ROM
2,048-by-24-bit PROGRAM MEMORY ROM
EXTERNAL DATA-BUS SWITCH
INTERNAL DATA-BUS SWITCH
24 BITS
X-DATA
DATA BUS
Y DATA
P DATA
GLOBAL DATA
ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL
INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGR
AMMED I/O, BUS CONTROL
DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRO
DUCING 56 BIT RESULT
24 BITS
PROGRAM CONTROLLER
I/O PORTS
2 BITS
3 BITS
7 BITS
CLOCK
INTERRUPT
58 Program control unit
Two 16-by-16 bit multipliers
Y0
Y0
Y1
Y1
X
X
PO
P1
scale
scale
96-bit instructions
Program memory (Z data)
Instruction decoder
Two 40 bit arithmic- logic units
shift
Saturation
Saturation
Four 40 bit accumulators
16-bit bus
Saturation/scale
R.E.A.L.
X data
Y data
Buses for
Z data
59source
lexical analysis
syntax analysis
Front end
semantic analysis
Intermediate machine independent representation
Code selection
Register allocation
Code generation
scheduling
1 instr // ops order of instr
code
60Intermediate machine independent representation
BBi
BBk
BBj
a
b
c
d
c
t1 a b t2 c d t3 t1 c out
t2 t3
t1
t2
t3
61Code selection example
d memory
p memory
ADSP Analog Devices
ax
ay
af
mx
my
mf
x
y
x
y
-
MAC
ALU
-
ar
mr
62Example of code selection covering of
intermediate representation with RTPs
mx dmem
my pmem
ax dmem
ay pmem
a
b
c
d
mr dmem
c
ar ax ay
3
t1
t2
2
Mr mr (mx my)
t3
1
my ar
mr mr my
63- Problems
- local decisions which have a global impact
- phase coupling example
- asap schedule
- maximal freedom for scheduling
- code selection during scheduling
- register allocation comes afterwards
- can lead to infeasible solutions
64phase coupling discussion
It is very difficult and almost impossible to
develop robust and efficient DSP compilers.
Current DSP practice programming in assembler
Solution 1. Solve code generation for DSPs 2.
Step back and rethink the architecture develop
an architecture which is still efficient but
also a good model for building a compiler
Efficiency exploit instruction level
parallelism (ILP) compilation systematic
positioning of registers and regular
interconnect VLIW Very
Long Instruction Word
65- Will embedded CPUs and DSPs converge ?
- Converging forces
- both include a hardware multiplier
- trend in DSPs towards caches and RTK
- trend in DSPs towards C/C
- common trend towards VLIW
- Diverging forces
- deeply embedded code (DSP) vs. end-user SW (CPU)
- different RTKs
- SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)
- Conclusions VLIW
- good balance between hw and sw
- between efficiency (ILP) and cost
- fundamental problems code size, interruptability