Title: Ch. 4: Programmable Digital Signal Processors
1Ch. 4 Programmable Digital Signal Processors
- instruction level parallelism (ILP)
- hardware support for loop control
- attention for high level data types e.g. arrays,
delaylines - (vs. scalars for CPUs)
- difficult to compare architectures
- e.g. DIT, DIF, radix 2/4, FFT loop unrolling,
scaling, - shuffling, intialisation can be included or
forgotten - benchmarking (Berkeley Design Technology Inc
(BDTi)) - (compare to SpecInt benchmarks for CPs)
2Outline
- architectures for programmable DSPs
- multiplier-accumulator
- modified Harvard architecture
- extension with an ALU (decision making)
- controller architectures
- examples TI, Motorola, Philips
- code generation
- recent developments VLIW (Very Long Instruction
Word) - examples C6 and TM
3Goal 1 cycle per iteration
- position ACR (1 or 2)
- adder/subtractor
- extra pipelines
- asymmetric inputs
- multi-precision
- Modifications
- extra inputs/outputs
4DSP data types
- not every signal requires 32 bits
- 2 types of DSP floating point and integer
- advantages FP most specs are in FP
- (conversion to int is time consuming since the
behaviour - may change)
- disadvantage FP cost (area, speed, power)
- wanted type of output of an operation type
of input - (because both stored in RAM)
- no problem for FP but for integer
- integer multiplication doubles the number of
bits n n gt 2n - What about fractional numbers ?
5DSP data types
- integer and fractional numbers are a special
case of fixed point - fix ltp,qgt (ART designer SystemC)
p
q
1
1
0
1
1
0
1
-19/8 -2.375
1
fix lt8,3gt
Scale factor 1/8
2-2
2-3
2-1
20
21
22
23
-24
negative weight 2s complement
quantization error
Same alu handles fix lt8,1gt, fix lt8,2gt, fix
lt8,3gt, ...
if q0 then integer e.g. int lt8,0gt if qp-1 then
fractional e.g. int lt8,7gt
6DSP data types
-19/8
1
0
1
1
0
1
1
Int lt8,3gt
1
97/16
Int lt8,4gt
1
0
0
0
0
1
0
1
-1843/128
1
0
0
1
1
0
1
1
1
1
1
1
0
0
0
1
Some processors (C54) have special instructions
for fractional Numbers (and symmetric number
domain 2n-1 2n-1)
s x x x s y y y -------- s s z z z z z z s z z z
z z z 0
gt if FRCT 1
7DSP data types
- continue (after multiplication) with msb only
- represents the limit of the accuracy of the
result - (can not be larger than the accuracy of the
inputs) - more efficient solution
- continue with msb lsb
- sum-of-product operations generate accumulative
noise at 32nd - vs. 16th bit
- Still overflow for addition overflow bits
- double precision accumulator
- extra overflow bits
- shift, round, truncate unit
8(No Transcript)
9xQ
xQ
xQ
x
x
x
rounding
value truncation
magnitude truncation
1 1 1 . 1 1 -0.25 0 0 0 . 1 0 0 0
0
1 1 1 . 1 1 -0.25 1 1 1 -1
1 1 1 . 1 1 -0.25 0 0 1 . 0 0 0
0
1 1 1 . 0 1 -0.75 0 0 0 . 1 1 1 1
-1
1 1 1 . 0 1 -0.75 1 1 1 -1
1 1 1 . 0 1 -0.75 0 0 1 . 0 0 0
0
10zeroing
saturation
sawtooth
11Prog/data memory
prog mem.
data mem.
prog mem.
data mem. 1
data mem. 2
EXU
EXU
EXU
Harvard
Modified Harvard
Von Neumann (sequencial)
? c(i) x(i)
Goal 1 cycle per iteration
12RAM_A
RAM_B
MAC
13time loop
1 cycle/tap ?
? ci xi
filter loop i
How updating the delayline ?
x5
x4
x3
x2
x1
Z-1
Z-1
Z-1
Z-1
c4
c5
c3
c2
c1
y
14Solution 1 blockmove in memory
- 2 possibilities
- complete move after every output sample is
calculated - read and write the data twice
- move after read of every datum separately
- write the data twice
- need for a special instruction (TMS320)
15Solution 2 indirect adressing
- use of a pointer to mark the begin of the delay
line - update the pointer instead of moving the data
- problem trashing of the whole memory
- solution modulo addressing
- need for a register to store the pointer
16IIR filter
pointer
y1
y2
y2
y3
y4
y5
y1
y3
Z-1
Z-1
Z-1
Z-1
y4
c2
c1
c3
c4
y5
x
y
memory map
17pntr 1
pntr 1
2 filters
x1
x1
x2
x2
x3
x3
modulo range 1
time loop
x4
x4
x5
x5
for i 1..itaps ? c(i) x(i)
modulo range
pntr 2
y1
y1
for j 1..jtaps ? d(j) y(j)
y2
y2
y3
y3
modulo range 2
y4
y4
y5
y5
2 memory segments gt 1 segment
18Mapping strategy
pntr 1
y1
y2
x1/y3
modulo range
x2
x3
- Mapping strategy
- define positions in Ram
- constraint vars that form a delay line in
consecutive places - find a schedule
- example c1 gt c2 gt c3 gt c4 gt c5
- define ACU instructions
19yo
c7
c5
c3
c1
x8
x6
x4
x2
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
x7
x5
x1
x3
c2
c6
c4
c8
ye
20ACU architecture and Instruction set
A
S
Output reg A reg S Read_A A
A S Read_S S A S incA A1
A1 S decA A-1 A-1 S Step
AS AS S Inc_step S1 A S1
Modulo
16 10 000 23 10 111 mask hold
Modulo can be implemented as a mask operation
if the size is 2k
output to RAM
21Mapping example
16
pntr
y1
17
y2
18
19
x1/y3
modulo range
x2
20
x3
21
22
23
Assume initialisation A pointer17 S -2
read_A 17 incA 18 incA 19 incA 20 incA 21 step 19
dec 18 prepare new pointer for next iteration
22Addressing modes
- register ADD R4, R3
RR4 RR4 RR3 - immediate ADD R4, 3 RR4
RR4 3 - direct ADD R4, (100)
RR4 RR4 Mem100 - indirect ADD R4, (R3)
RR4 RR4 MemRR3 - w. inc/dec ADD R4, (R3) RR4 RR4
MemRR3 - RR3
RR3 1 - indexed ADD R4, (R3R2) RR4 RR4
MemRR3 - RR3
RR3 RR2
- Remarks
- direct for static data
- indirect for arrays
- inc/dec for stepping through arrays e.g. ? xn
- index for stepping through arrays e.g. ? x2n
23Addressing modes extra for DSP
- 8 ARs (address or auxiliary register) available
- extra indirect modes
- circular ARn post inc/dec by
1 - circular - ARn AR0 post inc/dec by AR0 -
circular - bit reverse ARn AR0 B post inc/dec by AR0 -
bit rev.
24Incorporation of an ALU
- regular data-flow algorithms gt MAC
- filtering, correlation, windowing etc
- decision making gt ALU
- sorting filters (e.g. median filters)
- interpolation (e.g. sqrt)
- absolute value calculation
- logarithmic conversion
- finite field aritmetic (e.g. Galois field)
- Viterbi
- VLC, VLD
- division
25Interrupt address
Reset
ACU_A
ACU_B
AR_A
AR_B
Stack
1
PC
RAM_A
RAM_B
Program Memory
DR_A
DR_B
IR
MAC
ALU
Control Bus
Rfile
26Bus-oriented instruction encoding
ACU A B
ALU
SX
SY
DX
DY
RF
00
ACU A B
MULT
SX
SY
DX
DY
RF
01
ACU A B
Imm. data
DX
DY
RF
10
ACU A B
Next address
11
BR
Cond
27first solution
? c(i) x(i)
Not shown coefficient RAMACU
resources
6 clockcycles/sample limit pipelines in the
controller
time (cc)
28Loopfolding (software pipelining)
29Loopfolding (software pipelining)
? c(i) x(i)
Pre- and postamble 4 clockcycles /sample
30 hardware support for loop control
? c(i) x(i)
1 clockcycles/sample repeat instruction and
repeat block
31Outline
- architectures for programmable DSPs
- multiplier-accumulator
- modified Harvard architecture
- extension with an ALU (decision making)
- controller architectures
- examples TI, Motorola, Philips
- code generation
- recent developments VLIW (Very Long Instruction
Word) - examples C6 and TM
32TMS320C5000
T register
E
P
D
C
D
T
B
A
T
A
B
C
C
D
A
D
Sign ctr
Sign ctr
Sign ctr
Sign ctr
Sign ctr
A(40)
B(40)
Multiplier (1717)
MUX
A
ALU (40)
M
U
A
B
0
B
A
B
Barrer shifter
fractional
MUX
MUX
COMP
Adder (40)
MSW/LSW select
TRN
ZERO
SAT
ROUND
TC
33Address bus
16 bits
Motorola 56K family
EXTERNAL ADRESS SWITCH
P Address
Y Address
X Address
Y memory 256-by-24-bit RAM 256-by-24-bit ROM
Address ALU
X memory 256-by-24-bit RAM 256-by-24-bit ROM
2,048-by-24-bit PROGRAM MEMORY ROM
EXTERNAL DATA-BUS SWITCH
INTERNAL DATA-BUS SWITCH
24 BITS
X-DATA
DATA BUS
Y DATA
P DATA
GLOBAL DATA
ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL
INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGR
AMMED I/O, BUS CONTROL
DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRO
DUCING 56 BIT RESULT
24 BITS
PROGRAM CONTROLLER
I/O PORTS
2 BITS
3 BITS
7 BITS
CLOCK
INTERRUPT
34 Program control unit
Two 16-by-16 bit multipliers
Y0
Y0
Y1
Y1
X
X
PO
P1
scale
scale
96-bit instructions
Program memory (Z data)
Instruction decoder
Two 40 bit arithmic- logic units
shift
Saturation
Saturation
Four 40 bit accumulators
16-bit bus
Saturation/scale
R.E.A.L.
X data
Y data
Buses for
Z data
35RD16021 DSP
36Instruction cycle counts for BDTi benchmarks
16 taps 40 samples 8 biquads
37(No Transcript)
38Outline
- architectures for programmable DSPs
- multiplier-accumulator
- modified Harvard architecture
- extension with an ALU (decision making)
- controller architectures
- examples TI, Motorola, Philips
- code generation
- recent developments VLIW (Very Long Instruction
Word)
39source
lexical analysis
syntax analysis
Front end
semantic analysis
Intermediate machine independent representation
Code selection
Register allocation
Code generation
scheduling
1 instr // ops order of instr
code
40Intermediate machine independent representation
BBi
BBk
BBj
a
b
c
d
c
t1 a b t2 c d t3 t1 c out
t2 t3
t1
t2
t3
41Code selection
Register transfer pattern (RTP) for a given
datapath is any RT operation ( read -
combinatorial logic - write) which can be
executed on the datapath. Leupers
Notation ar ar ax ay af means ar
ar ay or ar ar af
or ar ax ay or ar
ax af
42Code selection example
d memory
p memory
ADSP Analog Devices
ax
ay
af
mx
my
mf
x
y
x
y
-
MAC
ALU
-
ar
mr
43Examples of RTPs on the ADSP-210 datapath
ar mr mx
my mf
ar mr mx
my mf
mr
mr
-
mr mf
mr mf
ar mr mx
my mf
mr ar ax
ay af
mr ar ax
ay af
-
mr mf
ar af
ar af
44Example of code selection covering of
intermediate representation with RTPs
mx dmem
my pmem
ax dmem
ay pmem
a
b
c
d
mr dmem
c
ar ax ay
3
t1
t2
2
Mr mr (mx my)
t3
1
my ar
mr mr my
45- Problems
- local decisions which have a global impact
- phase coupling example
- asap schedule
- maximal freedom for scheduling
- code selection during scheduling
- register allocation comes afterwards
- can lead to infeasible solutions
46phase coupling example 1
1
R2
R3
3
2
alu1
R1
alu2
4
(a)
(b)
47phase coupling example 2
Mesman
Pu
Pu
if u and v share the same register
Pv
Pv
u
u
v
v
Cu
Cu
Cv
Cv
Example of coupling between scheduling and
register allocation
48phase coupling discussion
Mesman
application
Traditional code generation (heuristic)
feasible space
OK ?
constraints
no
design space seen by code generator
yes
Phase coupling is difficult because of many
constraints originating from irregular
interconnect, special purpose registers and
non-orthogonal microcode.
49phase coupling discussion
It is very difficult and almost impossible to
develop robust and efficient DSP compilers.
Current DSP practice programming in assembler
Solution 1. Solve code generation for DSPs 2.
Step back and rethink the architecture develop
an architecture which is still efficient but
also a good model for building a compiler
Efficiency exploit instruction level
parallelism (ILP) compilation systematic
positioning of registers and regular
interconnect VLIW Very
Long Instruction Word
50Outline
- architectures for programmable DSPs
- multiplier-accumulator
- modified Harvard architecture
- extension with an ALU (decision making)
- controller architectures
- examples TI, Motorola, Philips
- code generation
- recent developments VLIW (Very Long Instruction
Word) - principles
- central register file example TM
- clustered VLIW example C6
- subword parallelism or SIMD
51VLIW principles
- multiple parallel FUs, possibly different and
pipelined - pipelining is exposed to the compiler
- no interlock mechanism
- load-store architecture
- all operands fetched from/stored in register
files, - possibly multi-ported
- each FU can receive an instruction every clock
cycle - one instruction many RISC instructions
- each RISC instruction one issue slot
- no dependencies between different RISC
instructions - orthogonal microcode
- compiler friendly
52VLIW architecture
Register file
...
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 24
Exec unit 25
RW addr. instruction
Issue slot 1
Issue slot 2
Issue slot 3
Issue slot 4
Issue slot 5
Issue slot 24
Issue slot 25
...
- long instruction words e.g. (374)25625
- many ports on the registerfile e.g. 75
53VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
54TM1000 DSPCPU
Register file (128 regs, 32 bit, 15 ports)
5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2
DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP
compare 1 FP div/sqrt
Exec unit
Exec unit
Exec unit
Exec unit
Exec unit
Data cache (16 kB)
Instruction register (5 issue slots)
PC
Instruction cache (32kB)
55TriMedia TM32A processor
0.18 micron area 16.9mm2 200 MHz (typ) 1.4 W 7
mW/MHz (MIPS 0.9 mW/MHz)
56Synthesised RF area (CMOS18, 64 bit)
Area, speed and power dissipation goes more than
linear with the number of ports
57VLIW architecture clustered Register Files
Register file 1
Register file 2
Register file 3
Exec unit 1
Exec unit 2
copy unit
Exec unit 3
Exec unit 4
copy unit
Exec unit 5
Exec unit 6
copy unit
58VLIW architecture clustered Register Files
REGISTER FILE 1
REGISTER FILE 2
REGISTER FILE 3
FMUL FADD
IMUL IADD
IMUL IADD
FMUL r1,r2,r3
IADD r1,r2,r3
IMUL r1,r2,r3
59VLIW architecture clustered Register Files
FU00
FU10
REGISTER FILE I0
IADD_01 IMOV_01
REGISTER FILE I1
IADD_10 IMOV_10
FU01
FU01
IADD_00 LAND_00
IADD_11 LAND_10
FU02
FU02
IMUL_00 SHFT_00
IMUL_10 SHFT_10
60VLIW architecture clustered Register Files
Discussion
- performance loss (more instructions) compared to
a central - Register File (due to extra cycle for copy)
- 15-20 for 2 clusters
- 20-30 for 4 clusters
- limited scalability
- not too many clusters
- not too many registers within each cluster (too
many RF ports) - add of copy ops in the compiler
- graph changes during scheduling
61TMS320C62x VelociTI (fixed point)
Registerfile 0-15
Registerfile 0-15 (32 bits)
Src_up Dst_up Dst src1 src2
Dst src1 src2
Dst src1 src2
Dst src1 src2
Dst src1 src2
D2
M1
D1
M2
S2
L2
L1
S1
Store/load data
load data
Store/load address
Store/load address
Int add logical bit manip shift constant branch
Int add logical bit count
Int add load/ store
Int mult (16gt32)
62VelociTI principles
- parallelism (fetch-decode-execute) (max 8 issue
slots) - pipeline critical sections (alu 1cc, mult 2 cc,
200 MHz) - Risc (simple, atomic, independent instructions)
- performance comes from compiler (pipelining,
unroll) - load-store
- orthogonal (2 identical DP, add on 6 units)
- deterministic (no interlock)
- conditional instructions (guarding)
- instruction packing
63Fully serial
Classical encoding fetching many nops
n
n
A
n
n
n
n
n
n
B
n
n
n
n
n
n
n
n
n
n
n
C
n
n
Mixed serial/parallel
n
n
n
n
n
D
n
n
n
n
n
E
n
n
n
n
n
B
A
n
n
C
n
n
F
n
n
n
n
n
n
n
n
n
n
E
n
D
n
n
Fully parallel
n
n
n
n
n
n
G
n
F
n
n
n
n
n
n
n
n
n
n
n
n
n
n
H
n
n
n
n
n
n
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
0
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
0
Velocity encoding
64Instruction cycle counts for BDTi benchmarks
65Subword parallelism
(custom operators in TM)
1st input operand
2nd input operand
byte3
byte3
byte2
byte2
byte1
byte1
byte0
byte0
32 bits 4 bytes are processed independently
op
op
op
op
Ex. , - , min, max gt quadumin
gt quadumax ...
byte3
byte2
byte1
byte0
output operand
66Subword parallelism
(custom operators in TM)
faster execution - rewrite effort (e.g.
different types for in- and outputs)
Typical example graphics ( 4 32 bit floating
point)
67Subword parallelism
MPEG example
for (i0 ilt64 I) temp ((back(i)
forward(i) 1) gtgt 1) idct(i) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi temp
Remark simple example without interloop
dependencies
68for (i0 ilt64 i4) temp ((back(i0)
forward(i0) 1) gtgt 1) idct(i0) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi0 temp temp ((back(i1)
forward(i1) 1) gtgt 1) idct(i1) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi1 temp temp ((back(i2)
forward(i2) 1) gtgt 1) idct(i2) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi2 temp temp ((back(i3)
forward(i3) 1) gtgt 1) idct(i3) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi3 temp
69temp0 ((back(i0) forward(i0) 1) gtgt 1)
temp1 ((back(i1) forward(i1) 1) gtgt 1)
temp2 ((back(i2) forward(i2) 1) gtgt 1)
temp3 ((back(i3) forward(i3) 1) gtgt 1)
temp0 idct(i0) if (temp0 gt 255) temp
255 else if (temp0 lt 0) temp0 0 temp1
idct(i1) if (temp1 gt 255) temp1 255 else if
(temp1 lt 0) temp1 0 temp2 idct(i2) if
(temp2 gt 255) temp2 255 else if (temp2 lt 0)
temp2 0 temp3 idct(i3) if (temp3 gt 255)
temp3 255 else if (temp3 lt 0) temp3
0 destinationi0 temp0 destinationi1
temp1 destinationi2 temp2 destinationi3
temp3
quadavg
dspuquadaddui
70- Will embedded CPUs and DSPs converge ?
- Converging forces
- both include a hardware multiplier
- trend in DSPs towards caches and RTK
- trend in DSPs towards C/C
- common trend towards VLIW
- Diverging forces
- deeply embedded code (DSP) vs. end-user SW (CPU)
- different RTKs
- SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)
- Conclusions VLIW
- good balance between hw and sw
- between efficiency (ILP) and cost
- fundamental problems code size, interruptability