Ch. 4: Programmable Digital Signal Processors

About This Presentation

Title:

Ch. 4: Programmable Digital Signal Processors

Description:

e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation ... real-time worst-case processing = need for more compute power. sec instr ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 71

Provided by: abc774

Category:

more less

Transcript and Presenter's Notes

Title: Ch. 4: Programmable Digital Signal Processors

1
Ch. 4 Programmable Digital Signal Processors

instruction level parallelism (ILP)
hardware support for loop control
attention for high level data types e.g. arrays,
delaylines
(vs. scalars for CPUs)
difficult to compare architectures
e.g. DIT, DIF, radix 2/4, FFT loop unrolling,
scaling,
shuffling, intialisation can be included or
forgotten
benchmarking (Berkeley Design Technology Inc
(BDTi))
(compare to SpecInt benchmarks for CPs)

2
Outline

architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples TI, Motorola, Philips
code generation
recent developments VLIW (Very Long Instruction
Word)
examples C6 and TM

3
Goal 1 cycle per iteration

position ACR (1 or 2)
adder/subtractor
extra pipelines
asymmetric inputs
multi-precision

Modifications
extra inputs/outputs

4
DSP data types

not every signal requires 32 bits
2 types of DSP floating point and integer
advantages FP most specs are in FP
(conversion to int is time consuming since the
behaviour
may change)
disadvantage FP cost (area, speed, power)
wanted type of output of an operation type
of input
(because both stored in RAM)
no problem for FP but for integer
integer multiplication doubles the number of
bits n n gt 2n
What about fractional numbers ?

5
DSP data types

integer and fractional numbers are a special
case of fixed point
fix ltp,qgt (ART designer SystemC)

p
q
1
1
0
1
1
0
1
-19/8 -2.375
1
fix lt8,3gt
Scale factor 1/8
2-2
2-3
2-1
20
21
22
23
-24
negative weight 2s complement
quantization error
Same alu handles fix lt8,1gt, fix lt8,2gt, fix
lt8,3gt, ...
if q0 then integer e.g. int lt8,0gt if qp-1 then
fractional e.g. int lt8,7gt
6
DSP data types
-19/8
1
0
1
1
0
1
1
Int lt8,3gt
1
97/16
Int lt8,4gt
1
0
0
0
0
1
0
1
-1843/128
1
0
0
1
1
0
1
1
1
1
1
1
0
0
0
1
Some processors (C54) have special instructions
for fractional Numbers (and symmetric number
domain 2n-1 2n-1)
s x x x s y y y -------- s s z z z z z z s z z z
z z z 0
gt if FRCT 1
7
DSP data types

continue (after multiplication) with msb only
represents the limit of the accuracy of the
result
(can not be larger than the accuracy of the
inputs)
more efficient solution
continue with msb lsb
sum-of-product operations generate accumulative
noise at 32nd
vs. 16th bit
Still overflow for addition overflow bits
double precision accumulator
extra overflow bits
shift, round, truncate unit

8
(No Transcript)
9
xQ
xQ
xQ
x
x
x
rounding
value truncation
magnitude truncation
1 1 1 . 1 1 -0.25 0 0 0 . 1 0 0 0
0
1 1 1 . 1 1 -0.25 1 1 1 -1
1 1 1 . 1 1 -0.25 0 0 1 . 0 0 0
0
1 1 1 . 0 1 -0.75 0 0 0 . 1 1 1 1
-1
1 1 1 . 0 1 -0.75 1 1 1 -1
1 1 1 . 0 1 -0.75 0 0 1 . 0 0 0
0
10
zeroing
saturation
sawtooth
11
Prog/data memory
prog mem.
data mem.
prog mem.
data mem. 1
data mem. 2
EXU
EXU
EXU
Harvard
Modified Harvard
Von Neumann (sequencial)
? c(i) x(i)
Goal 1 cycle per iteration
12
RAM_A
RAM_B
MAC
13
time loop
1 cycle/tap ?
? ci xi
filter loop i
How updating the delayline ?
x5
x4
x3
x2
x1
Z-1
Z-1
Z-1
Z-1
c4
c5
c3
c2
c1

y
14
Solution 1 blockmove in memory

2 possibilities
complete move after every output sample is
calculated
read and write the data twice
move after read of every datum separately
write the data twice
need for a special instruction (TMS320)

15
Solution 2 indirect adressing

use of a pointer to mark the begin of the delay
line
update the pointer instead of moving the data
problem trashing of the whole memory
solution modulo addressing
need for a register to store the pointer

16
IIR filter
pointer
y1
y2
y2
y3
y4
y5
y1
y3
Z-1
Z-1
Z-1
Z-1
y4
c2
c1
c3
c4

y5
x

y
memory map
17
pntr 1
pntr 1
2 filters
x1
x1
x2
x2
x3
x3
modulo range 1
time loop
x4
x4
x5
x5
for i 1..itaps ? c(i) x(i)
modulo range
pntr 2
y1
y1
for j 1..jtaps ? d(j) y(j)
y2
y2
y3
y3
modulo range 2
y4
y4
y5
y5
2 memory segments gt 1 segment
18
Mapping strategy
pntr 1
y1
y2
x1/y3
modulo range
x2
x3

Mapping strategy
define positions in Ram
constraint vars that form a delay line in
consecutive places
find a schedule
example c1 gt c2 gt c3 gt c4 gt c5
define ACU instructions

19

yo
c7
c5
c3
c1

x8
x6
x4
x2
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
x7
x5
x1
x3
c2
c6
c4
c8

ye
20
ACU architecture and Instruction set
A
S
Output reg A reg S Read_A A
A S Read_S S A S incA A1
A1 S decA A-1 A-1 S Step
AS AS S Inc_step S1 A S1
Modulo
16 10 000 23 10 111 mask hold
Modulo can be implemented as a mask operation
if the size is 2k
output to RAM
21
Mapping example
16
pntr
y1
17
y2
18
19
x1/y3
modulo range
x2
20
x3
21
22
23
Assume initialisation A pointer17 S -2
read_A 17 incA 18 incA 19 incA 20 incA 21 step 19
dec 18 prepare new pointer for next iteration
22
Addressing modes

register ADD R4, R3
RR4 RR4 RR3
immediate ADD R4, 3 RR4
RR4 3
direct ADD R4, (100)
RR4 RR4 Mem100
indirect ADD R4, (R3)
RR4 RR4 MemRR3
w. inc/dec ADD R4, (R3) RR4 RR4
MemRR3
RR3
RR3 1
indexed ADD R4, (R3R2) RR4 RR4
MemRR3
RR3
RR3 RR2

Remarks
direct for static data
indirect for arrays
inc/dec for stepping through arrays e.g. ? xn
index for stepping through arrays e.g. ? x2n

23
Addressing modes extra for DSP

8 ARs (address or auxiliary register) available
extra indirect modes
circular ARn post inc/dec by
1 - circular
ARn AR0 post inc/dec by AR0 -
circular
bit reverse ARn AR0 B post inc/dec by AR0 -
bit rev.

24
Incorporation of an ALU

regular data-flow algorithms gt MAC
filtering, correlation, windowing etc
decision making gt ALU
sorting filters (e.g. median filters)
interpolation (e.g. sqrt)
absolute value calculation
logarithmic conversion
finite field aritmetic (e.g. Galois field)
Viterbi
VLC, VLD
division

25
Interrupt address
Reset
ACU_A
ACU_B
AR_A
AR_B
Stack
1
PC
RAM_A
RAM_B
Program Memory
DR_A
DR_B
IR
MAC
ALU
Control Bus
Rfile
26
Bus-oriented instruction encoding
ACU A B
ALU
SX
SY
DX
DY
RF
00
ACU A B
MULT
SX
SY
DX
DY
RF
01
ACU A B
Imm. data
DX
DY
RF
10
ACU A B
Next address
11
BR
Cond
27
first solution
? c(i) x(i)
Not shown coefficient RAMACU
resources
6 clockcycles/sample limit pipelines in the
controller
time (cc)
28
Loopfolding (software pipelining)
29
Loopfolding (software pipelining)
? c(i) x(i)
Pre- and postamble 4 clockcycles /sample
30
hardware support for loop control
? c(i) x(i)
1 clockcycles/sample repeat instruction and
repeat block
31
Outline

architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples TI, Motorola, Philips
code generation
recent developments VLIW (Very Long Instruction
Word)
examples C6 and TM

32
TMS320C5000
T register
E
P
D
C
D
T
B
A
T
A
B
C
C
D
A
D
Sign ctr
Sign ctr
Sign ctr
Sign ctr
Sign ctr
A(40)
B(40)
Multiplier (1717)
MUX
A
ALU (40)
M
U
A
B
0
B
A
B
Barrer shifter
fractional
MUX
MUX
COMP
Adder (40)
MSW/LSW select
TRN
ZERO
SAT
ROUND
TC
33
Address bus
16 bits
Motorola 56K family
EXTERNAL ADRESS SWITCH
P Address
Y Address
X Address
Y memory 256-by-24-bit RAM 256-by-24-bit ROM
Address ALU
X memory 256-by-24-bit RAM 256-by-24-bit ROM
2,048-by-24-bit PROGRAM MEMORY ROM
EXTERNAL DATA-BUS SWITCH
INTERNAL DATA-BUS SWITCH
24 BITS
X-DATA
DATA BUS
Y DATA
P DATA
GLOBAL DATA
ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL
INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGR
AMMED I/O, BUS CONTROL
DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRO
DUCING 56 BIT RESULT
24 BITS
PROGRAM CONTROLLER
I/O PORTS
2 BITS
3 BITS
7 BITS
CLOCK
INTERRUPT
34

Program control unit
Two 16-by-16 bit multipliers
Y0
Y0
Y1
Y1
X
X
PO
P1
scale
scale
96-bit instructions
Program memory (Z data)
Instruction decoder
Two 40 bit arithmic- logic units
shift
Saturation
Saturation
Four 40 bit accumulators
16-bit bus
Saturation/scale
R.E.A.L.
X data
Y data
Buses for
Z data
35
RD16021 DSP
36
Instruction cycle counts for BDTi benchmarks
16 taps 40 samples 8 biquads
37
(No Transcript)
38
Outline

architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples TI, Motorola, Philips
code generation
recent developments VLIW (Very Long Instruction
Word)

39
source
lexical analysis
syntax analysis
Front end
semantic analysis
Intermediate machine independent representation
Code selection
Register allocation
Code generation
scheduling
1 instr // ops order of instr
code
40
Intermediate machine independent representation
BBi
BBk
BBj
a
b
c
d

c
t1 a b t2 c d t3 t1 c out
t2 t3
t1
t2

t3

41
Code selection
Register transfer pattern (RTP) for a given
datapath is any RT operation ( read -
combinatorial logic - write) which can be
executed on the datapath. Leupers
Notation ar ar ax ay af means ar
ar ay or ar ar af
or ar ax ay or ar
ax af
42
Code selection example
d memory
p memory
ADSP Analog Devices
ax
ay
af
mx
my
mf
x
y
x
y
-

MAC
ALU
-
ar
mr
43
Examples of RTPs on the ADSP-210 datapath
ar mr mx
my mf
ar mr mx
my mf

mr

mr

-
mr mf
mr mf
ar mr mx
my mf
mr ar ax
ay af
mr ar ax
ay af

-
mr mf
ar af
ar af
44
Example of code selection covering of
intermediate representation with RTPs
mx dmem
my pmem
ax dmem
ay pmem
a
b
c
d
mr dmem

c
ar ax ay
3
t1
t2

2
Mr mr (mx my)
t3

1
my ar
mr mr my
45

Problems
local decisions which have a global impact
phase coupling example
asap schedule
maximal freedom for scheduling
code selection during scheduling
register allocation comes afterwards
can lead to infeasible solutions

46
phase coupling example 1
1
R2
R3
3
2
alu1
R1
alu2
4
(a)
(b)
47
phase coupling example 2
Mesman
Pu
Pu
if u and v share the same register
Pv
Pv
u
u
v
v
Cu
Cu
Cv
Cv
Example of coupling between scheduling and
register allocation
48
phase coupling discussion
Mesman
application
Traditional code generation (heuristic)
feasible space
OK ?
constraints
no
design space seen by code generator
yes
Phase coupling is difficult because of many
constraints originating from irregular
interconnect, special purpose registers and
non-orthogonal microcode.
49
phase coupling discussion
It is very difficult and almost impossible to
develop robust and efficient DSP compilers.
Current DSP practice programming in assembler
Solution 1. Solve code generation for DSPs 2.
Step back and rethink the architecture develop
an architecture which is still efficient but
also a good model for building a compiler
Efficiency exploit instruction level
parallelism (ILP) compilation systematic
positioning of registers and regular

interconnect VLIW Very
Long Instruction Word
50
Outline

architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples TI, Motorola, Philips
code generation
recent developments VLIW (Very Long Instruction
Word)
principles
central register file example TM
clustered VLIW example C6
subword parallelism or SIMD

51
VLIW principles

multiple parallel FUs, possibly different and
pipelined
pipelining is exposed to the compiler
no interlock mechanism
load-store architecture
all operands fetched from/stored in register
files,
possibly multi-ported
each FU can receive an instruction every clock
cycle
one instruction many RISC instructions
each RISC instruction one issue slot
no dependencies between different RISC
instructions
orthogonal microcode
compiler friendly

52
VLIW architecture
Register file
...
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 24
Exec unit 25
RW addr. instruction

Issue slot 1
Issue slot 2
Issue slot 3
Issue slot 4
Issue slot 5
Issue slot 24
Issue slot 25
...

long instruction words e.g. (374)25625
many ports on the registerfile e.g. 75

53
VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
54
TM1000 DSPCPU
Register file (128 regs, 32 bit, 15 ports)
5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2
DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP
compare 1 FP div/sqrt
Exec unit
Exec unit
Exec unit
Exec unit
Exec unit
Data cache (16 kB)
Instruction register (5 issue slots)
PC
Instruction cache (32kB)
55
TriMedia TM32A processor
0.18 micron area 16.9mm2 200 MHz (typ) 1.4 W 7
mW/MHz (MIPS 0.9 mW/MHz)
56
Synthesised RF area (CMOS18, 64 bit)
Area, speed and power dissipation goes more than
linear with the number of ports
57
VLIW architecture clustered Register Files
Register file 1
Register file 2
Register file 3
Exec unit 1
Exec unit 2
copy unit
Exec unit 3
Exec unit 4
copy unit
Exec unit 5
Exec unit 6
copy unit
58
VLIW architecture clustered Register Files
REGISTER FILE 1
REGISTER FILE 2
REGISTER FILE 3
FMUL FADD
IMUL IADD
IMUL IADD
FMUL r1,r2,r3
IADD r1,r2,r3
IMUL r1,r2,r3
59
VLIW architecture clustered Register Files
FU00
FU10
REGISTER FILE I0
IADD_01 IMOV_01
REGISTER FILE I1
IADD_10 IMOV_10
FU01
FU01
IADD_00 LAND_00
IADD_11 LAND_10
FU02
FU02
IMUL_00 SHFT_00
IMUL_10 SHFT_10
60
VLIW architecture clustered Register Files
Discussion

performance loss (more instructions) compared to
a central
Register File (due to extra cycle for copy)
15-20 for 2 clusters
20-30 for 4 clusters
limited scalability
not too many clusters
not too many registers within each cluster (too
many RF ports)
add of copy ops in the compiler
graph changes during scheduling

61
TMS320C62x VelociTI (fixed point)
Registerfile 0-15
Registerfile 0-15 (32 bits)
Src_up Dst_up Dst src1 src2
Dst src1 src2
Dst src1 src2
Dst src1 src2
Dst src1 src2
D2
M1
D1
M2
S2
L2
L1
S1
Store/load data
load data
Store/load address
Store/load address
Int add logical bit manip shift constant branch
Int add logical bit count
Int add load/ store
Int mult (16gt32)
62
VelociTI principles

parallelism (fetch-decode-execute) (max 8 issue
slots)
pipeline critical sections (alu 1cc, mult 2 cc,
200 MHz)
Risc (simple, atomic, independent instructions)
performance comes from compiler (pipelining,
unroll)
load-store
orthogonal (2 identical DP, add on 6 units)
deterministic (no interlock)
conditional instructions (guarding)
instruction packing

63
Fully serial
Classical encoding fetching many nops
n
n
A
n
n
n
n
n
n
B
n
n
n
n
n
n
n
n
n
n
n
C
n
n
Mixed serial/parallel
n
n
n
n
n
D
n
n
n
n
n
E
n
n
n
n
n
B
A
n
n
C
n
n
F
n
n
n
n
n
n
n
n
n
n
E
n
D
n
n
Fully parallel
n
n
n
n
n
n
G
n
F
n
n
n
n
n
n
n
n
n
n
n
n
n
n
H
n
n
n
n
n
n
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
0
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
0
Velocity encoding
64
Instruction cycle counts for BDTi benchmarks
65
Subword parallelism
(custom operators in TM)
1st input operand
2nd input operand
byte3
byte3
byte2
byte2
byte1
byte1
byte0
byte0
32 bits 4 bytes are processed independently
op
op
op
op
Ex. , - , min, max gt quadumin
gt quadumax ...
byte3
byte2
byte1
byte0
output operand
66
Subword parallelism
(custom operators in TM)
faster execution - rewrite effort (e.g.
different types for in- and outputs)
Typical example graphics ( 4 32 bit floating
point)
67
Subword parallelism
MPEG example
for (i0 ilt64 I) temp ((back(i)
forward(i) 1) gtgt 1) idct(i) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi temp
Remark simple example without interloop
dependencies
68
for (i0 ilt64 i4) temp ((back(i0)
forward(i0) 1) gtgt 1) idct(i0) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi0 temp temp ((back(i1)
forward(i1) 1) gtgt 1) idct(i1) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi1 temp temp ((back(i2)
forward(i2) 1) gtgt 1) idct(i2) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi2 temp temp ((back(i3)
forward(i3) 1) gtgt 1) idct(i3) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi3 temp
69
temp0 ((back(i0) forward(i0) 1) gtgt 1)
temp1 ((back(i1) forward(i1) 1) gtgt 1)
temp2 ((back(i2) forward(i2) 1) gtgt 1)
temp3 ((back(i3) forward(i3) 1) gtgt 1)
temp0 idct(i0) if (temp0 gt 255) temp
255 else if (temp0 lt 0) temp0 0 temp1
idct(i1) if (temp1 gt 255) temp1 255 else if
(temp1 lt 0) temp1 0 temp2 idct(i2) if
(temp2 gt 255) temp2 255 else if (temp2 lt 0)
temp2 0 temp3 idct(i3) if (temp3 gt 255)
temp3 255 else if (temp3 lt 0) temp3
0 destinationi0 temp0 destinationi1
temp1 destinationi2 temp2 destinationi3
temp3
quadavg
dspuquadaddui

70