Ch. 4: Programmable Digital Signal Processors - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Ch. 4: Programmable Digital Signal Processors

Description:

e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation ... real-time worst-case processing = need for more compute power. sec instr ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 71
Provided by: abc774
Category:

less

Transcript and Presenter's Notes

Title: Ch. 4: Programmable Digital Signal Processors


1
Ch. 4 Programmable Digital Signal Processors
  • instruction level parallelism (ILP)
  • hardware support for loop control
  • attention for high level data types e.g. arrays,
    delaylines
  • (vs. scalars for CPUs)
  • difficult to compare architectures
  • e.g. DIT, DIF, radix 2/4, FFT loop unrolling,
    scaling,
  • shuffling, intialisation can be included or
    forgotten
  • benchmarking (Berkeley Design Technology Inc
    (BDTi))
  • (compare to SpecInt benchmarks for CPs)

2
Outline
  • architectures for programmable DSPs
  • multiplier-accumulator
  • modified Harvard architecture
  • extension with an ALU (decision making)
  • controller architectures
  • examples TI, Motorola, Philips
  • code generation
  • recent developments VLIW (Very Long Instruction
    Word)
  • examples C6 and TM

3
Goal 1 cycle per iteration
  • position ACR (1 or 2)
  • adder/subtractor
  • extra pipelines
  • asymmetric inputs
  • multi-precision
  • Modifications
  • extra inputs/outputs

4
DSP data types
  • not every signal requires 32 bits
  • 2 types of DSP floating point and integer
  • advantages FP most specs are in FP
  • (conversion to int is time consuming since the
    behaviour
  • may change)
  • disadvantage FP cost (area, speed, power)
  • wanted type of output of an operation type
    of input
  • (because both stored in RAM)
  • no problem for FP but for integer
  • integer multiplication doubles the number of
    bits n n gt 2n
  • What about fractional numbers ?

5
DSP data types
  • integer and fractional numbers are a special
    case of fixed point
  • fix ltp,qgt (ART designer SystemC)

p
q
1
1
0
1
1
0
1
-19/8 -2.375
1
fix lt8,3gt
Scale factor 1/8
2-2
2-3
2-1
20
21
22
23
-24
negative weight 2s complement
quantization error
Same alu handles fix lt8,1gt, fix lt8,2gt, fix
lt8,3gt, ...
if q0 then integer e.g. int lt8,0gt if qp-1 then
fractional e.g. int lt8,7gt
6
DSP data types
-19/8
1
0
1
1
0
1
1
Int lt8,3gt
1
97/16
Int lt8,4gt
1
0
0
0
0
1
0
1
-1843/128
1
0
0
1
1
0
1
1
1
1
1
1
0
0
0
1
Some processors (C54) have special instructions
for fractional Numbers (and symmetric number
domain 2n-1 2n-1)
s x x x s y y y -------- s s z z z z z z s z z z
z z z 0
gt if FRCT 1
7
DSP data types
  • continue (after multiplication) with msb only
  • represents the limit of the accuracy of the
    result
  • (can not be larger than the accuracy of the
    inputs)
  • more efficient solution
  • continue with msb lsb
  • sum-of-product operations generate accumulative
    noise at 32nd
  • vs. 16th bit
  • Still overflow for addition overflow bits
  • double precision accumulator
  • extra overflow bits
  • shift, round, truncate unit

8
(No Transcript)
9
xQ
xQ
xQ
x
x
x
rounding
value truncation
magnitude truncation
1 1 1 . 1 1 -0.25 0 0 0 . 1 0 0 0
0
1 1 1 . 1 1 -0.25 1 1 1 -1
1 1 1 . 1 1 -0.25 0 0 1 . 0 0 0
0
1 1 1 . 0 1 -0.75 0 0 0 . 1 1 1 1
-1
1 1 1 . 0 1 -0.75 1 1 1 -1
1 1 1 . 0 1 -0.75 0 0 1 . 0 0 0
0
10
zeroing
saturation
sawtooth
11
Prog/data memory
prog mem.
data mem.
prog mem.
data mem. 1
data mem. 2
EXU
EXU
EXU
Harvard
Modified Harvard
Von Neumann (sequencial)
? c(i) x(i)
Goal 1 cycle per iteration
12
RAM_A
RAM_B
MAC
13
time loop
1 cycle/tap ?
? ci xi
filter loop i
How updating the delayline ?
x5
x4
x3
x2
x1
Z-1
Z-1
Z-1
Z-1
c4
c5
c3
c2
c1






y
14
Solution 1 blockmove in memory
  • 2 possibilities
  • complete move after every output sample is
    calculated
  • read and write the data twice
  • move after read of every datum separately
  • write the data twice
  • need for a special instruction (TMS320)

15
Solution 2 indirect adressing
  • use of a pointer to mark the begin of the delay
    line
  • update the pointer instead of moving the data
  • problem trashing of the whole memory
  • solution modulo addressing
  • need for a register to store the pointer

16
IIR filter
pointer
y1
y2
y2
y3
y4
y5
y1
y3
Z-1
Z-1
Z-1
Z-1
y4
c2
c1
c3
c4




y5
x

y
memory map
17
pntr 1
pntr 1
2 filters
x1
x1
x2
x2
x3
x3
modulo range 1
time loop
x4
x4
x5
x5
for i 1..itaps ? c(i) x(i)
modulo range
pntr 2
y1
y1
for j 1..jtaps ? d(j) y(j)
y2
y2
y3
y3
modulo range 2
y4
y4
y5
y5
2 memory segments gt 1 segment
18
Mapping strategy
pntr 1
y1
y2
x1/y3
modulo range
x2
x3
  • Mapping strategy
  • define positions in Ram
  • constraint vars that form a delay line in
    consecutive places
  • find a schedule
  • example c1 gt c2 gt c3 gt c4 gt c5
  • define ACU instructions

19

yo
c7
c5
c3
c1




x8
x6
x4
x2
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
Z-1
x7
x5
x1
x3
c2
c6
c4
c8





ye
20
ACU architecture and Instruction set
A
S
Output reg A reg S Read_A A
A S Read_S S A S incA A1
A1 S decA A-1 A-1 S Step
AS AS S Inc_step S1 A S1
Modulo
16 10 000 23 10 111 mask hold
Modulo can be implemented as a mask operation
if the size is 2k
output to RAM
21
Mapping example
16
pntr
y1
17
y2
18
19
x1/y3
modulo range
x2
20
x3
21
22
23
Assume initialisation A pointer17 S -2
read_A 17 incA 18 incA 19 incA 20 incA 21 step 19
dec 18 prepare new pointer for next iteration
22
Addressing modes
  • register ADD R4, R3
    RR4 RR4 RR3
  • immediate ADD R4, 3 RR4
    RR4 3
  • direct ADD R4, (100)
    RR4 RR4 Mem100
  • indirect ADD R4, (R3)
    RR4 RR4 MemRR3
  • w. inc/dec ADD R4, (R3) RR4 RR4
    MemRR3
  • RR3
    RR3 1
  • indexed ADD R4, (R3R2) RR4 RR4
    MemRR3
  • RR3
    RR3 RR2
  • Remarks
  • direct for static data
  • indirect for arrays
  • inc/dec for stepping through arrays e.g. ? xn
  • index for stepping through arrays e.g. ? x2n

23
Addressing modes extra for DSP
  • 8 ARs (address or auxiliary register) available
  • extra indirect modes
  • circular ARn post inc/dec by
    1 - circular
  • ARn AR0 post inc/dec by AR0 -
    circular
  • bit reverse ARn AR0 B post inc/dec by AR0 -
    bit rev.

24
Incorporation of an ALU
  • regular data-flow algorithms gt MAC
  • filtering, correlation, windowing etc
  • decision making gt ALU
  • sorting filters (e.g. median filters)
  • interpolation (e.g. sqrt)
  • absolute value calculation
  • logarithmic conversion
  • finite field aritmetic (e.g. Galois field)
  • Viterbi
  • VLC, VLD
  • division

25
Interrupt address
Reset
ACU_A
ACU_B
AR_A
AR_B
Stack
1
PC
RAM_A
RAM_B
Program Memory
DR_A
DR_B
IR
MAC
ALU
Control Bus
Rfile
26
Bus-oriented instruction encoding
ACU A B
ALU
SX
SY
DX
DY
RF
00
ACU A B
MULT
SX
SY
DX
DY
RF
01
ACU A B
Imm. data
DX
DY
RF
10
ACU A B
Next address
11
BR
Cond
27
first solution
? c(i) x(i)
Not shown coefficient RAMACU
resources
6 clockcycles/sample limit pipelines in the
controller
time (cc)
28
Loopfolding (software pipelining)
29
Loopfolding (software pipelining)
? c(i) x(i)
Pre- and postamble 4 clockcycles /sample
30
hardware support for loop control
? c(i) x(i)
1 clockcycles/sample repeat instruction and
repeat block
31
Outline
  • architectures for programmable DSPs
  • multiplier-accumulator
  • modified Harvard architecture
  • extension with an ALU (decision making)
  • controller architectures
  • examples TI, Motorola, Philips
  • code generation
  • recent developments VLIW (Very Long Instruction
    Word)
  • examples C6 and TM

32
TMS320C5000
T register
E
P
D
C
D
T
B
A
T
A
B
C
C
D
A
D
Sign ctr
Sign ctr
Sign ctr
Sign ctr
Sign ctr
A(40)
B(40)
Multiplier (1717)
MUX
A
ALU (40)
M
U
A
B
0
B
A
B
Barrer shifter
fractional
MUX
MUX
COMP
Adder (40)
MSW/LSW select
TRN
ZERO
SAT
ROUND
TC
33
Address bus
16 bits
Motorola 56K family
EXTERNAL ADRESS SWITCH
P Address
Y Address
X Address
Y memory 256-by-24-bit RAM 256-by-24-bit ROM
Address ALU
X memory 256-by-24-bit RAM 256-by-24-bit ROM
2,048-by-24-bit PROGRAM MEMORY ROM
EXTERNAL DATA-BUS SWITCH
INTERNAL DATA-BUS SWITCH
24 BITS
X-DATA
DATA BUS
Y DATA
P DATA
GLOBAL DATA
ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL
INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGR
AMMED I/O, BUS CONTROL
DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRO
DUCING 56 BIT RESULT
24 BITS
PROGRAM CONTROLLER
I/O PORTS
2 BITS
3 BITS
7 BITS
CLOCK
INTERRUPT
34

Program control unit
Two 16-by-16 bit multipliers
Y0
Y0
Y1
Y1
X
X
PO
P1
scale
scale
96-bit instructions
Program memory (Z data)
Instruction decoder
Two 40 bit arithmic- logic units
shift
Saturation
Saturation
Four 40 bit accumulators
16-bit bus
Saturation/scale
R.E.A.L.
X data
Y data
Buses for
Z data
35
RD16021 DSP
36
Instruction cycle counts for BDTi benchmarks
16 taps 40 samples 8 biquads
37
(No Transcript)
38
Outline
  • architectures for programmable DSPs
  • multiplier-accumulator
  • modified Harvard architecture
  • extension with an ALU (decision making)
  • controller architectures
  • examples TI, Motorola, Philips
  • code generation
  • recent developments VLIW (Very Long Instruction
    Word)

39
source
lexical analysis
syntax analysis
Front end
semantic analysis
Intermediate machine independent representation
Code selection
Register allocation
Code generation
scheduling
1 instr // ops order of instr
code
40
Intermediate machine independent representation
BBi
BBk
BBj
a
b
c
d


c
t1 a b t2 c d t3 t1 c out
t2 t3
t1
t2

t3

41
Code selection
Register transfer pattern (RTP) for a given
datapath is any RT operation ( read -
combinatorial logic - write) which can be
executed on the datapath. Leupers
Notation ar ar ax ay af means ar
ar ay or ar ar af
or ar ax ay or ar
ax af
42
Code selection example
d memory
p memory
ADSP Analog Devices
ax
ay
af
mx
my
mf
x
y
x
y
-

MAC
ALU
-
ar
mr
43
Examples of RTPs on the ADSP-210 datapath
ar mr mx
my mf
ar mr mx
my mf

mr

mr

-
mr mf
mr mf
ar mr mx
my mf
mr ar ax
ay af
mr ar ax
ay af


-
mr mf
ar af
ar af
44
Example of code selection covering of
intermediate representation with RTPs
mx dmem
my pmem
ax dmem
ay pmem
a
b
c
d
mr dmem


c
ar ax ay
3
t1
t2

2
Mr mr (mx my)
t3

1
my ar
mr mr my
45
  • Problems
  • local decisions which have a global impact
  • phase coupling example
  • asap schedule
  • maximal freedom for scheduling
  • code selection during scheduling
  • register allocation comes afterwards
  • can lead to infeasible solutions

46
phase coupling example 1
1
R2
R3
3
2
alu1
R1
alu2
4
(a)
(b)
47
phase coupling example 2
Mesman
Pu
Pu
if u and v share the same register
Pv
Pv
u
u
v
v
Cu
Cu
Cv
Cv
Example of coupling between scheduling and
register allocation
48
phase coupling discussion
Mesman
application
Traditional code generation (heuristic)
feasible space
OK ?
constraints
no
design space seen by code generator
yes
Phase coupling is difficult because of many
constraints originating from irregular
interconnect, special purpose registers and
non-orthogonal microcode.
49
phase coupling discussion
It is very difficult and almost impossible to
develop robust and efficient DSP compilers.
Current DSP practice programming in assembler
Solution 1. Solve code generation for DSPs 2.
Step back and rethink the architecture develop
an architecture which is still efficient but
also a good model for building a compiler
Efficiency exploit instruction level
parallelism (ILP) compilation systematic
positioning of registers and regular

interconnect VLIW Very
Long Instruction Word
50
Outline
  • architectures for programmable DSPs
  • multiplier-accumulator
  • modified Harvard architecture
  • extension with an ALU (decision making)
  • controller architectures
  • examples TI, Motorola, Philips
  • code generation
  • recent developments VLIW (Very Long Instruction
    Word)
  • principles
  • central register file example TM
  • clustered VLIW example C6
  • subword parallelism or SIMD

51
VLIW principles
  • multiple parallel FUs, possibly different and
    pipelined
  • pipelining is exposed to the compiler
  • no interlock mechanism
  • load-store architecture
  • all operands fetched from/stored in register
    files,
  • possibly multi-ported
  • each FU can receive an instruction every clock
    cycle
  • one instruction many RISC instructions
  • each RISC instruction one issue slot
  • no dependencies between different RISC
    instructions
  • orthogonal microcode
  • compiler friendly

52
VLIW architecture
Register file
...
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 24
Exec unit 25
RW addr. instruction







Issue slot 1
Issue slot 2
Issue slot 3
Issue slot 4
Issue slot 5
Issue slot 24
Issue slot 25
...
  • long instruction words e.g. (374)25625
  • many ports on the registerfile e.g. 75

53
VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
54
TM1000 DSPCPU
Register file (128 regs, 32 bit, 15 ports)
5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2
DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP
compare 1 FP div/sqrt
Exec unit
Exec unit
Exec unit
Exec unit
Exec unit
Data cache (16 kB)
Instruction register (5 issue slots)
PC
Instruction cache (32kB)
55
TriMedia TM32A processor
0.18 micron area 16.9mm2 200 MHz (typ) 1.4 W 7
mW/MHz (MIPS 0.9 mW/MHz)
56
Synthesised RF area (CMOS18, 64 bit)
Area, speed and power dissipation goes more than
linear with the number of ports
57
VLIW architecture clustered Register Files
Register file 1
Register file 2
Register file 3
Exec unit 1
Exec unit 2
copy unit
Exec unit 3
Exec unit 4
copy unit
Exec unit 5
Exec unit 6
copy unit
58
VLIW architecture clustered Register Files
REGISTER FILE 1
REGISTER FILE 2
REGISTER FILE 3
FMUL FADD
IMUL IADD
IMUL IADD
FMUL r1,r2,r3
IADD r1,r2,r3
IMUL r1,r2,r3
59
VLIW architecture clustered Register Files
FU00
FU10
REGISTER FILE I0
IADD_01 IMOV_01
REGISTER FILE I1
IADD_10 IMOV_10
FU01
FU01
IADD_00 LAND_00
IADD_11 LAND_10
FU02
FU02
IMUL_00 SHFT_00
IMUL_10 SHFT_10
60
VLIW architecture clustered Register Files
Discussion
  • performance loss (more instructions) compared to
    a central
  • Register File (due to extra cycle for copy)
  • 15-20 for 2 clusters
  • 20-30 for 4 clusters
  • limited scalability
  • not too many clusters
  • not too many registers within each cluster (too
    many RF ports)
  • add of copy ops in the compiler
  • graph changes during scheduling

61
TMS320C62x VelociTI (fixed point)
Registerfile 0-15
Registerfile 0-15 (32 bits)
Src_up Dst_up Dst src1 src2
Dst src1 src2
Dst src1 src2
Dst src1 src2
Dst src1 src2
D2
M1
D1
M2
S2
L2
L1
S1
Store/load data
load data
Store/load address
Store/load address
Int add logical bit manip shift constant branch
Int add logical bit count
Int add load/ store
Int mult (16gt32)
62
VelociTI principles
  • parallelism (fetch-decode-execute) (max 8 issue
    slots)
  • pipeline critical sections (alu 1cc, mult 2 cc,
    200 MHz)
  • Risc (simple, atomic, independent instructions)
  • performance comes from compiler (pipelining,
    unroll)
  • load-store
  • orthogonal (2 identical DP, add on 6 units)
  • deterministic (no interlock)
  • conditional instructions (guarding)
  • instruction packing

63
Fully serial
Classical encoding fetching many nops
n
n
A
n
n
n
n
n
n
B
n
n
n
n
n
n
n
n
n
n
n
C
n
n
Mixed serial/parallel
n
n
n
n
n
D
n
n
n
n
n
E
n
n
n
n
n
B
A
n
n
C
n
n
F
n
n
n
n
n
n
n
n
n
n
E
n
D
n
n
Fully parallel
n
n
n
n
n
n
G
n
F
n
n
n
n
n
n
n
n
n
n
n
n
n
n
H
n
n
n
n
n
n
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
A
B
C
D
E
F
G
H
0
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
0
Velocity encoding
64
Instruction cycle counts for BDTi benchmarks
65
Subword parallelism
(custom operators in TM)
1st input operand
2nd input operand
byte3
byte3
byte2
byte2
byte1
byte1
byte0
byte0
32 bits 4 bytes are processed independently
op
op
op
op
Ex. , - , min, max gt quadumin
gt quadumax ...
byte3
byte2
byte1
byte0
output operand
66
Subword parallelism
(custom operators in TM)
faster execution - rewrite effort (e.g.
different types for in- and outputs)
Typical example graphics ( 4 32 bit floating
point)
67
Subword parallelism
MPEG example
for (i0 ilt64 I) temp ((back(i)
forward(i) 1) gtgt 1) idct(i) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi temp
Remark simple example without interloop
dependencies
68
for (i0 ilt64 i4) temp ((back(i0)
forward(i0) 1) gtgt 1) idct(i0) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi0 temp temp ((back(i1)
forward(i1) 1) gtgt 1) idct(i1) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi1 temp temp ((back(i2)
forward(i2) 1) gtgt 1) idct(i2) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi2 temp temp ((back(i3)
forward(i3) 1) gtgt 1) idct(i3) if (temp gt
255) temp 255 else if (temp lt 0) temp
0 destinationi3 temp
69
temp0 ((back(i0) forward(i0) 1) gtgt 1)
temp1 ((back(i1) forward(i1) 1) gtgt 1)
temp2 ((back(i2) forward(i2) 1) gtgt 1)
temp3 ((back(i3) forward(i3) 1) gtgt 1)
temp0 idct(i0) if (temp0 gt 255) temp
255 else if (temp0 lt 0) temp0 0 temp1
idct(i1) if (temp1 gt 255) temp1 255 else if
(temp1 lt 0) temp1 0 temp2 idct(i2) if
(temp2 gt 255) temp2 255 else if (temp2 lt 0)
temp2 0 temp3 idct(i3) if (temp3 gt 255)
temp3 255 else if (temp3 lt 0) temp3
0 destinationi0 temp0 destinationi1
temp1 destinationi2 temp2 destinationi3
temp3
quadavg
dspuquadaddui

70
  • Will embedded CPUs and DSPs converge ?
  • Converging forces
  • both include a hardware multiplier
  • trend in DSPs towards caches and RTK
  • trend in DSPs towards C/C
  • common trend towards VLIW
  • Diverging forces
  • deeply embedded code (DSP) vs. end-user SW (CPU)
  • different RTKs
  • SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)
  • Conclusions VLIW
  • good balance between hw and sw
  • between efficiency (ILP) and cost
  • fundamental problems code size, interruptability
Write a Comment
User Comments (0)
About PowerShow.com