Title: DSP: an introduction
1DSP an introduction
- Why a DSP?
- Characteristics of a DSP
- Some commercial DSPs
2Why a DSP?
- Its easy we want an architecture optimized for
Digital Signal Processing - Some versions are further optimized for some
specific applications - - e.g. very low power consumption for mobile
phones
3Which is the difference between a DSP and a
general purpose processor? (1/4)
- Memory architecture and bus
- The first processors (in the 40) had a Harvard
architecture separate memories for program and
data - But its complex -gt soon replaced by Von Neumann
architecture no real difference between program
and data (an instruction has two fields
operation and data) - Problem the processor cannot access instructions
and data simultaneously - To improve performance Harvard architecture
again! - In particular
- - separate memories and busses for program
and data - - possibly, another separate bus for the DMA
4Which is the difference between a DSP and a
general purpose processor? (2/4)
- A DSP is often used to realize a linear filter
- The convolution integral
- is actually a sum
- ynSixn-ihi
- - if the number of sums is finite FIR filter
(finite impulse response), - - otherwise IIR (infinite impulse response),
- - which can be realized using two finite sums
- ynSixn-ibi Siyn-iai
5Which is the difference between a DSP and a
general purpose processor? (3/4)
- A common operation in a FIR or IIR filter is
ABCD - a hardware multiplier (introduced in DSPs in the
'70) is needed - multiply and accumulate in only one clock cycle
MAC instruction. - Actually, the MAC is in a loop
- H/W for address generation (the access to memory
is not random) zero overhead loop - - autoincrement circular addressing
- H/W saturation
- Instructions to perform a division quickly
- Bit reversal for FFT
6Which is the difference between a DSP and a
general purpose processor? (4/4)
- Often, data are 16- o 8-bit wide (e.g., audio or
images) - a 32-bit ALU can be splitted in two 16-bit ALUs
or four 8-bit ALUs, - -gt 2 o 4 operations in parallel
- several ALUs which work in parallel
- fixed point ALUs, o 16-bit ALUs, to reduce power
consumption and costs - optimized versions
- - costs for consumer applications
- - power for mobile applications
- - for specific applications, e.g. electric motor
control
7- Example C30 (Texas Instruments, 1982)
8- Example FIR filter using a C30
9- Note several of these characteristics, which
were born on DSPs, have been ported to general
purpose processors
E.g. the cache in the Pentium processor
is Harvard-like
10- Another example. several units working in
parallel, and splittable ALUs (v. MMX extensions)
in the Pentium 4 processor
11Pipeline
- Example of a 4-stage pipeline (TI C30)
- each instruction is executed in 4 clock cycles,
but (normally) can be put just 1 cycle after the
previous one (data are needed only 3 cycles
later)
12Pipeline branch (e.g. on the C30)
- Standard branch the pipeline is flushed to
correctly handle the PC -gt 4 cycles - Delayed branch the pipeline is not flushed, and
the 3 following instructions are loaded before
modifying the PC - -gt only 1 cycle needed!
BRD label delayed branch MPYF
executed ADDF executed SUBF
executed AND not executed label MPYF
fetched after SUBF
13Architectures
- In order to exploit the instruction level
parallelism (ILP) two possible architectures - Superscalar the parallelism is dynamically
managed by the hardware - Very Long Instruction Word (VLIW) the
parallelism is statically managed by the compiler - Which is the problem?
- Dependences in data or control can generate
conflicts - - on data (an instruction needs the result of
a previous - instruction, but the results is not ready
yet), or - - on control (conditional jump, but the
condition is not ready yet) - -gt pipeline stall
14Superscalar
- The analysis of the independent instructions is
dynamically done by hardware (which is complex!) - The sequence of instructions can be executed
out-of-order then, the completion of the
instructions (commit) is done in-order to
correctly update the state of the CPU
15VLIW
- Very Long Instruction Word (VLIW) the
parallelism is statically managed by the compiler - The analysis of independent instructions is
statically realized during the compilation phase
- - the instructions which can be realized in
parallel are assembled in long instructions and
send to the various functional units in-order - Convenient solution for DSP programs (fixed
length cycles, few conditional operations) less
convenient for general purpose applications - Simpler hardware! But a specific compilation for
each platform is needed - Deterministic behaviour -gt exact computation of
execution times