Title: Analog Devices TigerSHARC
1Analog Devices TigerSHARC DSP Family
- Presented By Mike Lee and
- Mike Demcoe
- Date April 8th, 2002
2TigerSHARC Architectural Overview
- High performance, 128-bit successor to the
ADSP-2106x SHARC family - ADSP-TS101S, the newest TigerSHARC DSP, operates
at 250MHz! - Multiple computational units
- Two compute blocks, each containing a register
file, ALU, multiplier, and shifter. - Two additional integer ALUs
- Two hardware loop counter registers
- Can execute up to four independent 32-bit
instructions at a time - Or, eight 16-bit instructions
- Very wide word widths for high precision
arithmetic - Designed to be used in a multiple processor
environment
3TigerSHARC Architecture Overview (cont)
- BTB (Branch Target Buffer) as a means of
alleviating issues with the deep pipeline - 32-instruction, 4-way set-associative cache
- User controlled Branch Prediction
- Three, 128-bit blocks of memory which provide
access to a program and two data operands without
causing instruction/data conflicts. - Load-store, Harvard architecture, like SHARC.
- Native support for complex number instructions
4The TS101S Architecture
5Details of Multiple Compute Blocks
- Two computational units, each containing
- Register file Multi-ported to allow multiple
accesses to registers in a single clock cycle - General purpose registers!
- Contains 32 words, each word being 32-bits in
length. - ALU Fixed-point and floating point
- Multiplier Fixed-point and floating point
- Also features MAC (multiply-and-accumulate)
capabilities - Shifter Standard logical and arithmetic shifts
as well as bit manipulation
6The TS101S Pipeline
7Pipelines and Instruction Related Information
- ADSP-21061
- Three stage pipeline
- 20ns instruction cycle
- SISD but can put instructions in parallel
- ADSP-TS101S
- Eight stage pipeline with IAB
- 4ns instruction cycle
- MIMD and can also put instructions in parallel
8Loops, Branching and Timers
- ADSP-21061
- Zero-overhead hardware loop support
- Delayed Branching
- One timer
- ADSP-TS101S
- Little support for zero-overhead hardware loops
- 32-entry 4-way associative BTB cache with Branch
prediction - Two timers
9Memory and Buses
- ADSP-21061
- 1 Mbit dual ported SRAM
- Shared by three buses (PM, DM, I/O)
- PM and DM share a port while the I/O receives
its own - ADSP-TS101S
- 6 Mbit of SRAM (Quad Ported??)
- User defined partitions
- Each block is accessed by one 128-bit bus
10Multiplication and other Nifty Tricks
- ADSP-21061
- MAC instructions (MRF and MRB)
- Various precision output (32, 40, or 80 bit)
- ADSP-TS101S
- Each compute block has its own set of MAC
registers - 8 16-bit MAC with 40-bit accumulation or 2 32-bit
MAC with 80-bit accumulation - Complex number MAC instructions
- 128-bit accelerator
- Trellis decoding (8 Trellis butterflies per cycle)
11Data Address Generation
- ADSP-21061
- 2 data address generation units (DAGS)
- 8 circular buffers per DAG
- ADSP-TS101S
- 2 data address generation units (IALU)
- 4 circular buffers per IALU
- Both support modulo arithmetic, bit reversal
addressing, and post and pre-modify instructions
12Ease of Use
- ADSP-21061
- Easy to use
- Algebraic instruction set
- Visual DSP environment
- ADSP-TS101S
- Similar to 21061 but know have to consider 2
compute blocks - ADI suggests leaving parallelization to their
optimizing compiler - Visual DSP environment
13Specific DSP Algorithms and the TigerSHARC
- In ENEL515 (and/or related articles) weve
studied the FIR, IIR, and FFT algorithms - TigerSHARC has a massively parallel architecture
that is tailored to performing these algorithms.
14FIR Filter Characteristics
- Think back (or forward, depending on how much
youve procrastinated) to Lab 3. - FIR Characteristics
- Simple, long loop
- Repetitive calculations (multiply, then add!)
- Access to an array of coefficients, and an array
of delay-line values - Few data dependency issues during the calculation
of a single output - For a filter of length N, require N
multiplications and N adds to obtain a single
output value.
15TigerSHARC and the FIR Filter
- The general idea is Divide and conquer!
- Take a filter of size N and split it into two
groups of N/2 - Utilize the TigerSHARCs multiple computational
units and MAC instructions to perform the
algorithm in ½ the time (plus some overhead) - Two hardware loop counters to simultaneously
control the two new N/2 size FIR loops with no
overhead! - Can do all of the following SIMULTANEOUSLY!
- Fetch two operands (one coefficient, one delay
line value) from two separate memory banks - Fetch the next instruction
- Perform arithmetic operations on the PREVIOUS
operands! - Unlike SHARC, instruction/data clashes are
non-existant due to the numerous bus paths
linking computational units to memory space
16TigerSHARC and the FIR Filter (continued.)
- 8-cycle-deep pipeline
- Stalls are expensive..
- Branch Target Buffer reduces performance loss
that results from branching in a deeply pipelined
processor - The long loop characteristic of the FIR filter
algorithm allows us to keep the 8-cycle-deep
pipeline full - Full pipeline means fast algorithm
- FIR Filter algorithms rely heavily on data sets
that are aligned in memory - Post-increment is your friend
- TigerSHARC Quad Data Accesses Supply four
aligned words to one compute block or two aligned
words to each compute block.
17Example Instructions
- X/Y Conditional Compute
- if xALE do, R0R1R2
- Condition codes,
- AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0,
SF1. - A Adder, M Multiplier, S Shifter
- Memory Addessing
- Indirect post-modify with update, register
offset - YR20J1J2
- Indirect post-modify with update, 8-bit immediate
offset - QK10xF8XYR30
- Indirect pre-modify no update, register offset
- J32LK1K2
- Indirect pre-modify no update, immediate offset
- YR32LK10x0003333
- Complex Quad 16-bit Fixed Point Multiplication
Instructions - XYXY MRa Rm Rn (UICCRJ)
- XYXY RsRsdMRa, MRa Rm Rn
(UICJ)
18FIR Code Example
19TigerSHARC and the IIR Filter
- Short, simple loop characteristic
- Means loop overhead is more of a concern
- Means keeping the pipeline full is tougher!
- Time to unroll the loop, although ADI says to let
VisualDSP do it for you. - Again, split up the calculations on an N-tap IIR
filter into two N/2 sets operating simultaneously - Idea One computational block does feedforward
calculations, one does feedback! - Complex numbers commonly required
- Hardware support for complex MAC in TigerSHARC
- Again, Quad Data Access comes in handy for
aligned data - Post-increment is still your friend
20TigerSHARC and the FFT
- Does not use the same MAC modes that IIR and FIR
filters do. - Requires more complicated addressing modes
- Example Bit reverse addressing
- Found on both SHARC and TigerSHARC
- Difficult to split onto separate computational
units and even more difficult to split amongst
distributed processors - Requires large arrays of complex variables and
fixed coefficients - Hardware complex number MAC comes in handy again!
- Large arrays of aligned data Quad Data access
again! - Requires HIGH-PRECISION arithmetic
- Luckily we have 64-bit fixed point arithmetic and
40-bit extended floating point arithmetic. - 80-bit MAC precision
- FFT Requires many intermediate values
- 32 GP registers in a single computational block
21http//www.analog.com/technology/dsp/Sharc/benchma
rks.html
22http//www.analog.com/technology/dsp/TigerSHARC/be
nchmarks.html
4ns Instruction Cycle
23Conclusion
- TigerSHARC have a very SHARC-like architecture,
except its MUCH more complex. - Highly optimized for parallelism
- Major features Complex number support, multiple
computational units, high instruction throughput,
wider buses. - Performs DSP algorithms including FIR, IIR, FFT
significantly faster than SHARC!
24References
- 1. http//www.analog.com/productSelection/pdf/ADSP
-21061_L_b.pdf - 2. http//products.analog.com/products/info.asp?pr
oductADSP-TS101-S - 3. http//www.analog.com/technology/dsp/TigerSHARC
/backgrounder.html - 4. http//www.analog.com/library/dspManuals/Tigers
harc_hardware.html - 5. http//www.analog.com/library/dspManuals/Tigers
harc_instruction.html - 6. http//www.btid.com/procsum/tsfloat.htm
- 7. http//www.analog.com/library/applicationNotes/
dsp/tigerSharc/EE-147.pdf - 8. http//www.analog.com/technology/dsp/TigerSHARC
/architecture.html - 9. http//www.analog.com/library/dspManuals/pdf/TS
DSP_instruction/tsintr.pdf - (2-182 - 2-188)
- 10. ADSP-2106x SHARC Users Manual, Second
Edition - 11. http//www.analog.com/library/dspManuals/pdf/
TSDSP_instruction/tsin_flw.pdf - (3-9 - 3-16)
25Note from Dr. Smith
- Information on Burg algorithm outside ICT536. It
is essentially an FIR filter used for prediction
(i.e. what FIR coefficients are needed so that
the filtered signal is "white noise" )