Analog Devices TigerSHARC - PowerPoint PPT Presentation

About This Presentation
Title:

Analog Devices TigerSHARC

Description:

High performance, 128-bit successor to the ADSP-2106x SHARC family ... Trellis decoding (8 Trellis butterflies per cycle) 11. Data Address Generation. ADSP-21061 ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 26
Provided by: Mik128
Category:

less

Transcript and Presenter's Notes

Title: Analog Devices TigerSHARC


1
Analog Devices TigerSHARC DSP Family
  • Presented By Mike Lee and
  • Mike Demcoe
  • Date April 8th, 2002

2
TigerSHARC Architectural Overview
  • High performance, 128-bit successor to the
    ADSP-2106x SHARC family
  • ADSP-TS101S, the newest TigerSHARC DSP, operates
    at 250MHz!
  • Multiple computational units
  • Two compute blocks, each containing a register
    file, ALU, multiplier, and shifter.
  • Two additional integer ALUs
  • Two hardware loop counter registers
  • Can execute up to four independent 32-bit
    instructions at a time
  • Or, eight 16-bit instructions
  • Very wide word widths for high precision
    arithmetic
  • Designed to be used in a multiple processor
    environment

3
TigerSHARC Architecture Overview (cont)
  • BTB (Branch Target Buffer) as a means of
    alleviating issues with the deep pipeline
  • 32-instruction, 4-way set-associative cache
  • User controlled Branch Prediction
  • Three, 128-bit blocks of memory which provide
    access to a program and two data operands without
    causing instruction/data conflicts.
  • Load-store, Harvard architecture, like SHARC.
  • Native support for complex number instructions

4
The TS101S Architecture
5
Details of Multiple Compute Blocks
  • Two computational units, each containing
  • Register file Multi-ported to allow multiple
    accesses to registers in a single clock cycle
  • General purpose registers!
  • Contains 32 words, each word being 32-bits in
    length.
  • ALU Fixed-point and floating point
  • Multiplier Fixed-point and floating point
  • Also features MAC (multiply-and-accumulate)
    capabilities
  • Shifter Standard logical and arithmetic shifts
    as well as bit manipulation

6
The TS101S Pipeline
7
Pipelines and Instruction Related Information
  • ADSP-21061
  • Three stage pipeline
  • 20ns instruction cycle
  • SISD but can put instructions in parallel
  • ADSP-TS101S
  • Eight stage pipeline with IAB
  • 4ns instruction cycle
  • MIMD and can also put instructions in parallel

8
Loops, Branching and Timers
  • ADSP-21061
  • Zero-overhead hardware loop support
  • Delayed Branching
  • One timer
  • ADSP-TS101S
  • Little support for zero-overhead hardware loops
  • 32-entry 4-way associative BTB cache with Branch
    prediction
  • Two timers

9
Memory and Buses
  • ADSP-21061
  • 1 Mbit dual ported SRAM
  • Shared by three buses (PM, DM, I/O)
  • PM and DM share a port while the I/O receives
    its own
  • ADSP-TS101S
  • 6 Mbit of SRAM (Quad Ported??)
  • User defined partitions
  • Each block is accessed by one 128-bit bus

10
Multiplication and other Nifty Tricks
  • ADSP-21061
  • MAC instructions (MRF and MRB)
  • Various precision output (32, 40, or 80 bit)
  • ADSP-TS101S
  • Each compute block has its own set of MAC
    registers
  • 8 16-bit MAC with 40-bit accumulation or 2 32-bit
    MAC with 80-bit accumulation
  • Complex number MAC instructions
  • 128-bit accelerator
  • Trellis decoding (8 Trellis butterflies per cycle)

11
Data Address Generation
  • ADSP-21061
  • 2 data address generation units (DAGS)
  • 8 circular buffers per DAG
  • ADSP-TS101S
  • 2 data address generation units (IALU)
  • 4 circular buffers per IALU
  • Both support modulo arithmetic, bit reversal
    addressing, and post and pre-modify instructions

12
Ease of Use
  • ADSP-21061
  • Easy to use
  • Algebraic instruction set
  • Visual DSP environment
  • ADSP-TS101S
  • Similar to 21061 but know have to consider 2
    compute blocks
  • ADI suggests leaving parallelization to their
    optimizing compiler
  • Visual DSP environment

13
Specific DSP Algorithms and the TigerSHARC
  • In ENEL515 (and/or related articles) weve
    studied the FIR, IIR, and FFT algorithms
  • TigerSHARC has a massively parallel architecture
    that is tailored to performing these algorithms.

14
FIR Filter Characteristics
  • Think back (or forward, depending on how much
    youve procrastinated) to Lab 3.
  • FIR Characteristics
  • Simple, long loop
  • Repetitive calculations (multiply, then add!)
  • Access to an array of coefficients, and an array
    of delay-line values
  • Few data dependency issues during the calculation
    of a single output
  • For a filter of length N, require N
    multiplications and N adds to obtain a single
    output value.

15
TigerSHARC and the FIR Filter
  • The general idea is Divide and conquer!
  • Take a filter of size N and split it into two
    groups of N/2
  • Utilize the TigerSHARCs multiple computational
    units and MAC instructions to perform the
    algorithm in ½ the time (plus some overhead)
  • Two hardware loop counters to simultaneously
    control the two new N/2 size FIR loops with no
    overhead!
  • Can do all of the following SIMULTANEOUSLY!
  • Fetch two operands (one coefficient, one delay
    line value) from two separate memory banks
  • Fetch the next instruction
  • Perform arithmetic operations on the PREVIOUS
    operands!
  • Unlike SHARC, instruction/data clashes are
    non-existant due to the numerous bus paths
    linking computational units to memory space

16
TigerSHARC and the FIR Filter (continued.)
  • 8-cycle-deep pipeline
  • Stalls are expensive..
  • Branch Target Buffer reduces performance loss
    that results from branching in a deeply pipelined
    processor
  • The long loop characteristic of the FIR filter
    algorithm allows us to keep the 8-cycle-deep
    pipeline full
  • Full pipeline means fast algorithm
  • FIR Filter algorithms rely heavily on data sets
    that are aligned in memory
  • Post-increment is your friend
  • TigerSHARC Quad Data Accesses Supply four
    aligned words to one compute block or two aligned
    words to each compute block.

17
Example Instructions
  • X/Y Conditional Compute
  • if xALE do, R0R1R2
  • Condition codes,
  • AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0,
    SF1.
  • A Adder, M Multiplier, S Shifter
  • Memory Addessing
  • Indirect post-modify with update, register
    offset
  • YR20J1J2
  • Indirect post-modify with update, 8-bit immediate
    offset
  • QK10xF8XYR30
  • Indirect pre-modify no update, register offset
  • J32LK1K2
  • Indirect pre-modify no update, immediate offset
  • YR32LK10x0003333
  • Complex Quad 16-bit Fixed Point Multiplication
    Instructions
  • XYXY MRa Rm Rn (UICCRJ)
  • XYXY RsRsdMRa, MRa Rm Rn
    (UICJ)

18
FIR Code Example
19
TigerSHARC and the IIR Filter
  • Short, simple loop characteristic
  • Means loop overhead is more of a concern
  • Means keeping the pipeline full is tougher!
  • Time to unroll the loop, although ADI says to let
    VisualDSP do it for you.
  • Again, split up the calculations on an N-tap IIR
    filter into two N/2 sets operating simultaneously
  • Idea One computational block does feedforward
    calculations, one does feedback!
  • Complex numbers commonly required
  • Hardware support for complex MAC in TigerSHARC
  • Again, Quad Data Access comes in handy for
    aligned data
  • Post-increment is still your friend

20
TigerSHARC and the FFT
  • Does not use the same MAC modes that IIR and FIR
    filters do.
  • Requires more complicated addressing modes
  • Example Bit reverse addressing
  • Found on both SHARC and TigerSHARC
  • Difficult to split onto separate computational
    units and even more difficult to split amongst
    distributed processors
  • Requires large arrays of complex variables and
    fixed coefficients
  • Hardware complex number MAC comes in handy again!
  • Large arrays of aligned data Quad Data access
    again!
  • Requires HIGH-PRECISION arithmetic
  • Luckily we have 64-bit fixed point arithmetic and
    40-bit extended floating point arithmetic.
  • 80-bit MAC precision
  • FFT Requires many intermediate values
  • 32 GP registers in a single computational block

21
http//www.analog.com/technology/dsp/Sharc/benchma
rks.html
22
http//www.analog.com/technology/dsp/TigerSHARC/be
nchmarks.html
4ns Instruction Cycle
23
Conclusion
  • TigerSHARC have a very SHARC-like architecture,
    except its MUCH more complex.
  • Highly optimized for parallelism
  • Major features Complex number support, multiple
    computational units, high instruction throughput,
    wider buses.
  • Performs DSP algorithms including FIR, IIR, FFT
    significantly faster than SHARC!

24
References
  • 1. http//www.analog.com/productSelection/pdf/ADSP
    -21061_L_b.pdf
  • 2. http//products.analog.com/products/info.asp?pr
    oductADSP-TS101-S
  • 3. http//www.analog.com/technology/dsp/TigerSHARC
    /backgrounder.html
  • 4. http//www.analog.com/library/dspManuals/Tigers
    harc_hardware.html
  • 5. http//www.analog.com/library/dspManuals/Tigers
    harc_instruction.html
  • 6. http//www.btid.com/procsum/tsfloat.htm
  • 7. http//www.analog.com/library/applicationNotes/
    dsp/tigerSharc/EE-147.pdf
  • 8. http//www.analog.com/technology/dsp/TigerSHARC
    /architecture.html
  • 9. http//www.analog.com/library/dspManuals/pdf/TS
    DSP_instruction/tsintr.pdf
  • (2-182 - 2-188)
  • 10. ADSP-2106x SHARC Users Manual, Second
    Edition
  • 11. http//www.analog.com/library/dspManuals/pdf/
    TSDSP_instruction/tsin_flw.pdf
  • (3-9 - 3-16)

25
Note from Dr. Smith
  • Information on Burg algorithm outside ICT536. It
    is essentially an FIR filter used for prediction
    (i.e. what FIR coefficients are needed so that
    the filtered signal is "white noise" )
Write a Comment
User Comments (0)
About PowerShow.com