Title: The SHARC
1The SHARC
- Super Harvard Architecture Computer
2The SHARC
- Developed by Analog Devices
- Optimized for demanding DSP and imaging
applications. - 32 Bit floating point, with 40 bit extended
floating point capabilities. - Large on-chip memory.
- Ideal for scalable multi-processing applications.
3Harvard Architecture
- Program memory can store data.
- Able to simultaneously read or write data at one
location and get instructions from another place
in memory. - 2 buses
- Data memory bus.
- Program bus.
- Either two separate memories or a single
dual-port memory.
4Super Harvard Architecture
- Many processor employ Harvard Architecture by
having two separate memories or caches integrated
into the processor chip - The SHARC is unique in that its internal memory
is capable of holding a large program as well a
large amount of data. This is what makes it
SUPER!!!
5DSP
- Digital Signal Processor.
- High speed, low overhead data movement and rapid
computations required. - Usually has a small on-board ROM, RAM and single
cycle multiply. - Designed to run single line, serial in, serial
out, signal processing applications very fast.
6DSP Computations
- The inner product of two vectors is a common
computation for determining energy or
correlation. - The following C code is an example
for (n0 nltlength n) result
xn yn - The process which has the lowest instruction time
will have the best performance.
7SHARC DSP
- The SHARC incorporates features aimed at
optimizing such loops. - High-Speed Floating Point Capability
- Extended Floating Point
- These features are DSP specific.
- Meaning, when applied to a non-DSP application
performance may not be as optimal.
8Floating Point and Extended Floating Point
- The SHARC supports floating, extended-floating
and non-floating point. - No additional clock cycles for floating point
computations. - Data automatically truncated and zero padded when
moved between 32-bit memory and internal
registers. - Not accurate enough for scientific algorithms.
Excellent signal to noise ratio.
9SHARCs Internal Memory
- Makes SHARC unique.
- Size
- Allows many complex functions to be preformed
on-chip. Eliminating the need to move data
between internal and external memory. - Memory size is significantly larger then most
other high speed computational devices. - Dual-block, Dual-port
- Optimizes the Harvard Architecture by allowing
the fetch of instructions while performing data
memory accesses.
10Multiply and Accumulate Instructions on the SHARC
- Like most DSPs the SHARC is able to compute a
product and add the product to a running total in
a single clock cycle. - The SHARCs super instruction is that it can
multiply and accumulate while adding,
subtracting, or averaging data in two other
registers. - These instructions give the SHARC its 120
megaflop rating.
11Zero Overhead Loopingon the SHARC
- A single instruction outside the loop performs
loop set-up. Informing the SHARC that there is a
loop approaching. - The instruction also includes the iteration count
and termination condition. - This causes the pipeline to remain full during
loop execution and also allows the termination
condition to be tested in parallel.
12DAGs on the SHARC
- Data Address Generators are integer computation
units that manage the indexing of registers. - Allows the SHARC to to fetch a value and update
the index value. - If the updated value exceeds a limit, the DAB
adjusts the index so that it wraps. - This occurs in the same clock cycle as the read
or write.
13DAG Capabilities
- Circular Buffering
- Rather then actually moving data in and out of a
vector, circular buffers are used. - Updating the index modulo, the oldest entry can
be conveniently replaced by the newest entry. - Bit Reverse Addressing
- The bit pattern of a vector index is reversed.
- Done automatically by the SHARC.
- Required for Fast Fourier Transform (FFT), which
is often critical to DSP applications.
14SHARC DSP
- What Makes the SHARC unique?
- It also has some features not related directly
related to optimizing numeric computations. - Pipelining
- Handling Branches
- Why has this not emerged sooner?
- Technology has only recently become available to
make it economical to integrate general single
computing devices.
15SHARCs Pipeline
- 3 stages
- Instruction Fetch
- Decode
- Execution
- Takes three clock cycles for an instruction to
propagate through the pipeline. - The processor execution speed is one instruction
per clock cycle even though each instruction
requires three clock cycles.
16SHARCs Handling BranchesDelayed Branching
- When a branch instruction is encountered the two
instructions which have been loaded and decoded
are executed before the branch. - This keeps the pipeline full and avoids junking
those two instructions and reloading the
pipeline. - Beneficial in situations such as a few
instruction loops. When the ratio of wasted
clock cycles to instructions is significant.
17SHARCs Handling BranchesNon-delayed Branching
- Traditional branching.
- If the pipeline cannot be reordered to use
delayed branching, non-delayed branching is space
saving. - Uses only one word of storage.
- Although, it takes three cycles as the pipeline
gets reloaded.
18Multi-processing
- SHARC is uniquely equipped for multi-processing.
- Links to ports are very powerful multi-processing
capabilities. - Two main program models depending on the
application. - Adapts well to different multi-processing
architectures.
19Multi-processingSHARC Links
- SHARC has 6 link ports that can transport data at
rates up to 40Mbytes/sec. - Links designed for point-to-point connections.
- Data can be transmitted in either direction but
not both simultaneously.
20Multi-processing Program ModelMIMD
- Multiple instruction, multiple data.
- Good for applications that require multiple
instruction threads to execute concurrently. - Processors operate individually.
- Each processor executes different code.
- Typically used for image reconstruction and
multi-channel DSP.
21Multi-processing Program ModelSIMD
- Single instruction, multiple data.
- Works best when all processors execute identical
instruction sequences. - Do not require overhead for inter-processor
synchronization. - Typically used for synthetic aperture radar and
automatic target recognition.
22Multi-processing ArchitecturesCluster Design
- Groups of up to 6 in a cluster
- Most common for joining multiple SAHRC's
- All processors, global I/O and global memory
connected to a common Cluster bus. - Each SHARC can drive the bus.
23Multi-processing ArchitecturesMesh Design
- All SHARCs joined by their link ports and are
connected to a common bus. - In SIMD mode one single master SHARC drives the
bus. - In MIMD mode mesh architecture cannot function if
data is lager then on chip available memory. - Advantageous scalability over a wider range of
applications.
24Summary of what makes the SHARC Super
- It performs excellently for DSP applications.
- Employs a Harvard Architecture with very large on
chip memory. - Respectable Megaflop rating.
- Its multiprocessing capabilities.
25How optimal is the SHARC for non-DSP Applications?
- It is obviously geared for DSP applications.
- While it may fare better then other processors it
is still behind those which are designed
specifically for non-DSP applications.
26Sources
- www.alacron.com/news/tp_mimd_simd.htm
- www.analog.com
- www.cs.seas.gwu.edu/cs339/cs339-lecture2.pdf
- www.ixthos.aa.psiweb.com/technical/notes_articles/
articles