An Asynchronous Array of Simple Processors for DSP Applications - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

An Asynchronous Array of Simple Processors for DSP Applications

Description:

Suitable for future fabrication technologies. Performance & Energy efficiency ... suited for future fabrication technologies. Acknowledgments. Funding. Intel ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 28

Provided by: eceUc7

Learn more at: http://www.ece.ucdavis.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Asynchronous Array of Simple Processors for DSP Applications

1
An Asynchronous Array of Simple Processors for
DSP Applications

Zhiyi Yu, Michael Meeuwsen, Ryan Apperson,
Omar Sattari, Michael Lai, Jeremy Webb, Eric
Work,
Tinoosh Mohsenin, Mandeep Singh, Bevan Baas
VLSI Computation Lab, ECE Department
University of California, Davis, USA

2
Outline

Project objectives
Key features of the AsAP processor
Design of the AsAP processor
Results

3
Target Applications

Computationally intensive DSP and scientific apps
Key components in many systems
Require high performance
Limited power budgets
Require innovations in architecture and circuit
design

4
Project Objectives

Programming flexibility
High performance
Throughput
Latency
High energy efficiency
Suitable for future fabrication technologies

ASIC
AsAP
FPGA
Performance Energy efficiency
Prog. DSP
Programming flexibility
5
Outline

Project objectives
Key features of the AsAP processor
Design of the AsAP processor
Results

6
Key Features of the AsAP Processor
Chip multiprocessor
High performance
Small memory simple processor
High energy efficiency
Globally asynchronous locally synchronous (GALS)
Technology scalability
Nearest neighbor communication
7
High Performance Through Chip Multiprocessor

Increasing the clock frequency is challenging
Parallelism is more promising
Instruction level parallelism (VLIW, Superscalar)
Data level parallelism (SIMD)
Task level parallelism

Deeper pipeline
Higher clock frequency
Increased design complexity lower energy
efficiency
8
Task Level Parallelism

Well suited for DSP applications

Memory
Proc.
Proc.2
Proc.3
Proc.1

A
Task1 Task2 Task3
Task1
Task2
Task3
B
A
B
C
C
Improves performance and potentially reduces
memory size

Widely available in many DSP applications

training
pilots
in
out
scram coding
inter- leave
mod. map
loadfft inter- leave
IFFT
win- dow
up- samp filter
scale clip
up- samp filter
802.11a/g wireless LAN (54 Mbps, 5 GHz) baseband
transmit path
9
Memory Size in Modern Processors

Memory occupies much of the area in modern
processors this reduces area available for the
core and consumes large amounts of power
Area and energy dissipation can be dramatically
decreased with smaller memories

Area breakdown
other
core
mem
TI_C64x Itanium SPARC BlGe/L

ISSCC 02, 05
10
Small Memory Requirements for DSP Tasks

The memory required for common DSP tasks is quite
small
Several hundred words of memory are sufficient
for many DSP tasks

11
GALS Clocking Style

The challenge of globally synchronous systems
Design difficulty when using high clock
frequencies, long clock wires caused by large
chip sizes, and large circuit parameter
variations
High clock power consumption and lack of
flexibility to independently control clock
frequencies
How about totally asynchronous design style
Lack of EDA tool support
Design complexity and circuit overhead
The GALS compromise
Synchronous blocks each operating in their own
independent clock domain

12
Wires and On Chip Communication

Global wires are a concern
Their length doesnt shrink with technology
scalingassuming the chip size remains the same
The ratio of global wire delay to gate delay
doubles each generation
Methods to avoid global wires
Network on chip (NOC)
Local communication
Nearest neighbor communication

nearest
local
13
Outline

Project objectives
Key features of the AsAP processor
Design of the AsAP processor
Results

14
AsAP Block Diagram

GALS array of identical processors
Each processor is a reduced complexity
programmable DSP with small memories
Each processor can receive data from any two
neighbors and send data to any of its four
neighbors

IMem 64 words
OSC
FIFO 0 32 words
ALU MAC Control
Output
FIFO 1 32 words
Static config
DMem 128 words
Dynamic config
15
Single Processor Architecture

54 32-bit instructions, among which only
Bit-Reverse is algorithm specific
9-stage pipeline

IFetch
Decode
EXE 1
EXE 2
EXE 3
Mem. Read
Src Select
Result Select
Write Back
FIFO 0 RD
Proc Output
Bypass
PC
FIFO 1 RD
Data Mem Write
ALU
Data Mem Read
De- code
Inst. Mem
Multiply Accumulator
A C C

Addr. Gens.
DC Mem Write
DC Mem Read
16
Programmable Clock Oscillator

Standard cell based
Configurable frequency
Delay tunable stages using 7 parallel tri-state
inverters
5 or 9 stage selection
1 to 128 clock divider
SR latch logic enables clean OSC halt
Results
1.66 MHz 702 MHz
Max gap 0.08 MHz(1.66 500 MHz)

Clock divider
clk

...

...

...

...
stage_sel
reset
halt
17
Inter-processor Communication

Each processor contains two dual-clock FIFOs
Rd/Wr in separate clock domains
Gray coded Rd/Wr address across clock domains
No FIFO failures in several weeks of testing
with multiple procs

east
Read side
Write side
north
SRAM
west
south
Writelogic
Readlogic
east
north
addr
addr
C P U
west
Binary Gray Sync.
south
FIFO 0
east
north
west
FIFO 1
south
Processor
18
Advantages of theAsAP Clocking System

Simplified clock tree design
The maximum span is lt 1 mm in 0.18 µm
Scalable easy to add processors
Improved energy efficiency
Clock halts in 9 cycles (processor dissipates
leakage power only) and restores in lt 1 cycle
according to work availability
53 and 65 power savings for two applications
Independent clock and voltage scaling (individual
processor voltage scaling not implemented in this
version)

19
Physical Design

0.18 µm TSMC
Standard cell
Design flow
Completely synthesized, except oscillator
Macro memory blocks used for four main memories
Completely auto placed and routed
To simplify physical design, clock gating not
implemented in this version

20
Chip Micrograph of the 6 x 6 Array
Transistors 1 Proc
230,000 Chip 8.5
millionMax speed 475 MHz _at_ 1.8 V Area
1 Proc 0.66 mm² Chip
32.1 mm² Power (1 Proc _at_
1.8V, 475 MHz) Typical application 32
mW Typical 100 active 84 mW Worst
case 144 mW Power (1 Proc _at_ 0.9V,
116 MHz) Typical application 2.4 mW
IMem
DMem
5.68 mm
FIFOs
OSC
Single Processor
810 µm
810 µm
5.65 mm
21
Outline

Project objectives
Key features of the AsAP processor
Design of the AsAP processor
Results

22
Area Evaluation

Most of AsAPs area is for the core (66)
Each processor requires a very small area more
than 20x smaller than others

72
30
Processor area (mm²)
29
Area breakdown
8.3
7
20x
0.34
TI RAW Fujitsu BlGe/ AsAP
C64x 4-VLIW L
TI CELL/ Fujitsu RAW
ARM AsAP C64x SPE 4-VLIW
All scaled to 0.13 µm ISSCC 05, 00 ISCA04
CMPON96
core
comm
mem
23
Power and Performance
Power / Clock frequency / Scale (mW/MHz)
All scaled to 0.13 µm Assume 2 ops/cycle
for CELL/SPE and 3.3 ops/cycle for TI C64x
Note word widths and workload not factored
ISSCC 05, 00 ISCA04 CMPON96
Higher performance area, and Lower estimated
energy
Peak performance density (MOPS/mm²)
24
JPEG Core Encoder Implementation

9 processors
Fully functional on chip
224 mW _at_ 300 MHz
1400 clock cycles for each 8x8 block
Similar performance and 11x lower energy
dissipation than8-way VLIW TI C62x

output
input
Huffman
Lv-shift 1-DCT
DC in Huffm
AC in Huffm
Trans in DCT
DC in Huffm
AC in Huffm
Quant. Zigzag
Zigzag
1-DCT
8x8 DCT
MICRO 02, ISCAS 02
25
802.11a/802.11g Wireless Transmitter
Implementation
Data bits
Conv. Code
Inter- leave 1
Pad
Punc

22 processors
Fully functional on chip
407 mW _at_ 300 MHz 30 of 54 Mb/s
Code unscheduled and lightly optimized
10x performance and 35x 75x lower energy
dissipation than 8-way VLIW TI C62x

Inter- leave 2
Scram
Train
Pilot Insert
Mod. Map
IFFT Mem
IFFT BR
IFFT Output
GI/ Wind.
IFFT BF
GI/ Wind.
IFFT Mem
IFFT BF
FIR
FIR
To D/A converter
IFFT BF
IFFT Mem
Output Sync
IFFT
SIPS 04 ICC 02
26
Summary