An Asynchronous Array of Simple Processors for DSP Applications - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

An Asynchronous Array of Simple Processors for DSP Applications

Description:

Suitable for future fabrication technologies. Performance & Energy efficiency ... suited for future fabrication technologies. Acknowledgments. Funding. Intel ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 28
Provided by: eceUc7
Category:

less

Transcript and Presenter's Notes

Title: An Asynchronous Array of Simple Processors for DSP Applications


1
An Asynchronous Array of Simple Processors for
DSP Applications
  • Zhiyi Yu, Michael Meeuwsen, Ryan Apperson,
  • Omar Sattari, Michael Lai, Jeremy Webb, Eric
    Work,
  • Tinoosh Mohsenin, Mandeep Singh, Bevan Baas
  • VLSI Computation Lab, ECE Department
  • University of California, Davis, USA

2
Outline
  • Project objectives
  • Key features of the AsAP processor
  • Design of the AsAP processor
  • Results

3
Target Applications
  • Computationally intensive DSP and scientific apps
  • Key components in many systems
  • Require high performance
  • Limited power budgets
  • Require innovations in architecture and circuit
    design

4
Project Objectives
  • Programming flexibility
  • High performance
  • Throughput
  • Latency
  • High energy efficiency
  • Suitable for future fabrication technologies

ASIC
AsAP
FPGA
Performance Energy efficiency
Prog. DSP
Programming flexibility
5
Outline
  • Project objectives
  • Key features of the AsAP processor
  • Design of the AsAP processor
  • Results

6
Key Features of the AsAP Processor
Chip multiprocessor
High performance
Small memory simple processor
High energy efficiency
Globally asynchronous locally synchronous (GALS)
Technology scalability
Nearest neighbor communication
7
High Performance Through Chip Multiprocessor
  • Increasing the clock frequency is challenging
  • Parallelism is more promising
  • Instruction level parallelism (VLIW, Superscalar)
  • Data level parallelism (SIMD)
  • Task level parallelism

Deeper pipeline
Higher clock frequency
Increased design complexity lower energy
efficiency
8
Task Level Parallelism
  • Well suited for DSP applications

Memory
Proc.
Proc.2
Proc.3
Proc.1

A
Task1 Task2 Task3
Task1
Task2
Task3
B
A
B
C
C
Improves performance and potentially reduces
memory size

  • Widely available in many DSP applications

training
pilots
in
out
scram coding
inter- leave
mod. map
loadfft inter- leave
IFFT
win- dow
up- samp filter
scale clip
up- samp filter
802.11a/g wireless LAN (54 Mbps, 5 GHz) baseband
transmit path
9
Memory Size in Modern Processors
  • Memory occupies much of the area in modern
    processors this reduces area available for the
    core and consumes large amounts of power
  • Area and energy dissipation can be dramatically
    decreased with smaller memories

Area breakdown
other
core
mem
TI_C64x Itanium SPARC BlGe/L

ISSCC 02, 05
10
Small Memory Requirements for DSP Tasks
  • The memory required for common DSP tasks is quite
    small
  • Several hundred words of memory are sufficient
    for many DSP tasks

11
GALS Clocking Style
  • The challenge of globally synchronous systems
  • Design difficulty when using high clock
    frequencies, long clock wires caused by large
    chip sizes, and large circuit parameter
    variations
  • High clock power consumption and lack of
    flexibility to independently control clock
    frequencies
  • How about totally asynchronous design style
  • Lack of EDA tool support
  • Design complexity and circuit overhead
  • The GALS compromise
  • Synchronous blocks each operating in their own
    independent clock domain

12
Wires and On Chip Communication
  • Global wires are a concern
  • Their length doesnt shrink with technology
    scalingassuming the chip size remains the same
  • The ratio of global wire delay to gate delay
    doubles each generation
  • Methods to avoid global wires
  • Network on chip (NOC)
  • Local communication
  • Nearest neighbor communication

nearest
local
13
Outline
  • Project objectives
  • Key features of the AsAP processor
  • Design of the AsAP processor
  • Results

14
AsAP Block Diagram
  • GALS array of identical processors
  • Each processor is a reduced complexity
    programmable DSP with small memories
  • Each processor can receive data from any two
    neighbors and send data to any of its four
    neighbors

IMem 64 words
OSC
FIFO 0 32 words
ALU MAC Control
Output
FIFO 1 32 words
Static config
DMem 128 words
Dynamic config
15
Single Processor Architecture
  • 54 32-bit instructions, among which only
    Bit-Reverse is algorithm specific
  • 9-stage pipeline

IFetch
Decode
EXE 1
EXE 2
EXE 3
Mem. Read
Src Select
Result Select
Write Back
FIFO 0 RD
Proc Output
Bypass
PC
FIFO 1 RD
Data Mem Write
ALU
Data Mem Read
De- code
Inst. Mem
Multiply Accumulator
A C C


Addr. Gens.
DC Mem Write
DC Mem Read
16
Programmable Clock Oscillator
  • Standard cell based
  • Configurable frequency
  • Delay tunable stages using 7 parallel tri-state
    inverters
  • 5 or 9 stage selection
  • 1 to 128 clock divider
  • SR latch logic enables clean OSC halt
  • Results
  • 1.66 MHz 702 MHz
  • Max gap 0.08 MHz(1.66 500 MHz)

Clock divider
clk





...

...

...

...
stage_sel
reset
halt
17
Inter-processor Communication
  • Each processor contains two dual-clock FIFOs
  • Rd/Wr in separate clock domains
  • Gray coded Rd/Wr address across clock domains
  • No FIFO failures in several weeks of testing
    with multiple procs

east
Read side
Write side
north
SRAM
west
south
Writelogic
Readlogic
east
north
addr
addr
C P U
west
Binary Gray Sync.
south
FIFO 0
east
north
west
FIFO 1
south
Processor
18
Advantages of theAsAP Clocking System
  • Simplified clock tree design
  • The maximum span is lt 1 mm in 0.18 µm
  • Scalable easy to add processors
  • Improved energy efficiency
  • Clock halts in 9 cycles (processor dissipates
    leakage power only) and restores in lt 1 cycle
    according to work availability
  • 53 and 65 power savings for two applications
  • Independent clock and voltage scaling (individual
    processor voltage scaling not implemented in this
    version)

19
Physical Design
  • 0.18 µm TSMC
  • Standard cell
  • Design flow
  • Completely synthesized, except oscillator
  • Macro memory blocks used for four main memories
  • Completely auto placed and routed
  • To simplify physical design, clock gating not
    implemented in this version

20
Chip Micrograph of the 6 x 6 Array
Transistors 1 Proc
230,000 Chip 8.5
millionMax speed 475 MHz _at_ 1.8 V Area
1 Proc 0.66 mm² Chip
32.1 mm² Power (1 Proc _at_
1.8V, 475 MHz) Typical application 32
mW Typical 100 active 84 mW Worst
case 144 mW Power (1 Proc _at_ 0.9V,
116 MHz) Typical application 2.4 mW
IMem
DMem
5.68 mm
FIFOs
OSC
Single Processor
810 µm
810 µm
5.65 mm
21
Outline
  • Project objectives
  • Key features of the AsAP processor
  • Design of the AsAP processor
  • Results

22
Area Evaluation
  • Most of AsAPs area is for the core (66)
  • Each processor requires a very small area more
    than 20x smaller than others

72
30
Processor area (mm²)
29
Area breakdown
8.3
7
20x
0.34
TI RAW Fujitsu BlGe/ AsAP
C64x 4-VLIW L
TI CELL/ Fujitsu RAW
ARM AsAP C64x SPE 4-VLIW
All scaled to 0.13 µm ISSCC 05, 00 ISCA04
CMPON96
core
comm
mem
23
Power and Performance
Power / Clock frequency / Scale (mW/MHz)
All scaled to 0.13 µm Assume 2 ops/cycle
for CELL/SPE and 3.3 ops/cycle for TI C64x
Note word widths and workload not factored
ISSCC 05, 00 ISCA04 CMPON96
Higher performance area, and Lower estimated
energy
Peak performance density (MOPS/mm²)
24
JPEG Core Encoder Implementation
  • 9 processors
  • Fully functional on chip
  • 224 mW _at_ 300 MHz
  • 1400 clock cycles for each 8x8 block
  • Similar performance and 11x lower energy
    dissipation than8-way VLIW TI C62x

output
input
Huffman
Lv-shift 1-DCT
DC in Huffm
AC in Huffm
Trans in DCT
DC in Huffm
AC in Huffm
Quant. Zigzag
Zigzag
1-DCT
8x8 DCT
MICRO 02, ISCAS 02
25
802.11a/802.11g Wireless Transmitter
Implementation
Data bits
Conv. Code
Inter- leave 1
Pad
Punc
  • 22 processors
  • Fully functional on chip
  • 407 mW _at_ 300 MHz 30 of 54 Mb/s
  • Code unscheduled and lightly optimized
  • 10x performance and 35x 75x lower energy
    dissipation than 8-way VLIW TI C62x

Inter- leave 2
Scram
Train
Pilot Insert
Mod. Map
IFFT Mem
IFFT BR
IFFT Output
GI/ Wind.
IFFT BF
GI/ Wind.
IFFT Mem
IFFT BF
FIR
FIR
To D/A converter
IFFT BF
IFFT Mem
Output Sync
IFFT
SIPS 04 ICC 02
26
Summary
  • Scalable programmable processing array
  • Many processors on a single chip
  • Reduced complexity processors with small memories
  • GALS clocking style
  • Nearest neighbor communication
  • Results
  • 0.18 µm, 475 MHz _at_ 1.8 V
  • 32 mW application power
  • 84 mW 100 active power
  • 2.4 mW application power _at_ 116 MHz and 0.9 V
  • High performance density 475 MOPS in 0.66 mm²
  • Well suited for future fabrication technologies

27
Acknowledgments
  • Funding
  • Intel Corporation
  • UC Micro
  • NSF Grant No. 0430090
  • MOSIS
  • UCD Faculty Research Grant
  • Special Thanks
  • R. Krishnamurthy, M. Anders, S. Mathew, S.
    Muroor, W. Li, C. Chen, D. Truong
Write a Comment
User Comments (0)
About PowerShow.com