Title: A System Solution for High- Performance, Low Power SDR
1A System Solution for High- Performance, Low
Power SDR
- Yuan Lin1, Hyunseok Lee1, Yoav Harel1, Mark Woh1,
- Scott Mahlke1, Trevor Mudge1 and Krisztian
Flautner2 - 1Advanced Computer Architecture Laboratory
- University of Michigan
- 2ARM, Ltd.
2SDR Design Challenges
- Hardware design challenges
- High computational throughput (40 Gops)
- Low power consumption (200mW)
- Meet real-time requirements
- DSP programming support
- System-level development
- Inter-algorithm communication
- Algorithm-level development
- Efficient DSP representations
3SDR Benchmark Design Analysis
4W-CDMA Protocol 2Mbps
5W-CDMA Characteristics
- Plenty of vector parallelism
- 8 16-bit DSP algorithms
- Multiplication is not dominant
- No floating-point operation
- Small instruction/data memory
- Has periodic real-time tasks
6802.11a Protocol 24Mbps
7802.11a Characteristics
- Similar to W-CDMA
- Plenty of vector parallelism
- No floating-point operation
- Small instruction/data memory
- Different from W-CDMA
- Mostly 16-bit DSP algorithms
- Multiplication is more dominant
- No periodic real-time tasks
8SDR Processor Architecture Design
9System Architecture Design Tradeoffs
10System Architecture Design Tradeoffs
Number of Processing Elements x SIMD width For
W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V _at_400MHz
11Our SDR System Architecture Design
- Scalable system design
- Standardized SoC interface
- System interface supports multiple (potentially)
heterogeneous PEs and memories - For WCDMA 802.11a
- 4 homogeneous processing elements (PEs)
- Dual pipelines scalar pipeline SIMD pipeline
- Local scratchpad memory (no data cache)
- Global scratchpad memory (64KB)
- Controller -- ARM general purpose processor
12System Architecture Design
13PE Design (Area lt 1mm2 Powerlt50mW)
14Mapping DSP Algorithms Filters
z-1
In
b
Out
spread Vin, Sin
z-1
shift z, z, up
mac z, Vin, Sin
In
b0
b1
b2
b3
Out
Z-1
Z-1
Z-1
15Mapping DSP Algorithm Filter
spread Vin, Sin
shift z, z, up
mac z, Vin, Sin
16Efficient Design
- Wide SIMD width
- Small register file with minimum ports
- Small memories
- Narrow system BUS
- Data-path optimized for 8bits
- Vector shuffle reduce memory ports
17Processing Element (PE) Design
- Scalar pipelines
- 16bit data path
- SIMD pipeline
- 8 bit data path
- 32x8 SIMD ALU
- Software controlled local scratchpad memory
- 4KB scalar memory
- 4KB SIMD memory
- Inter-PE communication through DMA
18(No Transcript)
19(No Transcript)
20802.11a PE Mapping
21Power Results
- Configuration
- 4 PEs, 1 ARM (Cortex M3) controller
- Global scratchpad memory (64Kb)
- 90nm (1V _at_ 400 MHZ),
- Synthesized conservatively
22Area Results
23SDR Programming Language Support
24Software Development Flow
25SPEX (Signal Processing EXtension)
- Implemented as a library extension to C
- System-level development
- Support concurrent DSP kernel function
definitions - Channel variables for inter-kernel communications
- Algorithm-level development
- Native vector matrix variables
- Explicit DSP variable attribute definition
- Native vector matrix operations
26SPEX Overview
27SPEX Example Code Viterbi ACS
Concurrent DSP kernel definitions
void acs(void) / variable declaration /
saturated charlt64gt metrics1, metrics2
saturated charlt64gt states saturated charlt64gt
t1, t2 while (!viterbi.stop()) /
receiving data from BMC / metrics1
bmc_to_acs.receive() metrics2
bmc_to_acs.receive() / add / metrics1
states metrics2 states /
compare and select / t1 (metrics1(0,2,62),m
etrics2(0,2,62)) t2 (metrics1(1,2,63),metri
cs2(1,2,63)) states(t1ltt2) t1
states(t1gtt2) t2 / sending data to TB
/ acs_to_tb.send(states)
28SPEX Example Code Viterbi ACS
void acs(void) / variable declaration /
saturated charlt64gt metrics1, metrics2
saturated charlt64gt states saturated charlt64gt
t1, t2 while (!viterbi.stop()) /
receiving data from BMC / metrics1
bmc_to_acs.receive() metrics2
bmc_to_acs.receive() / add / metrics1
states metrics2 states /
compare and select / t1 (metrics1(0,2,62),m
etrics2(0,2,62)) t2 (metrics1(1,2,63),metri
cs2(1,2,63)) states(t1ltt2) t1
states(t1gtt2) t2 / sending data to TB
/ acs_to_tb.send(states)
Native SIMD variable definition with explicit
attributes SPEX variable supports 1.
saturated/overflow 2. various variable
bit-width 3. vector matrices
29SPEX Example Code Viterbi ACS
void acs(void) / variable declaration /
saturated charlt64gt metrics1, metrics2
saturated charlt64gt states saturated charlt64gt
t1, t2 while (!viterbi.stop()) /
receiving data from BMC / metrics1
bmc_to_acs.receive() metrics2
bmc_to_acs.receive() / add / metrics1
states metrics2 states /
compare and select / t1 (metrics1(0,2,62),m
etrics2(0,2,62)) t2 (metrics1(1,2,63),metri
cs2(1,2,63)) states(t1ltt2) t1
states(t1gtt2) t2 / sending data to TB
/ acs_to_tb.send(states)
Inter-kernel communication through channel
operations Channel types 1. FIFO queue 2.
Broadcast queue 3. Sync/control channel 4.
Random-read FIFO queue
30SPEX Example Code Viterbi ACS
void acs(void) / variable declaration /
saturated charlt64gt metrics1, metrics2
saturated charlt64gt states saturated charlt64gt
t1, t2 while (!viterbi.stop()) /
receiving data from BMC / metrics1
bmc_to_acs.receive() metrics2
bmc_to_acs.receive() / add / metrics1
states metrics2 states /
compare and select / t1 (metrics1(0,2,62),m
etrics2(0,2,62)) t2 (metrics1(1,2,63),metri
cs2(1,2,63)) states(t1ltt2) t1
states(t1gtt2) t2 / sending data to TB
/ acs_to_tb.send(states)
- SPEX vector operations
- Supports
- (Matlab-like C code)
- SIMD arithmetic
- operations
- 2. SIMD permutation
- 3. SIMD predication
31Summary
- Hardware software solutions for SDR
- Hardware
- 4 dual-issue asymmetric SIMD processing elements
- Consumes 200300mW for 90nm
- Meets the performance requirements for WCDMA
802.11a - Software
- SPEX provides efficient DSP algorithm and system
implementation
32Questions?