Programmable processors for wireless base-stations - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Programmable processors for wireless base-stations

Description:

Wireless rates clock rates. Need to process 100X more bits per clock cycle ... Base-stations need horsepower. Sophisticated signal processing for multiple users ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 48

Provided by: Srid

Category:

more less

Transcript and Presenter's Notes

Title: Programmable processors for wireless base-stations

1
Programmable processors for wireless base-stations

Sridhar Rajagopal
(sridhar_at_rice.edu)
December 16, 2003

2
Wireless rates ? clock rates
4 GHz
54-100 Mbps
200 MHz
2-10 Mbps
1 Mbps
9.6 Kbps

Need to process 100X more bits per clock cycle
today than in 1996

3
Base-stations need horsepower
Sophisticated signal processing for multiple
users Need 100-1000s of arithmetic operations to
process 1 bit Base-stations require gt 100 ALUs
4
Programmable architectures

Wireless algorithm kernels
Well known, ASIC mapping well-studied
Processors getting more powerful every year
Historic trend ASICs ? Programmable
Can we design a fully programmable wireless
system?

5
Thesis addresses the following problem

Design programmable processors for wireless
base-stations with 100s of ALUs
map wireless algorithms on these processors
power-efficient (adapt resources to needs)
(c) decide ALUs, clock frequency

how much programmable? as programmable as
possible
6
Choice Multi-processors

Single processors wont do
ILP, subword parallelism not sufficient
Register file explosion with increasing ALUs
Multiprocessors
Data parallelism in wireless systems
Data-parallel/SIMD/vector processors appropriate
Exploit ILP, MMX, DP

7
Thesis contributions

(a)Mapping algorithms on data-parallel processors
designing data-parallel algorithms
tradeoffs between packing, ALU utilization and
memory
reduced inter-cluster communication network
(b)Improve power efficiency
adapting compute resources to workload variations
varying voltage and frequency to real-time
requirements
(c) Design exploration between ALUs and clock
frequency to minimize power consumption
fast real-time performance prediction

8
Outline

Background
Wireless systems
Data-parallel (Stream) processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

9
Wireless workloads
System 2G 3G 4G
Users Data rates Algorithms Estimation Detection Decoding Theoretical Min ALUs _at_ 1 GHz 32 16 Kbps /user Single-user Correlator Matched filter Viterbi gt 2 32 128 Kbps/user Multi-user Max. likelihood Interference Cancellation Viterbi gt 20 32 1 Mbps/user MIMO Chip equalizer Matched filter LDPC gt 200
Time 1996
2003 ?
10
Key kernels studied for wireless

FFT Media processing
QRD Media processing
Outer product updates
Matrix vector operations
matrix matrix operations
Matrix transpose
Viterbi decoding
LDPC decoding (in progress)

11
Characteristics of wireless

Compute-bound
Finite precision
Limited temporal data reuse
Streaming data
Data parallelism
Static, deterministic, regular workloads
Limited control flow

12
Parallelism levels in wireless

int i,aN,bN,sumN // 32 bits
short int cN,dN,diffN // 16 bits packed
for (i 0 ilt 1024 i)
sumi ai bi
diffi ci - di
Instruction Level Parallelism (ILP) - DSP
Subword Parallelism (MMX) - DSP
Data Parallelism (DP) Vector Processor
DP can decrease by increasing ILP and MMX
Example loop unrolling

DP
ILP
MMX
13
Stream Processors multi-cluster DSPs
Memory Stream Register File (SRF)

ILP MMX

DP
adapt clusters to DP Identical clusters, same
operations. Power-down unused FUs, clusters
VLIW DSP (1 cluster)
14
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Reduced inter-cluster communication network
Power efficiency
Design exploration
Broad impact and future work

15
Patterns in inter-cluster comm

Intercluster comm network fully connected
Structure in access patterns can be exploited
Broadcasting
Matrix-vector multiplication, matrix-matrix
multiplication, outer product updates
Odd-even grouping
Transpose, Packing, Viterbi decoding

16
Viterbi needs odd-even grouping

Exploiting Viterbi DP
Odd-even grouping of trellis states

17
Performance of Viterbi decoding
Ideal C64x DSP (w/o co-proc) needs 200 MHz for
real-time
18
Odd-even grouping

Packing
If odd-even data packed in same cluster and
precision doubles
Odd-even grouping required for bringing data to
right cluster
Not always beneficial for performance
Matrix transpose
Better done in ALUs than in memory
Shown to have an order-of-magnitude better
performance
Done in ALUs as repeated odd-even groupings

19
Transpose uses odd-even grouping
20
Odd-even grouping
0 1 2 3 4 5 6 7 ? 0 2 4 6 1 3 5 7
Inter-cluster communication
Entire chip length Limits clock frequency Limits
scaling
21
A reduced inter-cluster comm network
only nearest neighbor interconnections
22
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

23
Flexibility needed in workloads
Billions of computations per second
needed Workload variation from 1 GOPs for 4
users, constraint 7 viterbi to 23 GOPs for 32
users, constraint 9 viterbi
24
DP changes with users
25
Data is not in the right banks

4 ? 2 clusters
Data not in the right SRF banks
Overhead in bringing data to the right banks
Via memory
Via inter-cluster communication network

26
Adapting clusters to Data Parallelism
SRF
Turned off using voltage gating to eliminate
static and dynamic power dissipation
Adaptive Multiplexer Network
Clusters
C
C
C
C
No reconfiguration
4 2 reconfiguration
41 reconfiguration
All clusters off
C
C
C
27
Cluster utilization variation
Cluster Index
Cluster utilization variation on a 32-cluster
processor (32, 9) 32 users, constraint length
9 Viterbi
28
Frequency variation
29
Operation

Dynamic Voltage-Frequency scaling when system
changes significantly
Users, data rates
Coarse time scale (when system changes)
Turn off clusters
when parallelism changes
Finer time scale (once every 1000 cycles) (di/dt
effects)
Memory operations
Exceed real-time requirements

30
Power Voltage Gating Scaling
Power can change from 12.38 W to 300 mW (40x
savings) depending on workload changes
31
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

32
Deciding ALUs vs. clock frequency

No independent variables
Clusters, ALUs, frequency, voltage (c,a,m,f)
Trade-offs exist
How to find the right combination for lowest
power!

33
Static design exploration
also helps in quickly predicting real-time
performance
34
Sensitivity analysis important

We have a capacitance model Khailany2003
All equations not exact
Need to see how variations affect solutions

35
Design exploration methodology

3 types of parallelism ILP, MMX, DP
For best performance (power)
Maximize the use of all
Maximize ILP and MMX at expense of DP
Loop unrolling, packing
Schedule on sufficient number of
adders/multipliers
If DP remains, set clusters DP
No other way to exploit that parallelism

36
Setting clusters, adders, multipliers

If sufficient DP, linear decrease in frequency
with clusters
Set clusters depending on DP and execution time
estimate
To find adders and multipliers,
Let compiler schedule algorithm workloads across
different numbers of adders and multipliers and
let it find execution time
Put all numbers in power equation
Compare increase in capacitance due to added ALUs
and clusters with benefits in execution time
Choose the solution that minimizes the power

37
Design exploration for clusters (c)
DP
time

For sufficiently large
adders, multipliers per cluster
Explore Algorithm 1 32 clusters
Explore Algorithm 2 64 clusters
Explore Algorithm 3 64 clusters
Explore Algorithm 4 16 clusters

38
Clusters frequency and power
32 clusters at frequency 836.692 MHz (p 1) 64
clusters at frequency 543.444 MHz (p 2) 64
clusters at frequency 543.444 MHz (p 3)
3G workload
39
ALU utilization with frequency
3G workload
Relation between ALU utilization and power
minimization?
40
Choice of adders and multipliers
(?,fp) Optimal Optimal ALU/Cluster Cluster/Total
Adders Multipliers Power Power
(0.01,1) 2 1 30 61
(0.01,2) 2 1 30 61
(0.01,3) 3 1 25 58
(0.1,1) 2 1 52 69
(0.1,2) 2 1 52 69
(0.1,3) 3 1 51 68
(1,1) 1 1 86 89
(1,2) 2 2 84 87
(1,3) 2 2 84 87
41
Exploration results

Final Design Conclusion
Clusters 64
Multipliers/cluster 1
Multiplier Utilization 62
Adders/cluster 3
Adder Utilization 55
Real-time frequency 568.68 MHz for 128
Kbps/user
Exploration done in seconds.

42
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

43
Broader impact

Results not specific to base-stations
High performance, low power system designs
Concepts can be extended to handsets
Mux network applicable to all SIMD processors
Power efficiency in scientific computing
Results 2, 3 applicable to all stream
applications
Design and power efficiency
Multimedia, MPEG,

44
Future work

Dont believe the model is the reality
Fabrication needed to verify concepts
Cycle accurate simulator
Extrapolating models for power
LDPC decoding (in progress)
Sparse matrix requires permutations over large
data
Indexed SRF may help
3G requires 1 GHz at 128 Kbps/user
4G equalization at 1 Mbps breaks down (expected)

45
Options for higher performance

Multi-threading (ILP, MMX, DP, MT)
Schedule other kernels on unused clusters
Additional microcontroller and issue logic
complexity
Pipelining (ILP, MMX, DP, MT, PP)
Standard way of improving performance
Inter-processor communication overhead
Load-balancing difficult
min(t1,t2,) instead of min(t1t2,)
Software tools need to catch up with hardware

46
Need for new architectures, definitions and
benchmarks

Road ends - conventional architecturesAgarwal2000
Wide range of architectures DSP, ASSP, ASIP,
reconfigurable,stream, ASIC, programmable
Difficult to compare and contrast
Need new definitions that allow comparisons
Wireless workloads
Typically ASIC designs
SPEC benchmark needed for programmable designs

47
Conclusions

Utilizing 100-1000s ALUs/clock cycle and mapping
algorithms not easy in programmable architectures
Data parallel algorithms need to be designed and
mapped
Power efficiency needs to be provided
Design exploration needed to decide ALUs to meet
real-time constraints
My thesis lays the initial foundations

Write a Comment

User Comments (0)