Title: Programmable processors for wireless base-stations
1Programmable processors for wireless base-stations
- Sridhar Rajagopal
- (sridhar_at_rice.edu)
- December 16, 2003
2Wireless rates ? clock rates
4 GHz
54-100 Mbps
200 MHz
2-10 Mbps
1 Mbps
9.6 Kbps
- Need to process 100X more bits per clock cycle
today than in 1996
3Base-stations need horsepower
Sophisticated signal processing for multiple
users Need 100-1000s of arithmetic operations to
process 1 bit Base-stations require gt 100 ALUs
4Programmable architectures
- Wireless algorithm kernels
- Well known, ASIC mapping well-studied
- Processors getting more powerful every year
- Historic trend ASICs ? Programmable
- Can we design a fully programmable wireless
system?
5Thesis addresses the following problem
- Design programmable processors for wireless
base-stations with 100s of ALUs - map wireless algorithms on these processors
- power-efficient (adapt resources to needs)
- (c) decide ALUs, clock frequency
how much programmable? as programmable as
possible
6Choice Multi-processors
- Single processors wont do
- ILP, subword parallelism not sufficient
- Register file explosion with increasing ALUs
- Multiprocessors
- Data parallelism in wireless systems
- Data-parallel/SIMD/vector processors appropriate
- Exploit ILP, MMX, DP
7Thesis contributions
- (a)Mapping algorithms on data-parallel processors
- designing data-parallel algorithms
- tradeoffs between packing, ALU utilization and
memory - reduced inter-cluster communication network
- (b)Improve power efficiency
- adapting compute resources to workload variations
- varying voltage and frequency to real-time
requirements - (c) Design exploration between ALUs and clock
frequency to minimize power consumption - fast real-time performance prediction
8Outline
- Background
- Wireless systems
- Data-parallel (Stream) processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
9Wireless workloads
System 2G 3G 4G
Users Data rates Algorithms Estimation Detection Decoding Theoretical Min ALUs _at_ 1 GHz 32 16 Kbps /user Single-user Correlator Matched filter Viterbi gt 2 32 128 Kbps/user Multi-user Max. likelihood Interference Cancellation Viterbi gt 20 32 1 Mbps/user MIMO Chip equalizer Matched filter LDPC gt 200
Time 1996
2003 ?
10Key kernels studied for wireless
- FFT Media processing
- QRD Media processing
- Outer product updates
- Matrix vector operations
- matrix matrix operations
- Matrix transpose
- Viterbi decoding
- LDPC decoding (in progress)
11Characteristics of wireless
- Compute-bound
- Finite precision
- Limited temporal data reuse
- Streaming data
- Data parallelism
- Static, deterministic, regular workloads
- Limited control flow
12Parallelism levels in wireless
- int i,aN,bN,sumN // 32 bits
- short int cN,dN,diffN // 16 bits packed
- for (i 0 ilt 1024 i)
- sumi ai bi
- diffi ci - di
-
- Instruction Level Parallelism (ILP) - DSP
- Subword Parallelism (MMX) - DSP
- Data Parallelism (DP) Vector Processor
- DP can decrease by increasing ILP and MMX
- Example loop unrolling
DP
ILP
MMX
13Stream Processors multi-cluster DSPs
Memory Stream Register File (SRF)
ILP MMX
DP
adapt clusters to DP Identical clusters, same
operations. Power-down unused FUs, clusters
VLIW DSP (1 cluster)
14Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Reduced inter-cluster communication network
- Power efficiency
- Design exploration
- Broad impact and future work
15Patterns in inter-cluster comm
- Intercluster comm network fully connected
- Structure in access patterns can be exploited
- Broadcasting
- Matrix-vector multiplication, matrix-matrix
multiplication, outer product updates - Odd-even grouping
- Transpose, Packing, Viterbi decoding
16Viterbi needs odd-even grouping
- Exploiting Viterbi DP
- Odd-even grouping of trellis states
17Performance of Viterbi decoding
Ideal C64x DSP (w/o co-proc) needs 200 MHz for
real-time
18Odd-even grouping
- Packing
- If odd-even data packed in same cluster and
precision doubles - Odd-even grouping required for bringing data to
right cluster - Not always beneficial for performance
- Matrix transpose
- Better done in ALUs than in memory
- Shown to have an order-of-magnitude better
performance - Done in ALUs as repeated odd-even groupings
19Transpose uses odd-even grouping
20Odd-even grouping
0 1 2 3 4 5 6 7 ? 0 2 4 6 1 3 5 7
Inter-cluster communication
Entire chip length Limits clock frequency Limits
scaling
21A reduced inter-cluster comm network
only nearest neighbor interconnections
22Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
23Flexibility needed in workloads
Billions of computations per second
needed Workload variation from 1 GOPs for 4
users, constraint 7 viterbi to 23 GOPs for 32
users, constraint 9 viterbi
24DP changes with users
25Data is not in the right banks
- 4 ? 2 clusters
- Data not in the right SRF banks
- Overhead in bringing data to the right banks
- Via memory
- Via inter-cluster communication network
26Adapting clusters to Data Parallelism
SRF
Turned off using voltage gating to eliminate
static and dynamic power dissipation
Adaptive Multiplexer Network
Clusters
C
C
C
C
No reconfiguration
4 2 reconfiguration
41 reconfiguration
All clusters off
C
C
C
27Cluster utilization variation
Cluster Index
Cluster utilization variation on a 32-cluster
processor (32, 9) 32 users, constraint length
9 Viterbi
28Frequency variation
29Operation
- Dynamic Voltage-Frequency scaling when system
changes significantly - Users, data rates
- Coarse time scale (when system changes)
- Turn off clusters
- when parallelism changes
- Finer time scale (once every 1000 cycles) (di/dt
effects) - Memory operations
- Exceed real-time requirements
30Power Voltage Gating Scaling
Power can change from 12.38 W to 300 mW (40x
savings) depending on workload changes
31Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
32Deciding ALUs vs. clock frequency
- No independent variables
- Clusters, ALUs, frequency, voltage (c,a,m,f)
- Trade-offs exist
- How to find the right combination for lowest
power!
33Static design exploration
also helps in quickly predicting real-time
performance
34Sensitivity analysis important
- We have a capacitance model Khailany2003
- All equations not exact
- Need to see how variations affect solutions
35Design exploration methodology
- 3 types of parallelism ILP, MMX, DP
- For best performance (power)
- Maximize the use of all
- Maximize ILP and MMX at expense of DP
- Loop unrolling, packing
- Schedule on sufficient number of
adders/multipliers - If DP remains, set clusters DP
- No other way to exploit that parallelism
36Setting clusters, adders, multipliers
- If sufficient DP, linear decrease in frequency
with clusters - Set clusters depending on DP and execution time
estimate - To find adders and multipliers,
- Let compiler schedule algorithm workloads across
different numbers of adders and multipliers and
let it find execution time - Put all numbers in power equation
- Compare increase in capacitance due to added ALUs
and clusters with benefits in execution time - Choose the solution that minimizes the power
37Design exploration for clusters (c)
DP
time
- For sufficiently large
- adders, multipliers per cluster
- Explore Algorithm 1 32 clusters
- Explore Algorithm 2 64 clusters
- Explore Algorithm 3 64 clusters
- Explore Algorithm 4 16 clusters
38Clusters frequency and power
32 clusters at frequency 836.692 MHz (p 1) 64
clusters at frequency 543.444 MHz (p 2) 64
clusters at frequency 543.444 MHz (p 3)
3G workload
39ALU utilization with frequency
3G workload
Relation between ALU utilization and power
minimization?
40Choice of adders and multipliers
(?,fp) Optimal Optimal ALU/Cluster Cluster/Total
Adders Multipliers Power Power
(0.01,1) 2 1 30 61
(0.01,2) 2 1 30 61
(0.01,3) 3 1 25 58
(0.1,1) 2 1 52 69
(0.1,2) 2 1 52 69
(0.1,3) 3 1 51 68
(1,1) 1 1 86 89
(1,2) 2 2 84 87
(1,3) 2 2 84 87
41Exploration results
-
- Final Design Conclusion
-
- Clusters 64
- Multipliers/cluster 1
- Multiplier Utilization 62
- Adders/cluster 3
- Adder Utilization 55
- Real-time frequency 568.68 MHz for 128
Kbps/user -
- Exploration done in seconds.
42Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
43Broader impact
- Results not specific to base-stations
- High performance, low power system designs
- Concepts can be extended to handsets
- Mux network applicable to all SIMD processors
- Power efficiency in scientific computing
- Results 2, 3 applicable to all stream
applications - Design and power efficiency
- Multimedia, MPEG,
44Future work
- Dont believe the model is the reality
-
- Fabrication needed to verify concepts
- Cycle accurate simulator
- Extrapolating models for power
- LDPC decoding (in progress)
- Sparse matrix requires permutations over large
data - Indexed SRF may help
- 3G requires 1 GHz at 128 Kbps/user
- 4G equalization at 1 Mbps breaks down (expected)
45Options for higher performance
- Multi-threading (ILP, MMX, DP, MT)
- Schedule other kernels on unused clusters
- Additional microcontroller and issue logic
complexity - Pipelining (ILP, MMX, DP, MT, PP)
- Standard way of improving performance
- Inter-processor communication overhead
- Load-balancing difficult
- min(t1,t2,) instead of min(t1t2,)
- Software tools need to catch up with hardware
46Need for new architectures, definitions and
benchmarks
- Road ends - conventional architecturesAgarwal2000
- Wide range of architectures DSP, ASSP, ASIP,
reconfigurable,stream, ASIC, programmable - Difficult to compare and contrast
- Need new definitions that allow comparisons
- Wireless workloads
- Typically ASIC designs
- SPEC benchmark needed for programmable designs
47Conclusions
- Utilizing 100-1000s ALUs/clock cycle and mapping
algorithms not easy in programmable architectures - Data parallel algorithms need to be designed and
mapped - Power efficiency needs to be provided
- Design exploration needed to decide ALUs to meet
real-time constraints - My thesis lays the initial foundations