Title: DSPs in emerging wireless systems
1DSPs in emerging wireless systems
2Motivation
- Software solutions becoming important in the
physical layer - Multi-standard systems
- Algorithms tailored to environment, SNR etc.
- Flexible parameters for spreading, coding
- Computations exceed real-time requirements by gt 2
orders of magnitude in current generation DSPs
3Current approaches
- HW/SW co-design
- Maximize programmability in DSPs
- Complex tasks on co-processors
- TI C6416
- Viterbi and Turbo co-processors
- How is this going to scale in 4G?
- Keep on adding co-processors??
4Our approach
- DSP role restricted to controlling co-processors
with increasing computational demands - Final system as inflexible as traditional ASIC
design - Investigating Scalable Wireless
Application-specific Processors (SWAPs) - Identifying bottlenecks in architectures and
identify gap w.r.t. ASICs. - Investigate solutions to bridge gap
5Scalable Wireless A-s Processors
- Multi-cluster stream-based architecture based on
Imagine media processor from Stanford - Streaming processor because
- GPP architectures not good for media, wireless
- streaming processor shown to be good for media
applications such as FFT and FIR. - Media and communication algorithms similar
- Media architectures popular --gt wireless
architectures?
6Scalable architectures
7Programming model
- Kernels
- Computation
-
- KERNEL example1(istreamltintgt a,
- istreamltintgt b,
- ostreamltintgt c)
- loop_stream(a)
- int ai, bi, ci
- a gtgt ai
- b gtgt bi
- ci ai 2 bi 3
- c ltlt ci
-
-
- Streams
- Communication
-
- void main()
- Streamltintgt a(256)
- Streamltintgt b(256)
- Streamltintgt c(256)
- Streamltintgt d(1024)
- ...
- example1(a, b, c)
- example2(c, d)
- ...
-
8Architecture evaluation
- Benchmark kernels currently used
- Matrix-vector multiplications, FFT, Viterbi
- Was fine in ASIC solutions
- Programmable architectures need to investigate
interaction between the kernels - May need to re-order data between the kernels
9Rice Benchmark for wireless systems
- Investigate chain of multi-user estimation,
multiuser detection and Viterbi decoding
algorithms
10Bottlenecks in multi-cluster architectures
- Packed data (subword parallelism)
- Not always good to pack data
- Matrix transposes (Interleaving)
- Doing in ALUs may be cheaper, lower power
- Cannot be avoided in packed matrices
- Viterbi shuffling of path metrics and survivor
states using register exchange - Register exchange needed for parallel computations
11DSP comparisions
12Packing in multi-cluster architectures
Kernel (in,out) half2 a //packed a int
p,q in gtgt a p mul_low(a,a) q
mul_high(a,a) out ltlt p ltlt q
13Matrix Transpose in Memory
14Matrix Transpose in kernel
15Data re-ordering for Viterbi
16Performance loss due to re-ordering data for
parallelism
Speedup per cluster added 0.5 due to
parallelizing Viterbi trellis
17Communication pattern
- All data re-arrangement problems share a common
communication pattern - Odd-even permutation of the data
- Investigating solutions to solve the problem and
bridge gap between multi-cluster and 1 cluster
systems