Title: Parallelization of Turbo Codec and Performance Analysis
1Parallelization of Turbo Codec and Performance
Analysis
Jung Soo Oh, Sang Moon Lee and Beongjo
KimSamsung Advanced Institute of
TechnologyandSamsung Electronics
2Abstract
- Turbo Codec
- simulation methodology for the standard forward
error correction scheme in IMT2000 system (UMTS
and CDMA2000) - Parallelized Turbo Codec
- Master and slave parallelization mechanism
- Uneven frame distribution
- contributed to
- obtaining reliable simulation results within a
short period of time - participating in IMT2000 standardization forum
thanks to the tremendously high throughput
3Turbo Codes
- Forward error correction scheme for data service
over - Forward/Reverse Supplemental Channels (F/R-SCH)
in CDMA2000 (IMT2000) - Uplink/Downlink Transport Channels (TrCH) in UMTS
- Need huge amount of simulation time to develop
and implement an efficient and a cost-effective
Turbo codec - ? We tried to parallelize Turbo Codec program.
4Concept
Forward
Link
Multimidea
terminal
BS
Network
2
Mbps
6
4
,
1
2
8
,
3
8
4
Kbps
8
,
6
4
Kbps
MSC
Hand
Portable
Telephone
Searcher
(
Sync)
Turbo Codec
5Key Factors
- Evaluation factors of Performance of Turbo Codec
- BER ( Bit Error Rate )
- FER ( Frame Error Rate )
- Turbo Codec operates in frame mode, hence
generating output codeword in frame. - In computational simulation, a random number
generator arbitrarily produces number of frames. - Turbo Codec simulator corrects parameters to make
them more appropriate by analyzing BER/FER from
the independent frame transmissions.
6Main computational algorithm
Set environmental variables and memory
allocation for (frame_number 0
frame_number) make_turbo_frame()
frame generation with a random number
turbo_encoder() channel()
A random number is used.
turbo_decoder() extract_turbo_frame() if
( Summation of total bit errors
gt Given maximum number of bit errors
) then break get global BER and FER
assigned before the program operation. so
the number of the transmitted frames is undefined.
7First simple approach
- In each processor
- Even allocation of the maximum number of bit
errors - different seeds of random number generation
- For example
- Maximum number of bit errors 1000
- Number of processors 4
- Maximum number of bit errors / Number of
processors 250 - seed rank of each processor
- Unbalanced Loading Problem
- can not predict the number of producing,
transmitting and analyzing frames in each
processor. - More frame transmission can be assigned to
certain processors. - It is simple but not useful.
8First Simple Approach( example )
Serial Program if ( Summation of total bit
errors
gt 1000 ) then terminate
The number of transmitted frames (unpredictable
before starting of simulation)
Perfect parallelization with 4 processors
parallel Program processor 0 if ( local bit
errors gt 250 ) then terminate
idling
The number of transmitted frames
parallel Program processor 1 if ( local bit
errors gt 250 ) then terminate
idling
The number of transmitted frames
Program terminating point
parallel Program processor 2 if ( local bit
errors gt 250 ) then terminate
The number of transmitted frames
idling
parallel Program processor 3 if ( local bit
errors gt 250 ) then terminate
The number of transmitted frames
9Master-Slave1
Set environmental variables and memory
allocation for (frame_number 0
frame_number) random number generation
make_turbo_frame() frame generation with a
random number turbo_encoder()
channel() A random number
is used. turbo_decoder()
extract_turbo_frame() if ( Summation of total
bit errors gt Given
maximum number of bit errors ) then break get
global BER and FER
Master
Slave
10Master - Slave1
- Slave nodes
- establish the transmitting channels and transmit
frames. - Master node
- generates the random numbers for all slave
processors. - summarizes all of the bit errors from slave
nodes. - determine the termination of frame transmission.
- Analysis of masters random number generation
- Advantage
- Getting rid of the overlapping problems in each
procossor - Disadvantage
- increment the message communication between
master and slave nodes
11Master-slave 1( message communication )
master
BER message
Random number message
12Master-Slave1
- Intel Paragon ( 75Mflops/node, total 256 nodes ),
rate1/4
13Master-slave 1
- Performance Analysis
- Better performance than that of first simple
approach - But ...
- Not linear performance
- Uneven frame distribution to each slave
- No better remedy!
- Slave nodes idling time
- waiting for random numbers from master node
- It degrades the performance.
- Try to cut down the idling time!
14Master-slave2
- Goal Minimization the idle time in slave nodes
- Modification
- slave nodes generate random numbers by itselves
- Analysis of slaves random number generation
- Advantage
- Decrement the message communication between
master and slave nodes - Minimization the idle time in slave nodes
- Disadvantage
- Random numeber redundancy problem among slave
nodes - Solve the redundancy problem
- a random number generator with sufficiently long
period - period 2256
15Master-slave 2
master
16Master-Slave 2
- the idling time of slave nodes can be ignored.
- This modified master-slave method is applied to
parallelized Turbo Codec to build a core chip. - Intel Paragon ( 75Mflops/node, total 256 nodes ),
rate1/3
17Experimental Result (Condition)
- Parallelized Turbo Codec
- the improved master-slave method
- 1.5dB
- maximum number of bit errors 200
- Platforms at Samsung Advanced Institute of
Technology (SAIT). - Intel Paragon
- HP Exemplar
- Linux Clusters
- Intel CPU
- Alpha CPU
18Intel Paragon (1995. 10 1999. 11)
- 19.2 Gflops peak performance, 256 nodes (MPP)
- main platform to develop parallelized Turbo Codec
- Deficient CPU clock speed, but 102 order of
parallel processing nodes.
Master-Slave 2
Master-Slave 1
19HP Exemplar (1998. 5 2001. 6)
- 51.2 Gflops peak performance, 64 nodes (CC-NUMA)
- parallel performance
- only up to 8 nodes
- the lowest parallel performance results
2.5 times faster
20Alpha Linux Cluster 1
- LX - Board (21164) / 533 MHz CPU
- 4 or 8 nodes CPML(Compaq Portable Math Library)
for efficient performance.
21Alpha Linux Cluster 2
- UP - Board (21264) / 667MHz CPU
- 8 nodes
- the fastest system
22Intel Linux Cluster
- 4 CPUs with Intel Pentium II 450MHz.
- GNU gcc compiler proper for Intel CPU
- general PC-cluster
23Proprietary
DB1.5, rate1/3, error_max200
1 CPU
4 CPU
8 CPU
time
7
645
6
5
335
357
4
240
254
258
3
220
212
2
123
126
143
047
043
1
0
HP Developed
EGCS
CPMLEGCS
CPMLCompaq C
EGCS
HP Exemplar
Alpha cluster (533MHz LX)
Intel cluster
24Turbo Codec
DB1.5, rate1/3, error_max200
25Performance Analysis
- The parallel scalability is not linear.
- Uneven and unpredicted number of frames in slave
nodes - It does not affect on parallel simulation.
- Through-put
- the biggest advantage for parameter study using
several nodes with shortened computation time. - Parallelized Turbo Codec has an invisible linear
scalability.
26Invisible Scalability
- Analysis more reliable parallel efficiency
- Analysis computing time per frame
- the global execution time of every processor
- Sprocessor the execution time of each
processors - ( the execution time of parallelized program
) - x ( the number of CPUs )
- Consumed time to transmit a frame
- ( the global execution time of every
processor ) - / ( the number of total
transmitted frames ) - By analyzing of numbers of parallelized
simulations, - ? the computing time per frame is equal in every
simulation. - ? Time ratio in total transmitted frame is
scalable.
27UP vs LX (computing time per frame)
DB1.5, rate1/3, error_max200
lt computing time per frame gt
533MHz
0.287 sec
0.063 sec
0.027 sec
667MHz
0.212 sec
0.05 sec
0.025 sec
28Concluding
- Parallelized Turbo Codec simulating algorithm
- remarkable parallel efficiency and higher
throughput. - Most outstanding contribution of parallelized
Turbo Codec - reducing the computational simulation time in the
design process of core chip. - allowing optimization of parameters in shorter
time - allowing analysis with the range over 2 dB.
- Samsung Electronics participates in IMT2000
standardization forum with large amount of
simulation data based on this parallel Turbo
Codec simulating algorithm.