Title: LYU0703 Parallel Distributed Programming on PS3
1LYU0703Parallel Distributed Programming on PS3
Department of Computer Science and Engineering,
CUHK 2007-2008 Final Year Project Presentation
(1st term)
- Huang Hiu Fung 05700512
- Wong Chung Hoi 05596742
- Supervised by Prof. Michael R. Lyu
2Agenda
- Background Information
- Architecture of PlayStation3
- Principals of Parallel Programming
- Optimization of the ADVISER program1.
Sequential Approach2. Parallel Approach - Conclusion
- Future Works
- QA
3Background Information
- Limitation of single-core processor
- Memory Access Latency
- Wire Delays
- Power Consumption
4Power Consumption
P powerC capacitance V voltageF
processor frequency (cycles per second)
5Development of Multi-Core Processor
6Development of Multi-Core Processor
- Reduce power consumption- use multiple cores
with low frequency instead of one with high
frequency - Efficient processing of multiple tasks- divide
the computation work- execute among the cores
concurrently
7Project Objectives
- Need of parallel programming to optimize
intensive-computation applications - Study features of parallel programming, compare
sequential and parallel approach - Optimize an application, showing great
improvement by parallel programming
8Architecture of PlayStation3 (PS3)
- A multi-core machine produced by Sony, with the
Cell Broadband Engine - Strong Computation Power
- Opened platform for other applications and
development
9Cell Broadband Engine (Cell BE)
PPE Power Processor Element SPE Synergistic
Processor Element EIB Element Interconnect Bus
10Power Processor Element (PPE)
- 64-bit PowerPC architecture based
- General purpose operation
- Designed as control-intensive
- Control I/O of main memeory and other devices by
the OS - Control over all 8 SPEs
11Synergistic Processor Element (SPE)
- Designed to provide computation performance
- SPU perform allocated task
- LS the only memory
- MFC control data transfer
- Totally 8 SPEs in Cell
- Only 6 acessisble
- 1 reserved for system software1 disabled
12Element Interconnect Bus (EIB)
- Internal communication bus inside Cell
- Connect different elements PPE, SPEs. Memory
controller
13Principal of Parallel Programming
Parallel algorithm Serial algorithm
multiple processing units single processing unit
communication overhead no communication overhead
higher complexity in code straight forward code
ensure load balance between PU everything is done by CPU
14Concept of Load Balance
- Distribute data evenly
- Total runtime depends on
- the busiest processing
- element
- Wasting computation
- time on idling processing
- element
15Method of parallelism
- Data parallelism
- Task parallelism
16Parallel Architecture
Flynn's taxonomy Flynn's taxonomy Flynn's taxonomy
 Single Instruction Multiple Instruction
Single Data SISD MISD
Multiple Data SIMD MIMD
17SISD
- Traditional Computer
- von Neumann model
18SIMD
- Same instruction on all data
- Data parallelism
- SIMD intrinsic function
19MISD
- No well known system
- Mention for completeness
20MIMD
- Different instruction on
- different data
- Task parallelism
- Further break down to
- Shared Memory System
- Distributed Memory System
21Shared Memory System
- Access to central
- memory for data
- PS3 Achieve by
- MFC issuing DMA
- command
22Distributed Memory System
- Each PE has its
- own memory
- PS3 Each SPE
- has 256KB Local Store
- PS3 is hybrid shared-distributed memory system
23ADVISER
- Comparing 2 video clips
- Generating meaningful data (in form of numbers)
of frames from the video - Comparing and looking for the most similar frames
- Locating the similar segment which consist of a
series of very similar frames
24Input
- 2 Folder, Repository Target
- hl3 file vector of 1024 double precision values
Input No. of hl3 files
Target directory 5473
Repository directory 7547
25Processing
- hl3 file vector of 1024 double precision values
- File P
- File Q
- Similarity
- Smaller the better
26Output
- M Target, N Repository
- O ( M N )
- Computation time 633 sec
- Flash demo
target hl3 1 most match repository A difference
value ?? target hl3 2 most match repository
B difference value ?? target hl3 3 most match
repository C difference value ??
27Parallel Version
- Data parallelism
- Split data to 6 SPEs evenly
- Computation time for 6 SPEs 330 sec
- Flash demo
28Parallel Version
- Expected speed up 6X
- Actual speed up 2X
- PC and PPU, SPE all run at different speed
- Computation time with CPU 633 sec
- Computation time with 1 SPE 1928 sec
- Computation time with PPU 3119 sec
- CPU gt SPE gt PPU
29Time Attack
- SIMD intrinsic function
- Changing data type
- Double Buffering
- Parallel Read
- Distributing Job to idling PPE
- SIMD on loop counter
- Loop unrolling
30SIMD intrinsic function
- Addition, subtraction,
- multiplication, etc.
- Operates on 128 bits
- registers
- Date type double (64 bits)
- Speed up 2X
31Changing Data Type to int
- Precision not important
- Major speed up from
- SIMD intrinsic
- Data type int (32 bits)
- Total Speed up 4X
- Computation time
- 71 sec
32Changing Data Type to float
- SPE specified for high
- precision computation
- No intrinsic for int data
- type at all
- Data Type float (32 bits)
- Save data conversion time
- Speed up by 30
- Computation time 49 sec
33Double buffering
- Save communication time
- MFC and SPU
- 2 buffers
- Prefetching
- Processing
- Not heavy in communication
- Minor speed up
34Parallel Reading for All Files
- Read Target and Repository concurrently
- Share file reading job among SPEs
- Not improve as predicted, even slower
- Reason hard disk cannot cannot handle concurrent
request - Failed Attempt
35Distributing Job to Idling PPE
- PPE current job read files, distribute files,
collect result - Use stall time to do some computation
- Relatively low computation power of PPE
- No significant improvement
- Increase program complexity
- Abandon this approach
36Applying SIMD for Loop Counter
- Major computation power consumed in
- initialize i 0, diff (0, 0, 0, 0).
- for i lt Number of float numbers in a file /
Number of floats packed in a registerA. temp
SIMD subtraction on vector i in Target and
Repository file.B. diff SIMD addition (SIMD
multiplication (temp, temp) , diff). - i i 1.
- Loop back to 2.
37Applying SIMD for Loop Counter
- Try to optimize step 3
- Apply SIMD to the loop counter
- Addition and comparison operations are reduced by
8 times
38Applying SIMD for Loop Counter
- initialize i (0,1,2,3,4,5,6,7) , diff (0, 0,
0, 0). - for i0 lt Number of float numbers in a file /
Number of floats packed in a register - temp SIMD subtraction on vector i0 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - temp SIMD subtraction on vector i1 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - temp SIMD subtraction on vector i2 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - temp SIMD subtraction on vector i3 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - temp SIMD subtraction on vector i4 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - temp SIMD subtraction on vector i5 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - temp SIMD subtraction on vector i6 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - temp SIMD subtraction on vector i7 in
Target and Repository file. - diff SIMD addition (SIMD multiplication (temp,
temp) , diff). - i SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)).
39Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
No. of SPU used 1 2 3 4 5 6
Read input time (sec) 4 5 3 4 4 4
Total Elapsed time (sec) 286 146 97 75 60 51
Net Elapsed time (sec) 282 141 94 71 56 47
40Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
41Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
- little improvement (about 4).
- shows the possibility to have faster performance
by further loop unrolling. - The best performance becomes 47 sec
42Loop Unrolling
- Proved that optimizing the loop can improve
performance - Completely loop unrolling
- More obvious speed up
43Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
No. of SPU used 1 2 3 4 5 6
Read input time (sec) 3 4 3 3 4 3
Total Elapsed time (sec) 159 82 55 42 35 30
Net Elapsed time (sec) 156 78 52 39 31 27
44Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
45Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
- 45 faster
- ultimate best performance becomes 27 sec
46Conclusion of Optimization
- PC version663 sec
- PS3 with 1 SPU (i.e. sequential version on
PS3)1928 sec - Final optimized version of PS327 sec23 times
faster than PC version71 times faster than
sequential version on PS3
47Conclusion of Optimization
48Future Works
- Port the whole ADVISER application on
PlayStation3 - Optimization throughout the whole application
49QA
50The End