LYU0703 Parallel Distributed Programming on PS3 - PowerPoint PPT Presentation

About This Presentation

Title:

LYU0703 Parallel Distributed Programming on PS3

Description:

Title: PowerPoint Presentation Last modified by: Huang Hiu Fung Created Date: 1/1/1601 12:00:00 AM Document presentation format: Other titles – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 51

Provided by: cseCuhkE5

Category:

more less

Transcript and Presenter's Notes

Title: LYU0703 Parallel Distributed Programming on PS3

1
LYU0703Parallel Distributed Programming on PS3
Department of Computer Science and Engineering,
CUHK 2007-2008 Final Year Project Presentation
(1st term)

Huang Hiu Fung 05700512
Wong Chung Hoi 05596742
Supervised by Prof. Michael R. Lyu

2
Agenda

Background Information
Architecture of PlayStation3
Principals of Parallel Programming
Optimization of the ADVISER program1.
Sequential Approach2. Parallel Approach
Conclusion
Future Works
QA

3
Background Information

Limitation of single-core processor
Memory Access Latency
Wire Delays
Power Consumption

4
Power Consumption
P powerC capacitance V voltageF
processor frequency (cycles per second)
5
Development of Multi-Core Processor
6
Development of Multi-Core Processor

Reduce power consumption- use multiple cores
with low frequency instead of one with high
frequency
Efficient processing of multiple tasks- divide
the computation work- execute among the cores
concurrently

7
Project Objectives

Need of parallel programming to optimize
intensive-computation applications
Study features of parallel programming, compare
sequential and parallel approach
Optimize an application, showing great
improvement by parallel programming

8
Architecture of PlayStation3 (PS3)

A multi-core machine produced by Sony, with the
Cell Broadband Engine
Strong Computation Power
Opened platform for other applications and
development

9
Cell Broadband Engine (Cell BE)
PPE Power Processor Element SPE Synergistic
Processor Element EIB Element Interconnect Bus
10
Power Processor Element (PPE)

64-bit PowerPC architecture based
General purpose operation
Designed as control-intensive
Control I/O of main memeory and other devices by
the OS
Control over all 8 SPEs

11
Synergistic Processor Element (SPE)

Designed to provide computation performance
SPU perform allocated task
LS the only memory
MFC control data transfer
Totally 8 SPEs in Cell
Only 6 acessisble
1 reserved for system software1 disabled

12
Element Interconnect Bus (EIB)

Internal communication bus inside Cell
Connect different elements PPE, SPEs. Memory
controller

13
Principal of Parallel Programming
Parallel algorithm Serial algorithm
multiple processing units single processing unit
communication overhead no communication overhead
higher complexity in code straight forward code
ensure load balance between PU everything is done by CPU
14
Concept of Load Balance

Distribute data evenly
Total runtime depends on
the busiest processing
element
Wasting computation
time on idling processing
element

15
Method of parallelism

Data parallelism
Task parallelism

16
Parallel Architecture
Flynn's taxonomy Flynn's taxonomy Flynn's taxonomy
Single Instruction Multiple Instruction
Single Data SISD MISD
Multiple Data SIMD MIMD
17
SISD

Traditional Computer
von Neumann model

18
SIMD

Same instruction on all data
Data parallelism
SIMD intrinsic function

19
MISD

No well known system
Mention for completeness

20
MIMD

Different instruction on
different data
Task parallelism
Further break down to
Shared Memory System
Distributed Memory System

21
Shared Memory System

Access to central
memory for data
PS3 Achieve by
MFC issuing DMA
command

22
Distributed Memory System

Each PE has its
own memory
PS3 Each SPE
has 256KB Local Store
PS3 is hybrid shared-distributed memory system

23
ADVISER

Comparing 2 video clips
Generating meaningful data (in form of numbers)
of frames from the video
Comparing and looking for the most similar frames
Locating the similar segment which consist of a
series of very similar frames

24
Input

2 Folder, Repository Target
hl3 file vector of 1024 double precision values

Input No. of hl3 files
Target directory 5473
Repository directory 7547
25
Processing

hl3 file vector of 1024 double precision values
File P
File Q
Similarity
Smaller the better

26
Output

M Target, N Repository
O ( M N )
Computation time 633 sec
Flash demo

target hl3 1 most match repository A difference
value ?? target hl3 2 most match repository
B difference value ?? target hl3 3 most match
repository C difference value ??
27
Parallel Version

Data parallelism
Split data to 6 SPEs evenly
Computation time for 6 SPEs 330 sec
Flash demo

28
Parallel Version

Expected speed up 6X
Actual speed up 2X
PC and PPU, SPE all run at different speed
Computation time with CPU 633 sec
Computation time with 1 SPE 1928 sec
Computation time with PPU 3119 sec
CPU gt SPE gt PPU

29
Time Attack

SIMD intrinsic function
Changing data type
Double Buffering
Parallel Read
Distributing Job to idling PPE
SIMD on loop counter
Loop unrolling

30
SIMD intrinsic function

Addition, subtraction,
multiplication, etc.
Operates on 128 bits
registers
Date type double (64 bits)
Speed up 2X

31
Changing Data Type to int

Precision not important
Major speed up from
SIMD intrinsic
Data type int (32 bits)
Total Speed up 4X
Computation time
71 sec

32
Changing Data Type to float

SPE specified for high
precision computation
No intrinsic for int data
type at all
Data Type float (32 bits)
Save data conversion time
Speed up by 30
Computation time 49 sec

33
Double buffering

Save communication time
MFC and SPU
2 buffers
Prefetching
Processing
Not heavy in communication
Minor speed up

34
Parallel Reading for All Files

Read Target and Repository concurrently
Share file reading job among SPEs
Not improve as predicted, even slower
Reason hard disk cannot cannot handle concurrent
request
Failed Attempt

35
Distributing Job to Idling PPE

PPE current job read files, distribute files,
collect result
Use stall time to do some computation
Relatively low computation power of PPE
No significant improvement
Increase program complexity
Abandon this approach

36
Applying SIMD for Loop Counter

Major computation power consumed in
initialize i 0, diff (0, 0, 0, 0).
for i lt Number of float numbers in a file /
Number of floats packed in a registerA. temp
SIMD subtraction on vector i in Target and
Repository file.B. diff SIMD addition (SIMD
multiplication (temp, temp) , diff).
i i 1.
Loop back to 2.

37
Applying SIMD for Loop Counter

Try to optimize step 3
Apply SIMD to the loop counter
Addition and comparison operations are reduced by
8 times

38
Applying SIMD for Loop Counter

initialize i (0,1,2,3,4,5,6,7) , diff (0, 0,
0, 0).
for i0 lt Number of float numbers in a file /
Number of floats packed in a register
temp SIMD subtraction on vector i0 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
temp SIMD subtraction on vector i1 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
temp SIMD subtraction on vector i2 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
temp SIMD subtraction on vector i3 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
temp SIMD subtraction on vector i4 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
temp SIMD subtraction on vector i5 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
temp SIMD subtraction on vector i6 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
temp SIMD subtraction on vector i7 in
Target and Repository file.
diff SIMD addition (SIMD multiplication (temp,
temp) , diff).
i SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)).

39
Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
No. of SPU used 1 2 3 4 5 6
Read input time (sec) 4 5 3 4 4 4
Total Elapsed time (sec) 286 146 97 75 60 51
Net Elapsed time (sec) 282 141 94 71 56 47
40
Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
41
Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version

little improvement (about 4).
shows the possibility to have faster performance
by further loop unrolling.
The best performance becomes 47 sec

42
Loop Unrolling

Proved that optimizing the loop can improve
performance
Completely loop unrolling
More obvious speed up

43
Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
No. of SPU used 1 2 3 4 5 6
Read input time (sec) 3 4 3 3 4 3
Total Elapsed time (sec) 159 82 55 42 35 30
Net Elapsed time (sec) 156 78 52 39 31 27
44
Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
45
Result of the parallel, with SIMD, float input,
loop unrolling PS3 version

45 faster
ultimate best performance becomes 27 sec

46
Conclusion of Optimization

PC version663 sec
PS3 with 1 SPU (i.e. sequential version on
PS3)1928 sec
Final optimized version of PS327 sec23 times
faster than PC version71 times faster than
sequential version on PS3

47
Conclusion of Optimization
48
Future Works