LYU0703 Parallel Distributed Programming on PS3 - PowerPoint PPT Presentation

About This Presentation
Title:

LYU0703 Parallel Distributed Programming on PS3

Description:

Architecture of PlayStation 3. Principals of Parallel Programming ... Architecture of PlayStation 3 (PS3) ... Port the whole ADVISER application on PlayStation 3 ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 51
Provided by: cseCu
Category:

less

Transcript and Presenter's Notes

Title: LYU0703 Parallel Distributed Programming on PS3


1
LYU0703Parallel Distributed Programming on PS3
Department of Computer Science and Engineering,
CUHK 2007-2008 Final Year Project Presentation
(1st term)
  • Huang Hiu Fung 05700512
  • Wong Chung Hoi 05596742
  • Supervised by Prof. Michael R. Lyu

2
Agenda
  • Background Information
  • Architecture of PlayStation3
  • Principals of Parallel Programming
  • Optimization of the ADVISER program1.
    Sequential Approach2. Parallel Approach
  • Conclusion
  • Future Works
  • QA

3
Background Information
  • Limitation of single-core processor
  • Memory Access Latency
  • Wire Delays
  • Power Consumption

4
Power Consumption
P powerC capacitance V voltageF
processor frequency (cycles per second)
5
Development of Multi-Core Processor
6
Development of Multi-Core Processor
  • Reduce power consumption- use multiple cores
    with low frequency instead of one with high
    frequency
  • Efficient processing of multiple tasks- divide
    the computation work- execute among the cores
    concurrently

7
Project Objectives
  • Need of parallel programming to optimize
    intensive-computation applications
  • Study features of parallel programming, compare
    sequential and parallel approach
  • Optimize an application, showing great
    improvement by parallel programming

8
Architecture of PlayStation3 (PS3)
  • A multi-core machine produced by Sony, with the
    Cell Broadband Engine
  • Strong Computation Power
  • Opened platform for other applications and
    development

9
Cell Broadband Engine (Cell BE)
PPE Power Processor Element SPE Synergistic
Processor Element EIB Element Interconnect Bus
10
Power Processor Element (PPE)
  • 64-bit PowerPC architecture based
  • General purpose operation
  • Designed as control-intensive
  • Control I/O of main memeory and other devices by
    the OS
  • Control over all 8 SPEs

11
Synergistic Processor Element (SPE)
  • Designed to provide computation performance
  • SPU perform allocated task
  • LS the only memory
  • MFC control data transfer
  • Totally 8 SPEs in Cell
  • Only 6 acessisble
  • 1 reserved for system software1 disabled

12
Element Interconnect Bus (EIB)
  • Internal communication bus inside Cell
  • Connect different elements PPE, SPEs. Memory
    controller

13
Principal of Parallel Programming
Parallel algorithm Serial algorithm
multiple processing units single processing unit
communication overhead no communication overhead
higher complexity in code straight forward code
ensure load balance between PU everything is done by CPU
14
Concept of Load Balance
  • Distribute data evenly
  • Total runtime depends on
  • the busiest processing
  • element
  • Wasting computation
  • time on idling processing
  • element

15
Method of parallelism
  • Data parallelism
  • Task parallelism

16
Parallel Architecture
Flynn's taxonomy Flynn's taxonomy Flynn's taxonomy
  Single Instruction Multiple Instruction
Single Data SISD MISD
Multiple Data SIMD MIMD
17
SISD
  • Traditional Computer
  • von Neumann model

18
SIMD
  • Same instruction on all data
  • Data parallelism
  • SIMD intrinsic function

19
MISD
  • No well known system
  • Mention for completeness

20
MIMD
  • Different instruction on
  • different data
  • Task parallelism
  • Further break down to
  • Shared Memory System
  • Distributed Memory System

21
Shared Memory System
  • Access to central
  • memory for data
  • PS3 Achieve by
  • MFC issuing DMA
  • command

22
Distributed Memory System
  • Each PE has its
  • own memory
  • PS3 Each SPE
  • has 256KB Local Store
  • PS3 is hybrid shared-distributed memory system

23
ADVISER
  • Comparing 2 video clips
  • Generating meaningful data (in form of numbers)
    of frames from the video
  • Comparing and looking for the most similar frames
  • Locating the similar segment which consist of a
    series of very similar frames

24
Input
  • 2 Folder, Repository Target
  • hl3 file vector of 1024 double precision values

Input No. of hl3 files
Target directory 5473
Repository directory 7547
25
Processing
  • hl3 file vector of 1024 double precision values
  • File P
  • File Q
  • Similarity
  • Smaller the better

26
Output
  • M Target, N Repository
  • O ( M N )
  • Computation time 633 sec
  • Flash demo

target hl3 1 most match repository A difference
value ?? target hl3 2 most match repository
B difference value ?? target hl3 3 most match
repository C difference value ??
27
Parallel Version
  • Data parallelism
  • Split data to 6 SPEs evenly
  • Computation time for 6 SPEs 330 sec
  • Flash demo

28
Parallel Version
  • Expected speed up 6X
  • Actual speed up 2X
  • PC and PPU, SPE all run at different speed
  • Computation time with CPU 633 sec
  • Computation time with 1 SPE 1928 sec
  • Computation time with PPU 3119 sec
  • CPU gt SPE gt PPU

29
Time Attack
  • SIMD intrinsic function
  • Changing data type
  • Double Buffering
  • Parallel Read
  • Distributing Job to idling PPE
  • SIMD on loop counter
  • Loop unrolling

30
SIMD intrinsic function
  • Addition, subtraction,
  • multiplication, etc.
  • Operates on 128 bits
  • registers
  • Date type double (64 bits)
  • Speed up 2X

31
Changing Data Type to int
  • Precision not important
  • Major speed up from
  • SIMD intrinsic
  • Data type int (32 bits)
  • Total Speed up 4X
  • Computation time
  • 71 sec

32
Changing Data Type to float
  • SPE specified for high
  • precision computation
  • No intrinsic for int data
  • type at all
  • Data Type float (32 bits)
  • Save data conversion time
  • Speed up by 30
  • Computation time 49 sec

33
Double buffering
  • Save communication time
  • MFC and SPU
  • 2 buffers
  • Prefetching
  • Processing
  • Not heavy in communication
  • Minor speed up

34
Parallel Reading for All Files
  • Read Target and Repository concurrently
  • Share file reading job among SPEs
  • Not improve as predicted, even slower
  • Reason hard disk cannot cannot handle concurrent
    request
  • Failed Attempt

35
Distributing Job to Idling PPE
  • PPE current job read files, distribute files,
    collect result
  • Use stall time to do some computation
  • Relatively low computation power of PPE
  • No significant improvement
  • Increase program complexity
  • Abandon this approach

36
Applying SIMD for Loop Counter
  • Major computation power consumed in
  • initialize i 0, diff (0, 0, 0, 0).
  • for i lt Number of float numbers in a file /
    Number of floats packed in a registerA. temp
    SIMD subtraction on vector i in Target and
    Repository file.B. diff SIMD addition (SIMD
    multiplication (temp, temp) , diff).
  • i i 1.
  • Loop back to 2.

37
Applying SIMD for Loop Counter
  • Try to optimize step 3
  • Apply SIMD to the loop counter
  • Addition and comparison operations are reduced by
    8 times

38
Applying SIMD for Loop Counter
  • initialize i (0,1,2,3,4,5,6,7) , diff (0, 0,
    0, 0).
  • for i0 lt Number of float numbers in a file /
    Number of floats packed in a register
  • temp SIMD subtraction on vector i0 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • temp SIMD subtraction on vector i1 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • temp SIMD subtraction on vector i2 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • temp SIMD subtraction on vector i3 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • temp SIMD subtraction on vector i4 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • temp SIMD subtraction on vector i5 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • temp SIMD subtraction on vector i6 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • temp SIMD subtraction on vector i7 in
    Target and Repository file.
  • diff SIMD addition (SIMD multiplication (temp,
    temp) , diff).
  • i SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)).

39
Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
No. of SPU used 1 2 3 4 5 6
Read input time (sec) 4 5 3 4 4 4
Total Elapsed time (sec) 286 146 97 75 60 51
Net Elapsed time (sec) 282 141 94 71 56 47
40
Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
41
Result of the parallel, with SIMD, float input,
SIMD for loop counter PS3 version
  • little improvement (about 4).
  • shows the possibility to have faster performance
    by further loop unrolling.
  • The best performance becomes 47 sec

42
Loop Unrolling
  • Proved that optimizing the loop can improve
    performance
  • Completely loop unrolling
  • More obvious speed up

43
Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
No. of SPU used 1 2 3 4 5 6
Read input time (sec) 3 4 3 3 4 3
Total Elapsed time (sec) 159 82 55 42 35 30
Net Elapsed time (sec) 156 78 52 39 31 27
44
Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
45
Result of the parallel, with SIMD, float input,
loop unrolling PS3 version
  • 45 faster
  • ultimate best performance becomes 27 sec

46
Conclusion of Optimization
  • PC version663 sec
  • PS3 with 1 SPU (i.e. sequential version on
    PS3)1928 sec
  • Final optimized version of PS327 sec23 times
    faster than PC version71 times faster than
    sequential version on PS3

47
Conclusion of Optimization
48
Future Works
  • Port the whole ADVISER application on
    PlayStation3
  • Optimization throughout the whole application

49
QA
50
The End
Write a Comment
User Comments (0)
About PowerShow.com