CSCI43206360: Parallel Programming - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

CSCI43206360: Parallel Programming

Description:

Desktop Dual Core: E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz. E8400 ... Laptop Dual Core: T9500 - 6 MB L2 - 2.60 GHz - 800 MHz. T9300 - 6 MB L2 - 2.50 GHz - 800 MHz ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 52
Provided by: DaveHol
Category:

less

Transcript and Presenter's Notes

Title: CSCI43206360: Parallel Programming


1
CSCI-4320/6360 Parallel Programming
ComputingAE 215, Tues./Fri. 12-120
p.m.Introduction, Syllabus Prelims
  • Prof. Chris Carothers
  • Computer Science Department
  • Lally 306
  • Office Hrs Tuesdays, 130 330 p.m
  • chrisc_at_cs.rpi.edu
  • www.cs.rpi.edu/chrisc/COURSES/PARALLEL/SPRING-200
    9

2
Course Prereqs
  • Some programming experience in Fortran, C, C
  • Java is great but not for HPC
  • Youll have a choice to do your assignment in C,
    C or Fortransubject to the language support of
    the programming paradigm..
  • Assume youve never touched a parallel or
    distributed computer..
  • If you have MPI experience great..it will help
    you, but it is not necessary
  • If you love to write software
  • Both practice and theory are presented but there
    is a strong focus on getting your programs to
    work

3
Course Optional Textbook
  • Introduction to Parallel Computing, by Grama,
    Gupta, Karypis and Kumar
  • Make sure you have the 2nd edition!
  • Available online thru the Pearson/Addison Wesley
    publisher or Amazon.com etc.
  • Written in 2003 so somewhat out of date

4
Course Topics
  • Prelims Motivation
  • Memory Hierarchy
  • CPU Organization
  • Parallel Architectures
  • Message Passing/SMP/Vector-SIMD
  • Communications Networks
  • Basic Communications Operations
  • MPI Programming
  • Principles of Parallel Algorithm Design
  • Thread Programming
  • Ptreads
  • OpenMP

5
Course Topics (cont.)?
  • Parallel Debuggers
  • Parallel Operating Systems
  • Parallel Filesystems
  • Parallel Algorithms
  • Matrix Algorithms
  • Search
  • Graph
  • Other Programming Paradigm
  • MapReduce
  • Transactional Memory
  • Fault Tolerance
  • Applications (Guest Lectures)?
  • Computational Fluid Dynamics
  • Mesh Adaptivity
  • Parallel Discrete-Event Simulation

6
Course Grading Criteria
  • You must read lecture slides, and find papers on
    that topic
  • FOR EACH CLASS
  • Find a academic paper written in 2003 or later
    that relates to the class topic.
  • You will write a 1 to 2 page paper that sumarizes
    the class and the paper you states any questions
    you might have.
  • If youre a grad student, you must find and
    review 2 papers!
  • Whats it worth
  • 1 grade point per class up to 25 points total
  • There are 27/28 lectures, so you can pick 3 to
    miss
  • 4 programming assignments worth 10 pts each
  • MPI, Pthreads, OpenMP, Parallel I/O
  • Parallel Computing Research Project worth 35 pts
  • Yes, thats right no mid-term or final exam
  • May sound good, but when its 4 a.m. and your
    parallel program doesnt work and youve spent
    the past 30 hours debugging it, an exam doesnt
    sound so bad
  • For a course like this, youll need to manage you
    time well..do a little each day and dont get
    behind on the assignments or projects!

7
Where to find papers
  • Use Google Scholar or ACM and IEEE Digital
    Libraries.
  • You can access ACM or IEEE publications online
    from any on campus RPI computer system
  • Publications to look for are
  • Any IEEE, ACM, SIAM parallel computing conf. or
    journal
  • A few to consider are
  • Super Computing (SC), IPDPS, ICP, IEEE TPDS,
    IEEE TOC, ACM TOPLAS, ACM TOCS, JPDC, Currency
    Practice and Experience, Cluster Computing, SPAA,
    HPCA, HPDC, Parallel Computing, IBM Journal of R
    D
  • Try to find the most recent paper relating to a
    particular topic
  • Dont re-use a paper if it covers potentially
    multiple topics..
  • Summary /Reviews are due the next lecture w/o
    exception.
  • Include a copy of your paper(s) with your
    summary/review and a reference citation of the
    paper and where you found it.

8
To Make A Fast Parallel Computer You Need a
Faster Serial Computerwell sorta
  • Review of
  • Instructions
  • Instruction processing..
  • Put it togetherwhy the heck do we care about or
    need a parallel computer?
  • i.e., they are really cool pieces of technology,
    but can they really do anything useful beside
    compute Pi to a few billion more digits

9
Processor Instruction Sets
  • In general, a computer needs a few different
    kinds of instructions
  • mathematical and logical operations
  • data movement (access memory)?
  • jumping to new places in memory
  • if the right conditions hold.
  • I/O (sometimes treated as data movement)?
  • All these instructions involve using registers to
    store data as close as possible to the CPU
  • E.g. t0, s0 in MIPs on eax, ebx in x86

10
a(bc)-(de)
s0
s1
s2
s3
s4
  • add t0, s1, s2 t0 bc
  • add t1, s3, s4 t1 de
  • sub s0, t0, t1 a t0t1

11
lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
12
Array Example abc8
s0
s2
s1
  • lw t0,8(s2) t0 c8
  • add s0, s1, t0 s0s1t0
  • (yeah, this is not quite right ?)?

13
lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
14
sw srcreg, const(addrreg)?
Store Word
A number
Name of register to get base address from
Name of register to get value from
address (contents of addrreg) const
15
Example sw s0, 4(s3)?
  • If s3 has the value 100, this will copy the word
    in register s0 to memory location 104.
  • Memory104 lt- s0

16
lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
17
sw srcreg, const(addrreg)?
Store Word
A number
Name of register to get base address from
Name of register to get value from
address (contents of addrreg) const
18
Example sw s0, 4(s3)?
  • If s3 has the value 100, this will copy the word
    in register s0 to memory location 104.
  • Memory104 lt- s0

19
Instruction formats
32 bits
This format is used for many MIPS instructions
that involve calculations on values already in
registers. E.g. add t0, s0, s1
20
How are instructions processed?
  • In the simple case
  • Fetch instruction from memory
  • Decode it (read op code, and use registers based
    on what instruction the op code says
  • Execute the instruction
  • Write back any results to register or memory
  • Complex case
  • Pipeline overlap instruction processing
  • Superscalar multi-instruction issue per clock
    cycle..

21
Simple (relative term) CPU Multicyle Datapath
Control
22
Simple (yeah right!) Instruction Processing FSM!
23
Pipeline Processing w/ Laundry
  • While the first load is drying, put the second
    load in the washing machine.
  • When the first load is being folded and the
    second load is in the dryer, put the third load
    in the washing machine.
  • NOTE unrealistic scenario for CS students, as
    most only own 1 load of clothes

24
(No Transcript)
25
Pipelined DP w/ signals
26
Pipelined Instruction.. But wait, weve got
dependencies!
27
Pipeline w/ Forwarding Values
28
Where Forwarding Failsmust stall
29
How Stalls Are Inserted
30
What about those crazy branches?
Problem if the branch is taken, PC goes to addr
72, but dont know until after 3 other
instructions are processed
31
Dynamic Branch Prediction
  • From the phase There is no such thing as a
    typical program, this implies that programs will
    branch is different ways and so there is no one
    size fits all branch algorithm.
  • Alt approach keep a history (1 bit) on each
    branch instruction and see if it was last taken
    or not.
  • Implementation branch prediction buffer or
    branch history table.
  • Index based on lower part of branch address
  • Single bit indicates if branch at address was
    last taken or not. (1 or 0)?
  • But single bit predictors tends to lack
    sufficient history

32
Solution 2-bit Branch Predictor
Must be wrong twice before changing
predictionLearns if the branch is more biased
towards taken or not taken
33
Even more performance
  • Ultimately we want greater and greater
    Instruction Level Parallelism (ILP)?
  • How?
  • Multiple instruction issue.
  • Results in CPIs less than one.
  • Here, instructions are grouped into issue
    slots.
  • So, we usually talk about IPC (instructions per
    cycle)?
  • Static uses the compiler to assist with grouping
    instructions and hazard resolution. Compiler MUST
    remove ALL hazards.
  • Dynamic (i.e., superscalar) hardware creates the
    instruction schedule based on dynamically
    detected hazards

34
Example Static 2-issue Datapath
  • Additions
  • 32 bits from intr. Mem
  • Two read, 1 write ports on reg file
  • 1 more ALU (top handles address calc)?

35
Ex. 2-Issue Code Schedule
  • Loop lw t0, 0(s1) t0array element
  • addiu t0, t0, s2 add scalar in s2
  • sw t0, 0(s1) store result
  • addi s1, s1, -4 dec pointer
  • bne s1, zero, Loop branch s1!0

It take 4 clock cycles for 5 instructions or IPC
of 1.25
36
More Performance Loop Unrolling
  • Technique where multiple copies of the loop body
    are made.
  • Make more ILP available by removing dependencies.
  • How? Complier introduces additional registers via
    register renaming.
  • This removes name or anti dependence
  • where an instruction order is purely a
    consequence of the reuse of a register and not a
    real data dependence.
  • No data values flow between one pair and the next
    pair
  • Lets assume we unroll a block of 4 interations
    of the loop..

37
Loop Unrolling Schedule
Now, it takes 8 clock cycles for 14 instructions
or IPC of 1.75!! This is a 40 performance boost!
38
Dynamic Scheduled Pipeline
39
Intel P4 Dynamic Pipeline Looks like a cluster
.. Just much much smaller
40
Summary of Pipeline Technology
Weve exhausted this!! IPC just wont go much
higher Why??
41
More Speed til it Hertz!
  • So, if not ILP is available, why not increase the
    clock frequency
  • E.g. why dont we have 100 GHz processors today?
  • ANSWER POWER HEAT!!
  • With current CMOS technology power needs
    polynominal increase with a linear increase in
    clock speed.
  • Power leads to heat which will ultimately turn
    your CPU to heap of melted silicon!

42
(No Transcript)
43
CPU Power Consumption
Typically, 100 watts is magic limit..
44
Where do we go from here?(actually, weve
arrived _at_ here!)?
  • Current Industry Trend Multi-core CPUs
  • Typically lower clock rate (i.e., lt 3 Ghz)?
  • 2, 4 and now 8 cores in single socket package
  • Because of smaller VLSI design processes (e.g. lt
    45 nm) can reduce power heat..
  • Potential for large, lucrative contracts in
    turning old dusty sequential codes to multi-core
    capable
  • Salesman heres your new 200 CPU, oh, BTW,
    youll need this million consulting contract to
    port your code to take advantage of those extra
    cores!
  • Best business model since the mainframe!
  • More cores require greater and greater
    exploitation of available parallelism in an
    application which gets harder and harder as you
    scale to more processors..
  • Due to cost, well force in-house development of
    talent pool..
  • You could be that talent pool

45
Examples Multicore CPUs
  • Brief listing of the recently released new 45 nm
    processors Based on Intel site (Processor Model
    - Cache - Clock Speed - Front Side Bus)?
  • Desktop Dual Core
  • E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz
  • E8400 - 6 MB L2 - 3.00 GHz - 1333 MHz
  • E8300 - 6 MB L2 - 2.66 GHz - 1333 MHz
  • Laptop Dual Core
  • T9500 - 6 MB L2 - 2.60 GHz - 800 MHz
  • T9300 - 6 MB L2 - 2.50 GHz - 800 MHz
  • T8300 - 3 MB L2 - 2.40 GHz - 800 MHz
  • T8100 - 3 MB L2 - 2.10 GHz - 800 MHz
  • Desktop Quad Core
  • Q9550 - 12MB L2 - 2.83 GHz - 1333 MHz
  • Q9450 - 12MB L2 - 2.66 GHz - 1333 MHz
  • Q9300 - 6MB L2 - 2.50 GHz - 1333 MHz
  • Desktop Extreme Series
  • QX9650 - 12 MB L2 - 3 GHz - 1333 MHz
  • Note Intel's new 45nm Penryn-based Core 2 Duo
    and Core 2 Extreme processors were released on
    January 6, 2008. The new processors launch within
    a 35W thermal envelope.

46
Nov. 2008TOP 5 Supercomputers(www.top500.org)?
  • DOE/LANL RoadRunner, IBM QS22 Opteron cluster
    w/ PowerXCell8i 3.2 Ghz/. 129600 procs, 1105
    Tflops
  • ORNL Jaguar Cray XT5 2.3 GHz Opterons, 150152
    procs, 1059 TFlops
  • NASA Pleiades SGI Altix ICE Xeon 3.0/2.66
    GHz, 51200 procs, 487 TFlops
  • DOE/LLNL IBM Blue Gene/L, 700 MHz PPC 440,
    212992, 478 TFlops
  • ANL Intrepid IBM Blue Gene/P, 850 MHz PPC
    450, 163840 procs, 450 TFlops

RPI/CCNI is current 34 with Fen IBM Blue
Gene/L, 32,768 procs, 73 TFlops.
47
(No Transcript)
48
(No Transcript)
49
What are SCs used for??
  • Can you say fever for the flavor..
  • Yes, Pringles used an SC to model airflow of
    chips as the entered The Can..
  • Improved overall yield of good chips in The
    Can and less chips on the floor

50
Patient Specific Vascular Surgical Planning
  • Virtual flow facility for patient specific
    surgical planning
  • High quality patient specific flow simulations
    needed quickly
  • Image patent, create model, adaptive flow
    simulation
  • Simulation on massively parallel computers
  • Cost only 600 on 32K Blue Gene/L vs. 50K for a
    repeat open heart surgery

51
Summary
  • Current uni-core speed has peaked
  • No more ILP to exploit
  • Cant make CPU cores any faster w/ current CMOS
    technology
  • Must go massively parallel in order to increase
    IPC (instructions per clock cycle).
  • Only way for large application to go really fast
    is to use lots and lots of processors..
  • Todays systems have 10s of thousands of
    processors
  • By 2012 systems will emerge w/ gt 200K to 1
    million processors w/ 10 PFlop compute power!
    (e.g. Blue Waters _at_ UIUC)?
Write a Comment
User Comments (0)
About PowerShow.com