ME964 High Performance Computing for Engineering Applications - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

ME964 High Performance Computing for Engineering Applications

Description:

I will be in Milwaukee on Th, Mik will be the lecturer - tracing the evolution ... On grid computing (the Condor project) On the N-body problem. Midterm exam on 11/20 ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 49
Provided by: sbel3
Learn more at: https://sbel.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: ME964 High Performance Computing for Engineering Applications


1
ME964High Performance Computing for Engineering
Applications
  • Parallel Programming Patterns
  • Discussion of Midterm Project
  • Oct. 21, 2008

2
Before we get started
  • Last Time
  • Parallel Programming patterns
  • The Finding Concurrency Design Space
  • The Algorithm Structure Design Space
  • Today
  • Parallel Programming patterns
  • Discuss the Supporting Structures Design Space
  • Discussion of the Midterm Project
  • Assigned today, due on 11/18 or 11/09 (based on
    level of difficulty)
  • Other issues
  • I will be in Milwaukee on Th, Mik will be the
    lecturer - tracing the evolution of the hardware
    support for the graphics pipeline

2
3
Key Parallel Programming Steps
  1. Find the concurrency in the problem
  2. Structure the algorithm so that concurrency can
    be exploited
  3. Implement the algorithm in a suitable programming
    environment
  4. Execute and tune the performance of the code on a
    parallel system

3
4
Whats Comes Next?
Focus on thisfor a while
4
5
Before We Dive In
  • You hover in this design space when you ponder
    whether an MPI or an SPMD or a event driven
    simulation environment is what it takes to
    optimally design your algorithm
  • We are not going to focus on this process too
    much since its clear that we are stuck with SPMD
  • All that well be doing is to compare SPMD with
    other parallel programming paradigms to
    understand when SPMD is good and when it is bad
  • Also, we will not cover the Implementation
    Mechanisms design space since we dont need to
    wrestle with how to spawn and join threads, how
    to do synchronization, etc.
  • This would be Chapter 6 of the book

5
6
Supporting Structures Program and Data Models
Program Models
Data Models
SPMD
Shared Data
Master/Worker
Shared Queue
Loop Parallelism
Distributed Array
Fork/Join
These are not necessarily mutually exclusive.
6
7
Program Structuring Patterns
  • SPMD (Single Program, Multiple Data)
  • All PEs (Processor Elements) execute the same
    program in parallel
  • Each PE has its own data
  • Each PE uses a unique ID to access its portion of
    data
  • Different PE can follow different paths through
    the same code
  • This is essentially the CUDA Grid model
  • Master/Worker
  • Loop Parallelism
  • Fork/Join

7
8
Program Structuring Patterns(the rest of the
pack)
  • Master/Worker
  • A Master thread sets up a pool of worker threads
    and a bag of tasks
  • Workers execute concurrently, removing tasks
    until done
  • Loop Parallelism
  • Loop iterations execute in parallel
  • FORTRAN do-all (truly parallel), do-across (with
    dependence)
  • Very common in OpenMP
  • Fork/Join
  • Most general way of creation of threads (the
    POSIX standard)
  • Can be regarded as a very low level approach in
    which you use the OS to manage parallelism

8
9
Review Algorithm Structure
Algorithm Design
Organize by Task
Organize by Data
Organize by Data Flow
Linear
Recursive
Linear
Recursive
Regular
Irregular
Task Parallelism
Divide and Conquer
Geometric Decomposition
Recursive Data
Pipeline
Event Driven
9
10
Algorithm Structures vs. Supporting Structures
Task Parallel. Divide/Conquer Geometric Decomp. Recursive Data Pipeline Event-based
SPMD ???? ??? ???? ?? ??? ??
Loop Parallel ???? ?? ???
Master/ Worker ???? ?? ? ? ? ?
Fork/ Join ?? ???? ?? ???? ????
Four smilies is the best (Source Mattson, et al.)
10
11
Supporting Structures vs. Program Models
OpenMP MPI CUDA
SPMD ??? ???? ????
Loop Parallel ???? ?
Master/ Slave ?? ???
Fork/Join ???
11
12
More on SPMD
  • Dominant coding style of scalable parallel
    computing
  • MPI code is mostly developed in SPMD style
  • Many OpenMP code is also in SPMD (next to loop
    parallelism)
  • Particularly suitable for algorithms based on
    data parallelism, geometric decomposition, divide
    and conquer.
  • Main advantage
  • Tasks and their interactions visible in one piece
    of source code, no need to correlated multiple
    sources

SPMD is by far the most commonly used pattern for
structuring parallel programs.
12
13
Typical SPMD Program Phases
  • Initialize
  • Establish localized data structure and
    communication channels
  • Obtain a unique identifier
  • Each thread acquires a unique identifier,
    typically in the range from 0 to N-1, where N is
    the number of threads.
  • Both OpenMP and CUDA have built-in support for
    this.
  • Distribute Data
  • Decompose global data into chunks and localize
    them, or
  • Sharing/replicating major data structure using
    thread ID to associate subset of the data to
    threads
  • Run the core computation
  • More details in next slide
  • Finalize
  • Reconcile global data structure, prepare for the
    next major iteration

13
14
Core Computation Phase
  • Thread IDs are used to differentiate behavior of
    threads
  • Use thread ID in loop index calculations to split
    loop iterations among threads
  • Use conditions based on thread ID to branch to
    their specific actions

Both can have very different performance results
and code complexity depending on the way they are
done.
14
15
A Simple Example
  • Assume
  • The computation being parallelized has 1,000,000
    iterations.
  • Sequential code

Num_steps 1000000 for (i0 ilt num_steps,
i)
15
16
SPMD Code Version 1
  • Assign a chunk of iterations to each thread
  • The last thread also finishes up the remainder
    iterations

num_steps 1000000 i_start my_id
(num_steps/num_threads) i_end i_start
(num_steps/num_threads) if (my_id
(num_threads-1)) i_end num_steps for (i
i_start i lt i_end i) . Reconciliation of
results across threads if necessary.
16
17
Problems with Version 1
  • The last thread executes more iterations than
    others
  • The number of extra iterations is up to the total
    number of threads 1
  • This is not a big problem when the number of
    threads is small
  • When there are thousands of threads, this can
    create serious load imbalance problems.
  • Also, the extra if statement is a typical source
    of branch divergence in CUDA programs.

17
18
SPMD Code Version 2
  • Assign one more iterations to some of the threads

int rem num_steps num_threads i_start
my_id (num_steps/num_threads) i_end i_start
(num_steps/num_threads) if (rem ! 0) if
(my_id lt rem) i_start my_id i_end
(my_id 1) else i_start rem i_end
rem .
Less load imbalance More branch divergence.
18
19
SPMD Code Version 3
  • Use cyclic distribution of iteration

num_steps 1000000 for (i my_id i lt
num_steps i num_threads) .
Less load imbalance Loop branch divergence in
the last Warp Data padding further eliminates
divergence.
19
20
Comparing the Three Versions
Version 1
ID0
ID1
ID3
ID2
Version 2
ID0
ID1
ID3
ID2
Version 3
ID0
ID1
ID3
ID2
Padded version1 1 may be best for some data
access patterns.
20
21
Data Styles
  • Shared Data
  • All threads share a major data structure
  • This is what CUDA supports
  • Shared Queue
  • All threads see a thread safe queue that
    maintains ordering of data communication
  • Very relevant in conjunction with the
    Master/Worker scenarios
  • Distributed Array
  • Decomposed and distributed among threads
  • Limited support in CUDA Shared Memory
  • Typically an issue in NUMA layouts

21
22
End Parallel Programming Patterns Begin
Overview Reminder of Semester
22
23
The Reminder of the Semester
  • There will be one more lecture on CUDA (in
    November)
  • There will be three lectures on MPI
  • There will be about four to five guest lectures
  • On massively parallel hardware design
  • On grid computing (the Condor project)
  • On the N-body problem
  • Midterm exam on 11/20
  • Midterm Project (assigned today, due on 11/17 or
    sooner)
  • Ill explain why sooner shortly
  • Final Project (due on last Friday, Dec. 19, at
    1159 PM)

23
24
Midterm Project
  • The topic Produce a Collision Detection engine
  • Deal with spheres or ellipsoids
  • There will be two levels of difficulty
  • The entry level
  • You implement a simple brute force approach
  • Deal only with spheres of various radii
  • Due Date November 9, 1159 PM
  • The advanced level
  • Apply a sophisticated design to produce a
    parallel algorithm that is very efficient
  • Deal with ellipsoids
  • Note that spheres are a particular case of
    ellipsoids
  • Due Date November 17, 1159 PM
  • Most likely make this engine a part of your Final
    Project
  • Independent of the level embraced, youll be
    presenting your work on November 18, during
    regular class hours
  • Youll have 10 minutes to make a PowerPoint
    presentation. The presentations are part of the
    project (to be submitted for grading)

24
25
Final Project
  • Work on something related to your research
  • I can assist you with selecting a topic
  • I will help you with the design
  • There is a default Final Project, in case you
    cant choose

26
The Final Project Further Comments
  • There is a connection between the Midterm and
    Final Projects
  • It is up to you to decide what you want to work
    on for the Final Project
  • However, if you dont know what to work on, Ill
    assign you a Granular Dynamics problem (this is
    the default Final Project)
  • An important part of this problem is the
    collision detection stage

Granular Dynamics Problem
26
27
The Final Project Concluding Remarks
  • To conclude, the default Final Project will look
    like this

Final Project Midterm Project Some Extra
Headache
Granular Material Dynamics
  • Keep in mind
  • If you plan to do the default final project,
    think ahead and implement a collision detection
    engine that is suitable to a scenario where the
    bodies slightly change their location each time
    you have to do the contact detection
  • This is because in granular flow the particles
    move in space
  • Maybe you can run a cheaper collision detection
    after an initial exhaustive one
  • Youll need to look at problems with millions of
    colliding bodies

27
28
End Overview Reminder of Semester Begin
Midterm Project
28
29
The Collision Problem(The Midterm Project)
  • Problem Statement
  • A large set of spheres or ellipsoids is given to
    you
  • Provide collision state information for each pair
    of bodies that are colliding
  • The bodies that you are dealing with are of
    spherical shape
  • Spheres are of arbitrary radius R, where Rlow lt R
    lt Rhigh
  • The bound values Rlow and Rhigh are given to you
  • Their location is given to you
  • NOTE The ellipsoid case is similar (skipped here
    the discussion)
  • You will have to validate your results against
    Bullet, which is an open source game
    development tool.

29
30
Proposed Workflow
The proposed workflow below will be captured in a
Visual Studio Solution (with three Projects)
that will be provided to you (zipped file). You
are responsible for the part in red font below.
  • Generate an initial distribution of the spheres
    (part of Project 1)
  • Load data (part of Project 2)
  • Run collision detection on the device (part of
    Project 2)
  • Save results in a file (part of Project 2)
  • Run collision detection using the Bullet library
    (part of Project 3)
  • The utility will tell you if you passed or
    failed (part of Project 3)

30
31
Project template
  • Solution contains three projects
  • First project is used to generate the input data
    (distribution of the spheres along with their
    radius)
  • Second project loads input data and checks for
    collisions
  • This is the project you modify
  • The main driver calls a function that invokes the
    collision detection kernel
  • Pretty much like youve had to do in your HW
  • Third project does the Bullet part followed up by
    the validation
  • Uses C function notation
  • Main file uses C classes and objects
  • Allows C and C code to work together

31
32
Spheres Possible Contact Cases
  • Convention
  • Color code for object A bluish
  • Color code for object B greenish
  • Sphere contacts sphere
  • One contact point
  • Normal - from center of B to center of A
  • Sphere interpenetrates sphere
  • Two contact points
  • Points on objects furthest in
  • Normal - from center of B to center of A

A
B
A
B
32
33
Possible Contact Cases (Contd.)
  • Sphere exactly on top of sphere
  • Two points returned
  • Points on surface that intersect the x-axis
  • Normal is on x axis
  • Sphere inside sphere
  • Two points returned
  • Points on surface that intersect the x-axis
  • Normal is on x axis
  • Spheres are not in contact
  • Nothing to report, ignore

B
A
A
B
B
A
33
34
Bullet Physics Library
  • Bullet is a cross platform open source physics
    library.
  • Can simulate many types of objects
  • Rigid bodies (concave, convex, triangular mesh,
    etc.)
  • Cloth, soft bodies
  • Supports joints and constraints
  • Geared towards game development.
  • Speed vs. accuracy comes on the side of speed
  • Bullets collision detection will be used to
    verify your results

34
35
Screenshot, Bullet
35
36
Relevant Variables
  • All variables retuned per contact by Bullet
  • btVector3 localPointA
  • btVector3 localPointB
  • btVector3 positionWorldOnB
  • btVector3 positionWorldOnA  
  • btVector3 normalWorldOnB
  • btScalar distance1
  • btVector3 lateralFrictionDir1
  • btVector3 lateralFrictionDir2
  • int partId0
  • int partId1
  • int index0
  • int index1
  • btScalar combinedFriction
  • btScalar combinedRestitution
  • void  userPersistentData
  • btScalar appliedImpulse
  • bool lateralFrictionInitialized
  • int lifeTime
  • Variables you need to return and calculate
  • btVector3 positionWorldOnA
  • btVector3 positionWorldOnB
  • btVector3 normalWorldOnB
  • int objectIdA (note this is the Id associated
    with A)
  • int objectIdB (note this is the Id associated
    with B)
  • Additional parameters for Final Project
  • localPointA
  • localPointB
  • btVector3 lateralFrictionDir1
  • btVector3 lateralFrictionDir2
  • distance1

36
37
Variable Description
  • positionWorldOnB (float3) Absolute Collision
    location on object A
  • positionWorldOnA (float3) Absolute Collision
    location on object B
  • normalWorldOnB (float3) Collision Normal Vector
    from B to A
  • objectIdA (int) Object A ID
  • objectIdB (int) Object B ID (Id starts from 0)
  • Other parameters that need to be implemented for
    Final Project (if you take this route)
  • localPointA (float3) Collision point w/ respect
    to center of object A
  • localPointB (float3) Collision point w/ respect
    to center of object B
  • distance1 (float) Distance between collision
    points
  • lateralFrictionDir1 (float3) Vector orthogonal
    to normal and Dir2
  • lateralFrictionDir2 (float3) Vector orthogonal
    to normal and Dir1

37
38
The Passed/Failed Issue
  • The comparison is going to be made based on
  • The contact location expressed in global
    reference frame
  • There will be an epsilon value that controls this
    test
  • The contact normal
  • Your normal and Bullets normal should be within
    10
  • For comparison purposes, your contacts should be
    ordered such that
  • I still owe you two things
  • How are body A and body B chosen?
  • What to do with this collision case

38
39
Common Sense Advice
  • Donald Knuth (1974) Premature optimization is
    the root of all evil
  • Get it to run first, add the bells and whistles
    later on
  • Embrace a Crawl Walk Run approach

39
40
Collision Detection Some Possible Approaches
  • Brute force simplest and slowest
  • Ok to implement, unless you want to make this
    your final project, in which case youll have to
    implement one of the other three methods below
  • It relies on an exhaustive search N(N-1)/2 tests
    (given N bodies)
  • Baraff sorting based (the method is sometimes
    called culling
  • Harada rely on voxel and texture, not
    recommended
  • Scott Le Grand spatial subdivision method

40
41
References
  • The methods of Scott Le Grand and Harada are
    explained in great detail in GPU Gems 3 (you can
    borrow it from me)
  • I also have the following references (found them
    on the web, contact me if you need them)

Naga, K. G., R. Stephane, C. L. Ming, and M.
Dinesh, 2003 CULLIDE interactive collision
detection between complex models in large
environments using graphics hardware. Proceedings
of the ACM SIGGRAPH/EUROGRAPHICS conference on
Graphics hardware, Eurographics Association.
Baraff, D., 1997 An Introduction to Physically
Based Modeling Rigid Body Simulation
IINonpenetration Constraints. SIGGRAPH Course
Notes.
41
42
Brute Force
  • Iterate over all combinations of object pairs and
    perform distance checks on them
  • Store pairs that are colliding into list and
    compute required information
  • Iterate over each pair and calculate additional
    contact data.

42
43
Baraff Sorting axis
p6
p5
p3
p4
p1
p2
b3
b6
e6
e1
b5
b2
e3
b4
e5
e4
e2
b1
  • Every time an object p begins on the axis add p
    to stack
  • Every time an object p ends on the axis remove p
    from stack
  • Each time an object is added to stack compare it
    to those still on stack
  • p1 check with p3 and p6
  • p5 check with p3
  • An object is colliding with another if they
    overlap on all three axes.
  • Multi GPU relevant yes, it can speed up the
    problem by simultaneously culling based on more
    axes

43
44
Example Why this is tricky at times
(The issue of false positives)
44
45
Harada , GPU Gems 3
  • Mostly meant to deal with spheres, and of the
    same radii
  • Voxel collision method
  • Voxel 3d pixel
  • Spheres are used to represent other objects
  • Fill objects space using spheres
  • Use many 2D textures to store data.
  • Textures make up 3d grid
  • Optimize by using texture memory
  • Easier to coalesce reads and writes

45
46
Harada (Contd.)
  • Each voxel in texture contains up to 4 object
    ids
  • Stored in RGBA values
  • Objects and cells need to be sized so that this
    is true
  • Constant radius is preferred but some variation
    can exist
  • Too much variation could cause instances with gt4
    objects in one cells space
  • Test each cell with its 26 neighbors
  • Iterate over all cells in this way

46
47
Scott Le Grand, GPU Gems 3
  • Spatial subdivision method
  • Divide space into three dimensional grid
  • Collision state of spheres cannot be modified by
    different threads
  • Cells numbered by processing order (1, 2,)
  • First all 1 cells are processed, then 2 cells
    and so on
  • Insures that two threads do not overwrite
    collision data for one sphere
  • Spheres get assigned a home cell id and
    overlapping phantom cell ids
  • Array gets sorted by cell number and type

47
48
Scott Le Grand (Contd.)
  • From this list determine which cells have
    multiple objects inside
  • These cells will be checked for collisions
  • Task is simplified because cell id list is
    already sorted by cell number
  • By checking for a change in cell id the number of
    objects in that cell can be determined.
  • If cell only has phantom objects it will not be
    checked
  • Phantom objects - objects whose centroid is not
    in that cell but still intersects it
  • Cell id must have at least one home id and gt1
    objects to be checked
  • Multi GPU relevant yes, it can increase the size
    of the problem being addressed

48
Write a Comment
User Comments (0)
About PowerShow.com