Novel and

About This Presentation

Title:

Novel and

Description:

... consistency. get(Read/Write, data), work on data, release(data) get makes a local copy. data-exchange protocols underneath provide the (simplified) consistency ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 35

Provided by: laxmika

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Novel and

1
Novel and Alternative Parallel Programming
Paradigms

Laxmikant Kale
CS433
Spring 2000

2
Parallel Programming models

We studied
MPI/Message passing, Shared Memory,
Charm/shared objs
loop-parallel openMP,
Other languages/paradigms
Loop parallelism on distributed memory machines
HPF
Linda, Cid, Chant
Several others
Acceptance barrier
I will assign reading assignments
papers on the above languages, available on the
web.
Pointers on course web page soon.

3
High Performance Fortran

Loop parallelism (mostly explicit) on distributed
memory machines
Arrays are the primary data structure (1 or
multi-dimensional)
How do decide which data lives where?
Provide distribute and align primitives
distribute Ablock, cyclic (notation difffers)
Align B with A same distribution
Who does which part of the loop iteration?
Owner computes
A(I,J) E

4
Linda

Shared tuple space
Specialization of shared memory
Operations
read, in, out eval
Pattern matching ( in 2,x -gt reads x in, and
removes tuple
Tuple analysis

5
Cid

Derived from Id, a data-flow language
Basic constructs
threads
create new threads
wait for data from other threads
User level vs. system level thread
What is a thread? stack, PC, ..
Preemptive vs non-preemptive

6
Cid

Multiple threads on each processor
Benefits adaptive overlap
Need a scheduler use the OS scheduler?
All threads on one PE share address space
Thread mapping
At creation time, one may ask the system to map
it to a PE
No migration after a thread starts running
Global pointers
Threads on different processors can exchange data
via these
(In addition to fork/join data exchange)

7
Cid

Global pointers
register any C structure as a global object (to
get a globalID)
get operation gets a local copy of a given
object
in read or write mode
asynchronous gets are also supported
get doesnt wait for data to arrive
HPF style global arrays
Grainsize control
Especially for tree structure computations
Create a thread, if other processors are idle
(for example)

8
Chant

Threads that send messages to each other
Message passing can be MPI style
User level threads
Simple implementation in Charm is available

9
CRL

Cache coherence techniques with software-only
support
release consistency
get(Read/Write, data), work on data,
release(data)
get makes a local copy
data-exchange protocols underneath provide the
(simplified) consistency

10
Multi-paradigm interoperabilty

Which one of these paradigms is the best?
Depends on the application, algorithm or module
Doesnt matter anyway, as we must use MPI
(openMP)
acceptance barrier
Idea
allow multiple modules to be written in different
paradigms
Difficulty
Each paradigm has its own view of how to schedule
processors
Comes down to scheduler
Solution have a common scheduler

11
Converse

Common scheduler
Components for easily implementing new paradigms
User level threads
separates 3 functions of a thread package
message passing support
Futures (origin Halstead MultiLisp)
What is a future
data, ready-or-not, caller blocks on access
Several other features

12
Other models
13
(No Transcript)
14
Object based load balancing

Load balancing is a resource management problem
Two sources of imbalances
Intrinsic application-induced
External environment induced

15
Object based load balancing

Application induced imbalances
Abrupt, but infrequent, or
Slow, cumulative
rarely frequent, large changes
Principle of peristence
Extension of principle of locality
Behavior, including computational load and
communication patterns, of objects tend to
persist over time
We have implemented strategies that exploit this
automatically!

16
(No Transcript)
17
Crack propagation example
Decomposition into 16 chunks (left) and 128
chunks, 8 for each PE (right). The middle area
contains cohesive elements. Both decompositions
obtained using Metis. Pictures S. Breitenfeld,
and P. Geubelle
18
Cross-approach comparison
MPI-F90 original
Charm framework(all C)
F90 charm library
19
Load balancer in action
20
Cluster handling intrusion
21
(No Transcript)
22
Applying to other languages

Need
MPI on Charm
threaded MPI multiple threads run on each PE
threads can be migrated!
Uses the load balancer framework
Non-threaded irecv/waitall library
More work, but more efficient
Currently rocket simulation program components
rocflo, rocsolid are being ported via this
approach

23
What next?

Timeshared parallel clusters
Web submission via appspector, and extension to
faucets
New applications
CSE simulations
Operations Research
Biological problems
New applications??
More info http//charm.cs.uiuc.edu,
http//www.ks.uiuc.edu

24
Using Global Loads

Idea
For even a moderately large number of processors,
collecting a vector of load on each PE is not
much more expensive than the collecting the total
(per message cost dominates)
How can we use this vector without creating
serial bottleneck?
Each processor know if it is overloaded compared
with avg.
Also knows which Pes are underloaded
But need an algorithm that allows each processor
to decide whom to send work to without global
coordination, beyond getting the vector
Insight everyone has the same vector
Also, assumption there are sufficient
fine-grained work pieces

25
Global vector scheme contd

Global algorithm if we were able to make the
decision centrally

Receiver nextUnderLoaded(0) For (I0, IltP
I) if (loadI gt average) assign
excess work to receiver, advancing receiver to
the next as needed
To make a distribued algorithm run the same
algorithm on each processor! Except ignore any
reassignment that doesnt involve me.
26
Tree structured computations

Examples
Divide-and-conquer
State-space search
Game-tree search
Bidirectional search
Branch-and-bound
Issues
Grainsize control
Dynamic Load Balancing
Prioritization

27
State Space Search

Definition
start state, operators, goal-state
(implicit/explicit)
Either search for goal state or for a path
leading to one
If we are looking for all solutions
same as divide and conquer, except no backward
communication
Search for any solution
Use the same algorithm as above?
Problems inconsistent and not monotonically
increasing speedups,

28
State Space Search

Using priorities
bitvector priorities
Let root have 0 prio
Prio of child
parent my rank

p
p03
p01
p02
29
Effect of Prioritization

Let us consider shared memory machines for
simplicity
Search directed to left part of the tree
Memory usage let B be branching factor of tree,
D its depth
O(DB P) nodes in the queue at a time
With stack O(DPB)
Consistent and monotonic speedups

30
Need prioritized load balancing

On non shared memory machines?
Centralized solution
Memory bottleneck too!
Fully distributed solutions
Hierarchical solution
Token idea

31
Bidirectional Search

Goal state is explicitly known and operators can
be inverted
Sequential
Parallel?

32
Game tree search

Tricky problem
alpha beta, negamax

33
Scalability

The Program should scale up to use a large number
of processors.
But what does that mean?
An individual simulation isnt truly scalable
Better definition of scalability
If I double the number of processors, I should
be able to retain parallel efficiency by
increasing the problem size

34
Isoefficiency

Quantify scalability
How much increase in problem size is needed to
retain the same efficiency on a larger machine?
Efficiency Seq. Time/ (P Parallel Time)
parallel time
computation communication idle

Write a Comment

User Comments (0)