Novel and - PowerPoint PPT Presentation

About This Presentation
Title:

Novel and

Description:

... consistency. get(Read/Write, data), work on data, release(data) get makes a local copy. data-exchange protocols underneath provide the (simplified) consistency ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 35
Provided by: laxmika
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Novel and


1
Novel and Alternative Parallel Programming
Paradigms
  • Laxmikant Kale
  • CS433
  • Spring 2000

2
Parallel Programming models
  • We studied
  • MPI/Message passing, Shared Memory,
    Charm/shared objs
  • loop-parallel openMP,
  • Other languages/paradigms
  • Loop parallelism on distributed memory machines
    HPF
  • Linda, Cid, Chant
  • Several others
  • Acceptance barrier
  • I will assign reading assignments
  • papers on the above languages, available on the
    web.
  • Pointers on course web page soon.

3
High Performance Fortran
  • Loop parallelism (mostly explicit) on distributed
    memory machines
  • Arrays are the primary data structure (1 or
    multi-dimensional)
  • How do decide which data lives where?
  • Provide distribute and align primitives
  • distribute Ablock, cyclic (notation difffers)
  • Align B with A same distribution
  • Who does which part of the loop iteration?
  • Owner computes
  • A(I,J) E

4
Linda
  • Shared tuple space
  • Specialization of shared memory
  • Operations
  • read, in, out eval
  • Pattern matching ( in 2,x -gt reads x in, and
    removes tuple
  • Tuple analysis

5
Cid
  • Derived from Id, a data-flow language
  • Basic constructs
  • threads
  • create new threads
  • wait for data from other threads
  • User level vs. system level thread
  • What is a thread? stack, PC, ..
  • Preemptive vs non-preemptive

6
Cid
  • Multiple threads on each processor
  • Benefits adaptive overlap
  • Need a scheduler use the OS scheduler?
  • All threads on one PE share address space
  • Thread mapping
  • At creation time, one may ask the system to map
    it to a PE
  • No migration after a thread starts running
  • Global pointers
  • Threads on different processors can exchange data
    via these
  • (In addition to fork/join data exchange)

7
Cid
  • Global pointers
  • register any C structure as a global object (to
    get a globalID)
  • get operation gets a local copy of a given
    object
  • in read or write mode
  • asynchronous gets are also supported
  • get doesnt wait for data to arrive
  • HPF style global arrays
  • Grainsize control
  • Especially for tree structure computations
  • Create a thread, if other processors are idle
    (for example)

8
Chant
  • Threads that send messages to each other
  • Message passing can be MPI style
  • User level threads
  • Simple implementation in Charm is available

9
CRL
  • Cache coherence techniques with software-only
    support
  • release consistency
  • get(Read/Write, data), work on data,
    release(data)
  • get makes a local copy
  • data-exchange protocols underneath provide the
    (simplified) consistency

10
Multi-paradigm interoperabilty
  • Which one of these paradigms is the best?
  • Depends on the application, algorithm or module
  • Doesnt matter anyway, as we must use MPI
    (openMP)
  • acceptance barrier
  • Idea
  • allow multiple modules to be written in different
    paradigms
  • Difficulty
  • Each paradigm has its own view of how to schedule
    processors
  • Comes down to scheduler
  • Solution have a common scheduler

11
Converse
  • Common scheduler
  • Components for easily implementing new paradigms
  • User level threads
  • separates 3 functions of a thread package
  • message passing support
  • Futures (origin Halstead MultiLisp)
  • What is a future
  • data, ready-or-not, caller blocks on access
  • Several other features

12
Other models
13
(No Transcript)
14
Object based load balancing
  • Load balancing is a resource management problem
  • Two sources of imbalances
  • Intrinsic application-induced
  • External environment induced

15
Object based load balancing
  • Application induced imbalances
  • Abrupt, but infrequent, or
  • Slow, cumulative
  • rarely frequent, large changes
  • Principle of peristence
  • Extension of principle of locality
  • Behavior, including computational load and
    communication patterns, of objects tend to
    persist over time
  • We have implemented strategies that exploit this
    automatically!

16
(No Transcript)
17
Crack propagation example
Decomposition into 16 chunks (left) and 128
chunks, 8 for each PE (right). The middle area
contains cohesive elements. Both decompositions
obtained using Metis. Pictures S. Breitenfeld,
and P. Geubelle
18
Cross-approach comparison
MPI-F90 original
Charm framework(all C)
F90 charm library
19
Load balancer in action
20
Cluster handling intrusion
21
(No Transcript)
22
Applying to other languages
  • Need
  • MPI on Charm
  • threaded MPI multiple threads run on each PE
  • threads can be migrated!
  • Uses the load balancer framework
  • Non-threaded irecv/waitall library
  • More work, but more efficient
  • Currently rocket simulation program components
  • rocflo, rocsolid are being ported via this
    approach

23
What next?
  • Timeshared parallel clusters
  • Web submission via appspector, and extension to
    faucets
  • New applications
  • CSE simulations
  • Operations Research
  • Biological problems
  • New applications??
  • More info http//charm.cs.uiuc.edu,
  • http//www.ks.uiuc.edu

24
Using Global Loads
  • Idea
  • For even a moderately large number of processors,
    collecting a vector of load on each PE is not
    much more expensive than the collecting the total
    (per message cost dominates)
  • How can we use this vector without creating
    serial bottleneck?
  • Each processor know if it is overloaded compared
    with avg.
  • Also knows which Pes are underloaded
  • But need an algorithm that allows each processor
    to decide whom to send work to without global
    coordination, beyond getting the vector
  • Insight everyone has the same vector
  • Also, assumption there are sufficient
    fine-grained work pieces

25
Global vector scheme contd
  • Global algorithm if we were able to make the
    decision centrally

Receiver nextUnderLoaded(0) For (I0, IltP
I) if (loadI gt average) assign
excess work to receiver, advancing receiver to
the next as needed
To make a distribued algorithm run the same
algorithm on each processor! Except ignore any
reassignment that doesnt involve me.
26
Tree structured computations
  • Examples
  • Divide-and-conquer
  • State-space search
  • Game-tree search
  • Bidirectional search
  • Branch-and-bound
  • Issues
  • Grainsize control
  • Dynamic Load Balancing
  • Prioritization

27
State Space Search
  • Definition
  • start state, operators, goal-state
    (implicit/explicit)
  • Either search for goal state or for a path
    leading to one
  • If we are looking for all solutions
  • same as divide and conquer, except no backward
    communication
  • Search for any solution
  • Use the same algorithm as above?
  • Problems inconsistent and not monotonically
    increasing speedups,

28
State Space Search
  • Using priorities
  • bitvector priorities
  • Let root have 0 prio
  • Prio of child
  • parent my rank

p
p03
p01
p02
29
Effect of Prioritization
  • Let us consider shared memory machines for
    simplicity
  • Search directed to left part of the tree
  • Memory usage let B be branching factor of tree,
    D its depth
  • O(DB P) nodes in the queue at a time
  • With stack O(DPB)
  • Consistent and monotonic speedups

30
Need prioritized load balancing
  • On non shared memory machines?
  • Centralized solution
  • Memory bottleneck too!
  • Fully distributed solutions
  • Hierarchical solution
  • Token idea

31
Bidirectional Search
  • Goal state is explicitly known and operators can
    be inverted
  • Sequential
  • Parallel?

32
Game tree search
  • Tricky problem
  • alpha beta, negamax

33
Scalability
  • The Program should scale up to use a large number
    of processors.
  • But what does that mean?
  • An individual simulation isnt truly scalable
  • Better definition of scalability
  • If I double the number of processors, I should
    be able to retain parallel efficiency by
    increasing the problem size

34
Isoefficiency
  • Quantify scalability
  • How much increase in problem size is needed to
    retain the same efficiency on a larger machine?
  • Efficiency Seq. Time/ (P Parallel Time)
  • parallel time
  • computation communication idle
Write a Comment
User Comments (0)
About PowerShow.com