Real Time load balancing of parallel application - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Real Time load balancing of parallel application

Description:

Real Time load balancing of parallel application ECE696b Yeliang Zhang Agenda Introduction Parallel paradigms Performance analysis Real time load balancing project ... – PowerPoint PPT presentation

Number of Views:272
Avg rating:3.0/5.0
Slides: 66
Provided by: SalimH6
Category:

less

Transcript and Presenter's Notes

Title: Real Time load balancing of parallel application


1
Real Time load balancing of parallel application
  • ECE696b
  • Yeliang Zhang

2
Agenda
  • Introduction
  • Parallel paradigms
  • Performance analysis
  • Real time load balancing project
  • Other research work example
  • Future work

3
What is Parallel Computing?
  • Using more than one computer at the same time to
    solve a problem, or using a computer that has
    more than one processor working simultaneously (a
    parallel computer).
  • Same program can be run on different machine at
    the same time (SPMD)
  • Different program can be run on different machine
    at the same time (MPMD)

4
Why it is interesting?
  • Use efficiently of computer capability
  • Solve problems which will take single CPU machine
    months, or years to solve
  • Provide redundancy to certain application

5
Continue
  • Limits of single CPU computing
  • Available memory
  • Performance
  • Parallel computing allows
  • Solve problems that dont fit on a single CPUs
    memory space
  • Solve problems that cant be solved in a
    reasonable time
  • We can run
  • Larger problems
  • Faster

6
One Application Example
  • Weather Modeling and Forecasting
  • Consider 3000 X 3000 miles, and height of 11
    miles.
  • For modeling partition into segments of
    0.1X0.1X0.1 cubic miles 1011 segments.
  • Lets take 2-day period and parameters need to be
    computed every 30 min. Assume the
  • computations take 100 instrs. A single update
    takes 1013 instrs. For two days we have total
  • instrs. of 1015 . For serial computer with 1010
    instrs./sec, this takes 280 hrs to predict next
    48
  • hrs !!
  • Lets take 1000 processors capable of 108
    instrs/sec. Each processor will do 108 segments.
    For
  • 2 days we have 1012 instrs. Calculation done in 3
    hrs !!
  • Currently all major weather forecast centers (US,
    Europe, Asia) have supercomputers with
  • 1000s of processors.

7
Some Other Application
  • Database inquiry
  • Simulation super star explosion
  • Fluid dynamic calculation
  • Cosmic microwave data analysis
  • Ocean modeling
  • Genetic research

8
Types of Parallelism Two Extremes
  • Data parallel
  • Each processor performs the same task on
    different data
  • Example - grid problems
  • Task parallel
  • Each processor performs a different task
  • Example - signal processing
  • Most applications fall somewhere on the continuum
    between these two extremes

9
Basics Data Parallelism
  • Data parallelism exploits the concurrency that
    derives from the application of the same
    operation to multiple elements of a data
    structure
  • Ex Add 2 to all elements of an array
  • Ex increase the salary of all employees with 5
    years services

10
Typical Task Parallel Application
  • N tasks if not overlapped, they can be run on N
    processors

Application
Task 2
Task n
Task 1
..
11
Limits of Parallel Computing
  • Theoretical Upper Limits
  • Amdahls Law
  • Practical Limits
  • Load balancing
  • Non-computational sections
  • Other Considerations
  • Sometime it needs to re-write the code

12
Amdahls Law
  • Amdahls Law places a strict limit on the speedup
    that can be realized by using multiple
    processors.
  • Effect of multiple processors on run time
  • Effect of multiple processors on speed up
  • Where
  • fs serial fraction of code
  • fp parallel fraction of code
  • N number of processors
  • tn time to run on N processors

13
Practical Limits Amdahls Law vs. Reality
  • Amdahls Law provides a theoretical upper limit
    on parallel speedup assuming
  • that there are no costs for speedup assuming that
    there are no costs for
  • communications. In reality, communications will
    result in a further degradation
  • of performance

14
Practical Limits Amdahls Law vs. Reality
  • In reality, Amdahls Law is limited by many
    things
  • Communications
  • I/O
  • Load balancing
  • Scheduling (shared processors or memory)

15
Other Considerations
  • Writing effective parallel application is
    difficult!
  • Load balance is important
  • Communication can limit parallel efficiency
  • Serial time can dominate
  • Is it worth your time to rewrite your
    application?
  • Do the CPU requirements justify parallelization?
  • Will the code be used just once?

16
Sources of Parallel Overhead
  • Interprocessor communication Time to transfer
    data between processors is usually the most
    significant source of parallel processing
    overhead.
  • Load imbalance In some parallel applications it
    is impossible to equally distribute the subtask
    workload to each processor. So at some point all
    but one processor might be done and waiting for
    one processor to complete.
  • Extra computation Sometime the best sequential
    algorithm is not easily parallelizable and one is
    forced to use a parallel algorithm based on a
    poorer but easily parallelizable sequential
    algorithm. Sometimes repetitive work is done on
    each of the N processors instead of send/recv,
    which leads to extra computation.

17
Parallel program Performance Touchstone
  • Execution time is the principle measure of
    performance

18
Programming Parallel Computers
  • Programming single-processor systems is
    (relatively) easy due to
  • single thread of execution
  • single address space
  • Programming shared memory systems can benefit
    from the single address space
  • Programming distributed memory systems is the
    most difficult due to multiple address spaces
    and need to access remote data
  • Both parallel systems (shared memory and
    distributed memory) offer ability to perform
    independent operations on different data (MIMD)
    and implement task parallelism
  • Both can be programmed in a data parallel, SPMD
    fashion

19
Single Program, Multiple Data (SPMD)
  • SPMD dominant programming model for shared and
    distributed memory machines.
  • One source code is written
  • Code can have conditional execution based on
    which processor is executing the copy
  • All copies of code are started simultaneously and
    communicate and synch with each other
    periodically
  • MPMD more general, and possible in hardware, but
    no system/programming software enables it

20
Shared Memory vs. Distributed Memory
  • Tools can be developed to make any system appear
    to look like a different kind of system
  • distributed memory systems can be programmed as
    if they have shared memory, and vice versa
  • such tools do not produce the most efficient
    code, but might enable portability
  • HOWEVER, the most natural way to program any
    machine is to use tools languages that express
    the algorithm explicitly for the architecture.

21
Shared Memory Programming OpenMP
  • Shared memory systems have a single address
    space
  • applications can be developed in which loop
    iterations (with no dependencies) are executed by
    different processors
  • shared memory codes are mostly data parallel,
    SPMD kinds of codes
  • OpenMP is the new standard for shared memory
    programming (compiler directives)
  • Vendors offer native compiler directives

22
Accessing Shared Variables
  • If multiple processors want to write to a shared
    variable at the same time there may be conflicts
  • Processor 1 and 2
  • read X
  • compute X1
  • write X
  • Programmer, language, and/or architecture must
    provide ways of resolving conflicts

23
OpenMP Example Parallel loop
  • !OMP PARALLEL DO
  • do i1,128
  • b(i) a(i) c(i)
  • end do
  • !OMP END PARALLEL DO
  • The first directive specifies that the loop
    immediately following should be executed in
    parallel. The second directive specifies the end
    of the parallel section (optional).
  • For codes that spend the majority of their time
    executing the content of simple loops, the
    PARALLEL DO directive can result in significant
    parallel performance.

24
MPI Basics
  • What is MPI?
  • A message-passing library specification
  • Extended message-passing model
  • Not a language or compiler specification
  • Not a specific implementation or product
  • Designed to permit the development of parallel
    software libraries
  • Designed to provide access to advanced parallel
    hardware for
  • End users
  • Library writers
  • Tool developers

25
Features of MPI
  • General
  • Communications combine context and group for
    message security
  • Thread safety
  • Point-to-point communication
  • Structured buffers and derived datatypes,
    heterogeneity.
  • Modes normal(blocking and non-blocking),
    synchronous, ready(to allow access to fast
    protocol), buffered
  • Collective
  • Both built-in and user-defined collective
    operations.
  • Large number of data movement routines.
  • Subgroups defined directly or by topology

26
Performance Analysis
  • Performance analysis process includes
  • Data collection
  • Data transformation
  • Data visualization

27
Data Collection Techniques
  • Profile
  • Record the amount of time spent in different
    parts of a program
  • Counters
  • Record either frequencies of events or cumulative
    times
  • Event Traces
  • Record each occurrence of various specified events

28
Performance Analysis Tool
  • Paragraph
  • Portable trace analysis and visualization package
    developed at Oak Ridge National Laboratory for
    MPI program
  • Upshot
  • A trace analysis and visualization package
    developed at Argonne National Laboratory for MPI
    program
  • SvPablo
  • Provides a variety of mechanisms for collecting,
    transforming, and visualizing data and is
    designed to be extensible, so that the programmer
    can incorporate new data formats, data collection
    mechanisms, data reduction modules and displays

29
Load Balance
  • Load Balance
  • Static load balance
  • The task and data distribution are determined at
    compile time
  • Not optimally because application behavior is
    data dependent
  • Dynamic load balance
  • Work is assigned to nodes at runtime

30
Load balance for heterogeneous tasks
  • Load balance for heterogeneous tasks is difficult
  • Different tasks have different costs
  • Data dependencies between tasks can be very
    complex
  • Consider data dependencies when doing load
    balancing

31
General load balance architecture(Research of
Carnegie Mellon Univ.)
  • Used for dynamic load balancing and applied on
    heterogeneous application

32
General load balance architecture(continue)
  • Global load balancer
  • Includes a set of simple load balancing
    strategies for each of the task types
  • Manages the interaction between the different
    task types and their load balancers.

33
Tasks with different dependency types
34
Explanation on General load balancer architecture
  • Task scheduler
  • Collects status information from the nodes and
    issues task migration instructions based on this
    information
  • Task scheduler supports three load balancing
    policies for homogeneous tasks

35
Why Real Time application monitoring important
  • A distributed and parallel application to gain
    high performance needs
  • Acquisition and use of substantial amounts of
    information about programs, about the systems on
    which they are running, and about specific
    program runs.
  • These information is difficult to predict
    accurately prior to a programs execution
  • Ex. Experimentation must be conducted to
    determine the performance effects of a programs
    load on processors and communication links or of
    a programs usage of certain operating system
    facilities

36
PRAGMA An Infrastructure for Runtime Management
of Grid Applications(U of A)
  • The overall goal of Pragma
  • Realize a next-generation adaptive runtime
    infrastructure capable of
  • Reactively and proactively managing and
    optimizing application execution
  • Gather current system and application state,
    system behavior and application performance in
    real time
  • Network control based on agent technology

37
Pragma addressed key challenges
  • Formulation of predictive performance functions
  • Mechanisms for application state monitoring and
    characterizing
  • Design and deployment of an active control
    network combining application sensors and
    actuators

38
Performance Function
  • Performance function hierarchically combine
    analytical, experimental and empirical
    performance models
  • Performance function is used along with current
    system/network state information to predict the
    application performance

39
Identifying Performance Function
  • 1. Identify the attributes that can accurately
    express and quantify the operation and
    performance of a resource
  • 2. Use experimental and analytical techniques to
    obtain the performance function
  • 3. Compose the component performance function to
    generate an overall performance function

40
Performance function example
  • Performance function model and analyze a simple
    network system
  • Two computer(PC1 and PC2) connected through an
    Ethernet switch
  • PC1 performs a matrix multiplication and sends
    the result to PC2 through switch
  • The same for PC2
  • We want to find the performance function to
    analyze the response time(delay) for the whole
    application

41
Performance function example(continue)
  • Attribute
  • Data size
  • Performance function determines the application
    response time with respect to this attribute
  • Measure the task processing time in terms of data
    size and feed to a neural network

42
Performance function example(continue)
  • Aj,bj,cj,di are constants and D is the data size

43
Pragma components
  • System characterization and abstraction component
  • Abstracting the current state of the underlying
    computational environment and predict its
    behavior
  • Application characterization component
  • Abstracting AMR application in terms of its
    communication and computational requirements

44
Pragma components(continue)
  • Active network control
  • Sensor
  • Actuator
  • Management/policy agents of adaptive runtime
    control
  • Policy base
  • A programmable database of adaptation policies
    used by agents and derive the overall adaptation
    process

45
Adaptive Mesh Refinement Basics
  • Concentrating computational effort to appropriate
    regions
  • Tracking regions in the domain that require
    additional resolution by overlaying finer grid
    over these region
  • Refinement proceeds recursively

46
AMR Basics(continue)
47
System Characterization and Abstraction
  • Objective
  • Monitor, abstract and characterize the current
    state of the underlying computational environment
  • Use this information to drive the predictive
    performance functions and models that can
    estimate its performance in the near future

48
Block diagram of the system model
49
Agent-based runtime adaptation
  • The underlying mechanisms for adaptive run-time
    management of grid applications is realized by an
    active control network of sensors, actuators and
    management agents

50
Agent-based runtime management architecture
51
Sensors and actuators for active adaptation
  • Sensors and actuators embedded within the
    application and/or system software
  • Define the adaptation interface and implement the
    mechanics of adaptation
  • Sensors and actuators can be directly deployed
    with the applications computational data
    structures

52
Adaptation Policy knowledge-base
  • Adaptation policy base maintains policies used by
    the management agents to drive decision-making
    during runtime application management and
    adaptation
  • Knowledge base is programmable
  • Knowledge base provides interface for agent to
    formulate partial queries and fuzzy reasoning

53
Results of using Pragma architecture
  • Our experiments are on RM3D (Richtmyer-Meshkov
    CFD kernel)

54
PF results
  • We generate Performance Function on two different
    platforms.
  • IBM SP Seaborg located in National Energy
    Research Scientific Computing Center
  • Linux Beowulf Cluster Discover located in New
    Jersey State University

55
PF on IBM SP
  • We obtain two PFs
  • PF for small loads(lt 30,000 work units)
  • PF for high loads(gt 30,000 work units)

56
IBM SP PF coefficient
57
PF on Linux Beowulf
  • Single PF on the linux beowulf cluster is
    generated
  • Coefficient

58
Accuracy of PF on IBM SP
59
Accuracy of PF on Linux Beowulf Cluster
60
Execution Time Gain (Beowulf)
  • Self-optimizing performance gain for 4 processor
    cluster

61
Execution Time Gain(continue)
  • Self-optimizing performance gain for 8 processor
    cluster with problem size 646432

62
Other Univ. work Example
  • Load balancing on different nodes need to be
    exploited more
  • According to research in Northwestern Univ.,
    perfect load balancing might not be
    optimistic(Multilevel Spectral Bisection
    application)

of procs Exec. Time(Balanced) Exec. Time (Imbalanced in )
16 2808.28ms 2653.07ms(5.2)
32 1501.97ms 1491.57ms(7.4)
64 854.06ms 854.843ms(10.6)
63
Lesson from this research
  • Allow some load imbalance can provide significant
    reductions in parallel execution time
  • As the number of processors increases for a fixed
    sized problem, the amount of load imbalance that
    can be exploited generally increases

64
Future Work
  • Generate context aware, self-configuring,
    self-adapting and self-optimizing components
  • Provide vGrid infrastructure that provides
    autonomic middleware services

65
Question?
Write a Comment
User Comments (0)
About PowerShow.com