Uncertainty in LargeScale Computing - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Uncertainty in LargeScale Computing

Description:

Regional computational complexity can change dramatically in both space and time ... Reconfigure if IR threshold. 10/11/09. 13. Natural Region Partitioning ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 29
Provided by: Sta7553
Category:

less

Transcript and Presenter's Notes

Title: Uncertainty in LargeScale Computing


1
Uncertainty inLarge-Scale Computing
  • Travis Desell

2
Overview
  • Motivation
  • Uncertainty?
  • ARM (Autonomic Runtime Manager)
  • Active Harmony

3
Extensible TeraScale Facility (ETF)
RPI
4
iVDGLInternational Virtual Data Grid Laboratory
www.ivdgl.org
5
CERN Worlds Largest Computing Grid
http//goc02.grid-support.ac.uk/googlemaps/lcg.htm
l
www.cern.ch
6
PlanetLab (November 06)
718 Nodes at 315 Sites
7
Map of Rensselaer Grid Clusters
Nanotech
Multiscale
Bioscience Cluster
CS /WCL
Multipurpose Cluster
CS
CCNI Cluster
8
Areas of Uncertainty
  • Application
  • Non-determinism
  • Environment
  • Dynamic resource availability
  • Unknown target environments
  • OS/Hardware effects
  • Faults
  • Interaction
  • Competition
  • Distribution
  • Unknown performance on different architectures

9
Overview of ARM
  • Autonomic Runtime Manager (ARM)
  • What is it for?
  • What is it?
  • Reconfiguration Methodology
  • What can it do?
  • Graph Partitioning
  • Natural Region Partitioning

10
Adaptive Distributed Applications
  • Non-deterministic
  • Application divided into computational regions
  • Regional computational complexity can change
    dramatically in both space and time
  • Example Wildfire Simulation

11
ARM (Autonomic Runtime Manager)
12
When to reconfigure?
  • Imbalance Ratio (IR)
  • Reconfigure if IR gt threshold

13
Natural Region Partitioning
  • Calculate Processor Computational Load (PCL)
  • Processor Load Ratio (PLR) is normalization of
    PAF
  • Processor Allocation Factor (PAF) is a ratio of
    how fast cells are computed
  • Application Computational Workload (ACW) is the
    total workload of the application

14
Graph Partitioning
  • Wildfire simulation domain represented as graph
    G(V,E)
  • Vertices are cells
  • Edges connect neighboring cells
  • Verticies weighted by computational complexity
  • Burning gt Unburned
  • Graphs partitioned to create cuts with similar
    vertex weight, and inter-cut edges

15
Results
16
Results (Continued)
17
Overview of Active Harmony
  • Active Harmony
  • What is it?
  • Reconfiguration Methodology
  • Case Studies
  • PETSc
  • POP
  • GS2

18
Active Harmony
  • Previously used for runtime performance tuning in
    dynamic environments.
  • Developers specify tunable parameters which can
    modify applications during runtime.
  • This work focuses on using Active Harmony
    off-line as opposed to on-line.

19
Active Harmony Architecture
20
Reconfiguration Methodology
  • Representative short runs used to provide
    measurements of application performance for
    different tuned parameters.
  • Uses an algorithm based on the Neader-Mead
    simplex method.
  • Parameters are treated as independent dimensions.
  • Modified because parameters are non-continuous.
  • Iteratively tunes application performance by
    repeatedly modifying parameters and converging
    and an optimum.

21
Case Study - PETSc
  • PETSc Portable Extensible Toolkit for
    Scientific Computation
  • Suite of data structures and routines for
    scalable (parallel) solution of scientific
    applications based on partial differential
    equations
  • Uses MPI
  • Used Active Harmony on two PETSc examples
  • SLES linear equation solver, uses matrix
    decomposition
  • 2-D driven cavity problem

22
Results - SLES
  • Ran 50x50 matrix decomposition with 4 homogeneous
    and 4 heterogeneous processors (see above).
  • Ran 21,025 x 21,025 matrix decomposition and 32
    processors. Resulted in a 18 performance
    speedup after tuning.
  • Also ran 90,601 x 90601 matrix decomposition.
    Full search space O(10100). After 120
    iterations resulted in a 15-20 performance
    improvement.

23
Results 2-DCP
  • Uses PETSc SNES non-linear equation solver.
  • Tunable parameter is how many grid points
    distributed among nodes (default is equal
    distribution).
  • Above shows results from a small problem, 2500
    grid points and 4 nodes.
  • Ran with 40,000 grid points on 32 processors,
    speedup was 11.5 over default partitioning.
    Search space was O(1036).

24
Case Study - POP
  • POP Parallel Ocean Program
  • Developed at Los Alamos National Labs.
  • Used by the Community Climate System Model (CCSM)
    as the ocean component.
  • Solves three-dimensional primitive equations for
    fluid motions on a sphere.

25
Results - POP
  • Problem size 3600x2400 grid points divided into
    480 blocks (processors).
  • Default configuration is 180x100 sized block.
  • Run on 16-way SMP nodes.
  • No single block size is best for all topologies
  • Execution time reduced up to 15 by tuning block
    size.
  • Tuning other parameters (20 performance related
    with 2-4 possible values) on a 8 node 4 processor
    cluster resulted in 12.7 speedup after 12
    iterations, and best speedup of 16.7 after 27
    iterations.
  • First set of x-axis labels is processing nodes
  • Second set of x-axis labels is best block size

26
Case Study GS2
  • Physics application used to study low-frequency
    turbulence in magnetized plasma.
  • Simulation involves billions of mesh points.
  • Primary tuning parameter was data layout.

27
Results GS2
  • Results from NERSC Seaborg (8 16-processor
    nodes).
  • Active Harmony reduced execution time from 55.06s
    to 16.25s (3.4x speedup) without collision mode
    and from 71.08s to 31.55s ( 2.3x speedup) in
    collision mode.
  • Used benchmarked runs of 10 iterations (as
    opposed to typical runs of 1000 iterations).

28
Results GS2 (2)
  • Search space of GS2 is O(105). To test Active
    Harmony, systematic sampling was used to evaluate
    O(104) configurations (see left).
  • Best performance sampled was 125.8s, only 2 of
    the sampled configurations resulted in lt 200s.
  • Active Harmonys result was in the top 5 of
    sampled configurations.
  • Also tuned for other performance related
    parameters in addition to data layout. Total
    speedup was 5.1x improvement.
Write a Comment
User Comments (0)
About PowerShow.com