Title: Automatic Online Performance Analysis for the Grid
1Automatic Online Performance Analysisfor the Grid
- Michael Gerndt
- Technische Universität München
- gerndt_at_in.tum.de
2Grid Computing
- Grids
- enable communities (virtual organizations) to
share geographically distributed resources as
they pursue common goals -- assuming the absence
of - central location,
- central control,
- omniscience,
- existing trust relationships.
- Globus Tutorial
- Major differences to parallel systems
- Dynamic system of resources
- Large number of diverse systems
- Sharing of resources
- Transparent resource allocation
3Requirements for Grid Performance Analysis
- Two ways to attack the question
- Scenarios
- Application types
4Scenarios for Performance Monitoring and Analysis
- Post-mortem application analysis
- Self-tuning applications
- Grid scheduling
- Grid management
- GGF performance working group, DataGrid,
CrossGrid
5Self-Tuning Applications
- Chris submits job
- Application adapts to assigned resources
- Application starts
- Application monitors performance and adapts to
resource changes
- Requires
- Integration of system and application monitoring
- On-the-fly performance analysis
- API for accessing monitor data (if PA by
application) - Performance model and interface to steer
adaptation (If PA and tuning decision by external
component.)
6Post-Mortem Application Analysis
- George submits job to the Grid
- Job is executed on some resources
- George receives performance data
- George analyzes performance
- Requires
- either resources with known performance
characteristics (QoS) - or system-level information to assess performance
data - scalability of performance tools
- Focus will be on interacting components
7Grid-Scheduling
- Gloria determines performance critical
application properties - She specifies a performance model
- Grid scheduler selects resources
- Application is started
- Requires
- PA of the grid application
- Possibly benchmarking the application
- Access to current performance capabilities of
resources - Even better to predicted capabilities
8Grid-Management
- George claims to see bad performance since one
week. - The helpdesk runs the Grid performance analysis
software. - Periodical saturation of connections is detected.
- Requires
- PA of historical system information
- Need to be done in a distributed fashion
9Application Types
- Remote site access
- Parameter studies
- Workflow applications
- Metacomputing applications
- Data-intensive applications
10Deployment of PA Tools
11 Requirements for Grid PA Tools
- Scalability to large systems
- Multiple HPC systems or tons of historic system
data - Integration of application- and system-level info
- Tuning for intersite communication, improving
resource allocation, dynamic adaptation,
post-morten clarification - Online analysis
- Only way to handle performance data set size
- Needed for dynamic tuning
- Automatic analysis
- Needed for dynamic tuning, inspection of large
historic data sets, online analysis, model
generation of applications
12New Aspect of Performance Analysis
- Transparent resource allocation
- Dynamism in resource availability
- Even larger and geographically dispersed systems
- Approaches in the following projects
- Damien
- Datagrid
- Crossgrid
- GrADS
13Managing Dynamism The GrADS Approach
- GrADS (Grid Application Development Software)
- Funded by National Science Foundation, started
2000 - Goal
- Provide application development technologies
that make it easy to construct and execute
applications with reliable and often high
performance in the constantly-changing
environment of the Grid. - Major techniques to handle transparency and
dynamism - Dynamic configuration to available resources
(configurable object programs) - Performance contracts and dynamic reconfiguration
14GrADS Software Architecture
Performance feedback
Software Components
Realtime perf monitor
Scheduler/ Service Negotiator
Grid runtime System (Globus)
Config. object program
Source appli- cation
whole program compiler
P S E
negotiation
Dynamic optimizer
libraries
Program Preparation System
Execution Environment
15Configurable Object Programs
- Integrated mapping strategy and cost model
- Performance enhanced by context-depend. variants
- Context includes potential execution platforms
- Dynamic Optimizer performs final binding
- Implements mapping strategy
- Chooses machine-specific variants
- Inserts sensors and actuators
- Perform final compilation and optimization
16Performance Contracts
- A performance contract specifies the measurable
performance of a grid application. - Given
- set of resources,
- capabilities of resources,
- problem parameters
- the application will
- achieve a specified, measurable performance
17Creation of Performance Contracts
Program
- Developer
- Compiler
- Measurements
PerformanceModel
MDS
Resource Broker
NWS
ResourceAssignment
PerformanceContract
18History-Based Contracts
- Resources given by broker
- Capabilities of resources given by
- Measurements of this code on those resources
- Possibly scaled by the Network Weather Service
- e.g. Flops/second and Bytes/second
- Problem parameters
- Given by the input data set
- Application intrinsic parameters
- Independent of execution platform
- Measurements of this code with same problem
parameters - e.g. floating point operation count, message
count, message bytes count - Measurable Performance Prediction
- Combining application parameters and resource
capabilities
19Application and System Space Signature
- Application Signature
- trajectory of values through N-dimensional
metric space - one trajectory per process
- e.g. one point per iteration
- e.g. metric iterations/flop
20Verification of Performance Contracts
Execution
Sensor Data
- Violation detection
- Fault detection
Rescheduling
ContractMonitor
SteerDynamic Optimizer
21Peridot
- Goal
- Develop a scalable automatic performance analysis
system - Main target system Hitachi SR8000
- Partners
- Leibniz Computer Center
- Research Center Jülich
- Technical University Dresden
- Technical University of Munich
22Hierarchy of analysis agents
- Agents are autonomous but cooperate
- Agents are responsible for components
- Whole system
- Nodes
- Processes
- Work distribution based on ASL specification
- Performance data are processed by leave nodes
- Reducing communication in analysis hierarchy
- Cooperation is done via higher level information
- Talk by Karl Fürlinger, Session 02.2 on Friday
23(No Transcript)
24Summary
- Scalability to large systems
- Multiple HPC systems or tons of historic system
data - Integration of application- and system-level info
- Tuning for intersite communication, improving
resource allocation, dynamic adaptation,
post-morten clarification - Online analysis
- Only way to handle performance data set size
- Needed for dynamic tuning
- Automatic analysis
- Needed for dynamic tuning, inspection of large
historic data sets, online analysis, model
generation of applications