Title: Grid performance, grid benchmarks, grid metrics
1Grid performance, grid benchmarks, grid metrics
- Zsolt Németh
- MTA SZTAKI Computer and Automation Research
Institute - zsnemeth_at_sztaki.hu
- http//www.lpds.sztaki.hu/zsnemeth
2Outline
- What is the grid?
- What is grid performance?
- Are benchmarks useful?
- How can be grid metrics defined?
3What is the grid?
4Distributed applications
- A set of cooperative processes
5Distributed applications
- Processes require resources
Printer
Network
Memory
CPU
Database
Storage
Librabries
I/O devices
6Distributed applications
- Resources can be found on computational nodes
Network
Printer
CPU
Storage
Mapping
Memory
Database
I/O devices
Libraries
CPU
7Distributed applications
Application Cooperative processes
- Process control?
- Security?
- Naming?
- Communication?
- Input / output?
- File access?
Physical layer Computational nodes
8Distributed applications
Application Cooperative processes
- Virtual machine
- Process control ?
- Security ?
- Naming ?
- Communication ?
- Input / output ?
- File access ?
Physical layer Computational nodes
9Conventional distributed environments and grids
- Distributed resources are virtually unified by a
software layer - A virtual machine is introduced between the
application and the physical layer - Provides a single system image to the application
- Types
- Conventional (PVM, some implementations of MPI)
- Grid (Globus, Legion)
10Conventional distributed environments and grids
- What is the essential difference?
11Conventional distributed environments and grids
12Conventional distributed environments and grids
13Conventional distributed environments and grids
14Conventional distributed environments and grids
- How is the virtual machine built up?
- What does execution mean?
- What is the semantics of execution?
15Description of grid
- flexible, secure, coordinated resource sharing
among dynamic collections of individuals,
institutions and resources (The anatomy of the
grid) - single, seamless, computational environment in
which cycles, communication and data are shared
(Legion the Next Step Toward a Nationwide
Virtual Computer) - widearea environment that transparently consists
of workstations, personal computers, graphic
rendering engines, supercomputers and
nontraditional devices (Legion - A View from
50,000 Feet) - collection of geographically separated resources
connected by a high speed network, a software
layer which transforms a collection of
independent resources into a single, coherent
virtual machine (Metacomputing - Whats in it
for me)
16Conventional environments
- Processes
- Have resource requests
- Mapping
- Processes are mapped onto nodes
- Resource assignment is implicit
Physical level
17Grid
- Processes
- Have resource requirements
- Mapping
- Assign nodes to resources?
Physical layer
18Grid the resource abstraction
- Processes
- Have resource needs
Physical layer
19Grid the user abstraction
- Processes
- Belong to a user
- User of the virtual machine is authorised to use
the constituting resources - Have no login access to the node the resource
belongs to
- Physical layer
- Local, physical users (user accounts)
20The grid abstraction
- Semantically the grid is nothing but abstraction
- Resource abstraction
- Physical resources can be assigned to virtual
resource needs (matched by properties) - Grid provides a mapping between virtual and
physical resources - User abstraction
- User of the physical machine may be different
from the user of the virtual machine - Grid provides a temporal mapping between virtual
and physical users
21Conventional distributed environments and grids
Smith 4 nodes
Smith, 4 CPU, memory, storage
Smith 1 CPU
smith_at_n1.edu
smith_at_n1.edu
default_at_foo.com
griduser_at_mynode.hu
smith_at_n2.edu
22Grid performance
23What is grid performance at all?
- Performance of grid infrastructure or
performance of grid application? - Traditionally performance is
- Speed
- Throughput
- Bandwidth, etc.
- Using grids
- Quantitative reasons
- Qualitative reasons QoS
- Economic aspects
24Grid performance analysis scenarios
- Resource brokering evaluate the performance of a
given resource if it is appropriate for a certain
job - At runtime check if a resource can maintain an
acceptable/required performance - At runtime check if a job can evolve according
to checkpoints - Find obvious idling/waiting spots
- Find bad communication patterns
- Find serious performance skew
- Post mortem see if brokering strategy was
correct - Etc.
25What is grid performance at all?
26What is grid performance at all?
- supercomputer
- task is done in 20 minutes
- cluster
- task is done in 12 hours
27What is grid performance at all?
- supercomputer
- task is done in 20 minutes
- available tomorrow night
- cluster
- task is done in 12 hours
- available now
28What is grid performance at all?
- supercomputer
- task is done in 20 minutes
- available tomorrow night
- costs 200/hour
- cluster
- task is done in 12 hours
- available now
- costs 15/hour
29What is grid performance at all?
- Grid is about resource sharing
- What is the benefit of sharing
- acceptable for resource owners
- acceptable for resource users
- Speed, bandwidth, capacity, etc. is just one
aspect - Properness, fairness, effectiveness of assignment
of processes to resources
30Grid performance
Performance?
31Grid performance
Performance?
Virtual layer
Physical layer
Measurement
32Grid performance
Performance?
Virtual layer
Physical layer
Measurement
33Interaction of application and the infrastructure
- Performance application perf. ? infrastructure
perf. - Signature model (Pablo group)
- Application signature
- e.g. instructions/FLOPs
- Scaling factor (capabilities of the resources)
- e.g. FLOPs/seconds
- Execution signature
- application signature scaling factor
- E.g. instructions/second instructions/FLOPS
FLOPs/seconds
34Possible performance problems in grids
- All that may occur in a distributed application
- Plus
- Effectiveness of resource brokering
- Synchronous availability of resources
- Resources may change during execution
- Various local policies
- Shared use of resources
- Higher costs of some activities
- The corresponding symptoms must be characterised
35Grid performance metrics
- Abstract representation of measurable quantities
- MR1xR2x...Rn
- Usual metrics
- Speedup, efficiency
- Load, queue length, etc.
- Such strict values are not characteristic in grid
- Cannot be interpreted
- Cannot be compared
- New metrics
- Local metrics and grid metrics
- Symbolic description / metrics
36Processing monitoring information
- Trace data reduction
- Proportional to time t, processes P, metrics
dimension n - Statistical clustering (reducing P)
- Similar temporal behaviours are classified
- Questionnable if works for grids
- Representative processes are recorded for each
class - Statistical projection pursuit (reducing n)
- reduces the dimension by identifying significant
metrics - Sampling frequency (reducing t)
37Performance tuning, optimisation
- The execution cannot be reproduced
- Post-mortem optimisation is not viable
- On-line steering is necessary though, hard to
realise - Sensors and actuators
- Application and implementation dependent
- E.g Autopilot, Falcon
- Average behaviour of applications can be improved
- Post-mortem tuning of the infrastructure (if
possible) - Brokering decisions
- Supporting services
38Grid benchmarking
39Grid performance,resource performance
- The traditional way benchmarking
- As suggested by GGF-GBRG
40Running benchmarks
- Benchmarks are executed on a virtual machine
41Running benchmarks
- Benchmarks are executed on a virtual machine
- The virtual machine may change (composed of
different resources) from run to run
42Running benchmarks
- Benchmarks are executed on a virtual machine
- The virtual machine may change (composed of
different resources) from run to run - Benchmark result is representative to one certain
virtual machine
43Running benchmarks
- Benchmarks are executed on a virtual machine
- The virtual machine may change (composed of
different resources) from run to run - Benchmark result is representative to one certain
virtual machine - What can it show about the entire grid?
- What can it show about a certain resource?
44Grid benchmarking
Measurement
Performance?
Virtual layer
Physical layer
45Grid metrics
46Local metrics
- Load averages, CPU user, system, idle
percentages, network bandwidth, cache hit ratio,
available memory, page faults, etc. - Performance is a trajectory in a
multi-dimensional space - Cannot be compared
- Cannot be interpreted
- processes 55.2, user 70, system 0, idle 30
- underloaded 64-CPU system
- processes 55.2, user 70, system 30, idle 0
- 64-CPU system, serious overheads
- processes 72.8, user 99, system 1, idle 0
- slightly overloaded 64-CPU system
- processes 4.1, user 99, system 1, idle 0
- seriously overloaded 1-CPU system
- Fine details are even more complex to evaluate
47Local metrics, global (grid) metrics
- Local metrics are transformed into some globally
understandable performance figures - What are the dimensions?
- What is the transformation?
48Global metrics
- MIPS, MFLOPS, Gbit/s, etc.
- Comparable, interpretable
- Most users have no idea about the computing power
they really require - These are usually nominal and not actual values
- Too general characterisation fine details are
hidden
49Benchmark metrics
- Benchmarks are for comparing computer systems
- A well selected benchmark set
- sensitive to different factors CPU intensive,
communication intensive, I/O intensive jobs - able to show fine details cache behaviour,
floating point capabilities, etc. - able to show behaviour at different levels
instruction, loop, procedure, application - These figures can be obtained actively require
time, resources
50Benchmark metrics
- Given a local database with local and benchmark
performance records - get the local performance figures
- low cost OS functionality
- look up the database for benchmark performance
- there may not be record for actual local
performance - symbolic (fuzzy) interpolation
- the actual benchmark figures can be estimated
- actual execution of benchmarks is costly if not
impossible - Estimated benchmark figures give a
characterisation of the system in a comparable
and interpretable way - Sounds reasonable but not enough
51Benchmark metrics
- Benchmarks may show actual execution performance
but it is not enough - Real-life experiments execution time may show no
correlation to actual load - start every job and suffer resource starvation
- wait until resources are available and start
specific jobs - Resource management policy must be taken into
consideration
52Job startup times
- corona.iif.hu, SUN Ultra Enterprise 10000, 64 CPU
- Sun Grid Engine
- Time between submission and actual start
- 1 processor job within 1 minute
- 2 processor job mostly within 1 minute
- 4 processor job 2-3 hours
- 8 processor job 1-2 days
- 9 processor job 1-2 days
- 16 processor job 2-3 days
- 25 processor job gt 4-5 days
- See online
- http//www.lpds.sztaki.hu/zsnemeth/apart/statisti
cs/statistics.shtml
53Resource performance characterisation
- Execution phase resource performance can be
characterized in the space of benchmark metrics - analyse relationship between local metrics a
benchmark results - find the principal components
- Waiting phase a stochastic model
- find the parameters of the distribution
54Resource performance characterisation
- These parameters (?i, ?i, t1, t2,tn ) can be
distributed in an information system - Interpretable the stochastic model and the
benchmark set give an appropriate framework - Comparable figures have the same meaning within
this framework
55Ongoing work
- Exploring the statistical properties of
benchmarks and system parameters - Intensive benchmark experiments
- Getting the most out of figures
- Principal component analysis which figures are
really meaningful - Testing the stability of statistic data
- http//www.lpds.sztaki.hu/zsnemeth/apart/statisti
cs/statistics.shtml - Exploring the way how benchmark results can be
estimated from past measurements - Database management
- Symbolic interpolation
56Conclusion
- A semantic definition for grids
- the presence of user and resource abstraction
- Grid performance has a more complex meaning
- Resource abstraction requires abstraction in the
performance characterisation, too - separation of local (physical) an global
(virtual) metrics - benchmarking is not viable
- but benchmarks can serve as metrics
- Experiments with resource characterisation