Title: A Mobile-Agent-Based Performance-Monitoring System at RHIC
1A Mobile-Agent-Based Performance-Monitoring
System at RHIC
2Overview
- Motivation for a new monitoring system
- Design of the Instrumentation system
- Use of mobile agents (mobile programs vs remote
procedures) - How it works, what it does and doesnt do
- Practical experiences with a test instrument
- What works well and what doesnt
- Future enhancements
3Monitoring System Purpose
- The system should
- Provide performance monitoring at service-level
- End-to-end tests yielding mixed information on
the functioning of several services - Track performance changes during configuration
changes - Monitor current health of system
- Provide some error-tracking/reporting
capabilities - Be a tool for administrators experimenters
- It will not
- Provide detailed system information for fault
diagnosis (system-specific, vendor-supplied tools
already exist)
4Desired Features of the System
- View / compare past and current measurements
- Inspect correlations between metrics
- Allow variation of sampling rate
- Automatically execute scheduled measurements
- Can perform measurements on demand at shorter
intervals - Perform OS-independent measurements
- Use a small fraction of available resources
5Components of the System
- Instruments which perform measurements
- Centralized database of Instruments (code) and
time-stamped results - Allows simple addition of new metrics
- Allows previously run tests to be reproduced
- Mechanism for remote execution of Instruments
- IBM Aglets mobile-agent system
(http//www.trl.ibm.co.jp/aglets)
parameters
code
monitor
sequence of measurements
6Mobile Agents vs. RPC
Users system
Remote system
Datasetto search
Local search utility
Search request
A pre-defined procedure on remote host executes
and returns result
Increased network load for large agents
Remote system
Users system
Daemon on remote host accepts agent and allows
execution
Datasetto search
Search request
Local search utility
7Advantages of Mobile Agents
- Metrics can be defined at any time, and
implemented on the central host - Performance is measured on the relevant host
- Aglets system is Java-based, providing
platform-independent execution - Sophisticated security model exists for
restricting actions of the agents
8Use of Mobile Agents In Monitoring
- Simplest approach, Single-Remote-Host was
implemented for initial configuration - Waiting between tests is done on central server
for reliability
Target host
Itineraryapproach
SingleRemote Host approach
Central server
Target host
Target host
Central server
Target host
Target host
Target host
9Anatomy of an Instrument
The code defining a specific implementation of an
Instrument is ? 30 lines
Inherits from
Inherits from
10Test Instrument File Access
- NFS access time (write) used as test of concept
- File size, location (file-system) are passed as
parameters in database (specified at run-time) - Measurements are started by automated process as
specified by Schedule table in database - Tested access to one file-system on several
client computers - Linux (PIII) system with NFSv2, 1KB blocksize
- Linux (PIII) system with NFSv2, 8KB blocksize
- Linux (PIII) system with NFSv3
- Solaris system with NFSv3
11Report Generation Tool
- Sample tests are carried out automatically by a
Scheduler Aglet - Reports are requested via an html form. Users
specify a test-type, parameter-set and target
host. A Perl cgi-script queries the database and
plots results using Gnuplot.
12Sample Report for File access
Results indicate server load, client config
Nightly backups
Weekly de-frag
13Problems With the Mobile Agents
- Transfer interrupted when several agents move to
/ from the same host within ? 1-2 sec - Small size of Aglets currently used (?15KB)
cannot explain the effective dead-time - The failure is presented to the Aglet as a
refusal (can detect, wait and retry) - Congestion at central host can be relieved by
following a circuit before returning (multiple
hosts)
14Future System Development
- Solve transfer interruption problem
- Development of other mobility patterns
- NFS read-access may be tested by writing on one
host and timing a read on a different host (to
avoid caching) - Use of itinerary can ease network congestion at
the central server - A tracking / error-reporting system is being
developed, and will be connected to a paging
system
15Summary
- Initial implementation is proving useful
- Mobile agent architecture adds design work but
eases implementation, adds flexibility - Transfer interruption causing scalability
problems, but not insurmountable - Plan to have expanded system running before
data-taking begins
16Questions...
Richard Ibbotson, BNL ibbotson_at_bnl.gov
Thanks to
David Stampf, BNL Tom Throwe, BNL Bruce Gibbard,
BNL
17(No Transcript)