Grid Monitoring Futures with Globus
  • Jennifer M. Schopf
  • Argonne National Lab
  • April 2003

My Definitions
  • Grid
  • Shared resources
  • Coordinated problem solving
  • Multiple sites (multiple institutions)
  • Monitoring
  • Discovery
  • Registry service
  • Contains descriptions of data that is available
  • Expression of data
  • Access to sensors, archives, etc.

What do I mean by Grid monitoring?
  • Different levels of monitoring needed
  • Application specific
  • Node level
  • Cluster/site Level
  • Grid level
  • Grid level monitoring concerns data
  • Shared between administrative domains
  • For use by multiple people
  • (think scalability)

Grid Monitoring Does Not Include
  • All the data about every node of every site
  • Years of utilization logs to use for planning
    next hardware purchase
  • Low-level application progress details for a
    single user
  • Application debugging data (except perhaps
    notification of a failure of a heartbeat)
  • Point-to-point sharing of all data over all sites

Overview of This Talk
  • Evaluation of information infrastructures
  • Globus Toolkit MDS2, R-GMA, Hawkeye
  • Insights into performance issues
  • (publication at HPDC 2003)
  • What monitoring and discovery could be
  • Next-generation information architecture
  • Open Grid Services Architecture mechanisms
  • Integrated monitoring discovery arch for GT3

Performance and the Grid
  • Its not enough to use the Grid, it has to
    perform otherwise, why bother?
  • First prototypes rarely consider performance
    (tradeoff with devt time)
  • MDS1centralized LDAP
  • MDS2decentralized LDAP
  • MDS3decentralized Grid service
  • Often performance is simply not known

Globus Monitoring andDiscovery Service (MDS2)
  • Part of Globus Toolkit, compatible with other
  • Used most often for resource selection
  • aid user/agent to identify host(s) on which to
    run an application
  • Standard mechanism for publishing and discovery
  • Decentralized, hierarchical structure
  • Soft-state protocols
  • Caching
  • Grid Security Infrastructure credentials

MDS2 Architecture
Relational Grid Monitoring Architecture (R-GMA)
  • Implementation of the Grid Monitoring
    Architecture (GMA) defined within the Global Grid
    Forum (GGF)
  • Three components
  • Consumers
  • Producers
  • Registry
  • GMA as defined currently does not specify the
    protocols or the underlying data model to be

GGF Grid Monitoring Architecture
  • Monitoring used in the EU Datagrid Project
  • Steve Fisher, RAL, and James Magowan, IBM-UK
  • Based on the relational data model
  • Used Java Servlet technologies
  • Focus on notification of events
  • User can subscribe to a flow of data with
    specific properties directly from a data source

R-GMA Architecture
  • Developed by Condor Group
  • Focus automatic problem detection
  • Underlying infrastructure builds on the Condor
    and ClassAd technologies
  • Condor ClassAd Language to identify resources in
    a pool
  • ClassAd Matchmaking to execute jobs based on
    attribute values of resources to identify
    problems in a pool

Hawkeye Architecture
Comparing Information Systems
Some Architecture Considerations
  • Similar functional components
  • Grid-wide for MDS2, R-GMA Pool for Hawkeye
  • Global schema
  • Different use cases will lead to different
  • GIIS for decentralized registry no standard
    protocol to distribute multiple R-GMA registries
  • R-GMA meant for streaming data currently used
    for NW data Hawkeye and MDS2 for single queries
  • Push vs Pull
  • MDS2 is PULL only
  • R-GMA allows push and pull
  • Hawkeye allows triggers push model

  • How many users can query an information server at
    a time?
  • How many users can query a directory server?
  • How does an information server scale with the
    amount of data in it?
  • How does an aggregator scale with the number of
    information servers registered to it?

  • Lucky cluster at Argonne
  • 7 nodes, each has two 1133 MHz Intel PIII CPUs
    (with a 512 KB cache) and 512 MB main memory
  • Users simulated at the UC nodes
  • 20 P3 Linux nodes, mostly 1.1 GHz
  • R-GMA has an issue with the shared file system,
    so we also simulated users on Lucky nodes
  • All figures are 10 minute averages
  • Queries happening with a one second wait between
    each query (think synchronous send with a 1
    second wait)

  • Throughput
  • Number of requests processed per second
  • Response time
  • Average amount of time (in sec) to handle a
  • Load
  • percentage of CPU cycles spent in user mode and
    system mode, recorded by Ganglia
  • High when running small number compute intensive
  • Load1
  • average number of processes in the ready queue
    waiting to run, 1 minute average, from Ganglia
  • High when large number of aps blocking on I/O

Performance of Information Servers vs. Number of
Experiment 1 Summary
  • Caching can significantly improve performance of
    the information server
  • Particularly desirable if one wishes the server
    to scale well with an increasing number of users
  • When setting up an information server, care
    should be taken to make sure the server is on a
    well-connected machine
  • Network behavior plays a larger role than
  • If this is not an option, thought should be given
    to duplicating the server if more than 200 users
    are expected to query it

Directory Server Scalability
Experiment 2 Summary
  • Because of the network contention issues, the
    placement of a directory server on a highly
    connected machine will play a large role in the
    scalability as the number of users grows
  • Significant loads are seen even with only a few
    users, it will be important that this service be
    run on a dedicated machine, or that it be
    duplicated as the number of users grows.

Information Service Throughput vs. Num. of
Information Collectors
Experiment 3 Summary
  • Too many information collectors is a performance
  • Caching data helps
  • Alternatively, register to more instances of
    information servers with each handling a subset
    of the collectors

Overall Results
  • Performance can be a matter of deployment
  • Effect of background load
  • Effect of network bandwidth
  • Performance can be affected by underlying
  • LDAP/Java strengths and weaknesses
  • Performance can be improved using standard
  • Caching multi-threading etc.

So what could monitoring be?
  • Basic functionality
  • Push and pull (subscription and notification)
  • Aggregation and Caching
  • More information available
  • More higher-level services
  • Triggers like Hawkeye
  • Viz of archive data like Ganglia
  • Plug and Play
  • Well defined protocols, interfaces and schemas
  • Performance considerations
  • Easy searching
  • Keep load off of clients

  • Evaluation of information infrastructures
  • Globus Toolkit MDS2, RGMA, Hawkeye
  • Throughput, response time, load
  • Insights into performance issues
  • What monitoring and discovery could be
  • Next-generation information architecture
  • Open Grid Services Architecture mechanisms
  • Integrated monitoring discovery arch for GT3

Open Grid Services Architecture (OGSA)
  • Defines standard interfaces and behaviors for
    distributed system integration, especially
  • Standard XML-based service information model
  • Standard interfaces for push and pull mode access
    to service data
  • Notification and subscription

Key OGSI concept - serviceData
  • Every service has its own service data
  • OGSA has common mechanism to expose a service
    instances state data to service requestors for
    query, update and change notification
  • Monitoring data is baked right in
  • Service-level concept, not host-level concept

  • Every Grid Service can expose internal state as
    serviceData elements
  • An XML element of arbitrary complexity
  • Each service has a serviceData set
  • The collection of serviceData Elements (SDEs)
  • Example state of a host is exposed as an SDE by
  • Similar to MDS2 GRIS functionality, but in each
    service (rather than once per host)

ExampleReliable File Transfer Service
File Transfer
Internal State
Data transfer operations
MDS3 Monitoring and Discovery System
  • Consists of a various components
  • Core functionality
  • Information providers
  • Higher level services
  • Clients

Core Functionality
  • Xpath support
  • XPath is a language that describes a way to
    locate and process items in XML docs by using an
    addressing syntax based on a path through the
    document's logical structure or hierarchy
  • Xindice support native XML database
  • Registry support

Schema Issues
  • Need to keep track of service data schema
  • Avoid conflicts
  • Find the data easier
  • Should really have unified naming approach
  • All of the tool are schema-agnostic, but
    interoperability needs a well-understood common

MDS3 Information Providers in June Release
  • All the data currently in core MDS2
  • Full data in the GLUE schema for compute elements
  • Ganglia information provider for cluster data
    will also be available from Ganglia folks (with
  • Service data from RFT, RLS, GRAM
  • GT2 to GT3 work
  • GridFTP server data
  • Software version and path data
  • Documentation for translating your GT2
    information provider to a GT3 information provider

MDS3 Higher Level Products
  • Higher-level services can perform actions on
    service data collected from other services
  • Part of this functionality can be provided by a
    set of building blocks provided
  • Provider interface GRIS-style API for writing
    information providers
  • Service Data Aggregator set up subscriptions to
    data for other services, and publish it as a
    single data stream
  • Hierarchy Builder allow for hierarchy of

MDS3 Index Server
  • Simplest higher-level service is the caching
    index service
  • Much like the GIIS in MDS2
  • Will have configurablity like an GIIS hierarchy
  • Will also have PHP-style scripts, much as
    available today

Clients currently in GT3
  • findServiceData command line client
  • Same functionality of grid-info-search
  • C bindings
  • Core C bindings provide findServiceData C
  • findServiceData command line client gives an
    example of using it to parse out information (in
    this case, registry contents)

Service Data Browser
  • GUI client to display service data from any
  • Extensible for data-specific visualization
  • A version was released with GT3 alpha 
  • http//

Comparing Information Systems
Is this enough?
  • No!
  • Many places where additional help developing MDS3
    is needed

We Need More Basic Information
  • Interfaces to other sources of data
  • GPT data
  • Other monitoring systems
  • Others?
  • Service data from other components
  • Every service has service data
  • Will need to interface on schema

We Will Need More GUIs and Clients
  • Additional GUI visualizers may be implemented to
    display service data specific to a particular
    port type (as part of service data browser)
  • Additional Client interfaces possibly
  • Integration into current portals, brokers

We Need MoreHigher Level Services
  • We have a couple planned
  • Archiving service
  • Trigger template

Post-3.0 release Archiving Service
  • Will allow subscription to service data
  • Logging in a flexible way
  • Well defined interfaces for mining
  • Open questions
  • Best way to store time-series of arbitrary XML?
  • Best way to query this archive?
  • Link to OGSA-DAI?
  • Link to other archivers?

Post-3.0 release Trigger Template
  • Will provide a template to allow subscription to
    data, reasoning about that data, and a course of
    action to take place
  • Essentially, a gateway service between OGSA
    Notifications and some other notification
    framework, with filtering of notifications
  • Example Subscribe to disk space information,
    send mail to sys admin when it reached 90 full
  • Needed trigger template and several small
    examples of common triggers, and documentation
    for how users could extend them or write new

Other Possible HigherLevel Services
  • Site Validation Service
  • Job Tracking Service
  • Interfacing to Netlogger?

We Need Security
  • Need I say more?

  • Current monitoring systems
  • Insights into performance issues
  • What we really want for monitoring and discovery
    is a combination of all the current systems
  • Next-generation information architecture
  • Open Grid Services Architecture mechanisms
  • MDS3 plans
  • Additional work needed!

  • Testbed/Experiment support and comments
  • John Mcgee, ISI James Magowan, IBM-UK Alain Roy
    and Nick LeRoy at University of Wisconsin,
    MadisonScott Gose and Charles Bacon, ANL Steve
    Fisher, RAL Brian Tierney and Dan Gunter, LBNL.
  • This work was supported in part by the
    Mathematical, Information, and Computational
    Sciences Division subprogram of the Office of
    Advanced Scientific Computing Research, U.S.
    Department of Energy, under contract
    W-31-109-Eng-38. This work also supported by
    DOESG SciDAC Grant, iVDGL from NSF, and others.

Additional Information
  • MDS3 technology coordinators
  • Ben Clifford (
  • Jennifer Schopf (
  • Zhang, Freschl and Schopf, A Performance Study
    of Monitoring and Information Services for
    Distributed Systems, to appear in HPDC 2003
  • http//
  • MDS-3 information
  • Soon at

Extra Slides
Why Information Infrastructure?
  • Distributed, often complex, performance-critical
    nature of Grids apps demands tools for
  • Discovering available resources
  • Discovering available sensors
  • Integrating information from multiple sources
  • Archiving and replaying historical information
  • These and other functions are provided by an
    information infrastructure
  • Many projects are concerned with design,
    deployment, evaluation, and application

