Title: Grid Monitoring Futures with Globus
1Grid Monitoring Futures with Globus
- Jennifer M. Schopf
- Argonne National Lab
- April 2003
2My Definitions
- Grid
- Shared resources
- Coordinated problem solving
- Multiple sites (multiple institutions)
- Monitoring
- Discovery
- Registry service
- Contains descriptions of data that is available
- Expression of data
- Access to sensors, archives, etc.
3What do I mean by Grid monitoring?
- Different levels of monitoring needed
- Application specific
- Node level
- Cluster/site Level
- Grid level
- Grid level monitoring concerns data
- Shared between administrative domains
- For use by multiple people
- (think scalability)
4Grid Monitoring Does Not Include
- All the data about every node of every site
- Years of utilization logs to use for planning
next hardware purchase - Low-level application progress details for a
single user - Application debugging data (except perhaps
notification of a failure of a heartbeat) - Point-to-point sharing of all data over all sites
5Overview of This Talk
- Evaluation of information infrastructures
- Globus Toolkit MDS2, R-GMA, Hawkeye
- Insights into performance issues
- (publication at HPDC 2003)
- What monitoring and discovery could be
- Next-generation information architecture
- Open Grid Services Architecture mechanisms
- Integrated monitoring discovery arch for GT3
6Performance and the Grid
- Its not enough to use the Grid, it has to
perform otherwise, why bother? - First prototypes rarely consider performance
(tradeoff with devt time) - MDS1centralized LDAP
- MDS2decentralized LDAP
- MDS3decentralized Grid service
- Often performance is simply not known
7Globus Monitoring andDiscovery Service (MDS2)
- Part of Globus Toolkit, compatible with other
elements - Used most often for resource selection
- aid user/agent to identify host(s) on which to
run an application - Standard mechanism for publishing and discovery
- Decentralized, hierarchical structure
- Soft-state protocols
- Caching
- Grid Security Infrastructure credentials
8MDS2 Architecture
9Relational Grid Monitoring Architecture (R-GMA)
- Implementation of the Grid Monitoring
Architecture (GMA) defined within the Global Grid
Forum (GGF) - Three components
- Consumers
- Producers
- Registry
- GMA as defined currently does not specify the
protocols or the underlying data model to be
used.
10GGF Grid Monitoring Architecture
11R-GMA
- Monitoring used in the EU Datagrid Project
- Steve Fisher, RAL, and James Magowan, IBM-UK
- Based on the relational data model
- Used Java Servlet technologies
- Focus on notification of events
- User can subscribe to a flow of data with
specific properties directly from a data source
12R-GMA Architecture
13Hawkeye
- Developed by Condor Group
- Focus automatic problem detection
- Underlying infrastructure builds on the Condor
and ClassAd technologies - Condor ClassAd Language to identify resources in
a pool - ClassAd Matchmaking to execute jobs based on
attribute values of resources to identify
problems in a pool
14Hawkeye Architecture
15Comparing Information Systems
Â
Â
16Some Architecture Considerations
- Similar functional components
- Grid-wide for MDS2, R-GMA Pool for Hawkeye
- Global schema
- Different use cases will lead to different
strengths - GIIS for decentralized registry no standard
protocol to distribute multiple R-GMA registries - R-GMA meant for streaming data currently used
for NW data Hawkeye and MDS2 for single queries - Push vs Pull
- MDS2 is PULL only
- R-GMA allows push and pull
- Hawkeye allows triggers push model
17Experiments
- How many users can query an information server at
a time? - How many users can query a directory server?
- How does an information server scale with the
amount of data in it? - How does an aggregator scale with the number of
information servers registered to it?
18Testbed
- Lucky cluster at Argonne
- 7 nodes, each has two 1133 MHz Intel PIII CPUs
(with a 512 KB cache) and 512 MB main memory - Users simulated at the UC nodes
- 20 P3 Linux nodes, mostly 1.1 GHz
- R-GMA has an issue with the shared file system,
so we also simulated users on Lucky nodes - All figures are 10 minute averages
- Queries happening with a one second wait between
each query (think synchronous send with a 1
second wait)
19Metrics
- Throughput
- Number of requests processed per second
- Response time
- Average amount of time (in sec) to handle a
request - Load
- percentage of CPU cycles spent in user mode and
system mode, recorded by Ganglia - High when running small number compute intensive
aps - Load1
- average number of processes in the ready queue
waiting to run, 1 minute average, from Ganglia - High when large number of aps blocking on I/O
20Performance of Information Servers vs. Number of
Users
21Experiment 1 Summary
- Caching can significantly improve performance of
the information server - Particularly desirable if one wishes the server
to scale well with an increasing number of users - When setting up an information server, care
should be taken to make sure the server is on a
well-connected machine - Network behavior plays a larger role than
expected - If this is not an option, thought should be given
to duplicating the server if more than 200 users
are expected to query it
22Directory Server Scalability
23Experiment 2 Summary
- Because of the network contention issues, the
placement of a directory server on a highly
connected machine will play a large role in the
scalability as the number of users grows - Significant loads are seen even with only a few
users, it will be important that this service be
run on a dedicated machine, or that it be
duplicated as the number of users grows.
24Information Service Throughput vs. Num. of
Information Collectors
25Experiment 3 Summary
- Too many information collectors is a performance
bottleneck - Caching data helps
- Alternatively, register to more instances of
information servers with each handling a subset
of the collectors
26Overall Results
- Performance can be a matter of deployment
- Effect of background load
- Effect of network bandwidth
- Performance can be affected by underlying
infrastructure - LDAP/Java strengths and weaknesses
- Performance can be improved using standard
techniques - Caching multi-threading etc.
27So what could monitoring be?
- Basic functionality
- Push and pull (subscription and notification)
- Aggregation and Caching
- More information available
- More higher-level services
- Triggers like Hawkeye
- Viz of archive data like Ganglia
- Plug and Play
- Well defined protocols, interfaces and schemas
- Performance considerations
- Easy searching
- Keep load off of clients
28Topics
- Evaluation of information infrastructures
- Globus Toolkit MDS2, RGMA, Hawkeye
- Throughput, response time, load
- Insights into performance issues
- What monitoring and discovery could be
- Next-generation information architecture
- Open Grid Services Architecture mechanisms
- Integrated monitoring discovery arch for GT3
29Open Grid Services Architecture (OGSA)
- Defines standard interfaces and behaviors for
distributed system integration, especially - Standard XML-based service information model
- Standard interfaces for push and pull mode access
to service data - Notification and subscription
30Key OGSI concept - serviceData
- Every service has its own service data
- OGSA has common mechanism to expose a service
instances state data to service requestors for
query, update and change notification - Monitoring data is baked right in
- Service-level concept, not host-level concept
31serviceData
- Every Grid Service can expose internal state as
serviceData elements - An XML element of arbitrary complexity
- Each service has a serviceData set
- The collection of serviceData Elements (SDEs)
- Example state of a host is exposed as an SDE by
GRAM. - Similar to MDS2 GRIS functionality, but in each
service (rather than once per host)
32ExampleReliable File Transfer Service
File Transfer
Internal State
Data transfer operations
33MDS3 Monitoring and Discovery System
- Consists of a various components
- Core functionality
- Information providers
- Higher level services
- Clients
34Core Functionality
- Xpath support
- XPath is a language that describes a way to
locate and process items in XML docs by using an
addressing syntax based on a path through the
document's logical structure or hierarchy - Xindice support native XML database
- Registry support
35Schema Issues
- Need to keep track of service data schema
- Avoid conflicts
- Find the data easier
- Should really have unified naming approach
- All of the tool are schema-agnostic, but
interoperability needs a well-understood common
language
36MDS3 Information Providers in June Release
- All the data currently in core MDS2
- Full data in the GLUE schema for compute elements
(CE) - Ganglia information provider for cluster data
will also be available from Ganglia folks (with
luck) - Service data from RFT, RLS, GRAM
- GT2 to GT3 work
- GridFTP server data
- Software version and path data
- Documentation for translating your GT2
information provider to a GT3 information provider
37MDS3 Higher Level Products
- Higher-level services can perform actions on
service data collected from other services - Part of this functionality can be provided by a
set of building blocks provided - Provider interface GRIS-style API for writing
information providers - Service Data Aggregator set up subscriptions to
data for other services, and publish it as a
single data stream - Hierarchy Builder allow for hierarchy of
aggregators
38MDS3 Index Server
- Simplest higher-level service is the caching
index service - Much like the GIIS in MDS2
- Will have configurablity like an GIIS hierarchy
- Will also have PHP-style scripts, much as
available today
39(No Transcript)
40Clients currently in GT3
- findServiceData command line client
- Same functionality of grid-info-search
- C bindings
- Core C bindings provide findServiceData C
function - findServiceData command line client gives an
example of using it to parse out information (in
this case, registry contents)
41Service Data Browser
- GUI client to display service data from any
service - Extensible for data-specific visualization
- A version was released with GT3 alphaÂ
- http//www.globus.org/ogsa/releases/
alpha/docs/infosvcs/sdbquickstart.html
42Comparing Information Systems
Â
Â
43Is this enough?
- No!
- Many places where additional help developing MDS3
is needed
44We Need More Basic Information
- Interfaces to other sources of data
- GPT data
- Other monitoring systems
- Others?
- Service data from other components
- Every service has service data
- OGSA-DAI
- Will need to interface on schema
45We Will Need More GUIs and Clients
- Additional GUI visualizers may be implemented to
display service data specific to a particular
port type (as part of service data browser) - Additional Client interfaces possibly
- Integration into current portals, brokers
46We Need MoreHigher Level Services
- We have a couple planned
- Archiving service
- Trigger template
47Post-3.0 release Archiving Service
- Will allow subscription to service data
- Logging in a flexible way
- Well defined interfaces for mining
- Open questions
- Best way to store time-series of arbitrary XML?
- Best way to query this archive?
- Link to OGSA-DAI?
- Link to other archivers?
48Post-3.0 release Trigger Template
- Will provide a template to allow subscription to
data, reasoning about that data, and a course of
action to take place - Essentially, a gateway service between OGSA
Notifications and some other notification
framework, with filtering of notifications - Example Subscribe to disk space information,
send mail to sys admin when it reached 90 full - Needed trigger template and several small
examples of common triggers, and documentation
for how users could extend them or write new
ones.
49Other Possible HigherLevel Services
- Site Validation Service
- Job Tracking Service
- Interfacing to Netlogger?
50We Need Security
51Summary
- Current monitoring systems
- Insights into performance issues
- What we really want for monitoring and discovery
is a combination of all the current systems - Next-generation information architecture
- Open Grid Services Architecture mechanisms
- MDS3 plans
- Additional work needed!
52Thanks
- Testbed/Experiment support and comments
- John Mcgee, ISI James Magowan, IBM-UK Alain Roy
and Nick LeRoy at University of Wisconsin,
MadisonScott Gose and Charles Bacon, ANL Steve
Fisher, RAL Brian Tierney and Dan Gunter, LBNL. - This work was supported in part by the
Mathematical, Information, and Computational
Sciences Division subprogram of the Office of
Advanced Scientific Computing Research, U.S.
Department of Energy, under contract
W-31-109-Eng-38. This work also supported by
DOESG SciDAC Grant, iVDGL from NSF, and others.
53Additional Information
- MDS3 technology coordinators
- Ben Clifford (benc_at_isi.edu)
- Jennifer Schopf (jms_at_mcs.anl.gov)
- Zhang, Freschl and Schopf, A Performance Study
of Monitoring and Information Services for
Distributed Systems, to appear in HPDC 2003 - http//people.cs.uchicago.edu/hai/hpdcv25.doc
- MDS-3 information
- Soon at www.globus.org/mds
54Extra Slides
55Why Information Infrastructure?
- Distributed, often complex, performance-critical
nature of Grids apps demands tools for - Discovering available resources
- Discovering available sensors
- Integrating information from multiple sources
- Archiving and replaying historical information
- These and other functions are provided by an
information infrastructure - Many projects are concerned with design,
deployment, evaluation, and application
56Performance of GIS Information Servers vs. Number
of Users
57Performance of GIS Information Servers vs. Number
of Users
58Performance of GIS Information Servers vs. Number
of Users
59Performance of GIS Information Servers vs. Number
of Users
60Directory Server Scalability
61Directory Server Scalability
62Directory Server Scalability
63Directory Server Scalability