Autonomous Configuration of Grid Monitoring Systems - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Autonomous Configuration of Grid Monitoring Systems

Description:

Directory Service: supports publication & discovery of components info ... Executes ping & ps systematically in parallel in n-by-n fashion ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 27

Provided by: matsuisT

Category:

more less

Transcript and Presenter's Notes

Title: Autonomous Configuration of Grid Monitoring Systems

1
Autonomous Configuration of Grid Monitoring
Systems

Kenichiro Shirose1, Satoshi Matsuoka1,3,
Hidemoto Nakada1,2, Hirotaka Ogawa2
1 Tokyo Institute of Technology
2 National Institute of Advanced Industrial
Science and Technology
3 National Institute of Informatics

2
Grid Monitoring System (1/2)

In a practical Grid deployment, monitoring are
must at all levels
Resource Brokering
Accounting and Auditing
System Administration
User feedback

3
Grid Monitoring System (2/2)

Monitoring system components fundamentally
distributed and subject to faults and
reconfigurations
Components heavily mutually dependent
Too many monitors ? Probing effect potential
problem

Data request program
Data collect program
Data collect program
sensor
sensor
sensor
sensor
sensor
4
Goal

Design and implementation of autonomically
managed Grid monitoring system
Propose a framework for autonomic management of
Grid monitoring systems
Implement a prototype that autonomously
configures and tolerates faults of NWS sensors

5
Grid Monitoring System Architecture
Consumer

NWS Rich et al.
MDS Globus Alliance
R-GMA EU DataGrid Project
Hawkeye Condor Project

Directory Service
Producer
Consist from common components (Grid Monitoring
Architecture 02 GGF) Directory Service
supports publication discovery of components
info Producer retrieves data from sources
makes them available to others Consumer
receives data from Producer processes
them Sources of Data collect data
(We employ NWS as the target, but the results are
generalizable)
6
Overview of NWS

NWS 99 Wolski measures CPU and network on
the Grid and forecasts their future values

Nameserver manages info on NWS components
(Client inquires Nameserver)
(Client requests monitored data)
Memoryhost gathers data from sensor
Sensor programs run on each machine and send data
to Memoryhost
7
NWS clique-based network performance measurement

Eliminate n2 traffic pressure of end-to-end
network measurements
Sensors on representative nodes measure
end-to-end performance amongst cliques
Measured value between the representing nodes is
returned as an approximation between any pair of
nodes from the respective clique pair

Representing node
clique
clique
8
Requirement of autonomic Management on the Grid

Need to be aware of correct configuration
fault recovery tactics based on various
information including its own probing (especially
the status of nodes and the network topology)
Additional requirements
Applicability Support of multiple, existing
systems
Scalability Scale to numerous number of nodes
Autonomy Manage with little or no user
intervention
Extensibility Possibility to incorporate various
autonomic, self-management features

9
Four Steps of autonomic Management
Loop with given time interval
Forecast network topology check status of
nodes and processes
Reconfiguring groups when nodes are added,
removed, etc.
Form node groups
For all nodes at startup, For halted components
at recovery
Decide configuration
Start up the components on assigned nodes
10
Implementation of Autonomic Grid Monitoring
System Prototype

Support autonomic configuration, execution and
recovery of NWS components
Input a list of nodes (with attribute info)
Four action steps
Measure RTT between nodes and collect PID of
components of NWS
Form node groups based on RTT
(Re)configure the components
Execute the components

11
RTT Measurement
Each node runs the script to measure RTT to
others and collect PID of NWS components on the
machine
initializationscript
ICMP
ICMP
Resources
RTT PID data
Management node
Executes ping ps systematically in parallel in
a n-by-n fashion
12
Forming node groups

For each node, decide the most proximal node
based on RTT
initialize H is a set of nodes, i 1, m is an
element of H
Until H has no elements
If m is an element of H (It doesnt belong to any
group)
Move m from H to Gi, m becomes the most proximal
node from m
Else if i 1 or m is an element of Gi (It
belongs to the newest group)
i i 1, Gi is empty set, m becomes an element
of H
Else (m is an element of Gj (j 1,2,i-1))
Merge Gj and Gi, make Gi be empty, and let m be
an element of H

13
Configuration of components and execution

Decide the nodes on which NWS Nameserver
Memoryhost will run
Nameserver a node with most connectivity with
other nodes and minimum average RTT from other
nodes
Memoryhost a node which is the most proximal
from most number of nodes in each group
These nodes will serve as clique representatives
Create a script to execute respective NWS
components on respective nodes

14
Fault Recovery(1/2)

Status of the nodes components
Nodes whether data collection executed properly
Components whether they are running
The current prototype handles two types of
faults in NWS
The node itself is active, but component is not ?
restart component
The node executing Nameserver or Memoryhost is
inactive ? select alternative node

15
Fault Recovery (2/2)

When an alternative node for Nameserver or
Memoryhost is selected
Other Sensorhosts Memoryhosts must be notified
of the change
For Memoryhost select another node in the same
(clique) group
Sensors in the group will be restarted with new
configuration
For Nameserver an appropriate node will need
to be selected globally
All components in the system will be restarted
with the new configuration

16
Evaluation Environment

Install our prototype on the Titech Campus Grid
_at_ Tokyo Institute of Technology
Dedicated 15 PC clusters on campus (over 800
processors), production usage
Each cluster connect to SuperTITANET a
multi-gigabit campus network backbone
Node spec
CPU PentiumIII 1.4Ghz X 2
Memory 512MB1GB
NIC 100Base-T (WAN, per node) Myrinet
(cluster-local LAN)
Use 1 node of each cluster for this
evaluation(SCore control node)

17
Experiments

NWS autonomic configuration quality and time
required
Fault Recovery
Memoryhost fault scenario

18
RTT measurement of thetest configuration
RTT in a campus under 0.450ms RTT between
campus over 0.450 ms
under 0.450ms
RTT measurement
from 0.450ms to 1000ms
over 1000ms
Oookayama
About 30km apart
Suzukakedai
19
Result of autonomic NWS configuration
nameserver of NWS
Appropriately grouped the clusters into two
groups, one for each campus
memoryhost
sensorhost
Data flow from sensor to memoryhost
Representation of network performance measurement
Oookayama
Suzukakedai
20
Configuration Time
Configuration time is O(N) (N number of
nodes) Most of time is spent on execution of ping
or NWS
21
Configuration time

Configuration time is O(N) (N of machines)
Due to sequential exec. of RTT measurement NWS
components
By parallel execution, we could reduce this to be
O(1) (independent of the number of machines)

22
Fault Recovery case of Memoryhost (1)
Nameserver of NWS
Memoryhost
Sensorhost
Data flow from sensor to Memoryhost
When this node goes down the management feature
detects it
23
Fault Recovery case of Memoryhost (2)
Nameserver of NWS
Memoryhost
Sensorhost
Data flow from sensor to Memoryhost
New Memoryhost starts running and all
Sensorhosts in the same clique are restarted.
24
Problem of Our Prototype Implementation

Single point of failure in the system
Some components are centrally managed
Solutions
Replication of management function
Distributed management algorithms
Excessive time required for data collection and
execution components
Solutions
O(1) Pinging
Parallel execution of each command

25
Conclusion

Proposed autonomic management of Grid monitoring
system
Consists of Grouping of machines, configuration
and halt recovery
Implement autonomic management feature for NWS
On a testbed (15 PC clusters), initial
configuration in 2 min.

26
Future Work

Eliminate single point of failure
Distributed management architecture
Support for Grid monitoring systems in general
Adaptation to large-scale environment
Evaluation with a larger set of machines on a WAN
environment

27
First and Second Step

Forecast network topology check status of nodes
and processes
Using network performance metric check if nodes
monitoring components are accessible
Forms node groups
Using proper algorithm for forming
For new nodes, editing reforming group
membership

28
Third and Fourth Step

Decide configuration
In initial configuration, for all machines
In recovery configuration, for halted components
and other components related with them
Start up the components on assigned nodes
Execute processes and register information of the
components into Information Services

29
Data Collection Step (1/2)

Executes ping ps systematically in parallel in
n-by-n fashion
Each node runs the scripts to measure RTT to the
others and collect PID of NWS components on the
machine
One node gathers all data which were generated by
resource nodes

30
Example (1/5)
2
3
1
5
6
4
31
Example (2/5)
A node was selected, the most Proximal node from
this belongs to H
2
3
1
5
6
4
G1
32
Example(3/5)
The most proximal node from 6 was 5, so a new
element 3 was selected. 2 (the most proximal
node from 3) belongs H, so 2 will belongs to G2
2
3
1
G2
5
6
4
G1
33
Example(4/5)
The most proximal node from 2 was 5, so G2 will
be merged to G1
2
3
1
G2
5
6
4
G1
34
Example(5/5)
A new element 4 was selected, The most proximal
node from it is 1
2
3
1
G1
5
6
4
G2
35
Forming Result
The most proximal node from 1 was 4. H is empty
set, so forming is over
2
3
1
5
6
4
36
Fault Recovery case of Nameserver
nameserver of NWS
memoryhost
sensorhost
Data flow from sensor to memoryhost
37
Fault Recovery case of Nameserver
nameserver of NWS
memoryhost
sensorhost
Data flow from sensor to memoryhost
New Nameserver starts running and all other
component are restarted
38
Time for fault recovery