Autonomous Configuration of Grid Monitoring Systems - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Autonomous Configuration of Grid Monitoring Systems

Description:

Directory Service: supports publication & discovery of components info ... Executes ping & ps systematically in parallel in n-by-n fashion ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 27
Provided by: matsuisT
Category:

less

Transcript and Presenter's Notes

Title: Autonomous Configuration of Grid Monitoring Systems


1
Autonomous Configuration of Grid Monitoring
Systems
  • Kenichiro Shirose1, Satoshi Matsuoka1,3,
    Hidemoto Nakada1,2, Hirotaka Ogawa2
  • 1 Tokyo Institute of Technology
  • 2 National Institute of Advanced Industrial
    Science and Technology
  • 3 National Institute of Informatics

2
Grid Monitoring System (1/2)
  • In a practical Grid deployment, monitoring are
    must at all levels
  • Resource Brokering
  • Accounting and Auditing
  • System Administration
  • User feedback

3
Grid Monitoring System (2/2)
  • Monitoring system components fundamentally
    distributed and subject to faults and
    reconfigurations
  • Components heavily mutually dependent
  • Too many monitors ? Probing effect potential
    problem

Data request program
Data collect program
Data collect program
sensor
sensor
sensor
sensor
sensor
4
Goal
  • Design and implementation of autonomically
    managed Grid monitoring system
  • Propose a framework for autonomic management of
    Grid monitoring systems
  • Implement a prototype that autonomously
    configures and tolerates faults of NWS sensors

5
Grid Monitoring System Architecture
Consumer
  • NWS Rich et al.
  • MDS Globus Alliance
  • R-GMA EU DataGrid Project
  • Hawkeye Condor Project

Directory Service
Producer
Consist from common components (Grid Monitoring
Architecture 02 GGF) Directory Service
supports publication discovery of components
info Producer retrieves data from sources
makes them available to others Consumer
receives data from Producer processes
them Sources of Data collect data
(We employ NWS as the target, but the results are
generalizable)
6
Overview of NWS
  • NWS 99 Wolski measures CPU and network on
    the Grid and forecasts their future values

Nameserver manages info on NWS components
(Client inquires Nameserver)
(Client requests monitored data)
Memoryhost gathers data from sensor
Sensor programs run on each machine and send data
to Memoryhost
7
NWS clique-based network performance measurement
  • Eliminate n2 traffic pressure of end-to-end
    network measurements
  • Sensors on representative nodes measure
    end-to-end performance amongst cliques
  • Measured value between the representing nodes is
    returned as an approximation between any pair of
    nodes from the respective clique pair

Representing node
clique
clique
8
Requirement of autonomic Management on the Grid
  • Need to be aware of correct configuration
    fault recovery tactics based on various
    information including its own probing (especially
    the status of nodes and the network topology)
  • Additional requirements
  • Applicability Support of multiple, existing
    systems
  • Scalability Scale to numerous number of nodes
  • Autonomy Manage with little or no user
    intervention
  • Extensibility Possibility to incorporate various
    autonomic, self-management features

9
Four Steps of autonomic Management
Loop with given time interval
Forecast network topology check status of
nodes and processes
Reconfiguring groups when nodes are added,
removed, etc.
Form node groups
For all nodes at startup, For halted components
at recovery
Decide configuration
Start up the components on assigned nodes
10
Implementation of Autonomic Grid Monitoring
System Prototype
  • Support autonomic configuration, execution and
    recovery of NWS components
  • Input a list of nodes (with attribute info)
  • Four action steps
  • Measure RTT between nodes and collect PID of
    components of NWS
  • Form node groups based on RTT
  • (Re)configure the components
  • Execute the components

11
RTT Measurement
Each node runs the script to measure RTT to
others and collect PID of NWS components on the
machine
initializationscript
ICMP
ICMP
Resources
RTT PID data
Management node
Executes ping ps systematically in parallel in
a n-by-n fashion
12
Forming node groups
  • For each node, decide the most proximal node
    based on RTT
  • initialize H is a set of nodes, i 1, m is an
    element of H
  • Until H has no elements
  • If m is an element of H (It doesnt belong to any
    group)
  • Move m from H to Gi, m becomes the most proximal
    node from m
  • Else if i 1 or m is an element of Gi (It
    belongs to the newest group)
  • i i 1, Gi is empty set, m becomes an element
    of H
  • Else (m is an element of Gj (j 1,2,i-1))
  • Merge Gj and Gi, make Gi be empty, and let m be
    an element of H

13
Configuration of components and execution
  • Decide the nodes on which NWS Nameserver
    Memoryhost will run
  • Nameserver a node with most connectivity with
    other nodes and minimum average RTT from other
    nodes
  • Memoryhost a node which is the most proximal
    from most number of nodes in each group
  • These nodes will serve as clique representatives
  • Create a script to execute respective NWS
    components on respective nodes

14
Fault Recovery(1/2)
  • Status of the nodes components
  • Nodes whether data collection executed properly
  • Components whether they are running
  • The current prototype handles two types of
    faults in NWS
  • The node itself is active, but component is not ?
    restart component
  • The node executing Nameserver or Memoryhost is
    inactive ? select alternative node

15
Fault Recovery (2/2)
  • When an alternative node for Nameserver or
    Memoryhost is selected
  • Other Sensorhosts Memoryhosts must be notified
    of the change
  • For Memoryhost select another node in the same
    (clique) group
  • Sensors in the group will be restarted with new
    configuration
  • For Nameserver an appropriate node will need
    to be selected globally
  • All components in the system will be restarted
    with the new configuration

16
Evaluation Environment
  • Install our prototype on the Titech Campus Grid
    _at_ Tokyo Institute of Technology
  • Dedicated 15 PC clusters on campus (over 800
    processors), production usage
  • Each cluster connect to SuperTITANET a
    multi-gigabit campus network backbone
  • Node spec
  • CPU PentiumIII 1.4Ghz X 2
  • Memory 512MB1GB
  • NIC 100Base-T (WAN, per node) Myrinet
    (cluster-local LAN)
  • Use 1 node of each cluster for this
    evaluation(SCore control node)

17
Experiments
  • NWS autonomic configuration quality and time
    required
  • Fault Recovery
  • Memoryhost fault scenario

18
RTT measurement of thetest configuration
RTT in a campus under 0.450ms RTT between
campus over 0.450 ms
under 0.450ms
RTT measurement
from 0.450ms to 1000ms
over 1000ms
Oookayama
About 30km apart
Suzukakedai
19
Result of autonomic NWS configuration
nameserver of NWS
Appropriately grouped the clusters into two
groups, one for each campus
memoryhost
sensorhost
Data flow from sensor to memoryhost
Representation of network performance measurement
Oookayama
Suzukakedai
20
Configuration Time
Configuration time is O(N) (N number of
nodes) Most of time is spent on execution of ping
or NWS
21
Configuration time
  • Configuration time is O(N) (N of machines)
  • Due to sequential exec. of RTT measurement NWS
    components
  • By parallel execution, we could reduce this to be
    O(1) (independent of the number of machines)

22
Fault Recovery case of Memoryhost (1)
Nameserver of NWS
Memoryhost
Sensorhost
Data flow from sensor to Memoryhost
When this node goes down the management feature
detects it
23
Fault Recovery case of Memoryhost (2)
Nameserver of NWS
Memoryhost
Sensorhost
Data flow from sensor to Memoryhost
New Memoryhost starts running and all
Sensorhosts in the same clique are restarted.
24
Problem of Our Prototype Implementation
  • Single point of failure in the system
  • Some components are centrally managed
  • Solutions
  • Replication of management function
  • Distributed management algorithms
  • Excessive time required for data collection and
    execution components
  • Solutions
  • O(1) Pinging
  • Parallel execution of each command

25
Conclusion
  • Proposed autonomic management of Grid monitoring
    system
  • Consists of Grouping of machines, configuration
    and halt recovery
  • Implement autonomic management feature for NWS
  • On a testbed (15 PC clusters), initial
    configuration in 2 min.

26
Future Work
  • Eliminate single point of failure
  • Distributed management architecture
  • Support for Grid monitoring systems in general
  • Adaptation to large-scale environment
  • Evaluation with a larger set of machines on a WAN
    environment

27
First and Second Step
  • Forecast network topology check status of nodes
    and processes
  • Using network performance metric check if nodes
    monitoring components are accessible
  • Forms node groups
  • Using proper algorithm for forming
  • For new nodes, editing reforming group
    membership

28
Third and Fourth Step
  • Decide configuration
  • In initial configuration, for all machines
  • In recovery configuration, for halted components
    and other components related with them
  • Start up the components on assigned nodes
  • Execute processes and register information of the
    components into Information Services

29
Data Collection Step (1/2)
  • Executes ping ps systematically in parallel in
    n-by-n fashion
  • Each node runs the scripts to measure RTT to the
    others and collect PID of NWS components on the
    machine
  • One node gathers all data which were generated by
    resource nodes

30
Example (1/5)
2
3
1
5
6
4
31
Example (2/5)
A node was selected, the most Proximal node from
this belongs to H
2
3
1
5
6
4
G1
32
Example(3/5)
The most proximal node from 6 was 5, so a new
element 3 was selected. 2 (the most proximal
node from 3) belongs H, so 2 will belongs to G2
2
3
1
G2
5
6
4
G1
33
Example(4/5)
The most proximal node from 2 was 5, so G2 will
be merged to G1
2
3
1
G2
5
6
4
G1
34
Example(5/5)
A new element 4 was selected, The most proximal
node from it is 1
2
3
1
G1
5
6
4
G2
35
Forming Result
The most proximal node from 1 was 4. H is empty
set, so forming is over
2
3
1
5
6
4
36
Fault Recovery case of Nameserver
nameserver of NWS
memoryhost
sensorhost
Data flow from sensor to memoryhost
37
Fault Recovery case of Nameserver
nameserver of NWS
memoryhost
sensorhost
Data flow from sensor to memoryhost
New Nameserver starts running and all other
component are restarted
38
Time for fault recovery
  • For number of sites is 7, nameserver down case
  • Reconfiguration 1 second,
  • Restart 37 second
  • (Data collection to be measured)

39
Sorting Result
Write a Comment
User Comments (0)
About PowerShow.com