Title: Implementing the monitoring system for the Grid application traffic
1Implementing the monitoring system for the Grid
application traffic
- Tai M. Chung
- School of Information Communication
Engineering, - Sungkyunkwan Univ.
- 300 Cheoncheon-dong, Jangan-gu, Suwon-si,
Gyeonggi-do, Korea. - Tel 82-31-290-7131, Fax 82-31-299-6673
- tmchung_at_ece.skku.ac.kr
2Contents
- Objective of Research
- Activities
- Plan Result
- Analysis of Monitoring Methods
- Grid Application Measurement Factors
- Implementation of the monitoring system
3Objective of Research
- Research of grid network applications
measurement methods - Kernel level monitoring
- Application level monitoring
- Implementation of grid application monitoring
systems - Design of grid application monitoring systems
- Define the metrics for grid network applications
- Implementation of metrics for grid network
applications - Implementation of UI for grid application
monitoring systems - Research of standardization for grid application
traffic monitoring - Suggestion of standard methods to measure the
grid application traffic - Contribution to the global grid application
monitoring activity for standardization
4Activities
Analysis Preparation for Implementation
A Study on the Methodology of the Grid
Application Traffic Monitoring
Analysis of the metrics for the grid application
traffic
Analysis of the Performance Measurement Mechanism
On the Kernel Level
On the Application Level
A Selection of the Performance Measurement
Mechanisms and Analysis of the Metrics for Grid
application traffic
A Design of the Grid Application Traffic
Monitoring System and Web-based Management System
Implementation
Implementation of the Grid Application
Performance Measurement Module
Implementation of the Grid Service Interface
Implementation of the Web- based GUI for the
Grid Application Traffic Monitoring
Test Debugging
Test Debugging
A Research of Standardization for Grid
Application Traffic Monitoring and Basic Survey
for the Application Performance Tuning
Standardization Activity
5Plan Result
6Analysis of Monitoring Methods
Whitch is the Better One?
Monitoring Tool
Request (Monitoring Information)
Usage CPU, Memory
Provide
- User level network monitoring using Libpcap
Kernel
- Kernel Level network monitoring
7User Level Network Monitoring Using Lipcap
- What is Libpcap?
- the Packet Capture library provides a high
level interface to packet capture systems - TCP Header information, IP Header information,
UDP Header information - Merits and demerits of Libpcap
- merits easy to develop platform independent
network monitoring applications - demerits packet loss can be occur on heavy
network load
8Kernel Level network monitoring (1/2)
- Monitoring /proc filesystem (in Linux)
- proc filesystem It is used to inform easily
to system user about various kinds of data
structure that kernel has - kernel tuning can be easily achieved by simply
modifying each files in the /proc filesystem - using network parameter of kernel in the /proc
filesystem - Opened network socket information per application
could be obtained - Using LibKVM (Kernel Virtual Memory access
library) - kvm_open, kvm_nlist, kvm_read and etc.
- can be used to access directly through /dev/kmem
device and access to the kernel data structure
Monitoring Tool
Kernel
9Kernel Level network monitoring (2/2)
- Network related modules in kernel
- Netfilter layer
- framework inside the Linux 2.4.x kernel which
enables packet filtering, network address
translation (NAT) and other packet mangling - stateful packet filtering (connection tracking)
- all kinds of network address translation
- large number of additional features as patches
- with using ip_conntrack module It is possible
to monitor network information per connection - Merits and demerits of Kernel Level network
monitoring - Merits less system load and monitoring latency
than using Libpcap - Demerits kernel patch is somewhat difficult and
dangerous job
Monitoring Tool
Kernel
10Grid Application Measurement Factors
- Grid network measurement parameter
- Bandwidth
- Delay
- Jitter
- Loss
- Grid Application Performance Characteristic
- Reliability
- Availability
- Survivability
- Closeness
GRID Application
11Grid network measurement parameter
- Bandwidth
- Bandwidth is defined most generally as data per
unit time - Available Bandwidth max amount of data per
time unit that a hop or path can provide given
the current utilization - Delay
- The time between when the first part of an
object passes an observational position and the
time the last part of that object or related
object passes - Jitter
- The variation in the one-way delay of packets
- Important in sizing playout buffers for
applications requiring regular delivery of
packets - Loss
- One-way Loss
- Roundtrip Loss
GRID Network
12Grid Application Performance Characteristic
- Grid Performance events required
- System info
- cpu, available memory, available free storage,
network utilization, how many clients can
connect, failure rate, available disk - Data information
- data type, size, current location
- Data access info
- read/write frequency, duration, size, user info
- Network info
- Bandwidth, latency, RTT, Packet Loss
- Grid Application Performance Characteristic
- Reliability
- Availability
- Survivability
- Closeness
GRID Application
13Reliability (1/3)
- The probability that it is functioning properly
and constantly over a fixed time period - The reliability of a grid network is the
probability that a grid application program which
runs on multiple processing elements and needs to
communicate with other processing elements - The reliability varies according to the
conditions of network (retransmission rate,
tcplistendrop rate), accessibility of network,
and TCP packets loss rate - Conditions of network
- Retransmission is caused by not receiving ACKs
fast enough and this is why bad network hardware
or a congested route - Retransmission rate tcpRetransBytes /
tcpOutDataBytes - ListenDrop counts the number of times that a
connection was dropped due to a full listen queue - ListenDrop rate tcpListenDrop / t
14Reliability (2/3)
- Accessibility of network
- Network accessibility is defined as the measure
of the capacity of a location to be reached by,
or to reach different locations - Accessibility rate (icmp input destination
unreachable icmp output destination
unreachable) / (icmp input packets icmp output
packets) - TCP Loss
- Packet loss describes an error condition in which
data packets appear to be transmitted correctly
at one end of a connection, but never arrive at
the other - TCPLoss rate TCPLoss / TCP packets
15Reliability (3/3) - Reliability Measurement
Identify the available servers of grid network to
run the grid application from the GT3
Step 1
Calculate the network accessibility for the
servers using in grid network to run the grid
application
Step 2
Calculate the tcp loss for all the servers using
in grid network
Step 3
Calculate the network condition in grid network
Step 5
Calculate the system failure rate ?, and system
repair rate µ 1-? (? accessibility rate
network condition rate tcp loss rate)
Step 6
Calculate the grid service reliability in each
server (Reliability µ/(µ?))
Step 7
Let the grid service reliability is R. Then we
can calculate the average, and variance The mean
of grid service reliability is
(for 0 lt i m, i is the number of
servers) The variance of grid service reliability
is (for 0 lt i m, i is the number of servers)
16Availability (1/2)
- Network availability means the ability of rapid
recovery in case of network failure - On detection of three dupacks, packet loss is
assumed and the sender halves congestion window
size - If congestion occurs, let time Tt the time in
which recover the window size before the
congestion occurs, let time Tf the time in
which congestion occurs - MTTR Tt Tf
- MTTF the execution time of grid application
MTTR - ? congestion occur rate 1/MTTR
- µ congestion repair rate 1/MTTF
- Network availability MTTF/(MTTFMTTR)
17Availability (2/2)
18Survivability (1/5)
- Capability to provide a prescribed level of QoS
for existing services after a given number of
failures occur within the network - Property of a network to be resilient to failure
- Use to describe the available performance of a
network after a failure - Measures the degree of functionality remaining in
a system after a failure and consists of
evaluating metrics which quantify network
performance during failure as well as normal
operation - Monitoring Data Needs for Survivability Guarantee
- determine the optimal resources for a application
job - applications could use monitoring data to adapt
themselves to the current situation - Fault detection and analysis
- monitoring data is used to determine faults in
system components and applications - monitoring data could also be used to find the
cause of the faults
19Survivability (2/5)
- Survivability is enhanced by
- Security techniques where applicable
- Redundancy, diversity, general trust validation
- Automated recovery support
- Strategies for Survivability Guarantee
Network Service View
-
-
- Network Restoration
- Network Protection
- Hardware Duplication
- Software Fault Tolerance
- Link/Site Diversity
- Provisioning
- Configurable parameters
Mitigation/Masking Strategies
-
-
- Design Centering
- Software Modularity
- Physical/GUI Desing
- Traffic Robustness
- Environmental Robustness
- Site Location/Integrity
Prevention Strategies
Network
-
-
- Technology Failures
- Operational activities
- Procedural errors
- Traffic overloads
- Environmental incidents
Failure Events
20Survivability (3/5) - Survivability Measurement
Factors
- RTV (Residual Traffic Volume)
- NPR (Network Path Protection Ratio)
RTV tv/tn tn traffic volume before failure tv
traffic volume after failure
Path protection Ratio wi Path i capacity ki
possible alternate path capacity
capacity(bits) bandwidth(bits/sec) round-trip
time(sec)
21Survivability (4/5)
- Resource Reallocation Mechanism After
Survivability Assessment
Resource Reallocation Mechanism After
Survivability Assessment
1. Monitoring Resource Creation
7. Survivability Assessment Result Reporting
8. Node Path Change Request
2. Performance Measurement Data Collection for
Resource
3. Survivability Assessment
6. Survivability Assessment Result Collection
5. Resource Reallocation Accomplishment
Registry
Grid Application Execution Nodes Path Change
Grid Application Execution Environment (
OS/HW/Storage etc.)
22Survivability (5/5)
- Reallocatin Algorithm for Survivability
- TPU average ProcessorUsage ()
- TPUEssential ProcessUsage()
- GPUGeneral ProcessUsage()
Survivability Assessment Resource Reallocation
Algorithm
Grid Application Resource Utilization Measurement
No
Total Utilization Datum Excess
Yes
Essential Service Utilization Datum Excess
Yes
Recovery Resource Reallocation
No
Service Resource Recovery Available Compromise
Resource Utilization Re-measurement
No
Essential Service Utilization Increase
Yes
Forced Exit of Service Available Compromise
23Network Closeness (1/2)
- A measure of the degree to which a node is
adjacent to or can reach others in a network. - Closeness is usually measured by the number of
steps it takes to reach others. - Network closeness is based on path-capacity
measurements and hop counts. - Closeness Measurement Factors
- Round Trip Time
- Packet loss frequency
- Throughput
24Network Closeness (2/2)
- Validity Assessment for Closeness
r Round Trip Time, Rmax max RTT, ploss
packet loss frequency th throughput thmax
maximum throughput
- a interval 0,1
- Closeness Measurement Data Dependence
- RTT throughput Factors
- - if closer a to 1 gt the more dependent is
Closeness on throughput - - if closer a to 0 gt the more dependent is
Closeness on RTT
25Implementation of the monitoring system
- Develop Environment
- Design spec. for Linux kernel based Information
Collector - Kernel based network information gathering
mechanism - Information Gathering Mechanism
- Components of Information Collector
- Information Collector daemon
- Design spec. for GA-NMS Web Service
- Example of GA-NMS protocol
- Service Architecture
- Examples of Implementation
26Develop Environment
- Hardware platform
- CPU Intel Pentium III 600MHz
- Memory 192MB
- Disk 6.1GB, 3.1GB
- Operating System
- REDHAT Linux 7.3
- Kernel 2.4.19
- Running Environment
- UNIX C
- Kernel Module
- Information Collector Module
- JAVA (Jakarta Tomcat, WSDP(Web Service
Development Pack)) - Information Provider Web Service
27Design spec. for Linux kernel based Information
Collector (1/4)
Kernel based network information gathering
mechanism
28Design spec. for Linux kernel based Information
Collector (2/4) - Information Gathering Mechanism
- 1. Hooking
- It replaces the existing protocol stack logic
that gathers network related information in the
abstract with the logic that gathers network
related information in detail (ex End-to-End
bandwidth) -
- 2-1. Information gathering using kernel module
- It gathers information from protocol stack
hooking layer - Protocol stack hooking layer hooks each protocol
stack and stores network related information
after processing into user-readable format - 2-2. Information gathering using kernel memory
interface - Not completely supported on Linux
- On common Unix environment an interface is
supported that user can access to the kernel data
through it (ex Kernel Virtual Memory interface
library (KVM)) - 3. Data accumulating
- Kernel module stores data into the filesystem
that can used by user at user level - By using ProcFS (Process information
pseudo-Filesystem) we can reduce the load that
should be occurred by using real filesystems - 4. Information processing
- A process that user application reads network
monitoring parameters from ProcFS and processes
them as network parameter for Grid applications
29Design spec. for Linux kernel based Information
Collector (3/4) - Components of Information
Collector
- Protocol stack hooking layer
- It uses Netfilter Layer that is supported on the
Linux kernel 2.4.X to 2.6.X. - Netfilter layer supports to hook in the protocol
stack by using user supplementable functions. It
does not modify the protocol stack code, so it
can process information that kernel uses without
modification of original kernel data - Kernel module
- It is based on the ip_conntrack kernel module
supplied by Netfilter layer. - Some codes are added and modified to gather and
process user specific network parameter in detail - Information Collector daemon
- It is a daemon that processes the network
related information in the ProcFS - It encodes gathered informations with XML scheme
and send to the Web Service application
30Design spec. for Linux kernel based Information
Collector (4/4)
Information Collector daemon
31Design spec. for GA-NMS Web Service (1/3)
- Definition of Service
- Grid Application Network Monitoring Service
(GA-NMS) supplies network monitoring parameters
that are useful for Grid Applications in the Grid
network - Messaging Protocol
- It uses XML (eXtensible Markup Language) and SOAP
(Simple Object Access Protocol) to communicate
with each services - Service Platform Specification
- Service Platform
- JAVA WSDP (Web Services Developer Pack) JAXM
(Java API for XML Messaging) / JAVA - Information Collector
- Linux Kernel module / C Language
- Site Platform
- Tomcat, Globus Toolkit 3.0 / JAVA (JSP)
32Design spec. for GA-NMS Web Service (2/3)
Example of GA-NMS protocol
33Design spec. for GA-NMS Web Service (3/3)
Service Architecture
34Examples of Implementation
Main View
Statistics View