Implementing the monitoring system for the Grid application traffic - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Implementing the monitoring system for the Grid application traffic

Description:

... of a grid network is the probability that a grid application program which ... NPR (Network Path Protection Ratio) RTV = tv/tn. tn : traffic volume before failure ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 35

Provided by: gridfor

Category:

more less

Transcript and Presenter's Notes

Title: Implementing the monitoring system for the Grid application traffic

1
Implementing the monitoring system for the Grid
application traffic

Tai M. Chung
School of Information Communication
Engineering,
Sungkyunkwan Univ.
300 Cheoncheon-dong, Jangan-gu, Suwon-si,
Gyeonggi-do, Korea.
Tel 82-31-290-7131, Fax 82-31-299-6673
tmchung_at_ece.skku.ac.kr

2
Contents

Objective of Research
Activities
Plan Result
Analysis of Monitoring Methods
Grid Application Measurement Factors
Implementation of the monitoring system

3
Objective of Research

Research of grid network applications
measurement methods
Kernel level monitoring
Application level monitoring
Implementation of grid application monitoring
systems
Design of grid application monitoring systems
Define the metrics for grid network applications
Implementation of metrics for grid network
applications
Implementation of UI for grid application
monitoring systems
Research of standardization for grid application
traffic monitoring
Suggestion of standard methods to measure the
grid application traffic
Contribution to the global grid application
monitoring activity for standardization

4
Activities
Analysis Preparation for Implementation
A Study on the Methodology of the Grid
Application Traffic Monitoring
Analysis of the metrics for the grid application
traffic
Analysis of the Performance Measurement Mechanism
On the Kernel Level
On the Application Level
A Selection of the Performance Measurement
Mechanisms and Analysis of the Metrics for Grid
application traffic
A Design of the Grid Application Traffic
Monitoring System and Web-based Management System
Implementation
Implementation of the Grid Application
Performance Measurement Module
Implementation of the Grid Service Interface
Implementation of the Web- based GUI for the
Grid Application Traffic Monitoring
Test Debugging
Test Debugging
A Research of Standardization for Grid
Application Traffic Monitoring and Basic Survey
for the Application Performance Tuning
Standardization Activity
5
Plan Result
6
Analysis of Monitoring Methods
Whitch is the Better One?
Monitoring Tool
Request (Monitoring Information)
Usage CPU, Memory
Provide

User level network monitoring using Libpcap

Kernel

Kernel Level network monitoring

7
User Level Network Monitoring Using Lipcap

What is Libpcap?
the Packet Capture library provides a high
level interface to packet capture systems
TCP Header information, IP Header information,
UDP Header information
Merits and demerits of Libpcap
merits easy to develop platform independent
network monitoring applications
demerits packet loss can be occur on heavy
network load

8
Kernel Level network monitoring (1/2)

Monitoring /proc filesystem (in Linux)
proc filesystem It is used to inform easily
to system user about various kinds of data
structure that kernel has
kernel tuning can be easily achieved by simply
modifying each files in the /proc filesystem
using network parameter of kernel in the /proc
filesystem
Opened network socket information per application
could be obtained
Using LibKVM (Kernel Virtual Memory access
library)
kvm_open, kvm_nlist, kvm_read and etc.
can be used to access directly through /dev/kmem
device and access to the kernel data structure

Monitoring Tool
Kernel
9
Kernel Level network monitoring (2/2)

Network related modules in kernel
Netfilter layer
framework inside the Linux 2.4.x kernel which
enables packet filtering, network address
translation (NAT) and other packet mangling
stateful packet filtering (connection tracking)
all kinds of network address translation
large number of additional features as patches
with using ip_conntrack module It is possible
to monitor network information per connection
Merits and demerits of Kernel Level network
monitoring
Merits less system load and monitoring latency
than using Libpcap
Demerits kernel patch is somewhat difficult and
dangerous job

Monitoring Tool
Kernel
10
Grid Application Measurement Factors

Grid network measurement parameter

Bandwidth
Delay
Jitter
Loss

Grid Application Performance Characteristic

Reliability
Availability
Survivability
Closeness

GRID Application
11
Grid network measurement parameter

Bandwidth
Bandwidth is defined most generally as data per
unit time
Available Bandwidth max amount of data per
time unit that a hop or path can provide given
the current utilization
Delay
The time between when the first part of an
object passes an observational position and the
time the last part of that object or related
object passes
Jitter
The variation in the one-way delay of packets
Important in sizing playout buffers for
applications requiring regular delivery of
packets
Loss
One-way Loss
Roundtrip Loss

GRID Network
12
Grid Application Performance Characteristic

Grid Performance events required
System info
cpu, available memory, available free storage,
network utilization, how many clients can
connect, failure rate, available disk
Data information
data type, size, current location
Data access info
read/write frequency, duration, size, user info
Network info
Bandwidth, latency, RTT, Packet Loss
Grid Application Performance Characteristic
Reliability
Availability
Survivability
Closeness

GRID Application
13
Reliability (1/3)

The probability that it is functioning properly
and constantly over a fixed time period
The reliability of a grid network is the
probability that a grid application program which
runs on multiple processing elements and needs to
communicate with other processing elements
The reliability varies according to the
conditions of network (retransmission rate,
tcplistendrop rate), accessibility of network,
and TCP packets loss rate
Conditions of network
Retransmission is caused by not receiving ACKs
fast enough and this is why bad network hardware
or a congested route
Retransmission rate tcpRetransBytes /
tcpOutDataBytes
ListenDrop counts the number of times that a
connection was dropped due to a full listen queue
ListenDrop rate tcpListenDrop / t

14
Reliability (2/3)

Accessibility of network
Network accessibility is defined as the measure
of the capacity of a location to be reached by,
or to reach different locations
Accessibility rate (icmp input destination
unreachable icmp output destination
unreachable) / (icmp input packets icmp output
packets)
TCP Loss
Packet loss describes an error condition in which
data packets appear to be transmitted correctly
at one end of a connection, but never arrive at
the other
TCPLoss rate TCPLoss / TCP packets

15
Reliability (3/3) - Reliability Measurement
Identify the available servers of grid network to
run the grid application from the GT3
Step 1
Calculate the network accessibility for the
servers using in grid network to run the grid
application
Step 2
Calculate the tcp loss for all the servers using
in grid network
Step 3
Calculate the network condition in grid network
Step 5
Calculate the system failure rate ?, and system
repair rate µ 1-? (? accessibility rate
network condition rate tcp loss rate)
Step 6
Calculate the grid service reliability in each
server (Reliability µ/(µ?))
Step 7
Let the grid service reliability is R. Then we
can calculate the average, and variance The mean
of grid service reliability is
(for 0 lt i m, i is the number of
servers) The variance of grid service reliability
is (for 0 lt i m, i is the number of servers)
16
Availability (1/2)

Network availability means the ability of rapid
recovery in case of network failure
On detection of three dupacks, packet loss is
assumed and the sender halves congestion window
size
If congestion occurs, let time Tt the time in
which recover the window size before the
congestion occurs, let time Tf the time in
which congestion occurs
MTTR Tt Tf
MTTF the execution time of grid application
MTTR
? congestion occur rate 1/MTTR
µ congestion repair rate 1/MTTF
Network availability MTTF/(MTTFMTTR)

17
Availability (2/2)
18
Survivability (1/5)

Capability to provide a prescribed level of QoS
for existing services after a given number of
failures occur within the network
Property of a network to be resilient to failure
Use to describe the available performance of a
network after a failure
Measures the degree of functionality remaining in
a system after a failure and consists of
evaluating metrics which quantify network
performance during failure as well as normal
operation
Monitoring Data Needs for Survivability Guarantee
determine the optimal resources for a application
job
applications could use monitoring data to adapt
themselves to the current situation
Fault detection and analysis
monitoring data is used to determine faults in
system components and applications
monitoring data could also be used to find the
cause of the faults

19
Survivability (2/5)

Survivability is enhanced by
Security techniques where applicable
Redundancy, diversity, general trust validation
Automated recovery support
Strategies for Survivability Guarantee

Network Service View

Network Restoration
Network Protection
Hardware Duplication
Software Fault Tolerance
Link/Site Diversity
Provisioning
Configurable parameters

Mitigation/Masking Strategies

Design Centering
Software Modularity
Physical/GUI Desing
Traffic Robustness
Environmental Robustness
Site Location/Integrity

Prevention Strategies
Network

Technology Failures
Operational activities
Procedural errors
Traffic overloads
Environmental incidents

Failure Events
20
Survivability (3/5) - Survivability Measurement
Factors

RTV (Residual Traffic Volume)
NPR (Network Path Protection Ratio)

RTV tv/tn tn traffic volume before failure tv
traffic volume after failure
Path protection Ratio wi Path i capacity ki
possible alternate path capacity
capacity(bits) bandwidth(bits/sec) round-trip
time(sec)
21
Survivability (4/5)

Resource Reallocation Mechanism After
Survivability Assessment

Resource Reallocation Mechanism After
Survivability Assessment
1. Monitoring Resource Creation
7. Survivability Assessment Result Reporting
8. Node Path Change Request
2. Performance Measurement Data Collection for
Resource
3. Survivability Assessment
6. Survivability Assessment Result Collection
5. Resource Reallocation Accomplishment
Registry
Grid Application Execution Nodes Path Change
Grid Application Execution Environment (
OS/HW/Storage etc.)
22
Survivability (5/5)

Reallocatin Algorithm for Survivability
TPU average ProcessorUsage ()
TPUEssential ProcessUsage()
GPUGeneral ProcessUsage()

Survivability Assessment Resource Reallocation
Algorithm
Grid Application Resource Utilization Measurement
No
Total Utilization Datum Excess
Yes
Essential Service Utilization Datum Excess
Yes
Recovery Resource Reallocation
No
Service Resource Recovery Available Compromise
Resource Utilization Re-measurement
No
Essential Service Utilization Increase
Yes
Forced Exit of Service Available Compromise
23
Network Closeness (1/2)

A measure of the degree to which a node is
adjacent to or can reach others in a network.
Closeness is usually measured by the number of
steps it takes to reach others.
Network closeness is based on path-capacity
measurements and hop counts.
Closeness Measurement Factors
Round Trip Time
Packet loss frequency
Throughput

24
Network Closeness (2/2)

Validity Assessment for Closeness

r Round Trip Time, Rmax max RTT, ploss
packet loss frequency th throughput thmax
maximum throughput

a interval 0,1
Closeness Measurement Data Dependence
RTT throughput Factors
- if closer a to 1 gt the more dependent is
Closeness on throughput
- if closer a to 0 gt the more dependent is
Closeness on RTT

25
Implementation of the monitoring system

Develop Environment
Design spec. for Linux kernel based Information
Collector
Kernel based network information gathering
mechanism
Information Gathering Mechanism
Components of Information Collector
Information Collector daemon
Design spec. for GA-NMS Web Service
Example of GA-NMS protocol
Service Architecture
Examples of Implementation

26
Develop Environment

Hardware platform
CPU Intel Pentium III 600MHz
Memory 192MB
Disk 6.1GB, 3.1GB

Operating System
REDHAT Linux 7.3
Kernel 2.4.19

Running Environment
UNIX C
Kernel Module
Information Collector Module
JAVA (Jakarta Tomcat, WSDP(Web Service
Development Pack))
Information Provider Web Service

27
Design spec. for Linux kernel based Information
Collector (1/4)
Kernel based network information gathering
mechanism
28
Design spec. for Linux kernel based Information
Collector (2/4) - Information Gathering Mechanism

1. Hooking
It replaces the existing protocol stack logic
that gathers network related information in the
abstract with the logic that gathers network
related information in detail (ex End-to-End
bandwidth)
2-1. Information gathering using kernel module
It gathers information from protocol stack
hooking layer
Protocol stack hooking layer hooks each protocol
stack and stores network related information
after processing into user-readable format
2-2. Information gathering using kernel memory
interface
Not completely supported on Linux
On common Unix environment an interface is
supported that user can access to the kernel data
through it (ex Kernel Virtual Memory interface
library (KVM))
3. Data accumulating
Kernel module stores data into the filesystem
that can used by user at user level
By using ProcFS (Process information
pseudo-Filesystem) we can reduce the load that
should be occurred by using real filesystems
4. Information processing
A process that user application reads network
monitoring parameters from ProcFS and processes
them as network parameter for Grid applications

29
Design spec. for Linux kernel based Information
Collector (3/4) - Components of Information
Collector

Protocol stack hooking layer
It uses Netfilter Layer that is supported on the
Linux kernel 2.4.X to 2.6.X.
Netfilter layer supports to hook in the protocol
stack by using user supplementable functions. It
does not modify the protocol stack code, so it
can process information that kernel uses without
modification of original kernel data
Kernel module
It is based on the ip_conntrack kernel module
supplied by Netfilter layer.
Some codes are added and modified to gather and
process user specific network parameter in detail
Information Collector daemon
It is a daemon that processes the network
related information in the ProcFS
It encodes gathered informations with XML scheme
and send to the Web Service application

30
Design spec. for Linux kernel based Information
Collector (4/4)
Information Collector daemon
31
Design spec. for GA-NMS Web Service (1/3)

Definition of Service
Grid Application Network Monitoring Service
(GA-NMS) supplies network monitoring parameters
that are useful for Grid Applications in the Grid
network
Messaging Protocol
It uses XML (eXtensible Markup Language) and SOAP
(Simple Object Access Protocol) to communicate
with each services
Service Platform Specification
Service Platform
JAVA WSDP (Web Services Developer Pack) JAXM
(Java API for XML Messaging) / JAVA
Information Collector
Linux Kernel module / C Language
Site Platform
Tomcat, Globus Toolkit 3.0 / JAVA (JSP)

32
Design spec. for GA-NMS Web Service (2/3)
Example of GA-NMS protocol
33
Design spec. for GA-NMS Web Service (3/3)
Service Architecture
34
Examples of Implementation
Main View
Statistics View

Write a Comment

User Comments (0)