AFFAIR A Flexible Fabric and Application Information Recorder

About This Presentation

Title:

AFFAIR A Flexible Fabric and Application Information Recorder

Description:

AFFAIR. A Flexible Fabric and Application Information Recorder ... of Collector nodes, graphs occasionally garbled, processes occasionally die. ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 28

Provided by: tomean

Category:

more less

Transcript and Presenter's Notes

Title: AFFAIR A Flexible Fabric and Application Information Recorder

1
AFFAIRA Flexible Fabric and Application
Information Recorder

Tome Anticic1, Ruzica Piskac2, Vedran Sego2
For the ALICE collaboration

1) Rudjer Boskovic Institute, Zagreb, Croatia 2)
PMF, Zagreb, Croatia
2
Outline

What is/will be monitored
Requirements
Tools used in AFFAIR
AFFAIR components
DIM/SMI
Round robin databases
ROOT
AFFAIR Collectors
AFFAIR Monitor
AFFAIR web
Status of AFFAIR/ performances achieved
Future work

3
ALICE DAQ
CASTOR/ROOT
4
Why a monitoring program ?

Long-term Monitoring for final ALICE DAQ
Now ALICE Data Challenges
Testing of ALICE DAQ and mass storage to be ready
for LHC
Need to monitor system performance of 100s
(1000s) of nodes
Need to monitor the ALICE DAQ software (DATE) and
hardware performance

5
Requirements

Need down to 10 (or even less) sec updates
Reliable and robust
Easy to maintain, easy to setup/install
Should be as invisible as possible
No growing (or better yet none) logfiles on
monitored nodes
Not cpu intensive
Not network intensive
Flexible new monitoring parameters/software
should be easy to add/adjust
Web access to processed, real time data in the
form of graphs, histograms,..
Scalable should work equally well for 10 as for
1000 computers
All monitored data should be permanently stored
for offline analysis
Wide area transparency

6
AFFAIR architecture
DataCollector
Data Collector
g
Control
monitoring station
Data Collector
DATA
Data storage
web interface
Data Collector
7
Tools?

Previous ADC monitoring used DATEs info loggers
to gather and process data (P.Saiz, K.
Schossmaier)
Worked, but need a more general and flexible
tool
Analysis of open source tools showed none
completely fulfilled the requirements
Combined a number of tools and added own
DIM
SMI
ROOT
rrdtool
Sysstat
Apache/php

8
AFFAIR overview at ADC
SMI (finite state machine)
Run Control
AFFAIR Collector-DATE (LDC,GDC,)
AFFAIR monitor
database
AFFAIR Collector- System (CPU, IO,)
Control, parameters, configuration
rrdtool, ROOT
DATA
AFFAIR plotter
100
ROOT
AFFAIR web
AFFAIR Collector-DATE (LDC,GDC,)
php
AFFAIR Collector- System (CPU, IO,)
DIM
9
Why not SNMP?

Not simple at all!
Need root intervention to get started
At least 2 times the number of calls, and
application busy/wasting time during calls

Request
busy
waiting
Data
Client
Server
Request
busy
Data
waiting

Each variable monitored (CPU, network IO,)
requires separate calls, with all the overhead
Use of SNMP limits one to just standard monitored
parameters, so any specialized ones cant be
monitored
Especially true for applications

10
How does DIM work

Data transferred asynchronously, interrupt driven
Twice less calls
Parallelism client can do other things while
server busy

Request (at startup)
busy doing his own job
busy doing his own job
Data
Client
Server
Data
Data
11
DIM in practice

Client/Server

Name server
Client

If Client or server goes down, and up again,
automatic connection

12
AFFAIR Collector program DATE
AFFAIR Collector
AFFAIR DIM/SMI library
Endless loop with 10 sec period
13
AFFAIR Collector program System
AFFAIR Collector
AFFAIR DIM/SMI library
Endless loop with 10 sec period
14
AFFAIR monitor
DIM library
AFFAIR Monitor
15
Round robin databases

Very fast and efficient data storage mechanism
Works with fixed amount of data (fixed time
depth)
Works with pointer to current element

Time 1
Time 5
Time 2
Time 4
Time 3

Each data set (LDC, GDC, system info) for each
machine has its own rrd
Each created so that it takes
10 sec info for last 1 hour 360 rows deep
1 min for last 6 hours 360 rows deep
4 min for last 24 hours 360 rows deep
for last 6 hours, 1 month, 3 months, 1 year

16
Round robin databases II

For each time bin the data gets resampled
(correctly interpolated) to keep in fixed
intervals
e.g. if rrd storing in 10 second interval
T 10 , value 100 stores as 100 at time 10
T 21, value 111 stores as 110 at time 20
Data Consolidation
Rrdtool for each time period finds average,
maxima, minima (each requires separate row)
These are very useful for graphs

17
AFFAIR plotter
AFFAIR ROOT
AFFAIR ROOT
rrd 1
rrd 2
rrd 3

ROOT used to generate for each last
hour/6hour/day/
generates eps graphs for each node
generates aggregate plots
(e.g. Total throughput in/out for
GDC/LDC/network)
Generates superimposed plots (e.g. GDC throughput
for all machines on one plot)

Put in permanent storage as ROOT files, for
later detailed analysis
18
Graph configuration

All graphs created using one configuration file
Completely defines units/ labels/ if graphs
aggregate / if graphs superimposed
Thus no code intervention needed to create the
plots
New monitored variables can be added and
configured very easily

19
Graph examples

GDC performance

Rates (kB/sec) for last 24 hours for some GDC
nodes

Full lines average
Dashed lines max values

Rates (kB/sec) for last 7 days for some GDC nodes
20
Graph examples II

GDC performance

Total Rates (kB/sec) for last hour for all GDC
nodes
Aggregate plot, calculated by AFFAIR
Recorded events for last 7 days for some GDC
nodes
21
Graph examples III

System performance for an individual computer
This format also defined using the configuration
file

CPU usage for last day for one computer (tbed0005)
22
Web interface

Web interface written using php/java script
Completely automatically generated
New variables, monitored sets automatically
reflected in plots

http//pcaldz02.cern.ch8080

Clicks for last hour/6 hour/day
On click converts eps to png

23
Web interface II
A click will generate plots for the machine and
lead you to its page
24
AFFAIR performance

Successfully monitoring 100 computers in 10 sec
intervals during ADC for last 2 months
Delay between data received and generated graphs
down to 1-5 minutes
10000s of plots generated every few minutes
Monitoring is done using two dual CPU 1GHz
machines, connected with NSF
No showstoppers encountered
Many problems found and solved during run
Some sporadic small problems remaining
occasional improper shutdown of Collector nodes,
graphs occasionally garbled, processes
occasionally die.

25
AFFAIR performance II

In testing phase monitored all lxshare machines
in 1 second intervals
Proved very useful in developing DATE, finding
the DATE performance, finding source and solution
of problems
No reason to believe system cannot scale to
1000s of nodes
All CPU/disk intensive operations can trivially
be spread across a number of nodes
Main possible limitation is continuous generation
of graphs for all individual computers (10000
every few minutes), but is being taken care of
(see next slide)
AFFAIR quite general easy to add additional
applications/variables to monitor

26
To do/near future (weeks/month)

Make graph generation more efficient factor 2-5
with some coding changes
Graphs will have a delay under 1 minutes
Web interface much better
Buttons for type of plots, not just time periods
(also automatic)
Page with table with latest (last 10 seconds)
numerical performance values
More graphs, better graphs, prettier graphs
Enable graph generation for individual computers
only when click for it
Have a releasable version of code
Documentation, user manual
Signals/ status of computers/applications
E.g. disk full/CPU too high/events not
incrementing
Color code the status for each computer/applicatio
n on web page

27
To do/long term

Have a affair control interface to manage it
(Kylix?)
Option to store data not in fixed intervals,
where data consolidation not wanted
e.g. event size/ trigger type
Add a lot more graph options
Option to store as SQL database
Add option to have varying number of variables
monitored
e.g. free space in all partitions
Detailed offline analysis code
Add more AFFAIR Collectors for DATE
Detector Data Optical Link
Switches (might need to incorporate SNMP)
Mass storage