AFFAIR A Flexible Fabric and Application Information Recorder - PowerPoint PPT Presentation

About This Presentation
Title:

AFFAIR A Flexible Fabric and Application Information Recorder

Description:

AFFAIR. A Flexible Fabric and Application Information Recorder ... of Collector nodes, graphs occasionally garbled, processes occasionally die. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 28
Provided by: tomean
Category:

less

Transcript and Presenter's Notes

Title: AFFAIR A Flexible Fabric and Application Information Recorder


1
AFFAIRA Flexible Fabric and Application
Information Recorder
  • Tome Anticic1, Ruzica Piskac2, Vedran Sego2
  • For the ALICE collaboration

1) Rudjer Boskovic Institute, Zagreb, Croatia 2)
PMF, Zagreb, Croatia
2
Outline
  • What is/will be monitored
  • Requirements
  • Tools used in AFFAIR
  • AFFAIR components
  • DIM/SMI
  • Round robin databases
  • ROOT
  • AFFAIR Collectors
  • AFFAIR Monitor
  • AFFAIR web
  • Status of AFFAIR/ performances achieved
  • Future work

3
ALICE DAQ
CASTOR/ROOT
4
Why a monitoring program ?
  • Long-term Monitoring for final ALICE DAQ
  • Now ALICE Data Challenges
  • Testing of ALICE DAQ and mass storage to be ready
    for LHC
  • Need to monitor system performance of 100s
    (1000s) of nodes
  • Need to monitor the ALICE DAQ software (DATE) and
    hardware performance

5
Requirements
  • Need down to 10 (or even less) sec updates
  • Reliable and robust
  • Easy to maintain, easy to setup/install
  • Should be as invisible as possible
  • No growing (or better yet none) logfiles on
    monitored nodes
  • Not cpu intensive
  • Not network intensive
  • Flexible new monitoring parameters/software
    should be easy to add/adjust
  • Web access to processed, real time data in the
    form of graphs, histograms,..
  • Scalable should work equally well for 10 as for
    1000 computers
  • All monitored data should be permanently stored
    for offline analysis
  • Wide area transparency

6
AFFAIR architecture
DataCollector
Data Collector
g
Control
monitoring station
Data Collector
DATA
Data storage
web interface
Data Collector
7
Tools?
  • Previous ADC monitoring used DATEs info loggers
    to gather and process data (P.Saiz, K.
    Schossmaier)
  • Worked, but need a more general and flexible
    tool
  • Analysis of open source tools showed none
    completely fulfilled the requirements
  • Combined a number of tools and added own
  • DIM
  • SMI
  • ROOT
  • rrdtool
  • Sysstat
  • Apache/php

8
AFFAIR overview at ADC
SMI (finite state machine)
Run Control
AFFAIR Collector-DATE (LDC,GDC,)
AFFAIR monitor
database
AFFAIR Collector- System (CPU, IO,)
Control, parameters, configuration
rrdtool, ROOT
DATA
AFFAIR plotter
100
ROOT
AFFAIR web
AFFAIR Collector-DATE (LDC,GDC,)
php
AFFAIR Collector- System (CPU, IO,)
DIM
9
Why not SNMP?
  • Not simple at all!
  • Need root intervention to get started
  • At least 2 times the number of calls, and
    application busy/wasting time during calls

Request
busy
waiting
Data
Client
Server
Request
busy
Data
waiting
  • Each variable monitored (CPU, network IO,)
    requires separate calls, with all the overhead
  • Use of SNMP limits one to just standard monitored
    parameters, so any specialized ones cant be
    monitored
  • Especially true for applications

10
How does DIM work
  • Data transferred asynchronously, interrupt driven
  • Twice less calls
  • Parallelism client can do other things while
    server busy

Request (at startup)
busy doing his own job
busy doing his own job
Data
Client
Server
Data
Data
11
DIM in practice
  • Client/Server

Name server
Client
  • If Client or server goes down, and up again,
    automatic connection

12
AFFAIR Collector program DATE
AFFAIR Collector
AFFAIR DIM/SMI library
Endless loop with 10 sec period
13
AFFAIR Collector program System
AFFAIR Collector
AFFAIR DIM/SMI library
Endless loop with 10 sec period
14
AFFAIR monitor
DIM library
AFFAIR Monitor
15
Round robin databases
  • Very fast and efficient data storage mechanism
  • Works with fixed amount of data (fixed time
    depth)
  • Works with pointer to current element

Time 1
Time 5
Time 2
Time 4
Time 3
  • Each data set (LDC, GDC, system info) for each
    machine has its own rrd
  • Each created so that it takes
  • 10 sec info for last 1 hour 360 rows deep
  • 1 min for last 6 hours 360 rows deep
  • 4 min for last 24 hours 360 rows deep
  • for last 6 hours, 1 month, 3 months, 1 year

16
Round robin databases II
  • For each time bin the data gets resampled
    (correctly interpolated) to keep in fixed
    intervals
  • e.g. if rrd storing in 10 second interval
  • T 10 , value 100 stores as 100 at time 10
  • T 21, value 111 stores as 110 at time 20
  • Data Consolidation
  • Rrdtool for each time period finds average,
    maxima, minima (each requires separate row)
  • These are very useful for graphs

17
AFFAIR plotter
AFFAIR ROOT
AFFAIR ROOT
rrd 1
rrd 2
rrd 3
  • ROOT used to generate for each last
    hour/6hour/day/
  • generates eps graphs for each node
  • generates aggregate plots
  • (e.g. Total throughput in/out for
    GDC/LDC/network)
  • Generates superimposed plots (e.g. GDC throughput
    for all machines on one plot)

Put in permanent storage as ROOT files, for
later detailed analysis
18
Graph configuration
  • All graphs created using one configuration file
  • Completely defines units/ labels/ if graphs
    aggregate / if graphs superimposed
  • Thus no code intervention needed to create the
    plots
  • New monitored variables can be added and
    configured very easily

19
Graph examples
  • GDC performance

Rates (kB/sec) for last 24 hours for some GDC
nodes
  • Full lines average
  • Dashed lines max values

Rates (kB/sec) for last 7 days for some GDC nodes
20
Graph examples II
  • GDC performance

Total Rates (kB/sec) for last hour for all GDC
nodes
Aggregate plot, calculated by AFFAIR
Recorded events for last 7 days for some GDC
nodes
21
Graph examples III
  • System performance for an individual computer
  • This format also defined using the configuration
    file

CPU usage for last day for one computer (tbed0005)
22
Web interface
  • Web interface written using php/java script
  • Completely automatically generated
  • New variables, monitored sets automatically
    reflected in plots

http//pcaldz02.cern.ch8080
  • Clicks for last hour/6 hour/day
  • On click converts eps to png

23
Web interface II
A click will generate plots for the machine and
lead you to its page
24
AFFAIR performance
  • Successfully monitoring 100 computers in 10 sec
    intervals during ADC for last 2 months
  • Delay between data received and generated graphs
    down to 1-5 minutes
  • 10000s of plots generated every few minutes
  • Monitoring is done using two dual CPU 1GHz
    machines, connected with NSF
  • No showstoppers encountered
  • Many problems found and solved during run
  • Some sporadic small problems remaining
    occasional improper shutdown of Collector nodes,
    graphs occasionally garbled, processes
    occasionally die.

25
AFFAIR performance II
  • In testing phase monitored all lxshare machines
    in 1 second intervals
  • Proved very useful in developing DATE, finding
    the DATE performance, finding source and solution
    of problems
  • No reason to believe system cannot scale to
    1000s of nodes
  • All CPU/disk intensive operations can trivially
    be spread across a number of nodes
  • Main possible limitation is continuous generation
    of graphs for all individual computers (10000
    every few minutes), but is being taken care of
    (see next slide)
  • AFFAIR quite general easy to add additional
    applications/variables to monitor

26
To do/near future (weeks/month)
  • Make graph generation more efficient factor 2-5
    with some coding changes
  • Graphs will have a delay under 1 minutes
  • Web interface much better
  • Buttons for type of plots, not just time periods
    (also automatic)
  • Page with table with latest (last 10 seconds)
    numerical performance values
  • More graphs, better graphs, prettier graphs
  • Enable graph generation for individual computers
    only when click for it
  • Have a releasable version of code
  • Documentation, user manual
  • Signals/ status of computers/applications
  • E.g. disk full/CPU too high/events not
    incrementing
  • Color code the status for each computer/applicatio
    n on web page

27
To do/long term
  • Have a affair control interface to manage it
    (Kylix?)
  • Option to store data not in fixed intervals,
    where data consolidation not wanted
  • e.g. event size/ trigger type
  • Add a lot more graph options
  • Option to store as SQL database
  • Add option to have varying number of variables
    monitored
  • e.g. free space in all partitions
  • Detailed offline analysis code
  • Add more AFFAIR Collectors for DATE
  • Detector Data Optical Link
  • Switches (might need to incorporate SNMP)
  • Mass storage
Write a Comment
User Comments (0)
About PowerShow.com