Title: Lemon Monitoring
1Lemon Monitoring
- Miroslav Siket, German Cancio, David Front,
- Maciej Stepniewski
- CERN-IT/FIO-FS
- LCG Operations Workshop
- Bologna, 24-26 May 2005
2Outline
- Lemon
- Structure and design
- How it works, deployment
- Use cases, web interface
- Installation and setup
- Summary
3Lemon LHC Era Monitoring
- Lemon is a system containing tools for monitoring
status and performance of computers - Distributed monitoring system scalable to 10k
nodes - Provides active monitoring of software and
hardware in the Computer Center on centrally
managed clusters - Facilitates early error detection and problem
prevention - Executes corrective actions and sends
notifications - Provides persistent storage of the monitoring
data - Offers a framework for further creation of
sensors for monitoring - Site independent functionality
- Link http//cern.ch/lemon
- Part of the ELFms toolsuite http//cern.ch/elfms
4Lemon Use
- It is used in-and-outside CERN by
- System administrators, service managers, cluster
responsibles - Developers and service/data challenges
- Managers and general users
- Deployments outside CERN
- EDG testbeds
- Accelerator (AB) department at CERN
- CMS online
- GridICE
- BARC India (development partner)
5Lemon architecture
6Components
- Lemon is a typical server/client application with
following components - MSA Monitoring Sensor Agent (Lemon Agent)
- Daemon on a client machine that spawns multiple
Monitoring Sensors to measure data in defined
intervals and sends data to Monitoring Repository - MS - Monitoring Sensor
- Uses standard C, perl API it is easy to write
your own sensor - Several sensors exist for performance, process,
hw and sw monitoring, grid VOs job reporting,
database monitoring, security, alarms (total 260
metrics) - MR Monitoring Repository
- Server application that receives samples and
processes/validates them - Stores the full monitoring history data
- Two implementations - flat files or Oracle DB
based - LRF - Lemon RRD Framework
- Pre-processes data into rrd files and creates
cluster summaries - These are used for web graphics
- Provides service and cluster overview in its web
displays - LAG Lemon Alarm Gateway
- Generic gateway for alarms (in development)
- Gateways to MonALISA and GridICE exist
7Lemon at CERN
- Lemon monitors about 2200 computers in 100
clusters - On average it collects about 70 metrics from each
host - Integrated with Sure alarm system
- Collecting about 1.5 GB/day
- LEAF (LHC-Era Automated Fabric) for high-level
intervention scheduling
Node
Configuration Management
Node Management
- Configuration
- Derived from the Quattor Configuration Database
(CDB) - individual configuration per cluster/host
- hierarchical structure
- Alarm system
- Sure legacy system receiving alarms from Lemon
- Integration with new LASER system (LHC alarm
system) via LAG is ongoing
8Web interface
- Cluster view displays accumulated statistics and
status for all machines in the cluster - Host view gives overview of the host status with
basic metrics - Other views available
- Rack view
- Hardware type view
- Other views can be added, working on user defined
views - With the newest version (to be released soon)
- Generic entry page displaying status overview of
the key services - Configurable views
- In development database services monitoring with
database specific view
9Use(ful) case
Reboot occurrence history graph
- Kernel upgrade
- Kernel version is measured on the boot of the
machine - Automatic tools for upgrading the kernel on a
cluster retrieve information from Lemon and
schedule reboot of a machine based on this info - Web interface allows monitoring of the progress
10Computer Center display
- Lemon Web Interface can be interfaced with a
Computer Center database of objects (racks,
silos, ) - Provides search of objects as well as listing
- Interfaced through a XML defined geometry of the
computer center - Generic design that can be used anywhere
-
lt?xml version"1.0" ?gt ltCCgt ltROOM
ID0513-S-0034" DESCRIPTIONTape Vault" R"0"
G"0" B"0"gt ltDOORS R"0" G"255"
B"0"gt ltDOOR X"63" Y"39" LX"64" LY"39" /gt
ltDOOR X"34" Y"0" LX"36" LY"0" /gt
lt/DOORSgt ltRACKS R"0" G"0"
B"203"gt ltRACK ID"EA01" X"73" Y"9" LX"75"
LY"10" PLANNED"0"/gt ltRACK ID"EA03" X"73"
Y"8" LX"75" LY"9" PLANNED"0"/gt
lt/RACKSgt ltWALLS R"0" G"0"
B"0"gt ltWALL X"0" Y"0" LX"0" LY"60" /gt
ltWALL X"0" Y"0" LX"76" LY"0" /gt
lt/WALLSgt ltSTEPS R"255" G"163"
B"0"gt ltSTEP X"47" Y"36" LX"52" LY"37" /gt
ltSTEP X"47" Y"37" LX"52" LY"38" /gt
lt/STEPSgt lt/ROOMgt lt/CCgt
11Service challenges, GRID VOs
- Lemon allows for
- Virtual clusters
- clusters defined on request by service managers
- or defined by scripts updated dynamically on
demand - or defined for specific purpose
- Examples Alice MDC, network challenges,
- Clusters defined dynamically
- example hosts running GRID jobs on the batch
cluster belonging to the given Virtual
Organization - hooks in Lemon for defining any dynamic grouping
of hosts
12Automatic recovery actions and Alarms
- Alarm Sensor
- For defined values of measured metrics an
actuator is called with predefined action - An example ssh daemon dead action
/sbin/service sshd start - Definition metric X, field Y ltopgt reference
value Z gt call actuator - ltopgt can be ,lt,gt,regexp, range, etc..
- If success log only, else call action up to max
times - Each occurrence is logged in the Monitoring
Repository - Already about 70 predefined alarms with automatic
recovery actions - After first month of deployment it reduced number
of problem tickets by half - Correlation engine (CMDaemon)
- Allows global correlations, and in the future
client/server alarms and recovery actions - Lemon Alarm gateway (LAG)
- Lemons LAG can be used to feed alarms into
arbitrary alarm systems (under development)
13Installation and setup (I)
- Lemon installation consists of three steps
- Server installation
- Client installation
- Web interface installation
- 1. Server installation
- install edg-fabricMonitoring-server rpm (flat
file server) - Configure receiving port in /etc/edg-fmon-server.c
onf - Start the server daemon
- 2. Client installation
- Install edg-fabricMonitoring-agent rpm (comes
with default metric configuration) - Configure server and its port in
/etc/edg-fmon-agent.conf - Start the client daemon on all monitored hosts
14Installation and setup (II)
- 3. Web interface installation
- Install and start apache server (with php) on
your server - Install rrdtool and lrf (lemon rrd framework)
rpms - Configure your clusters in clusters.conf file and
start lemonmrd daemon - Drink Champagne you have Lemon up and running!
-) - You can do all this on your laptop!
- Possible additional components
- Computer center synoptic view through xml file
- Problem tracking system integration (through php
plug-in to your DB/application) - Quattor CDB configuration view through CDB xml
profiles - Oracle based Repository (for very large
installations with high scalability and increased
functionality) - Other, new components are easy to add
- View detailed instructions at http//cern.ch/lemo
n/doc/installation/installation.html
15Summary
- Lemon serves to provide monitoring information
about the farms in Computer Centers (or your
laptop). - Lemon provides framework for recovery actions and
alarms. - Lemon is easy to install (and it is easy to add
your own metrics and visualize them). - It is flexible with respect to your needs you
can add clusters, views, specify your definition
of virtual and dynamic clusters. - It has been a useful tool for general monitoring
of performance and also for system administrators
in debugging problems. - For more information check http//cern.ch/lemon