Title: Online Monitoring with MonALISA
1Online Monitoring with MonALISA
- Dan Protopopescu
- Glasgow, UK
2MonALISA
- Is a distributed service able to
- collect any type of information from different
systems - analyze this information in real time
- take automated decisions and perform actions
based on it - optimize work flows in complex environments
- Read more at
- http//monalisa.caltech.edu
3Uses
- Monitoring distributed computing, i.e. GRIDs
- Optimizing flow in complex system (VRVS, optics
cable networks) - ALICE also uses ML for monitoring online
reconstruction - Some benchmark figures for the service
- 800k monitored parameters at 50k
updates/second - gt 10k running (alien) jobs monitored
simultaneously - gt 100 WAN links
- We are proposing ML as a high level monitoring
and possible control system along with (or on top
of) existing slow controls systems as epics, pvss
etc.
4Advantages
- MonALISA is simple to install, configure and use
- ApMon APIs are available in C, C, Java, Python
and Perl - ROOT plugin allows macros to send data directly
to MonaLISA - Can easily interface with (or sit on top of) any
existing or future slow controls subsystem
(epics, pvss) - Data is stored in a standard PgSQL (or MySQL)
database that can be accessed by other
applications, independently of ML - Automatic data summarizing
- Several data repositories (and hence DBs) can
exist (local and remote) - Easy access via WebService (WS) from service
and/or repository - Fully supported by development team work is
being done in this direction
5Capabilities
- Based on monitored information, actions can be
taken in - ML Service
- ML Repository
- Actions can be triggered by
- Values above/below given thresholds
- Absence/presence of values
- Correlations between several values
- Possible actions types
- External command
- Plain event logging
- Annotation of repository charts RSS feeds
- Email
- Instant messaging
6Components
GUI
LUS/Proxies
Web Server
Service
Service
ApMon
Actions based on local information
Repository
ApMon
ApMon
ApMon
Actions based on aggregated information
Quick actions
7Service setup
ML Service setup wget http//nuclear.gla.ac.uk/
protopop/ML/MonaLisa.tar.gz tar -zxvf
MonaLisa.tar.gz cd MonaLisa/ ./install.sh cd
../MonaLisa/Service/CMD/ ./MLD start
LUS
Web Server
Service
Service
ApMon
Actions based on local information
Repository
ApMon
ApMon
ApMon
Actions based on aggregated information
Quick actions
8Repository setup
ML Repository setup wget http//nuclear.gla.ac.u
k/protopop/ML/MLrepository.tgz tar -zxvf
MLrepository.tgz configure it cd
MLrepository ./start.sh
LUS
Web Server
Service
Service
ApMon
Actions based on local information
Repository
ApMon
ApMon
ApMon
Actions based on aggregated information
Quick actions
9ApMon setup
ApMon setup wget http//nuclear.gla.ac.uk/proto
pop/ML/ApMon_perl.tar.gz tar -xzvf
ApMon_perl.tar.gz cd ApMon_perl create your
script, say mysend.pl perl mysend.pl
LUS/Proxies
Web Server
Service
Service
ApMon
Actions based on local information
Repository
ApMon
ApMon
ApMon
Actions based on aggregated information
Quick actions
10Simple monitoring script
monalisa_at_glasgow cat mysend.pl use ApMon my
apm new ApMon("glasgow.jlab.org8884"
gt "sys_monitoring" gt 0, "general_info" gt
0) my _at_pair while (1) loop forever
get values from somewhere _at_pair
getmypar(pspec_logic_ai_0)
apm-gtsendParameters(Detector", MOR, _at_pair)
sleep (20)
LUS
Web Server
Service
Service
ApMon
Actions based on local information
Repository
ApMon
ApMon
ApMon
Actions based on aggregated information
Quick actions
11Time history
Time history example monalisa_at_glasgow cat
mor.properties pagehist FarmsJlabML ClustersDe
tector NodesMOR Functionspspec_logic_ai_0 ylabel
Tagger rate titleMOR annotation.groups2
LUS
Web Server
Service
Service
ApMon
Actions based on local information
Repository
ApMon
ApMon
ApMon
Actions based on aggregated information
Quick actions
12Web interface
13Java GUI
14Application control
Your custom Java client
- ML Clients
- TCP based subscribe mechanism serialized,
compressed objects with optional encryption - ML Proxies
- Application commands are encrypted
- ML Services
- Standard and/or users sensors and/or
application modules
GUI client
ML Repository
Your custom view
Key
LUS
Keystore
ML Service
Your mon module
Your app module
App MonC
ApMon
Your application
bash
Your Application
15Alert-based Actions
MySQL daemon is automatically restarted when it
runs out of memory Trigger threshold on VSZ
memory usage
ALICE Production jobs queue is automatically kept
full by the automatic resubmission Trigger
threshold on the number of aliprod waiting jobs
Administrators are kept up-to-date on the
services status Trigger presence/absence of
monitored information via instant messaging, RSS
feeds, toolbar alerts etc.
16Summary
- MonALISA is a very promising tool for online
experiment monitoring and interfacing with a
variety of slow control subsystems GlueX are
seriously considering ML for this task - Easy to configure, understand and use
- Experience from Grid monitoring and more
- Support from the developers group for
implementation of new modules/features - Online experiment monitoring tests of CLAS_at_Jlab
were recently carried on demo repository is at
http//mlr1.gla.ac.uk7002
17More examples / Extras
18Integrated Pie Charts
19History Plots, Annotations
20AliEn Services Monitoring
- AliEn services
- Periodically checked
- PID check SOAP call
- Simple functional tests
- SE space usage
- Efficiency
21Job Network Traffic Monitoring
- Based on the xrootd transfer from every job
- Aggregated statistics for
- Sites (incoming, outgoing, site to site,
internal) - Storage Elements (incoming, outgoing)
- Of
- Read and written files
- Transferred MB/s
22Individual Job Tracking
- Based on AliEn shell cmds.
- top, ps, spy, jobinfo, masterjob
- Using the GUI ML Client
- Status, resource usage, per job
23Head Node Monitoring
- Machine parameters, real-time history, load,
memory swap usage, processes, sockets
24MonALISA in AliEn
- The MonALISA framework is used as a primary
monitoring tool for the ALICE Grid since 2004 - Presently the system is used for monitoring of
all (identified) services, jobs and network
parameters necessary for the Grid operation and
debugging - The number of concurrently monitored and stored
parameters today is 300.000 in 75 ML Services - The add-on tools for automatic events
notification allow for more efficient reaction to
problems - The framework design and flexibility answers all
requirements for a monitoring system - The accumulated information allows to construct
and implement automated decision making
algorithms, thus increasing further the
efficiency of the Grid operations