MonALISA framework - PowerPoint PPT Presentation

About This Presentation
Title:

MonALISA framework

Description:

Store. location. Lookup. location. GMA Architecture ... MonALISA has been able to gather, store, plot, sort and group large variety of ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 33
Provided by: MME48
Category:

less

Transcript and Presenter's Notes

Title: MonALISA framework


1
Monitoring of adistrubuted computing systemthe
Grid AliEn_at_CERN
Marco MEONI
Master Degree 19/12/2005
2
Content
  • MonALISA Adaptationsand Extensions
  • Grid Conceptsand Grid Monitoring
  • PDC04 Monitoring and Results
  • Conclusions and Outlooks

http//cern.ch/mmeoni/thesis/eng.pdf
3
Section I
Grid Concepts and Grid Monitoring
4
ALICE experiment at CERN LHC
1) Heavy Nuclei and proton-proton colliding
5) ALICE physicists analyse the the data and
search for physics signals of interest
2) Secondary particles are produced in the
collision
4) Particle properties (trajectories, momentum,
type) are reconstructed by the AliRoot software
3) These particles are recorded by the ALICE
detector
5
Grid Computing
  • Grid Computing definition
  • coordinated use of large sets of heterogenous,
    geographically distributed resources to allow
    high-performance computation
  • The AliEn system
  • - pull rather than push architecture the
    scheduling service does not need to know the
    status of all resources in the Grid the
    resources advertise themselves
  • - robust and fault tolerant, where resources can
    come and go at any point in time
  • - interfaces to other Grid flavours allowing for
    rapid expansion of the size of the computing
    resources, transparently for the end user.

6
Grid Monitoring
  • GMA Architecture
  • R-GMA an example of implementation
  • Jini (Sun) provides the technical basis

7
MonALISA framework
  • Distributed monitoring service system using
    JINI/JAVA and WSDL/SOAP technologies
  • Each MonALISA server acts as a dynamic service
    system and provides the functionality to be
    discovered and used by any other services or
    clients that require such information

8
Section II
MonALISA Adaptations and Extensions
9
MonALISA Adaptations
A Web Repository as a front-end for production
monitoring
  • Stores history view of the monitored data
  • Displays the data in variety of predefined
    histograms and other visualisation formats
  • Simple interfaces to user code custom consumers,
    configuration modules, user-defined charts,
    distributions

Farms monitoring
  • User Java class to interface MonALISA and bash
    script to monitor the site

Remote Farm
WEB Repository
CE
Monitoring script
Monitored data
Java interface class
WNs
User code
MonALISA framework
Grid resources
10
Repository Setup
A Web Repository as a front-end for monitoring
  • Keeps full history of monitored data
  • Shows data in a moltitude of histograms
  • Added new presentation formats to provide a full
    set (gauges, distributions)
  • Simple interfaces to user code custom
    consumers, custom tasks

Installation and Maintenance
  • Packages installation (Tomcat, MySQL)
  • Configuration of main servlets for ALICE VO
  • Setup of scripts for startup/shutdown/backup
  • All the produced plots have been built and
    customized as from as many configuration files
  • SQL, parameters, colors, type
  • cumulative or averaged behaviour
  • smooth, fluctuations
  • user time intervals
  • many others

11
AliEn Jobs Monitoring
  • Centralized or distributed?
  • AliEn native APIs to retrieve job status
    snapshots

Job is submitted
(Error_I)
INSERTING
AliEn TQ
WAITING
(Error_A)
ASSIGNED
CE
(Error_S)
QUEUED
(Error_E)
STARTED
(Error_R)
RUNNING
ZOMBIE
WN
gt1h
(Error_V, VT, VN)
VALIDATION
FAILED
(Error_SV)
gt3h
SAVING
DONE
TOMCATJSP/servlets
12
Repository DataBase(s)
Data Collecting
  • 7 Gb of performance information, 24.5M records
  • During DC data from 2K monitored parameters
    arrive every 2/3 mins

1min
10 min
100 min

60 bins for each basicinformation
Averaging
process
FIFO
  • ROOT
  • CARROT
  • MonALISA Agents
  • Repository Web Services
  • AliEn API
  • LCG Interface
  • WNs monitoring (UDP)
  • Web Repository

Grid Analysis
Data collecting and Grid Monitoring
13
Web Repository
  • Storage and monitoring tools of the Data
    Challenge running parameters, task completion
    and resource status

14
Visualisation Formats
Menù
Statistics and real-time tabulated
CE Load factors and tasks completion
Stacked Bars
Running history
Snapshots and Pie charts
15
Monitored parameters
  • 2k parameters and 24,5M records with 1 minute
    granularity
  • Analysis of the collected data allows for
    improvement of the Grid performance

1868
  • Derived classes

16
MonALISA Extensions
  • Job monitoring of Grid users
  • Application Monitoring (ApMon) at WNs
  • Repository Web Services
  • Using AliEn commands (ps a, jobinfo jobid, ps
    X -st) output parsing
  • Jobs JDL scanning
  • Results presented in the same web front end
  • ApMon is a set of flexible APIs that can be used
    by any application to send monitoring information
    to MonALISA services, via UDP datagrams
  • Allows for data aggregation and scaling of the
    monitoring system
  • Developed a light monitoring C class to
    include within the Process Monitor payload
  • Alternative to ApMon for WEB repository
    purposes - dont need MonALISA agents - store
    data directly into the DB repository
  • Used to monitor Network Traffic through the ftp
    servers of ALICE at CERN

17
MonALISA Extensions
  • Distributions for principle of Analysis
  • First attempt for a Grid performance tuning,
    based on real monitored data
  • Use of ROOT and Carrot features
  • Cache system to optimize the requests

ROOT histogram server process (central cache)
HTTP
A p a c h e
1. ask for histogram
2. query NEW data
3. send NEW data
MonALISA Repository
4. send resulting object/file
ROOT/Carrot histogram clients
18
Section III
PDC04 Monitoring and Results
19
PDC04
  • Purpose test and validate the ALICE Offline
    computing model
  • Produce and analyse 10 of the data sample
    collected in a standard data-taking year
  • Use the complete set of off-line software AliEn,
    AliROOT, LCG, Proof and, in Phase 3, the ARDA
    user analysis prototype
  • Structure logically divided in three phases
  • Phase 1 - Production of underlying PbPb events
    with different centralities (impact parameters)
    production of pp events
  • Phase 2 - Mixing of signal events with different
    physics content into the underlying PbPb events
  • Phase 3 Distributed analysis

20
PDC04 Phase 1
  • Task - simulate the data flow in reverse events
    are produced at remote centres and stored in the
    CERN MSS

Storage
21
Total CPU profile
  • Aiming for continuous running, not always
    possible due to resources constraints

Total number of jobs running in parallel 18
computing centres participating
  • Start 10/03, end 29/05 (58 days active)
  • Maximum jobs running in parallel 1450
  • Average during active period 430

22
Efficiency
  • Calculation principle jobs are submitted only
    once

Successfully done jobs all submitted
jobs
Error (CE) free jobs all submitted
jobs
Error (AliROOT) free jobs all
submitted jobs
23
Phase 1 of PDC04 Statistics
24
PDC04 Phase 2
  • Task - simulate the event reconstruction and
    remote event storage

Central servers
Master job submission, Job Optimizer (N
sub-jobs), RB, File catalogue, processes
monitoring and control, SE
Register in AliEn FC LCG SE LCG LFN AliEn PFN
Sub-jobs
Sub-jobs
Storage
AliEn-LCG interface
CERN CASTOR underlying events
Underlying event input files
RB
Storage
CEs
CEs
CERN CASTOR backup copy
Job processing
Job processing
Output files
Output files
zip archive of output files
Local SEs
Local SEs
File catalogue
Primary copy
Primary copy
edg(lcg) copyregister
25
Individual sites CPU contribution
  • Start 01/07, end 26/09 (88 days active)
  • As in the 1st phase, general equilibrium in CPU
    contribution
  • AliEn direct control 17 CEs, each with a SE
  • CERN-LCG is encompassing the LCG resources
    worldwide (also with local/close SEs)

26
Sites occupancy
  • Outside CERN, sites such as Bari, Catania and
    JINR have generally run always at the maximum
    capacity

27
Phase 2 Statistics and Failures
28
PDC04 Phase 3
File Catalogue query
  • Task user data analysis

Data set (ESDs, other)
Job Optimizer
Grouped by SE files location
Sub-job 2
Sub-job n
User job (many events)
Job Broker
Submit to CE with closest SE
Job output
CE and SE
CE and SE
CE and SE
processing
processing
processing
Output file 2
Output file n
File merging job
29
Analysis
  • Start September 2004, end January 2005
  • Distributions charts built on top of ROOT
    environment using the Carrot web interface
  • Distribution of number of running jobs
  • - mainly depends on number of waiting jobs in
    TQ and availability of free CPU at the
    remote CEs
  • Occupancy versus the number of queued jobs
  • - there is an increase of the occupancy as
    more jobs are waiting in the local batch
    queue and a saturation is
  • reached at around 60 queued jobs

30
Section IV
Conclusions and Outlook
31
Lessons from PDC04
  • User jobs have been running for 9 months using
    AliEn
  • MonALISA has provided a flexible and complete
    monitoring framework successfully adapted to the
    needs of Data Challenge
  • MonALISA has given the expected results for
    performance tuning and workload balancing
  • Approach step by step from resources tuning to
    resources optimization
  • MonALISA has been able to gather, store, plot,
    sort and group large variety of monitored
    parameters, either basic or derived in a rich set
    of presentation formats
  • The Repository has been the only source of
    historical information and the modular
    architecture has made possible a development of
    variety of custom modules (800 lines of
    fundamental source code and 3k lines to perform
    service tasks)
  • PDC04 has been a real example of successful Grid
    interoperability by interfacing AliEn and LCG and
    proving the AliEn design scalability
  • The usage of MonALISA in ALICE has been
    documented in an article for a conference at
    Computing in High Energy and Nuclear Physics
    (CHEP) 04, Interlaken - Switzerland
  • Unprecedented experience to develop and improve a
    monitoring framework on top of a real functioning
    Grid, massively testing the involved software
    technologies
  • Easy to extend the framework and replace
    components with equivalent ones following the
    technical needs or strategic choices

32
Credits
  • Dott. F.Carminati, L.Betev, P.Buncic and all
    colleagues in ALICE
  • for the enthusiasm they trasmitted during
    this work
  • MonALISA team
  • collaborative anytime I needed
Write a Comment
User Comments (0)
About PowerShow.com