Title: Online Performance Monitoring of the Third ALICE Data Challenge
1Online Performance Monitoring of the Third ALICE
Data Challenge
- W. Carena1, R. Divia1, P. Saiz2, K. Schossmaier1,
A. Vascotto1, P. Vande Vyvre1 - CERN EP-AID1, EP-AIP2
- NEC2001
- Varna, Bulgaria
- 12-18 September 2001
2Contents
- ALICE Data Challenges
- Testbed infrastructure
- Monitoring system
- Performance results
- Conclusions
3ALICE Data Acquisition
Final system!
up to 20 GB/s
Local Data Concentrators (LDC)
Readout
300 nodes
up to 2.5 GB/s
Global Data Collectors (GDC)
Event Building
100 nodes
up to 1.25 GB/s
CASTOR System
Mass Storage System
4ALICE Data Challenges
- What? Put together components to demonstrate the
feasibility, reliability and performance of our
present prototypes. - Where? The ALICE common testbed uses the hardware
of the common CERN LHC testbed. - When? This exercise is repeated every year by
progressively enlarging the testbed. - Who? Joined effort between the ALICE online and
offline group, and two groups of the CERN IT
division.
ADC I March 1999 ADC II March-April 2000 ADC
III January-March 2001 ADC IV 2nd half 2002 ?
5Goals of the ADC III
- Performance, scalability, and stability of the
system (10 of the final system) - 300 MB/s event building bandwidth
- 100 MB/s over the full chain during a week
- 80 TB into the mass storage system
- Online monitoring tools
6ADC III Testbed Hardware
Farm
Network
Disks
Tapes
- 80 standard PCs
- dual PIII_at_800Mhz
- Fast and Gigabit Ethernet
- Linux kernel 2.2.17
6 switches from 3 manufactures Copper and fiber
media Fast and Gigabit Ethernet
8 disk servers dual PIII_at_700Mhz 20 IDE data
disks 750 GB mirrored 3 HP NetServers
12 tape drives 1000 cartridges 60 GB capacity 10
MB/s bandwidth
7ADC III Monitoring
- Minimum requirements
- LDC/GDC throughput (individual and aggregate)
- Data volume (individual and aggregate)
- CPU load (user and system)
- Identification time stamp, run number
- Plots accessible on the Web
- Online monitoring tools
- PEM (Performance and Exception Monitoring) from
CERN IT-PDP ? was not ready for ADC III - Fabric monitoring developed by CERN IT-PDP
- ROOT I/O measures mass storage throughput
- CASTOR measures disk/tape/pool statistics
- DATESTAT prototype development by EP-AID, EP-AIP
8Fabric Monitoring
- Collect CPU, network I/O, and swap statistics
- Send UDP packets to a server
- Display current status and history using Tcl/Tk
scripts
9ROOT I/O Monitoring
- Measures aggregate throughput to mass storage
system - Collect measurements in a MySQL data base
- Display history and histogram using ROOT on Web
pages
10DATESTAT Architecture
DATE v3.7
LDC
LDC
LDC
LDC
LDC
LDC
LDC
dateStat.c ? top, DAQCONTROL
GDC
GDC
GDC
GDC
GDC
GDC
DATE Info Logger
Log files (200 KB/hour/node)
Perl script
gnuplot script
Statistics files
C program
gnuplot/CGI script
MySQL data base
http//alicedb.cern.ch/statistics
11Selected DATESTAT Results
- Result 1 DATE standalone run, equal subevent
size - Result 2 Dependence on subevent size
- Result 3 Dependence on the number of LDC/GDC
- Result 4 Full chain, ALICE-like subevents
12Result 1/1
- DATE standalone
- 11LDCx11GDC nodes, 420...440 KB subevents, 18
hours
13Result 1/2
- DATE standalone
- 11LDCx11GDC nodes, 420...440 KB subevents, 18
hours
14Result 1/3
- DATE standalone
- 11LDCx11GDC nodes, 420...440 KB subevents, 18
hours
15Result 2
- DATE standalone
- 13LDCx13GDC nodes, 5060 KB subevents, 1.1 hours
16Result 3
- Dependence on LDC/GDC
- DATE standalone
- Gigabit Ethernet
- max. 30 MB/s per LDC
- max. 60 MB/s per GDC
17Result 4/1
- Full chain
- 20LDCx13GDC nodes, ALICE-like subevents, 59 hours
18Result 4/2
- Full chain
- 20LDCx13GDC nodes, ALICE-like subevents, 59 hours
19Result 4/3
- Full chain
- 20LDCx13GDC nodes, ALICE-like subevents, 59 hours
- LDC load 0.8 user, 2.7 sys
- LDC rate 1.1 MB/s (60 KB, Fast)
20Grand Total
- Maximum throughput in DATE 556 MB/s for
symmetric traffic, 350 MB/s for ALICE-like
traffic - Maximum throughput in full chain 120 MB/s
without migration, 86 MB/ with migration - Maximum volume per run 54 TB with DATE
standalone, 23.6 TB with full chain - Total volume through DATE at least 500 TB
- Total volume through full chain 110 TB
- Maximum duration per run 86 hours
- Maximum events per run 21E6
- Maximum subevent size 9 MB
- Maximum number of nodes 20x15
- Number of runs 2200
21Summary
- Most of the ADC III goals were achieved
- PC/Linux platforms are stable and reliable
- Ethernet technology is reliable and scalable
- DATE standalone is running well
- Full chain needs to be further analyzed
- Next ALICE Data Challenge in the 2nd half 2002
- Online Performance Monitoring
- DATESTAT prototype performed well
- Helped to spot bottlenecks in the DAQ system
- The team of Zagreb is re-designing and
re-engineering the DATESTAT prototype
22Future Work
- Polling agent
- obtain performance data from all components
- keep the agent simple, uniform, and extendable
- support several platforms (UNIX, application
software) - TransportStorage
- use communication with low overhead
- maintain common format in central database
- Processing
- apply efficient algorithms to filter and
correlate logged data - store permanently performance results in a
database - Visualization
- use common GUI (Web-based, ROOT objects)
- provide different views (levels, time scale,
color codes) - generate automatically plots, histograms,
reports, e-mail, ...