Online Performance Monitoring of the Third ALICE Data Challenge - PowerPoint PPT Presentation

About This Presentation
Title:

Online Performance Monitoring of the Third ALICE Data Challenge

Description:

CASTOR System. up to 2.5 GB/s ~100 nodes. Mass Storage System ... CASTOR: measures disk/tape/pool statistics. DATESTAT: prototype development by EP-AID, EP-AIP ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 23
Provided by: kmsc4
Category:

less

Transcript and Presenter's Notes

Title: Online Performance Monitoring of the Third ALICE Data Challenge


1
Online Performance Monitoring of the Third ALICE
Data Challenge
  • W. Carena1, R. Divia1, P. Saiz2, K. Schossmaier1,
    A. Vascotto1, P. Vande Vyvre1
  • CERN EP-AID1, EP-AIP2
  • NEC2001
  • Varna, Bulgaria
  • 12-18 September 2001

2
Contents
  • ALICE Data Challenges
  • Testbed infrastructure
  • Monitoring system
  • Performance results
  • Conclusions

3
ALICE Data Acquisition
  • ALICE detectors

Final system!
up to 20 GB/s
Local Data Concentrators (LDC)
Readout
300 nodes
up to 2.5 GB/s
Global Data Collectors (GDC)
Event Building
100 nodes
up to 1.25 GB/s
CASTOR System
Mass Storage System
4
ALICE Data Challenges
  • What? Put together components to demonstrate the
    feasibility, reliability and performance of our
    present prototypes.
  • Where? The ALICE common testbed uses the hardware
    of the common CERN LHC testbed.
  • When? This exercise is repeated every year by
    progressively enlarging the testbed.
  • Who? Joined effort between the ALICE online and
    offline group, and two groups of the CERN IT
    division.

ADC I March 1999 ADC II March-April 2000 ADC
III January-March 2001 ADC IV 2nd half 2002 ?
5
Goals of the ADC III
  • Performance, scalability, and stability of the
    system (10 of the final system)
  • 300 MB/s event building bandwidth
  • 100 MB/s over the full chain during a week
  • 80 TB into the mass storage system
  • Online monitoring tools

6
ADC III Testbed Hardware
Farm
Network
Disks
Tapes
  • 80 standard PCs
  • dual PIII_at_800Mhz
  • Fast and Gigabit Ethernet
  • Linux kernel 2.2.17

6 switches from 3 manufactures Copper and fiber
media Fast and Gigabit Ethernet
8 disk servers dual PIII_at_700Mhz 20 IDE data
disks 750 GB mirrored 3 HP NetServers
12 tape drives 1000 cartridges 60 GB capacity 10
MB/s bandwidth
7
ADC III Monitoring
  • Minimum requirements
  • LDC/GDC throughput (individual and aggregate)
  • Data volume (individual and aggregate)
  • CPU load (user and system)
  • Identification time stamp, run number
  • Plots accessible on the Web
  • Online monitoring tools
  • PEM (Performance and Exception Monitoring) from
    CERN IT-PDP ? was not ready for ADC III
  • Fabric monitoring developed by CERN IT-PDP
  • ROOT I/O measures mass storage throughput
  • CASTOR measures disk/tape/pool statistics
  • DATESTAT prototype development by EP-AID, EP-AIP

8
Fabric Monitoring
  • Collect CPU, network I/O, and swap statistics
  • Send UDP packets to a server
  • Display current status and history using Tcl/Tk
    scripts

9
ROOT I/O Monitoring
  • Measures aggregate throughput to mass storage
    system
  • Collect measurements in a MySQL data base
  • Display history and histogram using ROOT on Web
    pages

10
DATESTAT Architecture
DATE v3.7
LDC
LDC
LDC
LDC
LDC
LDC
LDC
dateStat.c ? top, DAQCONTROL
GDC
GDC
GDC
GDC
GDC
GDC
DATE Info Logger
Log files (200 KB/hour/node)
Perl script
gnuplot script
Statistics files
C program
gnuplot/CGI script
MySQL data base
http//alicedb.cern.ch/statistics
11
Selected DATESTAT Results
  • Result 1 DATE standalone run, equal subevent
    size
  • Result 2 Dependence on subevent size
  • Result 3 Dependence on the number of LDC/GDC
  • Result 4 Full chain, ALICE-like subevents

12
Result 1/1
  • DATE standalone
  • 11LDCx11GDC nodes, 420...440 KB subevents, 18
    hours

13
Result 1/2
  • DATE standalone
  • 11LDCx11GDC nodes, 420...440 KB subevents, 18
    hours
  • LDC load 12 user, 27 sys
  • LDC rate 27.1 MB/s

14
Result 1/3
  • DATE standalone
  • 11LDCx11GDC nodes, 420...440 KB subevents, 18
    hours
  • GDC load 1 user, 37 sys
  • GDC rate 27.7 MB/s

15
Result 2
  • DATE standalone
  • 13LDCx13GDC nodes, 5060 KB subevents, 1.1 hours

16
Result 3
  • Dependence on LDC/GDC
  • DATE standalone
  • Gigabit Ethernet
  • max. 30 MB/s per LDC
  • max. 60 MB/s per GDC

17
Result 4/1
  • Full chain
  • 20LDCx13GDC nodes, ALICE-like subevents, 59 hours

18
Result 4/2
  • Full chain
  • 20LDCx13GDC nodes, ALICE-like subevents, 59 hours
  • GDC load 6 user, 23 sys
  • GDC rate 6.8 MB/s

19
Result 4/3
  • Full chain
  • 20LDCx13GDC nodes, ALICE-like subevents, 59 hours
  • LDC load 0.8 user, 2.7 sys
  • LDC rate 1.1 MB/s (60 KB, Fast)

20
Grand Total
  • Maximum throughput in DATE 556 MB/s for
    symmetric traffic, 350 MB/s for ALICE-like
    traffic
  • Maximum throughput in full chain 120 MB/s
    without migration, 86 MB/ with migration
  • Maximum volume per run 54 TB with DATE
    standalone, 23.6 TB with full chain
  • Total volume through DATE at least 500 TB
  • Total volume through full chain 110 TB
  • Maximum duration per run 86 hours
  • Maximum events per run 21E6
  • Maximum subevent size 9 MB
  • Maximum number of nodes 20x15
  • Number of runs 2200

21
Summary
  • Most of the ADC III goals were achieved

  • PC/Linux platforms are stable and reliable
  • Ethernet technology is reliable and scalable
  • DATE standalone is running well
  • Full chain needs to be further analyzed
  • Next ALICE Data Challenge in the 2nd half 2002
  • Online Performance Monitoring
  • DATESTAT prototype performed well
  • Helped to spot bottlenecks in the DAQ system
  • The team of Zagreb is re-designing and
    re-engineering the DATESTAT prototype

22
Future Work
  • Polling agent
  • obtain performance data from all components
  • keep the agent simple, uniform, and extendable
  • support several platforms (UNIX, application
    software)
  • TransportStorage
  • use communication with low overhead
  • maintain common format in central database
  • Processing
  • apply efficient algorithms to filter and
    correlate logged data
  • store permanently performance results in a
    database
  • Visualization
  • use common GUI (Web-based, ROOT objects)
  • provide different views (levels, time scale,
    color codes)
  • generate automatically plots, histograms,
    reports, e-mail, ...
Write a Comment
User Comments (0)
About PowerShow.com