F.Carminati - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

F.Carminati

Description:

Nightly produced UML diagrams, code listing, coding rule violations, build and ... zip archive of output files. Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 38
Provided by: FedericoC1
Category:
Tags: carminati | code | finder | zip

less

Transcript and Presenter's Notes

Title: F.Carminati


1
ALICE Computing Model
  • F.Carminati
  • BNL Seminar
  • March 21, 2005

2
Offline framework
  • AliRoot in development since 1998
  • Entirely based on ROOT
  • Used since the detector TDRs for all ALICE
    studies
  • Two packages to install (ROOT and AliRoot)
  • Plus MCs
  • Ported on most common architectures
  • Linux IA32, IA64 and AMD, Mac OS X, Digital
    True64, SunOS
  • Distributed development
  • Over 50 developers and a single CVS repository
  • 2/3 of the code developed outside CERN
  • Tight integration with DAQ (data recorder) and
    HLT (same code-base)
  • Wide use of abstract interfaces for modularity
  • Restricted subset of c used for maximum
    portability

3
AliRoot layout
G3
G4
FLUKA
ISAJET
AliEn/gLite
Virtual MC
AliRoot
AliReconstruction
HIJING
EVGEN
MEVSIM
AliSimulation
HBTAN
PYTHIA6
STEER
PDF
EMCAL
ZDC
ITS
PHOS
TRD
TOF
RICH
PMD
HBTP
CRT
FMD
MUON
TPC
START
RALICE
STRUCT
ESD
AliAnalysis
ROOT
4
Software management
  • Regular release schedule
  • Major release every six months, minor release
    (tag) every month
  • Emphasis on delivering production code
  • Corrections, protections, code cleaning, geometry
  • Nightly produced UML diagrams, code listing,
    coding rule violations, build and tests , single
    repository with all the code
  • No version management software (we have only two
    packages!)
  • Advanced code tools under development
    (collaboration with IRST/Trento)
  • Smell detection (already under testing)
  • Aspect oriented programming tools
  • Automated genetic testing

5
ALICE Detector Construction Database (DCDB)
  • Specifically designed to aid detector
    construction in distributed environment
  • Sub-detector groups around the world work
    independently
  • All data collected in central repository and used
    to move components from one sub-detector group to
    another and during integration and operation
    phase at CERN
  • Multitude of user interfaces
  • WEB-based for humans
  • LabView, XML for laboratory equipment and other
    sources
  • ROOT for visualisation
  • In production since 2002
  • A very ambitious project with important spin-offs
  • Cable Database
  • Calibration Database

6
The Virtual MC
7
TGeo modeller
8
Results
Geant3
FLUKA
HMPID 5 GeV Pions
9
ITS SPD Cluster SizePRELIMINARY!
10
Reconstruction strategy
  • Main challenge - Reconstruction in the high flux
    environment (occupancy in the TPC up to 40)
    requires a new approach to tracking
  • Basic principle Maximum information approach
  • Use everything you can, you will get the best
  • Algorithms and data structures optimized for fast
    access and usage of all relevant information
  • Localize relevant information
  • Keep this information until it is needed

11
Tracking strategy Primary tracks
  • Incremental process
  • Forward propagation towards to the vertex TPC?ITS
  • Back propagation ITS?TPC?TRD?TOF
  • Refit inward TOF?TRD?TPC?ITS
  • Continuous seeding
  • Track segment finding in all detectors
  • Combinatorial tracking in ITS
  • Weighted two-tracks ?2 calculated
  • Effective probability of cluster sharing
  • Probability not to cross given layer for
    secondary particles

12
Tracking PID
TPC
ITSTPCTOFTRD
  • PIV 3GHz (dN/dy 6000)
  • TPC tracking - 40s
  • TPC kink finder 10 s
  • ITS tracking 40 s
  • TRD tracking 200 s

13
Condition and alignment
  • Heterogeneous information sources are
    periodically polled
  • ROOT files with condition information are created
  • These files are published on the Grid and
    distributed as needed by the Grid DMS
  • Files contain validity information and are
    identified via DMS metadata
  • No need for a distributed DBMS
  • Reuse of the existing Grid services

14
External relations and DB connectivity
Relations between DBs not final not all shown
From URs Source, volume, granularity, update
frequency, access pattern, runtime environment
and dependencies
Physicsdata
files
calibration procedures
API
ECS
DAQ
calibration files
API
Trigger
API
Calibration classes
AliEn?gLite metadata file store
DCS
API
AliRoot
DCDB
API
API
HLT
Call for UR sent to subdetectors
API
API Application Program Interface
15
Metadata
  • MetaData are essential for the selection of
    events
  • We hope to be able to use the Grid file catalogue
    for one part of the MetaData
  • During the Data Challenge we used the AliEn file
    catalogue for storing part of the MetaData
  • However these are file-level MetaData
  • We will need an additional event-level MetaData
  • This can be simply the TAG catalogue with
    externalisable references
  • We are discussing with STAR on this subject
  • We will take a decision soon
  • We would prefer that the Grid scenario be clearer

16
ALICE CDCs
17
Use of HLT for monitoring in CDCs
CASTOR
Root file
AliEn
alimdc
ESD
Monitoring
Histograms
Event builder
HLT Algorithms
Aliroot Simulation
GDC
LDC
LDC
Digits
LDC
LDC
Raw Data
18
ALICE Physics Data Challenges
19
PDC04 schema
AliEn job control
Data transfer
Production of RAW
Shipment of RAW to CERN
Reconstruction of RAW in all T1s
CERN
Analysis
Tier2
Tier1
Tier2
Tier1
20
Phase 2 principle
Mixed signal
21
Simplified view of the ALICE Grid with AliEn
ALICE VO central services
File Catalogue
User authentication
Workload management
Job submission
Configuration
Job Monitoring
Central Task Queue
Accounting
Storage volume manager
AliEn Site services
Computing Element
Data Transfer
Local scheduler
Disk and MSS
Storage Element
Cluster Monitor
Existing site components
ALICE VO Site services integration
22
Site services
  • Inobtrusive entirely in user space
  • Singe user account
  • All authentication already assured by central
    services
  • Tuned to the existing site configuration
    supports various schedulers and storage solutions
  • Running on many Linux flavours and platforms
    (IA32, IA64, Opteron)
  • Automatic software installation and updates (both
    service and application)
  • Scalable and modular different services can be
    run on different nodes (in front/behind
    firewalls) to preserve site security and
    integrity

CERN firewall solution for large volume file
transfers
ONLY High ports (50K-55K) for parallel file
transport
Fire wall
Load balanced file transfer nodes (on HTAR)
CERN Intranet
AliEn Other services
AliEn Data Transfer
23
Log files, application software storage 1TB
SATA Disk server
24
Phase 2 job structure
  • Task - simulate the event reconstruction and
    remote event storage

Central servers
Register in AliEn FC LCG SE LCG LFN AliEn PFN
Master job submission, Job Optimizer (N
sub-jobs), RB, File catalogue, processes
monitoring and control, SE
Sub-jobs
Sub-jobs
AliEn-LCG interface
Storage
Underlying event input files
CERN CASTOR underlying events
Completed Sep. 2004
RB
Storage
CEs
CEs
CERN CASTOR backup copy
Job processing
Job processing
Output files
Output files
zip archive of output files
Local SEs
Local SEs
File catalogue
Primary copy
Primary copy
edg(lcg) copyregister
25
Production history
  • Statistics
  • 400 000 jobs, 6 hours/job, 750 MSi2K hours
  • 9M entries in the AliEn file catalogue
  • 4M physical files at 20 AliEn SEs in centres
    world-wide
  • 30 TB stored at CERN CASTOR
  • 10 TB stored at remote AliEn SEs 10 TB backup
    at CERN
  • 200 TB network transfer CERN remote computing
    centres
  • AliEn efficiency observed 90
  • LCG observed efficiency 60 (see GAG document)
  • ALICE repository history of the entire DC
  • 1 000 monitored parameters
  • Running, completed processes
  • Job status and error conditions
  • Network traffic
  • Site status, central services monitoring
  • .
  • 7 GB data
  • 24 million records with 1 minute granularity
    analysed to improve GRID performance

26
Job repartition
  • Jobs (AliEn/LCG) Phase 1 - 75/25, Phase 2
    89/11
  • More operation sites added to the ALICE GRID as
    PDC progressed

Phase 1
Phase 2
  • 17 permanent sites (33 total) under AliEn direct
    control and additional resources through GRID
    federation (LCG)

27
Summary of PDC04
  • Computing resources
  • It took some effort to tune the resources at
    the remote computing centres
  • The centres response was very positive more
    CPU and storage capacity was made available
    during the PDC
  • Middleware
  • AliEn proved to be fully capable of executing
    high-complexity jobs and controlling large
    amounts of resources
  • Functionality for Phase 3 has been demonstrated,
    but cannot be used
  • LCG MW proved adequate for Phase 1, but not for
    Phase 2 and in a competitive environment
  • It cannot provide the additional functionality
    needed for Phase 3
  • ALICE computing model validation
  • AliRoot all parts of the code successfully
    tested
  • Computing elements configuration
  • Need for a high-functionality MSS shown
  • Phase 2 distributed data storage schema proved
    robust and fast
  • Data Analysis could not be tested

28
Development of Analysis
  • Analysis Object Data designed for efficiency
  • Contain only data needed for a particular
    analysis
  • Analysis à la PAW
  • ROOT at most a small library
  • Work on the distributed infrastructure has been
    done by the ARDA project
  • Batch analysis infrastructure
  • Prototype published at the end of 2004 with AliEn
  • Interactive analysis infrastructure
  • Demonstration performed at the end 2004 with
    AliEn?gLite
  • Physics working groups are just starting now, so
    timing is right to receive requirements and
    feedback

29
LCG
Site A
Site B
PROOF SLAVE SERVERS
PROOF SLAVE SERVERS
Proofd
Rootd
Forward Proxy
Forward Proxy
New Elements
Optional Site Gateway
Only outgoing connectivity
Site
Slave ports mirrored on Master host
Proofd Startup
Slave Registration/ Booking- DB
Grid Service Interfaces
TGrid UI/Queue UI
Master Setup
Grid Access Control Service
Grid/Root Authentication
Standard Proof Session
Grid File/Metadata Catalogue
Master
Booking Request with logical file names
Client retrieves list of logical file (LFN MSN)
Grid-Middleware independend PROOF Setup
Client
30
Grid situation
  • History
  • Jan 04 AliEn developers are hired by EGEE and
    start working on new MW
  • May 04 A prototype derived from AliEn is
    offered to pilot users (ARDA, Biomed..) under the
    gLite name
  • Dec 04 The four experiments ask for this
    prototype to be deployed on larger preproduction
    service and be part of the EGEE release
  • Jan 05 This is vetoed at management level --
    AliEn will not be common software
  • Current situation
  • EGEE has vaguely promised to provide the same
    functionality of AliEn-derived MW
  • But with a 2-4 months delay at least on top of
    the one already accumulated
  • But even this will be just the beginning of the
    story the different components will have to be
    field tested in a real environment, it took four
    years for AliEn
  • All experiments have their own middleware
  • Our is not maintained because our developers have
    been hired by EGEE
  • EGEE has formally vetoed any further work on
    AliEn or AliEn-derived software
  • LCG has allowed some support for ALICE but the
    situation is far from being clear

31
ALICE computing model
  • For pp similar to the other experiments
  • Quasi-online data distribution and first
    reconstruction at T0
  • Further reconstruction passes at T1s
  • For AA different model
  • Calibration, alignment and pilot reconstructions
    during data taking
  • Data distribution and first reconstruction at T0
    during the four months after AA run (shutdown)
  • Second and third pass distributed at T1s
  • For safety one copy of RAW at T0 and a second one
    distributed among all T1s
  • T0 First pass reconstruction, storage of one
    copy of RAW, calibration data and first-pass
    ESDs
  • T1 Subsequent reconstructions and scheduled
    analysis, storage of the second collective copy
    of RAW and one copy of all data to be safely kept
    (including simulation), disk replicas of ESDs
    and AODs
  • T2 Simulation and end-user analysis, disk
    replicas of ESDs and AODs
  • Very difficult to estimate network load

32
ALICE requirements on MiddleWare
  • One of the main uncertainties of the ALICE
    computing model comes from the Grid component
  • ALICE was developing its computing model assuming
    that a MW with the same quality and functionality
    that AliEn would have had in two years from now
    will be deployable on the LCG computing
    infrastructure
  • If not, we will still analyse the data (!), but
  • Less efficiency ? more computers ? more time and
    money
  • More people for production ? more money
  • To elaborate an alternative model we should know
    what will be
  • The functionality of the MW developed by EGEE
  • The support we can count on from LCG
  • Our political margin of manoeuvre

33
Possible strategy
  • If
  • Basic services from LCG/EGEE MW can be trusted at
    some level
  • We can get some support to port the higher
    functionality MW onto these services
  • We have a solution
  • If a) above is not true but if
  • We have support for deploying the ARDA-tested
    AliEn-derived gLite
  • We do not have a political veto
  • We still have a solution
  • Otherwise we are in trouble

34
ALICE Offline Timeline
35
Main parameters
36
Processing pattern
37
Conclusions
  • ALICE has made a number of technical choices for
    the Computing framework since 1998 that have been
    validated by experience
  • The Offline development is on schedule, although
    contingency is scarce
  • Collaboration between physicists and computer
    scientists is excellent
  • Tight integration with ROOT allows fast
    prototyping and development cycle
  • AliEn goes a long way in providing a GRID
    solution adapted to HEP needs
  • However its evolution into a common project has
    been stopped
  • This is probably the largest single risk factor
    for ALICE computing
  • Some ALICE-developed solutions have a high
    potential to be adopted by other experiments and
    indeed are becoming common solutions
Write a Comment
User Comments (0)
About PowerShow.com