Grid Computing in CMS - PowerPoint PPT Presentation

About This Presentation
Title:

Grid Computing in CMS

Description:

HEP2005 International Europhysics Conference on High Energy Physics ... US Open Science Grid (OSG) and LHC Computing Grid (LCG) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 16
Provided by: joseh156
Category:
Tags: cms | computing | grid | jose | madrid | open | real | us

less

Transcript and Presenter's Notes

Title: Grid Computing in CMS


1
Grid Computing in CMS
  • José M. Hernández
  • CIEMAT, Madrid
  • HEP2005 International Europhysics Conference
    on High Energy Physics

2
Outline
  • Architecture of CMS Computing System
  • Data and Workload Management Systems
  • Baseline capabilities and functionalities
  • Key CMS Grid Systems
  • Data Transfer and Placement System (PhEDEx)
  • Monte Carlo production on the Grid
  • Data analysis on the Grid
  • Stress Tests of CMS Computing System
  • CMS computing challenges and LCG service
    challenges
  • Develop computing system iteratively

3
CMS Computing Model
  • Distributed model for computing in CMS
  • Cope with computing requirements for storage,
    processing and analysis of data provided by LHC
  • Computing resources are geographically
    distributed, interconnected via high throughput
    networks and operated by means of Grid software
  • CMS computing TDR just released (June 2005)

4
Tiered Architecture
  • Tier-0
  • Accepts data from DAQ
  • Prompt reconstruction
  • Archives data and distributes them to Tier-1s
  • Tier-1s
  • Real data archiving
  • Re-processing
  • Calibration
  • Skimming and other data-intensive analysis tasks
  • MC data archiving
  • Tier-2s
  • Data Analysis
  • MC simulation
  • Import datasets from Tier-1 and export MC data

5
Workload and Data Management Systems
  • Design philosophy
  • Use Grid Services as much as possible and also
    CMS-specific services
  • Baseline system with minimal functionality for
    first physics
  • Keep it simple!
  • Optimize for the common case
  • Optimize for read access (most data is
    write-once, read-many)
  • Optimize for organized bulk processing, but
    without limiting single user
  • Decouple parts of the system
  • Minimize job dependencies
  • Site-local information stays site-local
  • Use explicit data placement
  • Data does not move around in response to job
    submission
  • All data is placed at a site through explicit CMS
    policy
  • Grid interoperability (LCG and OSG)

6
WMS DMS Services Overview
  • No global file replica catalogue
  • Track and replicate data with a granularity of
    file blocks
  • Data Bookkeeping System
  • What data exist?
  • Data Location Service
  • Where are data located?
  • Local File catalogue
  • Data Access and Storage
  • SRM and posix-IO
  • Data Transfer and placement system
  • Rely on Grid Workload Management
  • Reliability, performance, monitoring
  • Hierarchical task queue in future
  • Grid and CMS-specific job monitoring and
    bookkeeping

DMS
WMS
7
Data Transfer and Placement System
  • PhEDEx (Physics Experiment Data Export)
  • Large scale dataset replica management system
  • Managed data flow following a transfer topology
    (Tier0 ? Tier1 ? Tier2)
  • Routed multi-hop transfers. Routing agents
    determine the best route
  • Reliable point-to-point transfers based on
    unreliable Grid transfer tools
  • Set of quasi-independent, asynchronous software
    agents posting messages in a central blackboard
  • Nodes subscribe for data allocated from other
    nodes
  • Enables distribution management at dataset level
    rather than at file level
  • Implements experiments policy on data placement
  • Allows prioritization and scheduling
  • In production since more than a year
  • Managing transfers of several TB/day
  • 100 TB known to PhEDEx, 200 TB total replicated
  • Running at CERN, 7 Tier-1s, 10 Tier-2s

8
PhEDEx
9
MC Production on the Grid
  • US Open Science Grid (OSG) and LHC Computing Grid
    (LCG)
  • McRunjob tool for running CMS production jobs
    (preparation, submission, stage-in, execution,
    stage-out, cleanup)
  • Developed by FNAL with contributions from other
    CMS people
  • Highly configurable and flexible
  • Interfaced to all Grids and local farm production
  • Different production steps (generation,
    simulation, digitization and reconstruction)
    currently run separately
  • Production in both Grids since 2003
  • Production in LCG recently extended to run all
    steps. Now ramping up
  • Local Farm production moving to the Grid

10
LCG Production Workflow
  • Quasi-real-time job monitoring (BOSS)
  • Normally experiment software pre-installed

11
MC Production on the Grid
LCG
OSG
  • Several thousand CPUs available in both Grids
  • Few million events per month
  • Around 70-90 job efficiency ? The issue is
    reliability
  • System issues (hardware failures, NFS access,
    disk full, site misconfiguration)
  • Software installation problems
  • Grid services, stage-in and stage-out files, LCG
    catalogue instability
  • Run with site whitelist ? increase in manpower
    needed
  • Instrument job wrapper to cope with known
    instabilities

12
Data Analysis on the Grid
  • Data samples for the CMS Physics TDR distributed
    in Tier-1 sites (80 million events)
  • End-to-end analysis via LCG Grid
  • Simple analysis scenario where data is
    pre-located and jobs are sent to the data
  • CMS remote Analysis Builder (CRAB) tool for job
    preparation, submission, execution and basic
    monitoring
  • Several 10s of users and 100s of jobs per day
  • 100K jobs submitted

13
Tests of CMS Computing System
  • It is crucial to test prototypes of Grid
    resources and services of increasing scale and
    complexity so that they can become production
    services
  • Problems and missing components are identified
    and addressed
  • Iterative process in computing system development
  • Scheduled CMS Computing Challenges and LCG
    Service Challenges
  • CMS Data Challenge 2004
  • Tier-0 reconstruction _at_ 25 Hz and distribution to
    Tier-1 for real-time analysis
  • Put in place CMS data transfer and placement
    system (PhEDEx)
  • First large scale test of Grid WMS (real-time
    analysis)
  • Problems identified small file size (transfer,
    mass storage), slow central replica and metadata
    LCG catalogue, lack of reliable file transfer
    system in LCG
  • CMS Computing, Software and Analysis challenge
    (summer 2006)
  • Full test of CMS computing system
  • LCG Service Challenges
  • SC3 (Sept-Dec 2005) test all experiment services
    but analysis
  • SC4 (from April 2006) test of all computing
    services

14
Experience in CMS Grid Computing
  • Basic Grid Infrastructure and Services in place
  • The Grid works but the issue is reliability
  • A lot still to be done
  • VO policy and priorities not yet quite
    implemented in Grid WMS and DMS
  • Lack of dynamic behaviour in WMS and DMS
    (rescheduling)
  • As a consequence custom-made data transfer and
    placement system implemented in CMS DMS and
    hierarchical task queue planned in CMS WMS
  • Primitive and high-latency job monitoring
  • Accounting only recently implemented
  • Reduce Grid overheads in WMS and monitoring
  • Putting effort in integration is crucial. Working
    with sites

15
Summary
  • CMS has adopted a distributed computing model
    which makes use of Grid technologies
  • Production CMS services on the Grid in place
  • Data Management and Workload Management systems
  • Data transfer and placement system
  • Monte Carlo production
  • Data Analysis
  • Steadily increase in scale and complexity
  • Basic Grid Infrastructure and Services in place
    but reliability and stability are the problems
  • Lot of work ahead for Grid software providers and
    CMS computing team
Write a Comment
User Comments (0)
About PowerShow.com