Title: Jrgen Knobloch cernit 1
1The LHC Computing Grid ProjectChallenges, Status
Plans
LCG
- May 2004
- Jürgen Knobloch
- IT Department, CERN
- This file is available at http//cern.ch/lcg/pres
entations/LCG_JK_EU_Visit.ppt
2CERN
Large Hadron Collider
3CERN where the Web was born
Tim Berners-Lee
The World Wide Web provides seamless access to
information that is stored in many millions of
different geographical locations In contrast,
the Grid is an emerging infrastructure that
provides seamless access to computing power and
data storage capacity distributed over the globe.
WSIS, Geneva, October 10-12, 2003
First Web Server
4Particle Physics
Establish a periodic system of the fundamental
building blocks
andunderstandforces
5Methods of Particle Physics
The most powerful microscope
Creating conditions similar to the Big Bang
6Particle physics data
From raw data to physics results
2037 2446 1733 1699 4003 3611 952 1328 2132 1870
2093 3271 4732 1102 2491 3216 2421 1211 2319
2133 3451 1942 1121 3429 3742 1288 2343 7142
_
Raw data Convert to physics quantities
Interaction with detector material Pattern, recogn
ition, Particle identification
Analysis
Reconstruction
Simulation (Monte-Carlo)
7Challenge 1 Large, distributed community
Offline software effort 1000 person-yearsper
experiment
CMS
ATLAS
Software life span 20 years
5000 Physicistsaround the world- around the
clock
LHCb
8Challenge 2 Data Volume
Annual data storage 12-14 PetaBytes/year
9Challenge 3 Find the Needle in a Haystack
10Therefore Provide mountains of CPU
CalibrationReconstructionSimulationAnalysis
For LHC computing,some 100 Million SPECint2000
are needed!
Produced by Inteltoday in 6 hours
1 SPECint2000 0.1 SPECint95 1 CERN-unit 4
MIPS - a 3 GHz Pentium 4 has 1000 SPECint2000
11The CERN Computing Centre
2,400 processors 200 TBytes of disk 12 PB of
magnetic tape
Even with technology-driven improvements in
performance and costs CERN can provide nowhere
near enough capacity for LHC!
12What is the Grid?
- Resource Sharing
- On a global scale, across the labs/universities
- Secure Access
- Needs a high level of trust
- Resource Use
- Load balancing, making most efficient use
- The Death of Distance
- Requires excellent networking
- Open Standards
- Allow constructive distributed development
- There is not (yet) a single Grid
6.25 Gbps 20 April 2004
13How will it work?
- The GRID middleware
- Finds convenient places for the scientists job
(computing task) to be run - Optimises use of the widely dispersed resources
- Organises efficient access to scientific data
- Deals with authentication to the different sites
that the scientists will be using - Interfaces to local site authorisation and
resource allocation policies - Runs the jobs
- Monitors progress
- Recovers from problems
- and .
- Tells you when the work is complete and transfers
the result back!
14The LHC Computing Grid Project - LCG
- Collaboration
- LHC Experiments
- Grid projects Europe, US
- Regional national centres
- Choices
- Adopt Grid technology.
- Go for a Tier hierarchy.
- Use Intel CPUs in standard PCs
- Use LINUX operating system.
- Goal
- Prepare and deploy the computing environment to
help the experiments analyse the data from the
LHC detectors.
15Operational Management of the Project
Applications Area Development environment Joint
projects Data management Distributed analysis
Middleware Area now EGEE Provision of a base
set of gridmiddleware acquisition,development,
integration,testing, support
ARDA A Realisation of Distributed Analysis for
LHC
CERN Fabric Area Large cluster management Data
recording Cluster technology Networking Computing
service at CERN
Grid Deployment Area Establishing and managing
theGrid Service - Middleware certification,
security, operations,registration,
authorisation,accounting
Joint with a new European project EGEE
Phase 1 2002-05development of common software
prototyping and operation of a pilot computing
service Phase 2 2006-08acquire, build and
operate the LHC computing service
Enabling Grids for e-Science in Europe
16Most of our work is also useful for
- Medical/Healthcare (imaging, diagnosis and
treatment ) - Bioinformatics (study of the human genome and
proteome to understand genetic diseases) - Nanotechnology (design of new materials from the
molecular scale) - Engineering (design optimization, simulation,
failure analysis and remote Instrument access and
- control)
- Natural Resources and the Environment
- (weather forecasting, earth observation, modeling
- and prediction of complex systems, earthquake)
17Virtual Organizations for LHC and others
18Deploying the LCG Service
- Middleware
- Testing and certification
- Packaging, configuration, distribution and site
validation - Support problem determination and resolution
feedback to middleware developers - Operations
- Grid infrastructure services
- Site fabrics run as production services
- Operations centres trouble and performance
monitoring, problem resolution 24x7 globally - Support
- Experiment integration ensure optimal use of
system - User support call centres/helpdesk global
coverage documentation training
19Progressive Evolution
- Improve reliability, availability
- Add more sites
- Establish service quality
- Run more and more demanding data challenges
- Improve performance, efficiency
- scheduling, data migration, data transfer
- Develop interactive services
- Increase capacity and complexity gradually
- Recognise and migrate to de facto standards as
soon as they emerge
20Challenges
- Service quality
- Reliability, availability, scaling, performance
- Security our biggest risk
- Management and operations
- grid ? a collaboration of computing centres
- ? Maturity is some years away - a second (or
third) generation of middleware will be needed
before LHC starts - In the short-term there will many grids and
middleware implementations - for LCG - inter-operability will be a major
headache - How homogeneous does it need to be?
21LCG Service Status
- Certification and distribution process
established - Middleware package components from
- European DataGrid (EDG)
- US (Globus, Condor, PPDG, GriPhyN) ? the Virtual
Data Toolkit - Agreement reached on principles for registration
and security - Rutherford Lab (UK) providing the initial Grid
Operations Centre - FZK (Karlsruhe) operating the Call Centre
- LCG-2 software released February 2004
- More than 40 centres connected with more than
3000 processors - Four collaborations run data challenges on the
grid
22Regional centres connected to the LCG grid
gt 40 sites gt 3,100 CPUs
23Data challenges
- The 4 LHC experiments currently run data
challenges using the LHC computing grid - Part 1 World-wide production of simulated data
- Job submission, resource allocation and
monitoring - Catalog of distributed data
- Part 2 Test of Tier-0 operation
- Continuous (24 x 7) recording of data up to 450
MB/s per experiment (target for ALICE in 2005
750 MB/s) - First pass data reconstruction and analysis
- Distribute data in real-time to Tier-1 centres
- Part 3 Distributed analysis on the Grid
- Access of data from anywhere in the world in an
organized as well as in a chaotic access pattern
Now
Summer 04
Fall 04
241 Gbyte/s Computing Data Challenge ? Observed
rates
running in parallel with increasing production
service
1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
920 MB/s average over a period of 3 days with an
8 hour period of 1.1GB/s and peaks of 1.2GB/s
daytime tape server intervention
In addition 600 MB/s into CASTOR for 12 hours
, then window of opportunity closed services
started
T (minutes)
25High Througput Prototype (openlab LCG prototype)
10GE WAN connection
ENTERASYS N7 10 GE Switches
180 IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.)
56 Itanium Server (dual 1.3/1.5 GHz Itanium2, 2
GB mem)
GE per node
10GE per node
10GE
GE per node
multi GE connections to the backbone
20 Tape Server STK 9940B
GE per node
28 Disk Server (dual
P4, IDE disks, 1TB disk space each)
26Preparing for 2007
- 2003 has demonstrated event production
- In 2004 we must show that we can also handle the
data even if the computing model is very simple - -- This is a key goal of the 2004 Data
Challenges - Target for end of this year
- Basic model demonstrated using current grid
middleware - All Tier-1s and 25 of Tier-2s operating a
reliable service - Validate security model, understand storage
model - Clear idea of the performance, scaling, and
management issues
27LCG and EGEE (and EU projects in general)
- LCG counts on EGEE middleware with the required
functionality - The experience we have gained with existing
middleware will be essential EGEE starts from
LCG-2 - LCG focuses on a practical application that
cannot be satisfied otherwise - Pushing technology to its limit
- We are dedicated to success!
- LCG provides the running of services for EGEE
- How else could a 2 year project get quickly
started? - The developments are not special for HEP
- Other sciences will profit
- What happens after the 2 (2) year lifespan of
EGEE? - HEP and CERN is a leading user of the GEANT
network