Title: The Promise of Computational Grids in the LHC Era
1- The Promise of Computational Grids in the LHC Era
- Paul AveryUniversity of FloridaGainesville,
Florida, USA - avery_at_phys.ufl.eduhttp//www.phys.ufl.edu/avery/
- CHEP 2000Padova, ItalyFeb. 7-11, 2000
2LHC Computing Challenges
- Complexity of LHC environment and resulting data
- Scale Petabytes of data per year
- Geographical distribution of people and resources
Example CMS 1800 Physicists 150 Institutes 32
Countries
3Dimensioning / Deploying IT Resources
- LHC computing scale is something new
- Solution requires directed effort, new
initiatives - Solution must build on existing foundations
- Robust computing at national centers essential
- Universities must have resources to maintain
intellectual strength, foster training, engage
fresh minds - Scarce resources are/will be a fact of life ?
plan for it - Goal get new resources, optimize deployment of
all resources to maximize effectiveness - CPU CERN / national lab / region / institution /
desktop - Data CERN / national lab / region / institution
/ desktop - Networks International / national / regional /
local
4Deployment Considerations
- Proximity of datasets to appropriate IT resources
- Massive ? CERN national labs
- Data caches ? Regional centers
- Mini-summary ? Institutional
- Micro-summary ? Desktop
- Efficient use of network bandwidth
- Local gt regional gt national gt international
- Utilizing all intellectual resources
- CERN, national labs, universities, remote sites
- Scientists, students
- Leverage training, education at universities
- Follow lead of commercial world
- Distributed data, web servers
5Solution A Data Grid
- Hierarchical grid best deployment option
- Hierarchy ? Optimal resource layout (MONARC
studies) - Grid ? Unified system
- Arrangement of resources
- Tier 0 ? Central laboratory computing resources
(CERN) - Tier 1 ? National center (Fermilab / BNL)
- Tier 2 ? Regional computing center (university)
- Tier 3 ? University group computing resources
- Tier 4 ? Individual workstation/CPU
- We call this arrangement a Data Grid to reflect
the overwhelming role that data plays in
deployment
6Layout of Resources
- Want good impedance match between Tiers
- TierN-1 serves TierN
- TierN big enough to exert influence on TierN-1
- TierN-1 small enough to not duplicate TierN
- Resources roughly balanced across Tiers
Reasonable balance?
7Data Grid Hierarchy (Schematic)
Tier 0 (CERN)
3
3
3
3
3
T2
T2
3
T2
Tier 1
3
3
T2
T2
3
3
3
3
3
3
82.4 Gbps
Tier 3 Univ WG 1
Tier 1 FNAL/BNL 70k Si95 70 Tbytes Disk Robot
2.4 Gbps
Tier 3 Univ WG 2
Tier 2 Center 20k Si95 25 Tbytes Disk, Robot
N ? 622 Mbits/s
CERN (CMS/ATLAS) 350k Si95 350 Tbytes Disk Robot
Tier 3 Univ WG M
622Mbits/s
622 Mbits/s
622 Mbits/s
US Model Circa 2005
9Data Grid Hierarchy (CMS)
1 TIPS 25,000 SpecInt95 PC (today) 10-20
SpecInt95
PBytes/sec
Online System
100 MBytes/sec
Offline Farm20 TIPS
Bunch crossing per 25 nsecs. 100 triggers per
second Event is 1 MByte in size
100 MBytes/sec
Tier 0
CERN Computer Center
622 Mbits/sec
Tier 1
Fermilab4 TIPS
France Regional Center
Italy Regional Center
Germany Regional Center
2.4 Gbits/sec
Tier 2
622 Mbits/sec
Tier 3
Physicists work on analysis channels. Each
institute has 10 physicists workingon one or
more channels Data for these channels is cached
by the institute server
Physics data cache
1-10 Gbits/sec
Tier 4
Workstations
10Why a Data Grid Physical
- Unified system all computing resources part of
grid - Efficient resource use (manage scarcity)
- Averages out spikes in usage
- Resource discovery / scheduling / coordination
truly possible - The whole is greater than the sum of its parts
- Optimal data distribution and proximity
- Labs are close to the data they need
- Users are close to the data they need
- No data or network bottlenecks
- Scalable growth
11Why a Data Grid Political
- Central lab cannot manage / help 1000s of users
- Easier to leverage resources, maintain control,
assert priorities regionally - Cleanly separates functionality
- Different resource types in different Tiers
- Funding complementarity (NSF vs DOE)
- Targeted initiatives
- New IT resources can be added naturally
- Additional matching resources at Tier 2
universities - Larger institutes can join, bringing their own
resources - Tap into new resources opened by IT revolution
- Broaden community of scientists and students
- Training and education
- Vitality of field depends on University / Lab
partnership
12Tier 2 Regional Centers
- Possible Model CERNNationalTier 2 ? 1/3 1/3
1/3 - Complementary role to Tier 1 lab-based centers
- Less need for 24 ? 7 operation ? lower component
costs - Less production-oriented ? respond to analysis
priorities - Flexible organization, i.e. by physics goals,
subdetectors - Variable fraction of resources available to
outside users - Range of activities includes
- Reconstruction, simulation, physics analyses
- Data caches / mirrors to support analyses
- Production in support of parent Tier 1
- Grid RD
- ...
13Distribution of Tier 2 Centers
- Tier 2 centers arranged regionally in US model
- Good networking connections to move data (caches)
- Location independence of users always maintained
- Increases collaborative possibilities
- Emphasis on training, involvement of students
- High quality desktop environment for remote
collaboration, e.g., next generation VRVS system
14Strawman Tier 2 Architecture
- Linux Farm of 128 Nodes 0.30 M
- Sun Data Server with RAID Array 0.10 M
- Tape Library 0.04 M
- LAN Switch 0.06 M
- Collaborative Infrastructure 0.05 M
- Installation and Infrastructure 0.05 M
- Net Connect to Abilene network 0.14 M
- Tape Media and Consumables 0.04 M
- Staff (Ops and System Support) 0.20 M
- Total Estimated Cost (First Year) 0.98 M
- Cost in Succeeding Years, for evolution, 0.68
Mupgrade and ops - 1.5 2 FTE support required per Tier 2.
Physicists from institute also aid in support.
15Strawman Tier 2 Evolution
- 2000 2005
- Linux Farm 1,500 SI95 20,000 SI95
- Disks on CPUs 4 TB 20 TB
- RAID Array? 1 TB 20 TB
- Tape Library 1 TB 50 - 100 TB
- LAN Speed 0.1 - 1 Gbps 10 - 100 Gbps
- WAN Speed 155 - 622 Mbps 2.5 - 10 Gbps
- Collaborative MPEG2 VGA Realtime
HDTVInfrastructure (1.5 - 3 Mbps) (10 - 20
Mbps) - ? RAID disk used for higher availability data
- Reflects lower Tier 2 component costs due to
less demanding usage, e.g. simulation.
16The GriPhyN Project
- Joint project involving
- US-CMS, US-ATLAS
- LIGO Gravity wave experiment
- SDSS Sloan Digital Sky Survey
- http//www.phys.ufl.edu/avery/mre/
- Requesting funds from NSF to build worlds first
production-scale grid(s) - Sub-implementations for each experiment
- NSF pays for Tier 2 centers, some RD, some
networking - Realization of unified Grid system requires
research - Many common problems for different
implementations - Requires partnership with CS professionals
17R D Foundations I
- Globus (Grid middleware)
- Grid-wide services
- Security
- Condor (see M. Livny paper)
- General language for service seekers / service
providers - Resource discovery
- Resource scheduling, coordination, (co)allocation
- GIOD (Networked object databases)
- Nile (Fault-tolerant distributed computing)
- Java-based toolkit, running on CLEO
18R D Foundations II
- MONARC
- Construct and validate architectures
- Identify important design parameters
- Simulate extremely complex, dynamic system
- PPDG (Particle Physics Data Grid)
- DOE / NGI funded for 1 year
- Testbed systems
- Later program of work incorporated into GriPhyN
19The NSF ITR Initiative
- Information Technology Research Program
- Aimed at funding innovative research in IT
- 90M in funds authorized
- Max of 12.5M for a single proposal (5 years)
- Requires extensive student support
- GriPhyN submitted preproposal Dec. 30, 1999
- Intend that ITR fund most of our Grid research
program - Major costs for people, esp. students / postdocs
- Minimal equipment
- Some networking
- Full proposal due April 17, 2000
20Summary of Data Grids and the LHC
- Develop integrated distributed system, while
meeting LHC goals - ATLAS/CMS production, data handling oriented
- (LIGO/SDSS computation, commodity component
oriented) - Build, test the regional center hierarchy
- Tier 2 / Tier 1 partnership
- Commission and test software, data handling
systems, and data analysis strategies - Build, test the enabling collaborative
infrastructure - Focal points for student-faculty interaction in
each region - Realtime high-res video as part of collaborative
environment - Involve students at universities in building the
data analysis, and in the physics discoveries at
the LHC