Title: Data Grids
1- Data Grids
- Enabling Data Intensive Global Science
Paul Avery University of Florida avery_at_phys.ufl.ed
u
CITI Seminar Rice UniversityMarch 8, 2004
2The Grid Concept
- Grid Geographically distributed computing
resources configured for coordinated use - Fabric Physical resources networks provide
raw capability - Ownership Resources controlled by owners and
shared w/ others - Middleware Software ties it all together tools,
services, etc. - Goal Transparent sharing of resources
US-CMS Virtual Organization
3Scope of Talk
- NOT a comprehensive Grid talk!
- Many Grid projects going on worldwide
- Grid standards are evolving (http//www.ggf.org/)
- My talk A physicists perspective
- Science drivers for Grids, particularly High
Energy Physics - Recent research directions
- Grid projects and deployments, US and overseas
- Related network and outreach developments
- Political landscape
Globus Toolkit evolution
GT 1.x ? 2001
GT 2.x 2002
GT 3.x 2003 (OGSI)
GT 4.x 2004 (WSRF)
4Grids and Resource Sharing
- Resources for complex problems are distributed
- Advanced scientific instruments (accelerators,
telescopes, ) - Storage, computing, people, institutions
- Organizations require access to common services
- Research collaborations (physics, astronomy,
engineering, ) - Government agencies, health care organizations,
corporations, - Grids enable Virtual Organizations
- Create a VO from geographically separated
components - Make all community resources available to any VO
member - Leverage strengths at different institutions
- Grids require a foundation of strong networking
- Communication tools, visualization
- High-speed data transmission, instrument operation
5Grid Challenges
- Operate a fundamentally complex entity
- Geographically distributed resources
- Each resource under different administrative
control - Many failure modes
- Manage workflow of 1000s of jobs across Grid
- Balance policy vs. instantaneous capability to
complete tasks - Balance effective resource use vs. fast
turnaround for priority jobs - Match resource usage to policy over the long term
- Maintain a global view of resources and system
state - Coherent end-to-end system monitoring
- Build managed system integrated user environment
6Data Grids Data Intensive Sciences
- Scientific discovery increasingly driven by data
collection - Computationally intensive analyses
- Massive data collections
- Data distributed across networks of varying
capability - Internationally distributed collaborations
- Dominant factor data growth (1 Petabyte 1000
TB) - 2000 0.5 Petabyte
- 2005 10 Petabytes
- 2010 100 Petabytes
- 2015 1000 Petabytes?
How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
7Data Intensive Physical Sciences
- High energy nuclear physics
- Belle/BaBar, Tevatron, RHIC, JLAB, LHC
- Astronomy
- Digital sky surveys SDSS, VISTA, other Gigapixel
arrays - VLBI arrays multiple- Gbps data streams
- Virtual Observatories (multi-wavelength
astronomy) - Gravity wave searches
- LIGO, GEO, VIRGO, TAMA
- Time-dependent 3-D systems (simulation data)
- Earth Observation
- Climate modeling, oceanography, coastal dynamics
- Geophysics, earthquake modeling
- Fluids, aerodynamic design
- Pollutant dispersal
8Data Intensive Biology and Medicine
- Medical data and imaging
- X-Ray, mammography data, etc. (many petabytes)
- Radiation Oncology (real-time display of 3-D
images) - X-ray crystallography
- Bright X-Ray sources, e.g. Argonne Advanced
Photon Source - Molecular genomics and related disciplines
- Human Genome, other genome databases
- Proteomics (protein structure, activities, )
- Protein interactions, drug delivery
- High-res brain scans (1-10?m, time dependent)
9Science Drivers for U.S. Grids
- LHC experiments
- 100s of Petabytes
- High Energy Physics experiments
- 1 Petabyte (1000 TB)
- LIGO (gravity wave search)
- 100s of Terabytes
- Sloan Digital Sky Survey
- 10s of Terabytes
- Future Grid resources
- Massive CPU (PetaOps)
- Large distributed datasets (gt100PB)
- Global communities (1000s)
10LHC and Data Grids
11Large Hadron Collider (LHC) _at_ CERN
- 27 km Tunnel in Switzerland France
TOTEM
CMS
ALICE
LHCb
Search for Origin of Mass Supersymmetry (2007
?)
ATLAS
12Example CMS Experiment at LHC
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
13LHC Data and CPU Requirements
CMS
ATLAS
- Storage
- Raw recording rate 0.1 1 GB/s
- Plus simulated data
- 100 PB total by 2010
- Processing
- PetaOps (gt 300,000 3 GHz PCs)
LHCb
14Complexity Higgs Decay into 4 muons
109 collisions/sec, selectivity 1 in 1013
15LHC Global Collaborations
CMS
ATLAS
- 1000 4000 per experiment
- USA is 20 25 of total
16Driver for Transatlantic Networks (Gb/s)
- 2001 estimates, now seen as conservative!
17HEP Bandwidth Roadmap (Gb/s)
HEP Leading role in network development/deploymen
t
18Global LHC Data Grid Hierarchy
CMS Experiment
Online System
0.1 - 1.5 GBytes/s
CERN Computer Center
Tier 0
10-40 Gb/s
Tier 1
2.5-10 Gb/s
Tier 2
1-2.5 Gb/s
Tier 3
Physics caches
1-10 Gb/s
10s of Petabytes/yr by 2007-81000 Petabytes in
lt 10 yrs?
Tier 4
PCs
19Analysis by Globally Distributed Teams
- Exploit experiment Grid resources
- Non-hierarchical Chaotic analyses productions
- Superimpose significant random data flows
20Data Grid Projects
21Global Context of Data Grid Projects
Collaborating Grid infrastructure projects
- U.S. Projects
- GriPhyN (NSF)
- iVDGL (NSF)
- Particle Physics Data Grid (DOE)
- PACIs and TeraGrid (NSF)
- DOE Science Grid (DOE)
- NEESgrid (NSF)
- NSF Middleware Initiative (NSF)
- EU, Asia projects
- European Data Grid (EU)
- EDG-related national Projects
- DataTAG (EU)
- LHC Computing Grid (CERN)
- EGEE (EU)
- CrossGrid (EU)
- GridLab (EU)
- Japanese, Korea Projects
- Two primary project clusters US EU
- Not exclusively HEP, but driven/led by HEP (with
CS)
22International Grid Coordination
- Global Grid Forum (GGF)
- Grid standards body (http//www.ggf.org/)
- HICB HEP Inter-Grid Coordination Board
- Non-competitive forum, strategic issues,
consensus - Cross-project policies, procedures and
technology, joint projects - HICB-JTB Joint Technical Board
- Technical coordination, e.g. GLUE
interoperability effort - LHC Computing Grid (LCG)
- Technical work, deployments
- POB, SC2, PEB, GDB, GAG, etc.
23LCG LHC Computing Grid Project
- A persistent Grid infrastructure for LHC
experiments - Matched to decades long research program of LHC
- Prepare deploy computing environment for LHC
expts - Common applications, tools, frameworks and
environments - Deployment oriented no middleware development(?)
- Move from testbed systems to real production
services - Operated and supported 24x7 globally
- Computing fabrics run as production physics
services - A robust, stable, predictable, supportable
infrastructure
24LCG Activities
- Relies on strong leverage
- CERN IT
- LHC experiments ALICE, ATLAS, CMS, LHCb
- Grid projects Trillium, DataGrid, EGEE
- Close relationship with EGEE project (see
Gagliardi talk) - EGEE will create EU production Grid (funds 70
institutions) - Middleware activities done in common (same
manager) - LCG deployment in common (LCG Grid is EGEE
prototype) - LCG-1 deployed (Sep. 2003)
- Tier-1 laboratories (including FNAL and BNL)
- LCG-2 in process of being deployed
25Sep. 29, 2003 announcement
26LCG-1 Sites
(LCG-2 Service Being Deployed Now)
27LCG Timeline
- Phase 1 2002 2005
- Build a service prototype, based on existing grid
middleware - Gain experience in running a production grid
service - Produce the Computing TDR for the final system
- Phase 2 2006 2008
- Build and commission the initial LHC computing
environment - Phase 3 ?
28Trillium U.S. Physics Grid Projects
- Trillium PPDG GriPhyN iVDGL
- Large overlap in leadership, people, experiments
- HEP members are main drivers, esp. LHC
experiments - Benefit of coordination
- Common software base packaging VDT PACMAN
- Collaborative / joint projects monitoring,
demos, security, - Wide deployment of new technologies, e.g. Virtual
Data - Forum for US Grid projects
- Joint strategies, meetings and work
- Unified U.S. entity to interact with
international Grid projects - Build significant Grid infrastructure Grid2003
29Science Drivers for U.S. HEP Grids
- LHC experiments
- 100s of Petabytes
- Current HENP experiments
- 1 Petabyte (1000 TB)
- LIGO
- 100s of Terabytes
- Sloan Digital Sky Survey
- 10s of Terabytes
30Goal PetaScale Virtual-Data Grids
Production Team
Single Researcher
Workgroups
Interactive User Tools
Request Execution Management Tools
Request Planning Scheduling Tools
Virtual Data Tools
ResourceManagementServices
Security andPolicyServices
Other GridServices
- PetaOps
- Petabytes
- Performance
Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw datasource
31Trillium Grid Tools Virtual Data Toolkit
Use NMI processes later
NMI
VDT
Test
Sources (CVS)
Build
Binaries
Build Test Condor pool (37 computers)
Pacman cache
Package
Patching
RPMs
Build
Binaries
GPT src bundles
Build
Binaries
Test
Contributors (VDS, etc.)
A unique resource for managing, testing,
supporting, deploying, packaging, upgrading,
troubleshooting complex sets of software
32Virtual Data Toolkit Tools in VDT 1.1.12
- Globus Alliance
- Grid Security Infrastructure (GSI)
- Job submission (GRAM)
- Information service (MDS)
- Data transfer (GridFTP)
- Replica Location (RLS)
- Condor Group
- Condor/Condor-G
- DAGMan
- Fault Tolerant Shell
- ClassAds
- EDG LCG
- Make Gridmap
- Cert. Revocation List Updater
- Glue Schema/Info provider
- ISI UC
- Chimera related tools
- Pegasus
- NCSA
- MyProxy
- GSI OpenSSH
- LBL
- PyGlobus
- Netlogger
- Caltech
- MonaLisa
- VDT
- VDT System Profiler
- Configuration software
- Others
- KX509 (U. Mich.)
33VDT Growth (1.1.13 Currently)
VDT 1.1.8 First real use by LCG
VDT 1.1.11 Grid2003
VDT 1.0 Globus 2.0b Condor 6.3.1
VDT 1.1.7 Switch to Globus 2.2
VDT 1.1.3, 1.1.4 1.1.5 pre-SC 2002
34Pacman Packaging System
- Language define software environments
- Interpreter create, install, configure, update,
verify environments
- LCG/Scram
- ATLAS/CMT
- CMS DPE/tar/make
- LIGO/tar/make
- OpenSource/tar/make
- Globus/GPT
- NPACI/TeraGrid/tar/make
- D0/UPS-UPD
- Commercial/tar/make
Combine and manage software from arbitrary
sources.
1 button install Reduce burden on
administrators
pacman get iVDGLGrid3
Remote experts define installation/
config/updating for everyone at once
35Virtual Data Derivation and Provenance
- Most scientific data are not simple
measurements - They are computationally corrected/reconstructed
- They can be produced by numerical simulation
- Science eng. projects are more CPU and data
intensive - Programs are significant community resources
(transformations) - So are the executions of those programs
(derivations) - Management of dataset dependencies critical!
- Derivation Instantiation of a potential data
product - Provenance Complete history of any existing
data product
- Previously Manual methods
- GriPhyN Automated, robust tools
- (Chimera Virtual Data System)
36LHC Higgs Analysis with Virtual Data
Scientist adds a new derived data branch
continues analysis
decay bb
decay WW WW ? leptons
decay ZZ
mass 160
decay WW WW ? e??? Pt gt 20
decay WW WW ? e???
decay WW
37Chimera Sloan Galaxy Cluster Analysis
38- Grid2003 An Operational Grid
- 28 sites (2100-2800 CPUs)
- 400-1300 concurrent jobs
- 10 applications
- Running since October 2003
Korea
http//www.ivdgl.org/grid2003
39Grid2003 Participants
Korea
US-CMS
iVDGL
GriPhyN
US-ATLAS
DOE Labs
PPDG
Biology
40Grid2003 Applications
- High energy physics
- US-ATLAS analysis (DIAL),
- US-ATLAS GEANT3 simulation (GCE)
- US-CMS GEANT4 simulation (MOP)
- BTeV simulation
- Gravity waves
- LIGO blind search for continuous sources
- Digital astronomy
- SDSS cluster finding (maxBcg)
- Bioinformatics
- Bio-molecular analysis (SnB)
- Genome analysis (GADU/Gnare)
- CS Demonstrators
- Job Exerciser, GridFTP, NetLogger-grid2003
41Grid2003 Site Monitoring
42Grid2003 Three Months Usage
43Grid2003 Success
- Much larger than originally planned
- More sites (28), CPUs (2800), simultaneous jobs
(1300) - More applications (10) in more diverse areas
- Able to accommodate new institutions
applications - U Buffalo (Biology) Nov. 2003
- Rice U. (CMS) Feb. 2004
- Continuous operation since October 2003
- Strong operations team (iGOC at Indiana)
- US-CMS using it for production simulations (next
slide)
44Production Simulations on Grid2003
US-CMS Monte Carlo Simulation
Used 1.5 ? US-CMS resources
Non-USCMS
USCMS
45Grid2003 A Necessary Step
- Learning how to operate a Grid
- Add sites, recover from errors, provide info,
update, test, etc. - Need tools, services, procedures, documentation,
organization - Need reliable, intelligent, skilled people
- Learning how to cope with large scale
- Interesting failure modes as scale increases
- Increasing scale must not overwhelm human
resources - Learning how to delegate responsibilities
- Multiple levels Project, Virtual Org., service,
site, application - Essential for future growth
- Grid2003 experience critical for building
useful Grids - Frank discussion in Grid2003 Project Lessons doc
46Grid2003 Lessons (1) Investment
- Building momentum
- PPDG 1999
- GriPhyN 2000
- iVDGL 2001
- Time for projects to ramp up
- Building collaborations
- HEP ATLAS, CMS, Run 2, RHIC, Jlab
- Non-HEP Computer science, LIGO, SDSS
- Time for collaboration sociology to kick in
- Building testbeds
- Build expertise, debug Grid software, develop
Grid tools services - US-CMS 2002 2004
- US-ATLAS 2002 2004
- WorldGrid 2002 (Dec.)
47Grid2003 Lessons (2) Deployment
- Building something useful draws people in
- (Similar to a large HEP detector)
- Cooperation, willingness to invest time, striving
for excellence! - Grid development requires significant deployments
- Required to learn what works, what fails, whats
clumsy, - Painful, but pays for itself
- Deployment provides powerful training mechanism
48Grid2003 Lessons (3) Packaging
- Installation and configuration (VDT Pacman)
- Simplifies installation, configuration of Grid
tools applications - Major advances over 14 VDT releases
- A strategic issue and critical to us
- Provides uniformity automation
- Lowers barriers to participation ? scaling
- Expect great improvements (Pacman 3)
- Automation the next frontier
- Reduce FTE overhead, communication traffic
- Automate installation, configuration, testing,
validation, updates - Remote installation, etc.
49Grid2003 and Beyond
- Further evolution of Grid3 (Grid3, etc.)
- Contribute resources to persistent Grid
- Maintain development Grid, test new software
releases - Integrate software into the persistent Grid
- Participate in LHC data challenges
- Involvement of new sites
- New institutions and experiments
- New international partners (Brazil, Taiwan,
Pakistan?, ) - Improvements in Grid middleware and services
- Storage services
- Integrating multiple VOs
- Monitoring
- Troubleshooting
- Accounting
50New Directions
51U.S. Open Science Grid
- Goal Build an integrated Grid infrastructure
- Support US-LHC research program, other scientific
efforts - Resources from laboratories and universities
- Federate with LHC Computing Grid
- Getting there OSG-1 (Grid3), OSG-2,
- Series of releases ? increasing functionality
scale - Constant use of facilities for LHC production
computing - Jan. 12 meeting in Chicago
- Public discussion, planning sessions
- Next steps
- Creating interim Steering Committee (now)
- White paper to be expanded into roadmap
- Presentation to funding agencies (April/May?)
52Inputs to Open Science Grid
Technologists
Trillium
US-LHC
Universityfacilities
OpenScienceGrid
Educationcommunity
Multi-disciplinaryfacilities
Laboratorycenters
ComputerScience
Otherscienceapplications
53Outreach QuarkNet-Trillium Virtual Data Portal
- More than a web site
- Organize datasets
- Perform simple computations
- Create new computations analyses
- View share results
- Annotate enquire (metadata)
- Communicate and collaborate
- Easy to use, ubiquitous,
- No tools to install
- Open to the community
- Grow extend
Initial prototype implemented by graduate student
Yong Zhao and M. Wilde (U. of Chicago)
54CHEPREO Center for High Energy Physics Research
and Educational OutreachFlorida International
University
- Physics Learning Center
- CMS Research
- iVDGL Grid Activities
- AMPATH network (S. America)
Funded September 2003
55Recent Evolution of Grid StandardsGrid and Web
Services Convergence
Grid
Started far apart in applications technology
Now converging
Web
The definition of WSRF means that Grid and Web
communities can move forward on a common base
56Open Grid Services Architectureand WSRF
Domain-Specific Services
Program Execution
Data Services
Core Services
Open Grid Services Infrastructure
WS-Resource Framework
Web Services Messaging, Security, Etc.
57Globus Toolkit and WSRF
2004
2005
Not waiting for finalization of WSRF specs
58Optical Networks National Lambda Rail
- Started in 2003
- Initial 4?10 Gb/s
- Future 40?10 Gb/s
59Optical Networks ATT USAWaves
60UltraLight 10 Gb/s Network
- 10 Gb/s network
- Caltech, UF, FIU, UM, MIT
- SLAC, FNAL
- Intl partners
- Level(3), Cisco, NLR
61Summary
- CS Grid tools
- CS research, VDT releases, simplified packaging
- Virtual data a powerful paradigm for scientific
computing - Grid deployments providing excellent experience
- Testbeds, productions becoming more Grid-based
- Grid2003, LCG-1,2 using real applications
- Collaboration occurring with more partners
- National, international (Asia, South America)
- Promising new directions
- Grid tools OSGI ? WSRF
- Networks Increased capabilities (bandwidth,
effic, services) - Research Collaborative and Grid tools for
distributed teams - Outreach Many new opportunities and partners
62Grid References
- Grid2003
- www.ivdgl.org/grid2003
- Globus
- www.globus.org
- PPDG
- www.ppdg.net
- GriPhyN
- www.griphyn.org
- iVDGL
- www.ivdgl.org
- LCG
- www.cern.ch/lcg
- EU DataGrid
- www.eu-datagrid.org
- EGEE
- egee-ei.web.cern.ch
2nd Edition www.mkp.com/grid2
63Extra Slides
64Virtual Data Motivations (1)
Ive detected a muon calibration error and want
to know which derived data products need to be
recomputed.
Ive found some interesting data, but I need to
know exactly what corrections were applied before
I can trust it.
Data
consumed-by/ generated-by
product-of
Derivation
Transformation
execution-of
I want to search a database for 3 muon SUSY
events. If a program that does this analysis
exists, I wont have to write one from scratch.
I want to apply a forward jet analysis to 100M
events. If the results already exist, Ill save
weeks of computation.
65Virtual Data Motivations (2)
- Data track-ability and result audit-ability
- Universally sought by scientific applications
- Facilitate resource sharing and collaboration
- Data is sent along with its recipe
- A new approach to saving old data economic
consequences? - Manage workflow
- Organize, locate, specify, request data products
- Repair and correct data automatically
- Identify dependencies, apply x-tions
- Optimize performance
- Re-create data or copy it (caches)
Manual /error prone ? Automated /robust
66Chimera Virtual Data System
- Virtual Data Language (VDL)
- Describes virtual data products
- Virtual Data Catalog (VDC)
- Used to store VDL
- Abstract Job Flow Planner
- Creates a logical DAG (dependency graph)
- Concrete Job Flow Planner
- Interfaces with a Replica Catalog
- Provides a physical DAG submission file to
Condor-G - Generic and flexible
- As a toolkit and/or a framework
- In a Grid environment or locally
VDC
AbstractPlanner
XML
XML
VDL
DAX
ReplicaCatalog
ConcretePlanner
Virtual data CMS production MCRunJob
DAG
DAGMan
67Grid Analysis Environment
- GAE and analysis teams
- Extending locally, nationally, globally
- Sharing worldwide computing, storage network
resources - Utilizing intellectual contributions regardless
of location - HEPCAL and HEPCAL II
- Delineate common LHC use cases requirements
(HEPCAL) - Same for analysis (HEPCAL II)
- ARDA Architectural Roadmap for Distributed
Analysis - ARDA document released LCG-2003-033 (Oct. 2003)
- Workshop January 2004, presentations by the 4
experiments - CMS CAIGEE ALICE AliEn
- ATLAS ADA LHCb DIRAC
68One View of ARDA Core Services
InformationService
JobProvenance
Authentication
Authorization
Auditing
Grid AccessService
Accounting
MetadataCatalog
GridMonitoring
FileCatalog
WorkloadManagement
SiteGatekeeper
DataManagement
PackageManager
LCG-2003-033 (Oct. 2003)
69CMS Grid-Enabled Analysis CAIGEE
HTTP, SOAP, XML/RPC
- Clients talk standard protocols to Grid Services
Web Server, a.k.a. Clarens data/services portal
with simple WS API - Clarens portal hides complexity of Grid Services
from client - Key features Global scheduler, catalogs,
monitoring, and Grid-wide execution service - Clarens servers formglobal peer?peer network
Analysis Client
Analysis Client
Grid Services Web Server
Scheduler
Catalogs
Fully- Abstract Planner
Metadata
Partially- Abstract Planner
Virtual Data
Applications
Data Management
Monitoring
Replica
Fully- Concrete Planner
Grid
Execution Priority Manager
Grid Wide Execution Service
70Grids Enhancing Research Learning
- Fundamentally alters conduct of scientific
research - Lab-centric Activities center around large
facility - Team-centric Resources shared by distributed
teams - Knowledge-centric Knowledge generated/used by
a community - Strengthens role of universities in research
- Couples universities to data intensive science
- Couples universities to national international
labs - Brings front-line research and resources to
students - Exploits intellectual resources of formerly
isolated schools - Opens new opportunities for minority and women
researchers - Builds partnerships to drive advances in
IT/science/eng - HEP ? Physics, astronomy, biology, CS, etc.
- Application sciences ? Computer Science
- Universities ? Laboratories
- Scientists ? Students
- Research Community ? IT industry
71LHC Key Driver for Data Grids
- Complexity Millions of individual detector
channels - Scale PetaOps (CPU), 100s of Petabytes (Data)
- Distribution Global distribution of people
resources
2000 Physicists 159 Institutes 36 Countries
CMS Collaboration