Title: Grids for 21st Century Data Intensive Science
1- Grids for 21st CenturyData Intensive Science
Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
NSF MeetingApr. 15, 2003
2The Grid Concept
- Grid Geographically distributed computing
resources configured for coordinated use - Fabric Physical resources networks provide
raw capability - Middleware Software ties it all together (tools,
services, etc.) - Goal Transparent resource sharing
3Fundamental Idea Resource Sharing
- Resources for complex problems are distributed
- Advanced scientific instruments (accelerators,
telescopes, ) - Storage, computing,
- Groups of people, institutions
- Communities require access to common services
- Research collaborations (physics, astronomy,
biology, eng. ) - Government agencies
- Health care organizations, large corporations,
- Virtual Organizations
- Create a VO from geographically separated
components - Make all community resources available to any VO
member - Leverage strengths at different institutions
- Add people resources dynamically
4Some (Realistic) Grid Examples
- High energy physics
- 3,000 physicists worldwide pool Petaflops of CPU
resources to analyze Petabytes of data - Climate modeling
- Climate scientists visualize, annotate, analyze
Terabytes of simulation data - Biology
- A biochemist exploits 10,000 computers to screen
100,000 compounds in an hour - Engineering
- A multidisciplinary analysis in aerospace couples
code and data in four companies to design a new
airframe
From Ian Foster
5Grid Challenges
- Manage workflow across Grid
- Balance policy vs. instantaneous capability to
complete tasks - Balance effective resource use vs. fast
turnaround for priority jobs - Match resource usage to policy over the long term
- Goal-oriented algorithms steering requests
according to metrics - Maintain a global view of resources and system
state - Coherent end-to-end system monitoring
- Adaptive learning new paradigms for execution
optimization - Handle user-Grid interactions
- Guidelines, agents
- Build high level services integrated user
environment
6Data Intensive Science 2000-2015
- Scientific discovery increasingly driven by data
collection - Computationally intensive analyses
- Massive data collections
- Data distributed across networks of varying
capability - Internationally distributed collaborations
- Dominant factor data growth (1 Petabyte 1000
TB) - 2000 0.5 Petabyte
- 2005 10 Petabytes
- 2010 100 Petabytes
- 2015 1000 Petabytes?
How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
7Data Intensive Physical Sciences
- High energy nuclear physics
- Including new experiments at CERNs Large Hadron
Collider - Astronomy
- Digital sky surveys SDSS, VISTA, other Gigapixel
arrays - VLBI arrays multiple- Gbps data streams
- Virtual Observatories (multi-wavelength
astronomy) - Gravity wave searches
- LIGO, GEO, VIRGO
- Time-dependent 3-D systems (simulation data)
- Earth Observation, climate modeling
- Geophysics, earthquake modeling
- Fluids, aerodynamic design
- Dispersal of pollutants in atmosphere
8Data Intensive Biology and Medicine
- Medical data
- X-Ray, mammography data, etc. (many petabytes)
- Digitizing patient records (ditto)
- X-ray crystallography
- Bright X-Ray sources, e.g. Argonne Advanced
Photon Source - Molecular genomics and related disciplines
- Human Genome, other genome databases
- Proteomics (protein structure, activities, )
- Protein interactions, drug delivery
- Brain scans (1-10?m, time dependent)
9Example LHC Physics
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
10LHC Data Rates Detector to Storage
40 MHz
1000 TB/sec
Physics filtering
Level 1 Trigger Special Hardware
75 GB/sec
75 KHz
Level 2 Trigger Commodity CPUs
5 GB/sec
5 KHz
Level 3 Trigger Commodity CPUs
100 400 MB/sec
100 Hz
Raw Data to storage
11LHC Higgs Decay into 4 muons
109 events/sec, selectivity 1 in 1013
12LHC Computing Overview
- Complexity Millions of individual detector
channels - Scale PetaOps (CPU), Petabytes (Data)
- Distribution Global distribution of people
resources
1800 Physicists 150 Institutes 32 Countries
13Global LHC Data Grid
CMS Experiment
Tier0/(? Tier1)/(? Tier2) 111
Online System
100-400 MBytes/s
CERN Computer Center 20 TIPS
Tier 0
10-40 Gbps
Tier 1
2.5-10 Gbps
Tier 2
1-2.5 Gbps
Tier 3
1-10 Gbps
Physics cache
Tier 4
PCs, other portals
14Example Digital Astronomy Trends
- Future dominated by detector improvements
- Moores Law growth in CCDs
- Gigapixel arrays on horizon
- Growth in CPU/storage tracking data volumes
- Investment in software critical
Glass
MPixels
- Total area of 3m telescopes in the world in m2
- Total number of CCD pixels in Mpix
- 25 year growth 30x in glass, 3000x in pixels
15The Age of Astronomical Mega-Surveys
- Next generation mega-surveys will change
astronomy - Top-down design
- Large sky coverage
- Sound statistical plans
- Well controlled, uniform systematics
- The technology to store and access the data is
here - Following Moores law
- Integrating these archives for the whole
community - Astronomical data mining will lead to stunning
new discoveries - Virtual Observatory
16Virtual Observatories
Multi-wavelength astronomy,Multiple surveys
17Virtual Observatory Data Challenge
- Digital representation of the sky
- All-sky deep fields
- Integrated catalog and image databases
- Spectra of selected samples
- Size of the archived data
- 40,000 square degrees
- Resolution 50 trillion pixels
- One band (2 bytes/pixel) 100 Terabytes
- Multi-wavelength 500-1000 Terabytes
- Time dimension Many Petabytes
- Large, globally distributed database engines
- Multi-Petabyte data size
- Thousands of queries per day, Gbyte/s I/O speed
per site - Data Grid computing infrastructure
18Sloan Sky Survey Data Grid
19U.S. Grid Projects
20Global Context Data Grid Projects
- U.S. Infrastructure Projects
- Particle Physics Data Grid (PPDG) DOE
- GriPhyN NSF
- International Virtual Data Grid Laboratory
(iVDGL) NSF - TeraGrid NSF
- DOE Science Grid DOE
- NSF Middleware Initiative (NMI) NSF
- EU, Asia major projects
- European Data Grid (EU, EC)
- EDG national Projects (UK, Italy, France, )
- CrossGrid (EU, EC)
- DataTAG (EU, EC)
- LHC Computing Grid (LCG) (CERN)
- Japanese Project
- Korea project
21Particle Physics Data Grid
- Funded 2001 2004 _at_ US9.5M (DOE)
- D0, BaBar, STAR, CMS, ATLAS
- SLAC, LBNL, Jlab, FNAL, BNL, Caltech, Wisconsin,
Chicago, USC
22PPDG Goals
- Serve high energy nuclear physics (HENP)
experiments - Unique challenges, diverse test environments
- Develop advanced Grid technologies
- Focus on end to end integration
- Maintain practical orientation
- Networks, instrumentation, monitoring
- DB file/object replication, caching, catalogs,
end-to-end movement - Make tools general enough for wide community
- Collaboration with GriPhyN, iVDGL, EDG, LCG
- ESNet Certificate Authority work, security
23GriPhyN and iVDGL
- Both funded through NSF ITR program
- GriPhyN 11.9M (NSF) 1.6M (matching) (2000
2005) - iVDGL 13.7M (NSF) 2M (matching) (2001
2006) - Basic composition
- GriPhyN 12 funded universities, SDSC, 3
labs (80 people) - iVDGL 16 funded institutions, SDSC, 3 labs (80
people) - Expts US-CMS, US-ATLAS, LIGO, SDSS/NVO
- Large overlap of people, institutions, management
- Grid research vs Grid deployment
- GriPhyN 2/3 CS 1/3 physics ( 0 H/W)
- iVDGL 1/3 CS 2/3 physics (20 H/W)
- iVDGL 2.5M Tier2 hardware (1.4M LHC)
- 4 physics experiments provide frontier challenges
- Virtual Data Toolkit (VDT) in common
24GriPhyN Goals
- Conduct CS research to enable Petascale science
- Virtual Data as unifying principle (more later)
- Planning, execution, performance monitoring
- Disseminate through Virtual Data Toolkit
- Series of releases
- Integrate into GriPhyN science experiments
- Common Grid tools, services
- Impact other disciplines
- HEP, biology, medicine, virtual astronomy, eng.,
other Grid projects - Educate, involve, train students in IT research
- Undergrads, grads, postdocs, underrepresented
groups
25iVDGL Goals and Context
- International Virtual-Data Grid Laboratory
- A global Grid laboratory (US, EU, Asia, South
America, ) - A place to conduct Data Grid tests at scale
- A mechanism to create common Grid infrastructure
- A laboratory for other disciplines to perform
Data Grid tests - A focus of outreach efforts to small institutions
- Context of iVDGL in US-LHC computing program
- Mechanism for NSF to fund proto-Tier2 centers
- Learn how to do Grid operations (GOC)
- International participation
- DataTag
- UK e-Science programme support 6 CS Fellows per
year in U.S.
26Goal PetaScale Virtual-Data Grids
Production Team
Single Researcher
Workgroups
Interactive User Tools
Request Execution Management Tools
Request Planning Scheduling Tools
Virtual Data Tools
ResourceManagementServices
Security andPolicyServices
Other GridServices
- PetaOps
- Petabytes
- Performance
Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw datasource
27GriPhyN/iVDGL Science Drivers
- US-CMS US-ATLAS
- HEP experiments at LHC/CERN
- 100s of Petabytes
- LIGO
- Gravity wave experiment
- 100s of Terabytes
- Sloan Digital Sky Survey
- Digital astronomy (1/4 sky)
- 10s of Terabytes
- Massive CPU
- Large, distributed datasets
- Large, distributed communities
28US-iVDGL Sites (Spring 2003)
- Partners?
- EU
- CERN
- Brazil
- Australia
- Korea
- Japan
29iVDGL Map (2002-2003)
Surfnet
DataTAG
- New partners
- Russia T1
- China T1
- Brazil T1
- Romania ?
30ATLAS Simulations on iVDGL Resources
Joint project with iVDGL
31US-CMS Grid Testbed
32CMS Grid Testbed Production
33US-CMS Testbed Success Story
- Production Run for the IGT MOP Production
- Assigned 1.5 million events for eGamma Bigjets
- 500 sec per event on 750 MHz processor all
production stages from simulation to ntuple - 2 months continuous running across 5 testbed
sites - Demonstrated at Supercomputing 2002
1.5 Million Events Produced ! (nearly 30 CPU
years)
34Creation of WorldGrid
- Joint iVDGL/DataTag/EDG effort
- Resources from both sides (15 sites)
- Monitoring tools (Ganglia, MDS, NetSaint, )
- Visualization tools (Nagios, MapCenter, Ganglia)
- Applications ScienceGrid
- CMS CMKIN, CMSIM
- ATLAS ATLSIM
- Submit jobs from US or EU
- Jobs can run on any cluster
- Demonstrated at IST2002 (Copenhagen)
- Demonstrated at SC2002 (Baltimore)
35WorldGrid Sites
36Grid Coordination
37U.S. Project Coordination Trillium
- Trillium GriPhyN iVDGL PPDG
- Large overlap in leadership, people, experiments
- Benefit of coordination
- Common S/W base packaging VDT PACMAN
- Collaborative / joint projects monitoring,
demos, security, - Wide deployment of new technologies, e.g. Virtual
Data - Stronger, broader outreach effort
- Forum for US Grid projects
- Joint view, strategies, meetings and work
- Unified entity to deal with EU other Grid
projects - Natural collaboration across DOE and NSF
projects - Funding agency interest?
38International Grid Coordination
- Global Grid Forum (GGF)
- International forum for general Grid efforts
- Many working groups, standards definitions
- Close collaboration with EU DataGrid (EDG)
- Many connections with EDG activities
- HICB HEP Inter-Grid Coordination Board
- Non-competitive forum, strategic issues,
consensus - Cross-project policies, procedures and
technology, joint projects - HICB-JTB Joint Technical Board
- Definition, oversight and tracking of joint
projects - GLUE interoperability group
- Participation in LHC Computing Grid (LCG)
- Software Computing Committee (SC2)
- Project Execution Board (PEB)
- Grid Deployment Board (GDB)
39Future U.S. Grid Efforts
- Dynamic workspaces proposal (ITR program 15M)
- Extend virtual data technologies to global
analysis communities - Collaboratory proposal (ITR program, 4M)
- Develop collaborative tools for global research
- Combination of Grids, communications, sociology,
evaluation, - BTEV proposal (ITR program)
- FIU Creation of CHEPREO in Miami area
- HEP research, participation in WorldGrid
- Strong minority E/O, coordinate with
GriPhyN/iVDGL - Research intl network Brazil / South America
- Other proposals
- ITR, MRI, NMI, SciDAC, others
40Sub-Communities in Global CMS
Pull In Outside Resources
41Summary
- Progress on many fronts in PPDG/GriPhyN/iVDGL
- Packaging Pacman VDT
- Testbeds (development and production)
- Major demonstration projects
- Productions based on Grid tools using iVDGL
resources - WorldGrid providing excellent experience
- Excellent collaboration with EU partners
- Excellent opportunity to build lasting
infrastructure - Looking to collaborate with more international
partners - Testbeds, monitoring, deploying VDT more widely
- New directions
- Virtual data a powerful paradigm for LHC
computing - Emphasis on Grid-enabled analysis
42Grid References
- Grid Book
- www.mkp.com/grids
- Globus
- www.globus.org
- Global Grid Forum
- www.gridforum.org
- PPDG
- www.ppdg.net
- GriPhyN
- www.griphyn.org
- iVDGL
- www.ivdgl.org
- TeraGrid
- www.teragrid.org
- EU DataGrid
- www.eu-datagrid.org