Title: Global Data Grids for 21st Century Science
1- Global Data Grids for21st Century Science
Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
GriPhyN/iVDGL Outreach WorkshopUniversity of
Texas, BrownsvilleMarch 1, 2002
2What is a Grid?
- Grid Geographically distributed computing
resources configured for coordinated use - Physical resources networks provide raw
capability - Middleware software ties it together
3What Are Grids Good For?
- Climate modeling
- Climate scientists visualize, annotate, analyze
Terabytes of simulation data - Biology
- A biochemist exploits 10,000 computers to screen
100,000 compounds in an hour - High energy physics
- 3,000 physicists worldwide pool Petaflops of CPU
resources to analyze Petabytes of data - Engineering
- Civil engineers collaborate to design, execute,
analyze shake table experiments - A multidisciplinary analysis in aerospace couples
code and data in four companies
From Ian Foster
4What Are Grids Good For?
- Application Service Providers
- A home user invokes architectural design
functions at an application service provider - which purchases computing cycles from cycle
providers - Commercial
- Scientists at a multinational toy company design
a new product - Cities, communities
- An emergency response team couples real time
data, weather model, population data - A community group pools members PCs to analyze
alternative designs for a local road - Health
- Hospitals and international agencies collaborate
on stemming a major disease outbreak
From Ian Foster
5Proto-Grid SETI_at_home
- Community SETI researchers enthusiasts
- Arecibo radio data sent to users (250KB data
chunks) - Over 2M PCs used
6More Advanced Proto-GridEvaluation of AIDS Drugs
- Community
- Research group (Scripps)
- 1000s of PC owners
- Vendor (Entropia)
- Common goal
- Drug design
- Advance AIDS research
7Why Grids?
- Resources for complex problems are distributed
- Advanced scientific instruments (accelerators,
telescopes, ) - Storage and computing
- Groups of people
- Communities require access to common services
- Scientific collaborations (physics, astronomy,
biology, eng. ) - Government agencies
- Health care organizations, large corporations,
- Goal is to build Virtual Organizations
- Make all community resources available to any VO
member - Leverage strengths at different institutions
- Add people resources dynamically
8Grids Why Now?
- Moores law improvements in computing
- Highly functional endsystems
- Burgeoning wired and wireless Internet
connections - Universal connectivity
- Changing modes of working and problem solving
- Teamwork, computation
- Network exponentials
- (Next slide)
9Network Exponentials Collaboration
- Network vs. computer performance
- Computer speed doubles every 18 months
- Network speed doubles every 9 months
- Difference order of magnitude per 5 years
- 1986 to 2000
- Computers x 500
- Networks x 340,000
- 2001 to 2010?
- Computers x 60
- Networks x 4000
Scientific American (Jan-2001)
10Grid Challenges
- Overall goal Coordinated sharing of resources
- Technical problems to overcome
- Authentication, authorization, policy, auditing
- Resource discovery, access, allocation, control
- Failure detection recovery
- Resource brokering
- Additional issue lack of central control
knowledge - Preservation of local site autonomy
- Policy discovery and negotiation important
11Layered Grid Architecture(Analogy to Internet
Architecture)
Specialized servicesApp. specific distributed
services
User
Managing multiple resourcesubiquitous
infrastructure services
Collective
Sharing single resourcesnegotiating access,
controlling use
Resource
Talking to thingscommunications, security
Connectivity
Controlling things locallyAccessing,
controlling resources
Fabric
From Ian Foster
12Globus Project and Toolkit
- Globus Project (Argonne USC/ISI)
- O(40) researchers developers
- Identify and define core protocols and services
- Globus Toolkit 2.0
- A major product of the Globus Project
- Reference implementation of core protocols
services - Growing open source developer community
- Globus Toolkit used by all Data Grid projects
today - US GriPhyN, PPDG, TeraGrid, iVDGL
- EU EU-DataGrid and national projects
- Recent announcement of applying web services to
Grids - Keeps Grids in the commercial mainstream
- GT 3.0
13Globus General Approach
Applications
- Define Grid protocols APIs
- Protocol-mediated access to remote resources
- Integrate and extend existing standards
- Develop reference implementation
- Open source Globus Toolkit
- Client server SDKs, services, tools, etc.
- Grid-enable wide variety of tools
- Globus Toolkit
- FTP, SSH, Condor, SRB, MPI,
- Learn about real world problems
- Deployment
- Testing
- Applications
Diverse global services
Core services
Diverse resources
14Data Intensive Science 2000-2015
- Scientific discovery increasingly driven by IT
- Computationally intensive analyses
- Massive data collections
- Data distributed across networks of varying
capability - Geographically distributed collaboration
- Dominant factor data growth (1 Petabyte 1000
TB) - 2000 0.5 Petabyte
- 2005 10 Petabytes
- 2010 100 Petabytes
- 2015 1000 Petabytes?
How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
15Data Intensive Physical Sciences
- High energy nuclear physics
- Including new experiments at CERNs Large Hadron
Collider - Gravity wave searches
- LIGO, GEO, VIRGO
- Astronomy Digital sky surveys
- Sloan Digital sky Survey, VISTA, other Gigapixel
arrays - Virtual Observatories (multi-wavelength
astronomy) - Time-dependent 3-D systems (simulation data)
- Earth Observation, climate modeling
- Geophysics, earthquake modeling
- Fluids, aerodynamic design
- Pollutant dispersal scenarios
16Data Intensive Biology and Medicine
- Medical data
- X-Ray, mammography data, etc. (many petabytes)
- Digitizing patient records (ditto)
- X-ray crystallography
- Bright X-Ray sources, e.g. Argonne Advanced
Photon Source - Molecular genomics and related disciplines
- Human Genome, other genome databases
- Proteomics (protein structure, activities, )
- Protein interactions, drug delivery
- Brain scans (3-D, time dependent)
- Virtual Population Laboratory (proposed)
- Database of populations, geography,
transportation corridors - Simulate likely spread of disease outbreaks
Craig Venter keynote _at_SC2001
17Example High Energy Physics
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
18LHC Computing Challenges
- Complexity of LHC interaction environment
resulting data - Scale Petabytes of data per year (100 PB by
2010-12) - GLobal distribution of people and resources
1800 Physicists 150 Institutes 32 Countries
19Global LHC Data Grid
Tier0 CERNTier1 National LabTier2 Regional
Center (University, etc.)Tier3 University
workgroupTier4 Workstation
- Key ideas
- Hierarchical structure
- Tier2 centers
20Global LHC Data Grid
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
Experiment
PBytes/sec
Online System
100 MBytes/sec
Bunch crossing per 25 nsecs.100 triggers per
secondEvent is 1 MByte in size
CERN Computer Center 20 TIPS
Tier 0 1
HPSS
2.5 Gbits/sec
France Center
Italy Center
UK Center
USA Center
Tier 1
2.5 Gbits/sec
Tier 2
622 Mbits/sec
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
100 - 1000 Mbits/sec
Physics data cache
Physicists work on analysis channels. Each
institute has 10 physicists working on one or
more channels
Tier 4
Workstations,other portals
21Sloan Digital Sky Survey Data Grid
22LIGO (Gravity Wave) Data Grid
MIT
LivingstonObservatory
HanfordObservatory
OC48
OC3
OC3
OC12
Caltech
Tier1
OC48
23Data Grid Projects
- Particle Physics Data Grid (US, DOE)
- Data Grid applications for HENP expts.
- GriPhyN (US, NSF)
- Petascale Virtual-Data Grids
- iVDGL (US, NSF)
- Global Grid lab
- TeraGrid (US, NSF)
- Dist. supercomp. resources (13 TFlops)
- European Data Grid (EU, EC)
- Data Grid technologies, EU deployment
- CrossGrid (EU, EC)
- Data Grid technologies, EU
- DataTAG (EU, EC)
- Transatlantic network, Grid applications
- Japanese Grid Project (APGrid?) (Japan)
- Grid deployment throughout Japan
- Collaborations of application scientists
computer scientists - Infrastructure devel. deployment
- Globus based
24Coordination of U.S. Grid Projects
- Three U.S. projects
- PPDG HENP experiments, short term tools,
deployment - GriPhyN Data Grid research, Virtual Data, VDT
deliverable - iVDGL Global Grid laboratory
- Coordination of PPDG, GriPhyN, iVDGL
- Common experiments personnel, management
integration - iVDGL as joint PPDG GriPhyN laboratory
- Joint meetings (Jan. 2002, April 2002, Sept.
2002) - Joint architecture creation (GriPhyN, PPDG)
- Adoption of VDT as common core Grid
infrastructure - Common Outreach effort (GriPhyN iVDGL)
- New TeraGrid project (Aug. 2001)
- 13MFlops across 4 sites, 40 Gb/s networking
- Goal integrate into iVDGL, adopt VDT, common
Outreach
25Worldwide Grid Coordination
- Two major clusters of projects
- US based GriPhyN Virtual Data Toolkit (VDT)
- EU based Different packaging of similar
components
26GriPhyN App. Science CS Grids
- GriPhyN Grid Physics Network
- US-CMS High Energy Physics
- US-ATLAS High Energy Physics
- LIGO/LSC Gravity wave research
- SDSS Sloan Digital Sky Survey
- Strong partnership with computer scientists
- Design and implement production-scale grids
- Develop common infrastructure, tools and services
(Globus based) - Integration into the 4 experiments
- Broad application to other sciences via Virtual
Data Toolkit - Strong outreach program
- Multi-year project
- RD for grid architecture (funded at 11.9M
1.6M) - Integrate Grid infrastructure into experiments
through VDT
27GriPhyN Institutions
- UC San Diego
- San Diego Supercomputer Center
- Lawrence Berkeley Lab
- Argonne
- Fermilab
- Brookhaven
- U Florida
- U Chicago
- Boston U
- Caltech
- U Wisconsin, Madison
- USC/ISI
- Harvard
- Indiana
- Johns Hopkins
- Northwestern
- Stanford
- U Illinois at Chicago
- U Penn
- U Texas, Brownsville
- U Wisconsin, Milwaukee
- UC Berkeley
28GriPhyN PetaScale Virtual-Data Grids
Production Team
Individual Investigator
Workgroups
1 Petaflop 100 Petabytes
Interactive User Tools
Request Planning
Request Execution
Virtual Data Tools
Management Tools
Scheduling Tools
Resource
Other Grid
Security and
Management
Policy
Services
Services
Services
Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw data
source
29GriPhyN Research Agenda
- Virtual Data technologies (fig.)
- Derived data, calculable via algorithm
- Instantiated 0, 1, or many times (e.g., caches)
- Fetch value vs execute algorithm
- Very complex (versions, consistency, cost
calculation, etc) - LIGO example
- Get gravitational strain for 2 minutes around
each of 200 gamma-ray bursts over the last year - For each requested data value, need to
- Locate item location and algorithm
- Determine costs of fetching vs calculating
- Plan data movements computations required to
obtain results - Execute the plan
30Virtual Data in Action
- Data request may
- Compute locally
- Compute remotely
- Access local data
- Access remote data
- Scheduling based on
- Local policies
- Global policies
- Cost
Major facilities, archives
Regional facilities, caches
Local facilities, caches
31GriPhyN Research Agenda (cont.)
- Execution management
- Co-allocation of resources (CPU, storage, network
transfers) - Fault tolerance, error reporting
- Interaction, feedback to planning
- Performance analysis (with PPDG)
- Instrumentation and measurement of all grid
components - Understand and optimize grid performance
- Virtual Data Toolkit (VDT)
- VDT virtual data services virtual data tools
- One of the primary deliverables of RD effort
- Technology transfer mechanism to other scientific
domains
32GriPhyN/PPDG Data Grid Architecture
Application
initial solution is operational
DAG
Catalog Services
Monitoring
Planner
Info Services
DAG
Repl. Mgmt.
Executor
Policy/Security
Reliable Transfer Service
Compute Resource
Storage Resource
33Catalog Architecture
Transparency wrt location
Metadata Catalog
Metadata Catalog
Name
LObjN
Name
LObjN
X logO1
Y logO2
F.X
logO3
F.X
logO3
G(1).Y logO4
Object Name
Object Name
GCMS
GCMS
Logical Container
Name
Replica Catalog
Replica Catalog
LCN
PFNs
LCN
PFNs
logC1 URL1
logC1 URL1
logC2 URL2 URL3
logC2 URL2 URL3
logC3 URL4
logC3 URL4
logC4 URL5 URL6
logC4 URL5 URL6
URLs for physical file location
Physical file storage
34iVDGL A Global Grid Laboratory
We propose to create, operate and evaluate, over
asustained period of time, an international
researchlaboratory for data-intensive
science. From NSF proposal, 2001
- International Virtual-Data Grid Laboratory
- A global Grid laboratory (US, EU, South America,
Asia, ) - A place to conduct Data Grid tests at scale
- A mechanism to create common Grid infrastructure
- A facility to perform production exercises for
LHC experiments - A laboratory for other disciplines to perform
Data Grid tests - A focus of outreach efforts to small institutions
- Funded for 13.65M by NSF
35iVDGL Components
- Computing resources
- Tier1, Tier2, Tier3 sites
- Networks
- USA (TeraGrid, Internet2, ESNET), Europe (Géant,
) - Transatlantic (DataTAG), Transpacific, AMPATH,
- Grid Operations Center (GOC)
- Indiana (2 people)
- Joint work with TeraGrid on GOC development
- Computer Science support teams
- Support, test, upgrade GriPhyN Virtual Data
Toolkit - Outreach effort
- Integrated with GriPhyN
- Coordination, interoperability
36Current iVDGL Participants
- Initial experiments (funded by NSF proposal)
- CMS, ATLAS, LIGO, SDSS, NVO
- U.S. Universities and laboratories
- (Next slide)
- Partners
- TeraGrid
- EU DataGrid EU national projects
- Japan (AIST, TITECH)
- Australia
- Complementary EU project DataTAG
- 2.5 Gb/s transatlantic network
37U.S. iVDGL Proposal Participants
- U Florida CMS
- Caltech CMS, LIGO
- UC San Diego CMS, CS
- Indiana U ATLAS, GOC
- Boston U ATLAS
- U Wisconsin, Milwaukee LIGO
- Penn State LIGO
- Johns Hopkins SDSS, NVO
- U Chicago/Argonne CS
- U Southern California CS
- U Wisconsin, Madison CS
- Salish Kootenai Outreach, LIGO
- Hampton U Outreach, ATLAS
- U Texas, Brownsville Outreach, LIGO
- Fermilab CMS, SDSS, NVO
- Brookhaven ATLAS
- Argonne Lab ATLAS, CS
T2 / Software
CS support
T3 / Outreach
T1 / Labs(funded elsewhere)
38Initial US-iVDGL Data Grid
SKC
BU
Wisconsin
PSU
BNL
Fermilab
Hampton
Indiana
JHU
Caltech
UCSD
Florida
Brownsville
Other sites to be added in 2002
39iVDGL Map (2002-2003)
Surfnet
DataTAG
- Later
- Brazil
- Pakistan
- Russia
- China
40Summary
- Data Grids will qualitatively and quantitatively
change the nature of collaborations and
approaches to computing - The iVDGL will provide vast experience for new
collaborations - Many challenges during the coming transition
- New grid projects will provide rich experience
and lessons - Difficult to predict situation even 3-5 years
ahead
41Grid References
- Grid Book
- www.mkp.com/grids
- Globus
- www.globus.org
- Global Grid Forum
- www.gridforum.org
- TeraGrid
- www.teragrid.org
- EU DataGrid
- www.eu-datagrid.org
- PPDG
- www.ppdg.net
- GriPhyN
- www.griphyn.org
- iVDGL
- www.ivdgl.org