Title: Computational Data Grids
1- Computational Data Grids
- Solving the Problems of Data Intensive Science
Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
FSU Physics Colloquium Feb. 20, 2001
http//www.phys.ufl.edu/avery/griphyn/talks/aver
y_fsu_20feb01.ppt
2What is a Grid?
- Grid Geographically distributed computing
resources configured for coordinated use - Physical resources networks provide raw
capability - Middleware software ties it together
3Why Grids?
- Resources for complex problems are distributed
- Advanced scientific instruments (accelerators,
telescopes, ) - Large amounts of storage
- Large amounts of computing
- Groups of smart people
- Communities require access to common services
- Scientific collaborations (physics, astronomy,
biology, eng. ) - Government agencies
- Health care organizations
- Large corporations
-
- Other reasons
- Resource allocations vary between individual
institutions - Resource configurations change
4Distributed Computation SETI_at_home
- Community SETI researchers enthusiasts
- Arecibo radio data sent to users (250KB data
chunks) - Over 2M PCs used
5Distributed Computation Optimization
- Community mathematicians computer scientists
- Exact solution of nug30
- 30-Site Quadratic Assignment Problem
- 2 ? computation of USA13509 (trav. salesman,
13,509 cities) - ½ ? computation of factoring composite number
?10150 - 32 year-old problem
- Condor-G distributed compu-ting over several
institutions - Delivered 4005 CPU days in 7 days (650 average,
1009 peak) - Parallel computers, workstations, clusters (8
sites, US-Italy)
14, 5, 28, 24, 1, 3, 16, 15, 10, 9, 21, 2, 4, 29,
25, 22, 13, 26, 17, 30, 6, 20, 19, 8, 18, 7, 27,
12, 11, 23
6Distributed ComputationEvaluation of AIDS Drugs
- Community
- 1000s of home computer users
- Philanthropic computing vendor (Entropia)
- Research group (Scripps)
- Common goal
- Advance AIDS research
7Grids Next Generation Web
Software catalogs
Computers
Grid Flexible, high-performance access to all
significant resources
Sensor nets
Colleagues
Data archives
On-demand creation of powerfulvirtual computing
systems
From Ian Foster
8Grid Challenges
- Overall goal
- Coordinated sharing of resources
- Numerous technical problems
- Authentication, authorization, policy, auditing
- Resource discovery, access, allocation, control
- Failure detection recovery
- Brokering
- Additional issue lack of central control
knowledge - Need to preserve local site independence
- Policy discovery and negotiation important
- Many interesting failure modes
9Grids Why Now?
- Improvements in internet infrastructure
- Increasing bandwidth
- Advanced services
- Increased availability of compute/storage
resources - Dense web server-clusters, supercomputers, etc.
- Cheap storage 1 Terabyte
- Advances in application concepts
- Collaborative science and engineering
- Distributed analysis and simulation
- Advanced scientific instruments
- Remote control room (UF - Mauna Kea)
- ...
10Todays Information Infrastructure
O(106) nodes
Network
- Network-centric
- Simple, fixed end systems
- Few embedded capabilities
- Few services
- No user-level quality of service
11Tomorrows Information Infrastructure
O(109) nodes
Caching
ResourceDiscovery
Processing
QoS
- Application-centric
- Heterogeneous, mobile end-systems
- Many embedded capabilities
- Rich services
- User-level quality of service
Qualitatively different,not just faster
andmore reliable
12Simple View of Grid Services
Apps
Rich set of applications
App Toolkits
Remote viz toolkit
Remote comp. toolkit
Remote data toolkit
Remote sensors toolkit
Remote collab. toolkit
...
Grid Services
Protocols, authentication, policy, resource
management, instrumentation, data discovery, etc.
Grid Fabric
Archives, networks, computers, display devices,
etc. associated local services
Globus Project http//www.globus.org/
From Ian Foster
13Example Online Instrumentation
Advanced Photon Source
wide-area dissemination
desktop VR clients with shared controls
real-time collection
archival storage
tomographic reconstruction
DOE X-ray grand challenge ANL, USC/ISI, NIST,
U.Chicago
From Ian Foster
14Emerging Production GridsEarthquake Engineering
Simulation
- NEESgrid Argonne, Michigan, NCSA, UIUC, USC
- National infrastructure to couple earthquake
engineers with experimental facilities,
databases, computers - On-demand access to experi-ments, data streams,
computing, archives, collaboration
http//www.neesgrid.org/
15Data IntensiveScience
16Fundamental IT Challenge
- Scientific communities of thousands, distributed
globally, and served by networks with bandwidths
varying by orders of magnitude, need to extract
small signals from enormous backgrounds via
computationally demanding (Teraflops-Petaflops)
analysis of datasets that will grow by at least 3
orders of magnitude over the next decade from
the 100 Terabyte to the 100 Petabyte scale.
17Data Intensive Science 2000-2015
- Scientific discovery increasingly driven by IT
- Computationally intensive analyses
- Massive data collections
- Rapid access to large subsets
- Data distributed across networks of varying
capability - Dominant factor data growth (1 Petabyte 1000
TB) - 0.5 Petabyte in 2000
- 10 Petabytes by 2005
- 100 Petabytes by 2010
- 1000 Petabytes by 2015?
18Data Intensive Sciences
- High energy nuclear physics
- Gravity wave searches (e.g., LIGO, GEO, VIRGO)
- Astronomical sky surveys (e.g., Sloan Sky Survey)
- Virtual Observatories
- Earth Observing System
- Climate modeling
- Geophysics
- Computational chemistry?
19Data Intensive Biology and Medicine
- Radiology data
- X-ray sources (APS crystallography data)
- Molecular genomics (e.g., Human Genome)
- Proteomics (protein structure, activities, )
- Simulations of biological molecules in situ
- Human Brain Project
- Global Virtual Population Laboratory (disease
outbreaks) - Telemedicine
- Etc.
20Example High Energy Physics
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
21LHC Computing Challenges
- Events resulting from beam-beam collisions
- Signal event is obscured by 20 overlapping
uninteresting collisions in same crossing - CPU time does not scale from previous generations
2000
2006
22LHC Higgs Decay into 4 muons
109 events/sec, selectivity 1 in 1013
23LHC Computing Challenges
- Complexity of LHC environment and resulting data
- Scale Petabytes of data per year (100 PB by
2010) - Geographical distribution of people and resources
1800 Physicists 150 Institutes 32 Countries
24Example National Virtual Observatory
Multi-wavelength astronomy,Multiple surveys
25NVO Data Challenge
- Digital representation of the sky
- All-sky deep fields
- Integrated catalog and image databases
- Spectra of selected samples
- Size of the archived data
- 40,000 square degrees
- 2 trillion pixels (1/2 arcsecond)
- One band (2 bytes/pixel) 4 Terabytes
- Multi-wavelength 10-100 Terabytes
- Time dimension Few Petabytes
26NVO Computing Challenges
- Large distributed database engines
- Gbyte/s aggregate I/O speed
- High speed (10 Gbits/s) backbones
- Cross-connecting the major archives
- Scalable computing environment
- 100s of CPUs for statistical analysis and
discovery
27(No Transcript)
28GriPhyN Institutions
- U Florida
- U Chicago
- Boston U
- Caltech
- U Wisconsin, Madison
- USC/ISI
- Harvard
- Indiana
- Johns Hopkins
- Northwestern
- Stanford
- U Illinois at Chicago
- U Penn
- U Texas, Brownsville
- U Wisconsin, Milwaukee
- UC Berkeley
- UC San Diego
- San Diego Supercomputer Center
- Lawrence Berkeley Lab
- Argonne
- Fermilab
- Brookhaven
29GriPhyN App. Science CS Grids
- GriPhyN Grid Physics Network
- US-CMS High Energy Physics
- US-ATLAS High Energy Physics
- LIGO/LSC Gravity wave research
- SDSS Sloan Digital Sky Survey
- Strong partnership with computer scientists
- Design and implement production-scale grids
- Investigation of Virtual Data concept (fig)
- Integration into 4 major science experiments
- Develop common infrastructure, tools and services
- Builds on existing foundations PPDG project,
Globus tools - Multi-year project ? 70M total cost ? NSF
- RD
- Tier 2 center hardware, personnel (fig)
- Networking?
30Data Grid Hierarchy
Tier0 CERNTier1 National LabTier2 Regional
Center at UniversityTier3 University
workgroupTier4 Workstation
- GriPhyN
- RD
- Tier2 centers
- Unify all IT resources
31LHC Grid Hierarchy
Experiment
PBytes/sec
Online System
100 MBytes/sec
Bunch crossing per 25 nsecs.100 triggers per
secondEvent is 1 MByte in size
CERN Computer Center 20 TIPS
Tier 0 1
2.5 Gbits/sec
France Center
Italy Center
UK Center
USA Center
Tier 1
2.5 Gbits/sec
Tier 2
622 Mbits/sec
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
100 - 1000 Mbits/sec
Physics data cache
Physicists work on analysis channels. Each
institute has 10 physicists working on one or
more channels
Tier 4
Workstations,other portals
32Tier2 Architecture and Cost (2006)
- CPU Farm (32K SI95) 150K - 270K
- RAID Array (150 TB) 215K - 355K
- Data Server 60K - 140K
- LAN Switches 60K
- Small Tape Library 40K
- Tape Media and Consumables 20K
- Installation Infrastructure 30K
- Collaborative Tools Infrastructure 20K
- Software licenses 40K
- Total Estimated Cost (First Year) 635K 955K
- Require small (1.5 2) FTE support per Tier2
33Tier 2 Site 2001 (One Version)
GEth Switch
VRVS MPEG2
FEth Switch
FEth Switch
FEth Switch
FEth Switch
FEth
GbEth
RAID
Router
Data Server
OC-3
OC-12
OC-3
34GriPhyN RD Funded
- NSF results announced Sep. 13, 2000
- 11.9M from NSF Information Technology Research
Program - 1.4M in matching from universities
- Largest of all ITR awards
- Scope of ITR funding
- Major costs for people, esp. students, postdocs
- 2/3 CS 1/3 application science
- Industry partnerships needed to realize scope
- Microsoft, Intel, IBM, Sun, HP, SGI, Compaq,
Cisco - Education and outreach
- Reach non-traditional students and other
constituencies - University partnerships
- Grids natural for integrating intellectual
resources from all locations
35GriPhyN Philosophy
- Fundamentally alters conduct of scientific
research - Old People, resources flow inward to labs
- New Resources, data flow outward to universities
- Strengthens universities
- Couples universities to data intensive science
- Couples universities to national international
labs - Brings front-line research to students
- Exploits intellectual resources of formerly
isolated schools - Opens new opportunities for minority and women
researchers - Builds partnerships to drive new IT/science
advances - Physics ? Astronomy
- Application Science ? Computer Science
- Universities ? Laboratories
- Fundamental sciences ? IT infrastructure
- Research Community ? IT industry
36GriPhyN Research Agenda
- Virtual Data technologies (fig.)
- Derived data, calculable via algorithm (e.g.,
most HEP data) - Instantiated 0, 1, or many times
- Fetch vs execute algorithm
- Very complex (versions, consistency, cost
calculation, etc) - Planning and scheduling
- User requirements (time vs cost)
- Global and local policies resource availability
- Complexity of scheduling in dynamic environment
(hierarchy) - Optimization and ordering of multiple scenarios
- Requires simulation tools, e.g. MONARC
37Virtual Datain Action
- Data request may
- Compute locally
- Compute remotely
- Access local data
- Access remote data
- Scheduling based on
- Local policies
- Global policies
- Local autonomy
38Research Agenda (cont.)
- Execution management
- Co-allocation of resources (CPU, storage, network
transfers) - Fault tolerance, error reporting
- Agents (co-allocation, execution)
- Reliable event service across Grid
- Interaction, feedback to planning
- Performance analysis
- Instrumentation and measurement of all grid
components - Understand and optimize grid performance
- Simulations (MONARC project at CERN)
- Virtual Data Toolkit (VDT)
- VDT virtual data services virtual data tools
- One of the primary deliverables of RD effort
- Ongoing activity feedback from experiments (5
year plan) - Technology transfer mechanism to other scientific
domains
39GriPhyN PetaScale Virtual Data Grids
Production Team
Individual Investigator
Workgroups
Interactive User Tools
Request Planning
Request Execution
Virtual Data Tools
Management Tools
Scheduling Tools
Resource
Other Grid
Security and
Management
Policy
Services
Services
Services
Transforms
Distributed resources
(code, storage,
Raw data
computers, and network)
source
40Model Architecture for Data Grids
Attribute Specification
Metadata Catalog
Replica Catalog
Application
Multiple Locations
Logical Collection and Logical File Name
MDS
Replica Selection
Selected Replica
NWS
GridFTP commands
Performance Information Predictions
Disk Cache
Tape Library
Disk Array
Disk Cache
Replica Location 1
Replica Location 2
Replica Location 3
From Ian Foster
41Cluster Engineering
- Cluster management and software
- Cluster-wide upgrades for software and operating
system - Low-cost and low-personnel management
- Queuing software
- Performance monitoring tools (web-based)
- Fault-tolerance, healing
- Remote management
42Cluster Engineering (cont)
- Cluster performance
- High-speed and distributed file systems
- Scaling to very large clusters (100s 1000s)
- Evolution of high speed I/O bus technologies
- Linux vs. other operating systems
- Multi-processor architectures
- Lightweight multi-gigabit/s LAN protocols
- New switching/routing technologies
- Instrumentation to understand performance
43Other Grid Research/Engineering
- General grid research problems
- Scheduling in complex, distributed environment
- Wide area networks
- High performance and distributed databases
- Replication of databases at remote sites
(caching) - Portable execution environments
- Enterprise level fault tolerance
- Monitoring
44Remote Database Replication (PPDG)
ANL, BNL, Caltech, FNAL, JLAB, LBNL, SDSC, SLAC,
U.Wisc/CS
Site to Site Data Replication Service 100
Mbytes/sec
PRIMARY SITE Data Acquisition, CPU, Disk, Tape
Robot
SECONDARY SITE CPU, Disk, Tape Robot
- First Round Goal Optimized cached read access
to 10-100 Gbytes drawn from a total data set of
0.1 to 1 Petabyte - Matchmaking, Co-Scheduling SRB, Condor, Globus
services HRM, NWS
Multi-Site Cached File Access Service
45Simulation of Analysis Model (MONARC)
3000 SI95sec/event 1 job / year
RAW Data
3000 SI95sec/event 3 jobs / year
Reconstruction
Experiment- Wide Activity (109 events)
New detector calibrations Or understanding
Re-processing 3 Times per year
25 SI95sec/event 20 jobs/month
5000 SI95sec/event
Monte Carlo
Trigger based and Physics based refinements
Iterative selection Once per month
Selection
20 Groups Activity (109 ? 107 events)
10 SI95sec/event 500 jobs / day
25 Individual per Group Activity (106 108
events)
Analysis
Different Physics cuts MC comparison Once per
day
Algorithms applied to data to get results
46Major Data Grid Projects
- Earth System Grid (DOE Office of Science)
- Data Grid technologies, climate applications
- http//www.scd.ucar.edu/css/esg/
- Particle Physics Data Grid (DOE Science)
- Data Grid applications for high energy physics
experiments - http//www.ppdg.net/
- European Data Grid (EU)
- Data Grid technologies deployment in EU
- http//grid.web.cern.ch/grid/
- GriPhyN (NSF)
- Investigation of Virtual Data concept
- Integration into 4 major science experiments
- Broad application to other disciplines via
Virtual Data Toolkit - http//www.griphyn.org/
47The Grid Landscape Today
- Data Grid hierarchy is baseline computing
assumption for large experiments - Recent development 6 months
- Strong interest from other research groups
- Nuclear physics (ALICE experiment at LHC)
- VO community in Europe
- Gravity wave community in Europe
- Collaboration of major data grid projects
- GriPhyN, PPDG, EU DataGrid
- Develop common infrastructure, collaboration
- Major meeting in Amsterdam March 4
48The Grid Landscape Today (cont)
- New players
- CAL-IT2 (112M)
- Distributed Teraflop Facility, NSF-0151 (45M)
- UK-PPARC (40M)
- Other national initiatives
- iVDGL international Virtual-Data Grid
Laboratory - For national, international scale Grid tests,
operations - Initially US UK EU this year
- Add other world regions later
- Talks with Japan, Russia, China, South America,
India, Pakistan - New GriPhyN proposal to NSF ITR2001 (15M)
- Additional iVDGL deployment
- Integration of grid tools in applications
- Toolkit support
- Deployment of small systems at small colleges
(E/O)
49Grid References
- Grid Book
- www.mkp.com/grids
- Globus
- www.globus.org
- Global Grid Forum
- www.gridforum.org
- PPDG
- www.ppdg.net
- EU DataGrid
- grid.web.cern.ch/grid/
- GriPhyN
- www.griphyn.org
- www.phys.ufl.edu/avery/griphyn
50Summary
- Grids will qualitatively and quantitatively
change the nature of collaborations and
approaches to computing - Grids offer major benefits to data intensive
sciences - Many challenges during the coming transition
- New grid projects will provide rich experience
and lessons - Difficult to predict situation even 3-5 years
ahead