Grids for 21st Century Data Intensive Science - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Grids for 21st Century Data Intensive Science

Description:

LIGO, GEO, VIRGO. Time-dependent 3-D systems (simulation & data) ... standard man. NSF Meeting (Apr. 15, 2003) Paul Avery. 10. LHC Data Rates: Detector to Storage ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 43
Provided by: paula92
Category:

less

Transcript and Presenter's Notes

Title: Grids for 21st Century Data Intensive Science


1
  • Grids for 21st CenturyData Intensive Science

Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
NSF MeetingApr. 15, 2003
2
The Grid Concept
  • Grid Geographically distributed computing
    resources configured for coordinated use
  • Fabric Physical resources networks provide
    raw capability
  • Middleware Software ties it all together (tools,
    services, etc.)
  • Goal Transparent resource sharing

3
Fundamental Idea Resource Sharing
  • Resources for complex problems are distributed
  • Advanced scientific instruments (accelerators,
    telescopes, )
  • Storage, computing,
  • Groups of people, institutions
  • Communities require access to common services
  • Research collaborations (physics, astronomy,
    biology, eng. )
  • Government agencies
  • Health care organizations, large corporations,
  • Virtual Organizations
  • Create a VO from geographically separated
    components
  • Make all community resources available to any VO
    member
  • Leverage strengths at different institutions
  • Add people resources dynamically

4
Some (Realistic) Grid Examples
  • High energy physics
  • 3,000 physicists worldwide pool Petaflops of CPU
    resources to analyze Petabytes of data
  • Climate modeling
  • Climate scientists visualize, annotate, analyze
    Terabytes of simulation data
  • Biology
  • A biochemist exploits 10,000 computers to screen
    100,000 compounds in an hour
  • Engineering
  • A multidisciplinary analysis in aerospace couples
    code and data in four companies to design a new
    airframe

From Ian Foster
5
Grid Challenges
  • Manage workflow across Grid
  • Balance policy vs. instantaneous capability to
    complete tasks
  • Balance effective resource use vs. fast
    turnaround for priority jobs
  • Match resource usage to policy over the long term
  • Goal-oriented algorithms steering requests
    according to metrics
  • Maintain a global view of resources and system
    state
  • Coherent end-to-end system monitoring
  • Adaptive learning new paradigms for execution
    optimization
  • Handle user-Grid interactions
  • Guidelines, agents
  • Build high level services integrated user
    environment

6
Data Intensive Science 2000-2015
  • Scientific discovery increasingly driven by data
    collection
  • Computationally intensive analyses
  • Massive data collections
  • Data distributed across networks of varying
    capability
  • Internationally distributed collaborations
  • Dominant factor data growth (1 Petabyte 1000
    TB)
  • 2000 0.5 Petabyte
  • 2005 10 Petabytes
  • 2010 100 Petabytes
  • 2015 1000 Petabytes?

How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
7
Data Intensive Physical Sciences
  • High energy nuclear physics
  • Including new experiments at CERNs Large Hadron
    Collider
  • Astronomy
  • Digital sky surveys SDSS, VISTA, other Gigapixel
    arrays
  • VLBI arrays multiple- Gbps data streams
  • Virtual Observatories (multi-wavelength
    astronomy)
  • Gravity wave searches
  • LIGO, GEO, VIRGO
  • Time-dependent 3-D systems (simulation data)
  • Earth Observation, climate modeling
  • Geophysics, earthquake modeling
  • Fluids, aerodynamic design
  • Dispersal of pollutants in atmosphere

8
Data Intensive Biology and Medicine
  • Medical data
  • X-Ray, mammography data, etc. (many petabytes)
  • Digitizing patient records (ditto)
  • X-ray crystallography
  • Bright X-Ray sources, e.g. Argonne Advanced
    Photon Source
  • Molecular genomics and related disciplines
  • Human Genome, other genome databases
  • Proteomics (protein structure, activities, )
  • Protein interactions, drug delivery
  • Brain scans (1-10?m, time dependent)

9
Example LHC Physics
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
10
LHC Data Rates Detector to Storage
40 MHz
1000 TB/sec
Physics filtering
Level 1 Trigger Special Hardware
75 GB/sec
75 KHz
Level 2 Trigger Commodity CPUs
5 GB/sec
5 KHz
Level 3 Trigger Commodity CPUs
100 400 MB/sec
100 Hz
Raw Data to storage
11
LHC Higgs Decay into 4 muons
109 events/sec, selectivity 1 in 1013
12
LHC Computing Overview
  • Complexity Millions of individual detector
    channels
  • Scale PetaOps (CPU), Petabytes (Data)
  • Distribution Global distribution of people
    resources

1800 Physicists 150 Institutes 32 Countries
13
Global LHC Data Grid
CMS Experiment
Tier0/(? Tier1)/(? Tier2) 111
Online System
100-400 MBytes/s
CERN Computer Center 20 TIPS
Tier 0
10-40 Gbps
Tier 1
2.5-10 Gbps
Tier 2
1-2.5 Gbps
Tier 3
1-10 Gbps
Physics cache
Tier 4
PCs, other portals
14
Example Digital Astronomy Trends
  • Future dominated by detector improvements
  • Moores Law growth in CCDs
  • Gigapixel arrays on horizon
  • Growth in CPU/storage tracking data volumes
  • Investment in software critical

Glass
MPixels
  • Total area of 3m telescopes in the world in m2
  • Total number of CCD pixels in Mpix
  • 25 year growth 30x in glass, 3000x in pixels

15
The Age of Astronomical Mega-Surveys
  • Next generation mega-surveys will change
    astronomy
  • Top-down design
  • Large sky coverage
  • Sound statistical plans
  • Well controlled, uniform systematics
  • The technology to store and access the data is
    here
  • Following Moores law
  • Integrating these archives for the whole
    community
  • Astronomical data mining will lead to stunning
    new discoveries
  • Virtual Observatory

16
Virtual Observatories
Multi-wavelength astronomy,Multiple surveys
17
Virtual Observatory Data Challenge
  • Digital representation of the sky
  • All-sky deep fields
  • Integrated catalog and image databases
  • Spectra of selected samples
  • Size of the archived data
  • 40,000 square degrees
  • Resolution 50 trillion pixels
  • One band (2 bytes/pixel) 100 Terabytes
  • Multi-wavelength 500-1000 Terabytes
  • Time dimension Many Petabytes
  • Large, globally distributed database engines
  • Multi-Petabyte data size
  • Thousands of queries per day, Gbyte/s I/O speed
    per site
  • Data Grid computing infrastructure

18
Sloan Sky Survey Data Grid
19
U.S. Grid Projects
20
Global Context Data Grid Projects
  • U.S. Infrastructure Projects
  • Particle Physics Data Grid (PPDG) DOE
  • GriPhyN NSF
  • International Virtual Data Grid Laboratory
    (iVDGL) NSF
  • TeraGrid NSF
  • DOE Science Grid DOE
  • NSF Middleware Initiative (NMI) NSF
  • EU, Asia major projects
  • European Data Grid (EU, EC)
  • EDG national Projects (UK, Italy, France, )
  • CrossGrid (EU, EC)
  • DataTAG (EU, EC)
  • LHC Computing Grid (LCG) (CERN)
  • Japanese Project
  • Korea project

21
Particle Physics Data Grid
  • Funded 2001 2004 _at_ US9.5M (DOE)
  • D0, BaBar, STAR, CMS, ATLAS
  • SLAC, LBNL, Jlab, FNAL, BNL, Caltech, Wisconsin,
    Chicago, USC

22
PPDG Goals
  • Serve high energy nuclear physics (HENP)
    experiments
  • Unique challenges, diverse test environments
  • Develop advanced Grid technologies
  • Focus on end to end integration
  • Maintain practical orientation
  • Networks, instrumentation, monitoring
  • DB file/object replication, caching, catalogs,
    end-to-end movement
  • Make tools general enough for wide community
  • Collaboration with GriPhyN, iVDGL, EDG, LCG
  • ESNet Certificate Authority work, security

23
GriPhyN and iVDGL
  • Both funded through NSF ITR program
  • GriPhyN 11.9M (NSF) 1.6M (matching) (2000
    2005)
  • iVDGL 13.7M (NSF) 2M (matching) (2001
    2006)
  • Basic composition
  • GriPhyN 12 funded universities, SDSC, 3
    labs (80 people)
  • iVDGL 16 funded institutions, SDSC, 3 labs (80
    people)
  • Expts US-CMS, US-ATLAS, LIGO, SDSS/NVO
  • Large overlap of people, institutions, management
  • Grid research vs Grid deployment
  • GriPhyN 2/3 CS 1/3 physics ( 0 H/W)
  • iVDGL 1/3 CS 2/3 physics (20 H/W)
  • iVDGL 2.5M Tier2 hardware (1.4M LHC)
  • 4 physics experiments provide frontier challenges
  • Virtual Data Toolkit (VDT) in common

24
GriPhyN Goals
  • Conduct CS research to enable Petascale science
  • Virtual Data as unifying principle (more later)
  • Planning, execution, performance monitoring
  • Disseminate through Virtual Data Toolkit
  • Series of releases
  • Integrate into GriPhyN science experiments
  • Common Grid tools, services
  • Impact other disciplines
  • HEP, biology, medicine, virtual astronomy, eng.,
    other Grid projects
  • Educate, involve, train students in IT research
  • Undergrads, grads, postdocs, underrepresented
    groups

25
iVDGL Goals and Context
  • International Virtual-Data Grid Laboratory
  • A global Grid laboratory (US, EU, Asia, South
    America, )
  • A place to conduct Data Grid tests at scale
  • A mechanism to create common Grid infrastructure
  • A laboratory for other disciplines to perform
    Data Grid tests
  • A focus of outreach efforts to small institutions
  • Context of iVDGL in US-LHC computing program
  • Mechanism for NSF to fund proto-Tier2 centers
  • Learn how to do Grid operations (GOC)
  • International participation
  • DataTag
  • UK e-Science programme support 6 CS Fellows per
    year in U.S.

26
Goal PetaScale Virtual-Data Grids
Production Team
Single Researcher
Workgroups
Interactive User Tools
Request Execution Management Tools
Request Planning Scheduling Tools
Virtual Data Tools
ResourceManagementServices
Security andPolicyServices
Other GridServices
  • PetaOps
  • Petabytes
  • Performance

Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw datasource
27
GriPhyN/iVDGL Science Drivers
  • US-CMS US-ATLAS
  • HEP experiments at LHC/CERN
  • 100s of Petabytes
  • LIGO
  • Gravity wave experiment
  • 100s of Terabytes
  • Sloan Digital Sky Survey
  • Digital astronomy (1/4 sky)
  • 10s of Terabytes
  • Massive CPU
  • Large, distributed datasets
  • Large, distributed communities

28
US-iVDGL Sites (Spring 2003)
  • Partners?
  • EU
  • CERN
  • Brazil
  • Australia
  • Korea
  • Japan

29
iVDGL Map (2002-2003)
Surfnet
DataTAG
  • New partners
  • Russia T1
  • China T1
  • Brazil T1
  • Romania ?

30
ATLAS Simulations on iVDGL Resources
Joint project with iVDGL
31
US-CMS Grid Testbed
32
CMS Grid Testbed Production
33
US-CMS Testbed Success Story
  • Production Run for the IGT MOP Production
  • Assigned 1.5 million events for eGamma Bigjets
  • 500 sec per event on 750 MHz processor all
    production stages from simulation to ntuple
  • 2 months continuous running across 5 testbed
    sites
  • Demonstrated at Supercomputing 2002

1.5 Million Events Produced ! (nearly 30 CPU
years)
34
Creation of WorldGrid
  • Joint iVDGL/DataTag/EDG effort
  • Resources from both sides (15 sites)
  • Monitoring tools (Ganglia, MDS, NetSaint, )
  • Visualization tools (Nagios, MapCenter, Ganglia)
  • Applications ScienceGrid
  • CMS CMKIN, CMSIM
  • ATLAS ATLSIM
  • Submit jobs from US or EU
  • Jobs can run on any cluster
  • Demonstrated at IST2002 (Copenhagen)
  • Demonstrated at SC2002 (Baltimore)

35
WorldGrid Sites
36
Grid Coordination
37
U.S. Project Coordination Trillium
  • Trillium GriPhyN iVDGL PPDG
  • Large overlap in leadership, people, experiments
  • Benefit of coordination
  • Common S/W base packaging VDT PACMAN
  • Collaborative / joint projects monitoring,
    demos, security,
  • Wide deployment of new technologies, e.g. Virtual
    Data
  • Stronger, broader outreach effort
  • Forum for US Grid projects
  • Joint view, strategies, meetings and work
  • Unified entity to deal with EU other Grid
    projects
  • Natural collaboration across DOE and NSF
    projects
  • Funding agency interest?

38
International Grid Coordination
  • Global Grid Forum (GGF)
  • International forum for general Grid efforts
  • Many working groups, standards definitions
  • Close collaboration with EU DataGrid (EDG)
  • Many connections with EDG activities
  • HICB HEP Inter-Grid Coordination Board
  • Non-competitive forum, strategic issues,
    consensus
  • Cross-project policies, procedures and
    technology, joint projects
  • HICB-JTB Joint Technical Board
  • Definition, oversight and tracking of joint
    projects
  • GLUE interoperability group
  • Participation in LHC Computing Grid (LCG)
  • Software Computing Committee (SC2)
  • Project Execution Board (PEB)
  • Grid Deployment Board (GDB)

39
Future U.S. Grid Efforts
  • Dynamic workspaces proposal (ITR program 15M)
  • Extend virtual data technologies to global
    analysis communities
  • Collaboratory proposal (ITR program, 4M)
  • Develop collaborative tools for global research
  • Combination of Grids, communications, sociology,
    evaluation,
  • BTEV proposal (ITR program)
  • FIU Creation of CHEPREO in Miami area
  • HEP research, participation in WorldGrid
  • Strong minority E/O, coordinate with
    GriPhyN/iVDGL
  • Research intl network Brazil / South America
  • Other proposals
  • ITR, MRI, NMI, SciDAC, others

40
Sub-Communities in Global CMS
Pull In Outside Resources
41
Summary
  • Progress on many fronts in PPDG/GriPhyN/iVDGL
  • Packaging Pacman VDT
  • Testbeds (development and production)
  • Major demonstration projects
  • Productions based on Grid tools using iVDGL
    resources
  • WorldGrid providing excellent experience
  • Excellent collaboration with EU partners
  • Excellent opportunity to build lasting
    infrastructure
  • Looking to collaborate with more international
    partners
  • Testbeds, monitoring, deploying VDT more widely
  • New directions
  • Virtual data a powerful paradigm for LHC
    computing
  • Emphasis on Grid-enabled analysis

42
Grid References
  • Grid Book
  • www.mkp.com/grids
  • Globus
  • www.globus.org
  • Global Grid Forum
  • www.gridforum.org
  • PPDG
  • www.ppdg.net
  • GriPhyN
  • www.griphyn.org
  • iVDGL
  • www.ivdgl.org
  • TeraGrid
  • www.teragrid.org
  • EU DataGrid
  • www.eu-datagrid.org
Write a Comment
User Comments (0)
About PowerShow.com