GriPhyN: Grid Physics Network and iVDGL: International Virtual Data Grid Laboratory - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

GriPhyN: Grid Physics Network and iVDGL: International Virtual Data Grid Laboratory

Description:

... scale facility to do these tests, and to address other goals as well ... Sloan Galaxy Cluster Finder. Sloan Data ... UK e-science programme. DataTAG. EU ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 52
Provided by: eh7
Category:

less

Transcript and Presenter's Notes

Title: GriPhyN: Grid Physics Network and iVDGL: International Virtual Data Grid Laboratory


1
GriPhyN Grid Physics NetworkandiVDGL
International Virtual Data Grid Laboratory
2
Collaboratory Basics
  • Two NSF-funded Grid projects in HENP (high energy
    and nuclear physics) and computer science
  • MPS and CISE have oversight
  • GriPhyN and iVDGL are too closely related to
    discuss one without discussing the other
  • One is CS research and test application, the
    other is to build an international scale facility
    to do these tests, and to address other goals as
    well
  • Share vision, personnel and components
  • These two collaboratories are part of a larger
    effort to develop the components and
    infrastructure for supporting data intensive
    science

3
Some Science Drivers
  • Computation is becoming an increasingly important
    tool of scientific discovery
  • Computationally intense analyses
  • Global collaborations
  • Large datasets
  • The increasing importance of computation in
    science is more pronounced in some fields
  • Complex (e.g. climate modeling) and high volume
    (HEP) simulations
  • Detailed rendering (e.g. biomedical informatics)
  • Data intensive science (e.g. astronomy and
    physics)
  • GriPhyN and iVDGL were founded to provide the
    models and software for the data management
    infrastructure for four large projects

4
SDSS / NVO
  • SDSS / NVO are in full production
  • Explore how the Grid can be used in astronomy
  • Whats the benefit?
  • How to integrate?
  • How can the Grid be used for future sky surveys?
  • Data processing pipelines are complex
  • Has made the most sophisticated use of the
    virtual data concept

5
LIGO
  • Not in full production, but real data is being
    taken
  • LIGO I Engineering Runs
  • 35 TB since 1999 and growing
  • LIGO I Science Runs
  • 62 TB in two science runs, additional run planned
    that will generate 135 TB
  • Eventual constant operation at 270 TB/year
  • LIGO II Upgrade
  • Eventual Operation at 1-2 PB / year
  • Need distributed computing power of the Grid
  • Need virtual data catalogs for efficient
    dissemination of data and management of workflow

6
CMS / ATLAS
  • CMS and ATLAS are two experiments being
    developed for the Large Hadron Collider at CERN
  • Two projects, two cultures, but
  • Similar data challenges
  • Similar geographic distribution
  • Moving closer to common tools through the LCG
    computing grid.
  • Petabytes of data per year (100 PB by 2012)

7
Function Types
  • GriPhyN
  • Distributed Research Center
  • iVDGL
  • Community Data System

8
GriPhyN
9
GriPhyN Funding
  • Funded in 2000 through NSF ITR program
  • 11.9M 1.6M matching

10
GriPhyN Project Team
  • Led by U. Florida and U. Chicago
  • PDs Paul Avery (UF) and Ian Foster (UC)
  • 22 Participant institutions
  • 13 funded
  • 9 unfunded
  • Roughly 82 people involved
  • 2/3 of activity computer science, 1/3 physics

11
  • Funded Institutions
  • U. Florida
  • U. Chicago
  • CalTech
  • U. Wisconsin - Madison
  • USC / ISI
  • Indiana U.
  • Johns Hopkins U.
  • Texas A M
  • UT Brownsville
  • UC Berkeley
  • U Wisconsin Milwaukee
  • SDSC
  • Unfunded Institutions
  • Argonne NL
  • Fermi NAL
  • Brookhaven NL
  • UC San Diego
  • U. Pennsylvania
  • U. Illinois - Chicago
  • Stanford
  • Harvard
  • Boston U.
  • Lawrence Berkeley Lab

12
Technology
  • GriPhyNs science drivers demand timely access to
    very large datasets and the computer cycles and
    information management infrastructure needed to
    manipulate and transform those datasets in a
    meaningful way
  • Data Grids are an approach to data management and
    resource sharing in environments where datasets
    are very large
  • Policy-driven resource sharing, distributed
    storage, distributed computation, replication and
    provenance tracking
  • GriPhyN and iVDGL aim to enable petascale virtual
    data grids

13

Petascale Virtual DataGrids
14
GriPhyN Datagrid Contributions
  • GriPhyN has three areas of contribution for
    achieving the DataGrid vision
  • Contributing CS research
  • Virtual Data as a unifying concept
  • Planning, execution and performance monitoring
  • Integrating these facilities in a transparent and
    high-productivity manner Making the grid as easy
    to use as a workstation and the web
  • Disseminating this research through the Virtual
    Data Toolkit and other tools
  • Chimera
  • Pegasus
  • Integrate CS research results into GriPhyN
    science projects
  • GriPhyN experiments serve as an exciting but
    demanding CS and HCI laboratory

15
Virtual Data Toolkit VDT
  • A suite of tools developed by the CS team to
    support science on the Grid
  • Uniting theme is virtual data
  • Nearly all data in physics / astronomy is virtual
    data - derivations of a large, well known data
    set
  • It is possible to represent derived data as the
    set of instructions that created it
  • There is no need to always copy a derived data
    set - it can be recomputed if you have the
    workflow
  • Virtual data also has a number of beneficial side
    effects, e.g. data provenance,discovery,
    re-creation, workflow automation
  • Many packages, a few are unique to GriPhyN,
    others are common across many Grid projects

16
Motivations
Ive come across some interesting data, but I
need to understand the nature of the corrections
applied when it was constructed before I can
trust it for my purposes.
Ive detected a calibration error in an
instrument and want to know which derived data to
recompute.
Data
created-by
consumed-by/ generated-by
Transformation
Derivation
execution-of
I want to apply an astronomical analysis program
to millions of objects. If the results already
exist, Ill save weeks of computation.
I want to search an astronomical database for
galaxies with certain characteristics. If a
program that performs this analysis exists, I
wont have to write one from scratch.
Slide courtesy Ian Foster
17
Chimera
  • The Chimera Virtual Data System is one of the
    core tools of the GriPhyN Virtual Data Toolkit
  • Virtual Data Catalog
  • Represents transformation procedures and derived
    data
  • Virtual Data Language Interpreter
  • Translates user requests into Grid workflow

18
Pegasus
  • Planning Execution in Grids
  • Tool for mapping complex workflows onto the Grid
  • Converts abstract Chimera workflow into a
    concrete workflow, which is sent to DAGman for
    execution
  • DAGman is the Condor meta-scheduler
  • Determines sites and data transfers

19
Virtual Data Processing Tools
VDLx
abstract planner
XML DAG
Pegasus concrete planner
Local shell planner
Condor DAG
shell DAG
20
ExampleSloan Galaxy Cluster Finder
DAG
Sloan Data
Galaxy cluster size distribution
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab,
Michael Milligan, Yong Zhao, Chicago
21
ExampleSloan Galaxy Cluster
DAG
Sloan Data
Galaxy cluster size distribution
With Jim Annis Steve Kent, FNAL
22
Resource Diagram
23
International Virtual Data Grid Laboratory iVDGL
24
Some Context
  • There is much more to the DataGrid world than
    GriPhyN
  • Broad problem space, with many cooperative
    projects
  • U.S.
  • Particle Physics Data Grid (PPDG)
  • GriPhyN
  • Europe
  • DataTAG
  • EU DataGrid
  • International
  • iVDGL

25
Background and Goals
  • U.S. portion funded in 2001 as a Large ITR
    through NSF
  • 13.7M 2M matching
  • International partners responsible for own
    funding
  • Aims of iVDGL
  • Establish a Global Grid Laboratory
  • Conduct DataGrid tests at scale
  • Promote interoperability
  • Promote testbeds for non-physics applications

26
Relationship to GriPhyN
  • Significant overlap
  • Common management, personnel overlap
  • Roughly 80 people on each project, 120 total
  • Tight technical coordination
  • VDT
  • Outreach
  • Testbeds
  • Common External Advisory Committee
  • Different focus - domain challenges
  • GriPhyN - 2/3 CS, 1/3 Physics IT Research
  • iVDGL - 1/3 CS, 2/3 Physics Testbed deployment
    and operation

27
Project Composition
  • CS Research
  • U.S. iVDGL Institutions
  • UK e-science programme
  • DataTAG
  • EU DataGrid
  • Testbeds
  • ATLAS / CMS
  • LIGO
  • National Virtual Observatory
  • SDSS

28
iVDGL Institutions
  • U Florida CMS
  • Caltech CMS, LIGO
  • UC San Diego CMS, CS
  • Indiana U ATLAS, iGOC
  • Boston U ATLAS
  • U Wisconsin, Milwaukee LIGO
  • Penn State LIGO
  • Johns Hopkins SDSS, NVO
  • U Chicago CS
  • U Southern California CS
  • U Wisconsin, Madison CS
  • Salish Kootenai Outreach, LIGO
  • Hampton U Outreach, ATLAS
  • U Texas, Brownsville Outreach, LIGO
  • Fermilab CMS, SDSS, NVO
  • Brookhaven ATLAS
  • Argonne Lab ATLAS, CS

T2 / Software
CS support
T3 / Outreach
T1 / Labs(not funded)
29
US-iVDGL Sites
  • Partners?
  • EU
  • CERN
  • Brazil
  • Australia
  • Korea
  • Japan

30
Component Projects
  • iVDGL contains several core projects
  • iGOC
  • International Grid Operations Center
  • GLUE
  • Grid Laboratory Uniform Environment
  • WorldGrid 2002 international demo
  • Grid3 2003 deployment effort

31
iGOC
  • International Grid Operations Center
  • iVDGL headquarters
  • Analogous to a Network Operations Center
  • Located at Indiana University
  • Single point of contact for iVDGL operations
  • Database of contact information
  • Centralized information about storage, network
    and compute resources
  • Directory for monitoring services at iVDGL sites

32
GLUE
  • Grid Laboratory Uniform Environment
  • A grid interface subset specification that
    permits applications to run on grids from VDT and
    EDG sources
  • Effort to ensure interoperability across
    numerous physics grid projects
  • GriPhyN, iVDGL, PPDG
  • EU DataGrid, DataTAG, CrossGrid, etc.
  • Interoperability effort focuses on
  • Software
  • Configuration
  • Documentation
  • Test suites

33
WorldGrid
  • Effort at a world wide DataGrid
  • Easy to deploy and administer
  • Middleware based on VDT
  • Chimera development
  • Scalability
  • Demo at SC2002
  • United DataTAG and iVDGL

34
Resource Diagram
35
iVDGL Management
36
Issues across projects
  • Technical readiness
  • Infrastructure readiness
  • Collaboration readiness
  • Common ground
  • Coupling of tasks
  • Incentives

37
Technical readiness
  • Very high
  • Physics and CS are both very high on the adoption
    curve, generally
  • Long history of infrastructure development to
    support national and global experiments

38
Infrastructure readiness
  • Also quite high
  • Not all of the pieces are in place to meet demand
  • The expertise exists within these communities to
    build and maintain the necessary infrastructure
  • Community is inventing the infrastructure
  • Real understanding in the project that
    interoperability and standards are part of
    infrastructure

39
Collaboration readiness
  • Again, quite high
  • Physicists have a long history of large scale
    collaboration
  • CS collaborations built on old relationships with
    long time collaborators

40
Common ground
  • Perhaps a bit too high
  • What you can do with a physics background
  • Win the ACM Turing Award
  • Co-invent the World Wide Web
  • Direct the development of the Abilene backbone
  • Because application community has a strong
    understanding of the required work and the
    technical aspects of the work, some friction
    about how work separates
  • History of physicists building computational
    tools e.g. ROOT

41
Coupling of tasks
  • Tasks decompose into subtasks that are somewhat
    tightly coupled
  • Locate tightly coupled tasks at individual sites

42
Incentives
  • Both groups are well motivated, but for different
    reasons
  • CS is engaging in extremely cutting edge research
    across a large range of activities
  • Funded for deployment as well as development
  • Physics is structurally committed to global
    collaborations

43
Some successes
  • Lessons in infrastructure development
  • Outreach and engagement
  • Community buy-in / investment
  • Achieving the CS research goals for Virtual Data
    and Grid execution

44
Infrastructure Dev
  • Looking at the history of the Grid (electrical,
    not computational)
  • Long phases
  • Invention
  • Initial production use
  • Adaptation
  • Standardization / regulation
  • Geographically bounded dominant design
  • I.e. 220 vs. 110

45
Infrastructure Dev
  • We dont see this with GriPhyN / iVDGL
  • Projects concurrent, not consecutive
  • Pipeline approach to phases of infrastructure
    development
  • Real efforts at cooperation with other DataGrid
    communities
  • Why?
  • Deep understanding at high levels of project that
    building it alone is not enough
  • Directive and funding from NSF to do deployment

46
Outreach
  • The GriPhyN / iVDGL community is extremely active
    in outreach to other projects and communities
  • Evangelizing virtual data
  • Distributed tools
  • This is a huge win for building CI that others
    can use

47
Community buy-in
  • Together, these projects are funded at nearly
    30M over 6 years
  • This does not represent the total investment that
    was needed to make this work
  • Leveraged FTE
  • Unfunded testbed sites
  • International partners
  • Lots of collaboration with PPDG starting some
    with Alliance Expeditions, etc
  • This kind of community commitment necessary for a
    project of this size to succeed

48
Challenges
  • Staying relevant
  • Building infrastructure with term limited funding

49
Staying Relevant (1)
  • The application communities are fast paced, high
    power groups of people
  • Real danger in those communities developing tools
    that satisfice while they wait for the tools that
    are optimal and fit into a greater
    cyberinfrastructure
  • Each experiment ideally wants tools perfectly
    tailored to their needs
  • Maintaining user engagement and meeting the needs
    of each community is critical, but difficult

50
Staying Relevant (2)
  • In addition to staying relevant to the
    experiments, GriPhyN must also be relevant to the
    greater scientific community
  • To CS researchers
  • To similarly data-intensive projects
  • Easy to understand code, concepts, APIs, etc.
  • How do you accommodate both a focused client
    community and the broader scientific community
  • Common challenge across many CI initiatives

51
Limited Term Investment
  • These projects are both funded under the NSF ITR
    mechanism
  • 5 year limit
  • Would you buy your telephone service from a
    company that was going to shut down after 5
    years?
  • Challenge to find a sustainable support mechanism
    for CI
Write a Comment
User Comments (0)
About PowerShow.com