The Grid: The Future of High Energy Physics Computing - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

The Grid: The Future of High Energy Physics Computing

Description:

XML, Condor ClassAds, Globus RSL. X.509 certificate format ... g., Condor ClassAds - XML ... Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 86
Provided by: shaw179
Category:

less

Transcript and Presenter's Notes

Title: The Grid: The Future of High Energy Physics Computing


1
The Grid The Future of High Energy Physics
Computing?
  • Shawn McKee
  • January 7, 2002
  • University of Michigan

2
Acknowledgements
  • Disclaimer This talk will be an overview from a
    physicist who is a grid user, rather than a
    computer scientist who is a grid expert!
  • Much of this talk was borrowed from various
    sources. I would like to thank
  • Rob Gardner (IU)
  • Harvey Newman (Caltech)
  • Jennifer Schopf (Northwestern)
  • The Globus Team

3
Outline
  • Definitions
  • Example Grid Uses
  • HEP Motivations for the Grid
  • LHC Experiments and their scope
  • ATLAS as an example
  • LHC Tiered computing model
  • HENP Related Grid Projects
  • Grid Work at Michigan
  • Globus and the Globus Toolkit
  • Conclusions

4
What is The Grid?
  • There are many answers and interpretations
  • The term was originally coined in the mid-1990s
    (in analogy with the power grid?) and can be
    described thusly
  • The grid provides flexible, secure,
    coordinated resource sharing among dynamic
    collections of individuals, institutions and
    resources (virtual organizationsVOs)

5
Grid Perspectives
  • Users Viewpoint
  • A virtual computer which minimizes time to
    completion for my application while transparently
    managing access to inputs and resources
  • Programmers Viewpoint
  • A toolkit of applications and APIs which provide
    transparent access to distributed resources
  • Administrators Viewpoint
  • An environment to monitor, manage and secure
    access to geographically distributed computers,
    storage and networks.

6
Some Important Definitions
  • Resource
  • Network protocol
  • Network enabled service
  • Application Programmer Interface (API)
  • Software Development Kit (SDK)
  • Syntax
  • Not discussed, but important policies

From Introduction to Grids and Globus
7
Resource
  • An entity that is to be shared
  • E.g., computers, storage, data, software
  • Does not have to be a physical entity
  • E.g., Condor pool, distributed file system,
  • Defined in terms of interfaces, not devices
  • E.g. scheduler such as LSF and PBS define a
    compute resource
  • Open/close/read/write define access to a
    distributed file system, e.g. NFS, AFS, DFS

8
Network Protocol
  • A formal description of message formats and a set
    of rules for message exchange
  • Rules may define sequence of message exchanges
  • Protocol may define state-change in endpoint,
    e.g., file system state change
  • Good protocols designed to do one thing
  • Protocols can be layered
  • Examples of protocols
  • IP, TCP, TLS (was SSL), HTTP, Kerberos

9
Network Enabled Services
  • Implementation of a protocol that defines a set
    of capabilities
  • Protocol defines interaction with service
  • All services require protocols
  • Not all protocols are used to provide services
    (e.g. IP, TLS)
  • Examples FTP and Web servers

10
Application Programming Interface
  • A specification for a set of routines to
    facilitate application development
  • Refers to definition, not implementation
  • E.g., there are many implementations of MPI
  • Spec often language-specific (or IDL)
  • Routine name, number, order and type of
    arguments mapping to language constructs
  • Behavior or function of routine
  • Examples
  • GSS API (security), MPI (message passing)

11
Software Development Kit
  • A particular instantiation of an API
  • SDK consists of libraries and tools
  • Provides implementation of API specification
  • Can have multiple SDKs for an API
  • Examples of SDKs
  • MPICH, Motif Widgets

12
Syntax
  • Rules for encoding information, e.g.
  • XML, Condor ClassAds, Globus RSL
  • X.509 certificate format (RFC 2459)
  • Cryptographic Message Syntax (RFC 2630)
  • Distinct from protocols
  • One syntax may be used by many protocols (e.g.,
    XML) useful for other purposes
  • Syntaxes may be layered
  • E.g., Condor ClassAds -gt XML -gt ASCII
  • Important to understand layerings when comparing
    or evaluating syntaxes

13
The Grid Problem
  • Flexible, secure, coordinated resource sharing
    among dynamic collections of individuals,
    institutions, and resources
  • From The Anatomy of the Grid Enabling Scalable
    Virtual Organizations
  • Enable communities (virtual organizations) to
    share geographically distributed resources as
    they pursue common goals -- assuming the absence
    of
  • central location,
  • central control,
  • omniscience,
  • existing trust relationships.

14
Elements of the Problem
  • Resource sharing
  • Computers, storage, sensors, networks,
  • Sharing always conditional issues of trust,
    policy, negotiation, payment,
  • Coordinated problem solving
  • Beyond client-server distributed data analysis,
    computation, collaboration,
  • Dynamic, multi-institutional virtual orgs
  • Community overlays on classic org structures
  • Large or small, static or dynamic

15
Why Grids?
  • A biochemist exploits 10,000 computers to screen
    100,000 compounds in an hour
  • 1,000 physicists worldwide pool resources for
    petaop analyses of petabytes of data
  • Civil engineers collaborate to design, execute,
    analyze shake table experiments
  • Climate scientists visualize, annotate, analyze
    terabyte simulation datasets
  • An emergency response team couples real time
    data, weather model, population data

16
Why Grids? (contd)
  • A multidisciplinary analysis in aerospace couples
    code and data in four companies
  • A home user invokes architectural design
    functions at an application service provider
  • An application service provider purchases cycles
    from compute cycle providers
  • Scientists working for a multinational soap
    company design a new product
  • A community group pools members PCs to analyze
    alternative designs for a local road

17
Online Access to Scientific Instruments
Advanced Photon Source
wide-area dissemination
desktop VR clients with shared controls
real-time collection
archival storage
tomographic reconstruction
DOE X-ray grand challenge ANL, USC/ISI, NIST,
U.Chicago
18
Mathematicians Solve NUG30
  • Looking for the solution to the NUG30 quadratic
    assignment problem
  • An informal collaboration of mathematicians and
    computer scientists
  • Condor-G delivered 3.46E8 CPU seconds in 7 days
    (peak 1009 processors) in U.S. and Italy (8 sites)

14,5,28,24,1,3,16,15, 10,9,21,2,4,29,25,22, 13,26,
17,30,6,20,19, 8,18,7,27,12,11,23
MetaNEOS Argonne, Iowa, Northwestern, Wisconsin
19
Network for EarthquakeEngineering Simulation
  • NEESgrid national infrastructure to couple
    earthquake engineers with experimental
    facilities, databases, computers, each other
  • On-demand access to experiments, data streams,
    computing, archives, collaboration

NEESgrid Argonne, Michigan, NCSA, UIUC, USC
20
Home ComputersEvaluate AIDS Drugs
  • Community
  • 1000s of home computer users
  • Philanthropic computing vendor (Entropia)
  • Research group (Scripps)
  • Common goal advance AIDS research

21
Data Grids for High Energy Physics
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
PByte/sec
100 MBytes/sec
Online System
Offline Farm,CERN Computer Ctr 25 TIPS
Tier 0 1
HPSS
2.5 Gbits/sec
Tier 1
France
Italy
UK
BNL Center
Tier 2
2.5 Gbps
Tier 3
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels
Institute 0.25TIPS
Institute
Institute
Institute
100 - 1000 Mbits/sec
Physics data cache
Tier 4
Workstations
22
Broader Context
  • Grid Computing has much in common with major
    industrial thrusts
  • Business-to-business, Peer-to-peer, Application
    Service Providers, Storage Service Providers,
    Distributed Computing, Internet Computing
  • Sharing issues not adequately addressed by
    existing technologies
  • Complicated requirements run program X at site
    Y subject to community policy P, providing access
    to data at Z according to policy Q
  • High performance unique demands of advanced
    high-performance systems

23
Why Now?
  • Moores law improvements in computing produce
    highly functional endsystems
  • The Internet and burgeoning wired and wireless
    provide universal connectivity
  • Changing modes of working and problem solving
    emphasize teamwork, computation
  • Network exponentials produce dramatic changes in
    geometry and geography

24
Network Exponentials
  • Network vs. computer performance
  • Computer speed doubles every 18 months
  • Network speed doubles every 9 months
  • Difference order of magnitude per 5 years
  • 1986 to 2000
  • Computers x 500
  • Networks x 340,000
  • 2001 to 2010
  • Computers x 60
  • Networks x 4000

Moores Law vs. storage improvements vs. optical
improvements. Graph from Scientific American
(Jan-2001) by Cleo Vilett, source Vined Khoslan,
Kleiner, Caufield and Perkins.
25
The Network
  • As can be seen in the previous transparency, it
    can be argued it is the evolution of the network
    which has been the primary motivator for the
    Grid.
  • Ubiquitous, dependable worldwide networks have
    opened up the possibility of tying together
    geographically distributed resources
  • The success of the WWW for sharing information
    has spawned a push for a system to share
    resources
  • The network has become the virtual bus of a
    virtual computer.
  • More on this later

26
Motivation for the Grid
  • A HEP Perspective

27
Large Hadron Collider at CERN
28
Four LHC Experiments The Petabyte to Exabyte
Challenge
  • ATLAS, CMS, ALICE, LHCBHiggs New particles
    Quark-Gluon Plasma CP Violation

Data stored 40 Petabytes/Year and UP
CPU 0.30 Petaflops and UP
0.1 to 1 Exabyte (1 EB 1018
Bytes) (2007) (2012 ?) for the LHC
Experiments
29
How Much Data is Involved?
High Level-1 Trigger(1 MHz)
High No. ChannelsHigh Bandwidth(500 Gbit/s)
Level 1 Rate (Hz)
106
LHCB
ATLAS CMS
105
HERA-B
KLOE
TeV II
104
Hans Hoffman DOE/NSF Review, Nov 00
High Data Archive(PetaByte)
CDF/D0
103
H1ZEUS
ALICE
NA49
UA1
102
104
105
106
107
LEP
Event Size (bytes)
30
ATLAS
  • A Torroidal LHC Apparatus
  • Collaboration
  • 150 institutes
  • 1850 physicists
  • Detector
  • Inner tracker
  • Calorimeter
  • Magnet
  • Muon
  • United States ATLAS
  • 29 universities, 3 national labs
  • 20 of ATLAS

31
(No Transcript)
32
Discovery Potential for SM Higgs Boson
  • Good sensitivity over the full mass range from
    100 GeV to 1 TeV
  • For most of the mass range at least two channels
    available
  • Detector performance is crucial b-tag, leptons,
    g, E resolution, g / jet separation, ...

33
(No Transcript)
34
Data Flow from ATLAS
40 MHz (40 TB/sec)
level 1 - special hardware
75 KHz (75 GB/sec)
level 2 - embedded processors
5 KHz (5 GB/sec)
level 3 - PCs
ATLAS 9 PB/y one million PC hard drives!
100 Hz (100 MB/sec)
data recording offline analysis
35
ATLAS Parameters
  • Running conditions in the early years
  • Raw event size 2 MB
  • 2.7x109 event sample ? 5.4 PB/year, before data
    processing
  • Reconstructed events, Monte Carlo data ?
  • 9 PB/year (2PB disk) CPU 2M SI95 (todays PC
    20 SI95)
  • CERN alone can handle only 1/3 of these
    resourceshow will we handle this?

36
Data IntensiveComputing and Grids
  • The term Data Grid is often used
  • Unfortunate as it implies a distinct
    infrastructure, which it isnt but easy to say
  • Data-intensive computing shares numerous
    requirements with collaboration, instrumentation,
    computation,
  • Security, resource mgt, info services, etc.
  • Important to exploit commonalities as very
    unlikely that multiple infrastructures can be
    maintained
  • Fortunately this seems easy to do!

37
Data Intensive Issues Include
  • Harness potentially large numbers of data,
    storage, network resources located in distinct
    administrative domains
  • Respect local and global policies governing what
    can be used for what
  • Schedule resources efficiently, again subject to
    local and global constraints
  • Achieve high performance, with respect to both
    speed and reliability
  • Catalog software and virtual data

38
Examples ofDesired Data Grid Functionality
  • High-speed, reliable access to remote data
  • Automated discovery of best copy of data
  • Manage replication to improve performance
  • Co-schedule compute, storage, network
  • Transparency wrt delivered performance
  • Enforce access control on data
  • Allow representation of global resource
    allocation policies

39
HEP Data Analysis
  • Raw data
  • hits, pulse heights
  • Reconstructed data (ESD)
  • tracks, clusters
  • Analysis Objects (AOD)
  • Physics Objects
  • Summarized
  • Organized by physics topic
  • Ntuples, histograms, statistical data

40
Production Analysis
Trigger System
Data Acquisition
Run Conditions
Level 3 trigger
Calibration Data
Raw data
Trigger Tags
Reconstruction
Event Summary Data ESD
Event Tags
coordination required at the collaboration and
group levels
41
Physics Analysis
Event Tags
Tier 0,1 Collaboration wide
Event Selection
Analysis Objects
Calibration Data
Analysis Processing
Raw Data
Tier 2 Analysis Groups
PhysicsObjects StatObjects
PhysicsObjects StatObjects
PhysicsObjects StatObjects
Tier 3, 4 Physicists
Physics Analysis
42
A Model Architecture for Data Grids
Attribute Specification
Replica Catalog
Metadata Catalog
Application
Multiple Locations
Logical Collection and Logical File Name
MDS
Selected Replica
Replica Selection
Performance Information Predictions
NWS
GridFTP Control Channel
Disk Cache
GridFTPDataChannel
Tape Library
Disk Array
Disk Cache
Replica Location 1
Replica Location 2
Replica Location 3
43
LHC Computing Model(Based on MONARC Simulations)
  • Hierarchical, distributed tiers
  • The grid is necessary to tie these distributed
    resources together

Universities
Tier-2
Tier-2
Tier-1 BNL, FNAL
CERN
National Regional Computing Center
Dedicated or QoS Network Links
Tier-0
44
Why Worldwide Computing? Regional Center Concept
Goals
  • Managed, fair-shared access for Physicists
    everywhere
  • Maximize total funding resources while meeting
    the total computing and data handling needs
  • Balance proximity of datasets to large central
    resources, against regional resources under more
    local control
  • Tier-N Model
  • Efficient network use higher throughput on short
    paths
  • Local gt regional gt national gt international
  • Utilizing all intellectual resources, in several
    time zones
  • CERN, national labs, universities, remote sites
  • Involving physicists and students at their home
    institutions
  • Greater flexibility to pursue different physics
    interests, priorities, and resource allocation
    strategies by region
  • And/or by Common Interests (physics topics,
    subdetectors,)
  • Manage the Systems Complexity
  • Partitioning facility tasks, to manage and focus
    resources

45
Tier 2 Centers
  • Bring LHC Physics to the Universities
  • Optimize physics discovery potential
  • Standard configuration optimized for analysis at
    the DST level
  • Primary Resource for Monte Carlo Simulation
  • Production level particle searches (University
    autonomy)
  • Configuration
  • Commodity Pentium/Linux (100K SpecInt95)
    Tier 1 500K
  • Estimated 144 Dual Processor Nodes Tier 1
    640
  • Online Storage 100 TB Disk
    Tier 1 1000 TB
  • High Performance Storage Area Network

46
Who is working on the Grid?
  • HEP Perspective

47
HENP Related Data Grid Projects
  • Funded Projects
  • PPDG I USA DOE 2M 1999-2001
  • GriPhyN USA NSF 11.9M 1.6M 2000-2005
  • EU DataGrid EU EC 10M 2001-2004
  • PPDG II (CP) USA DOE 9.5M 2001-2004
  • iVDGL USA NSF 13.7M 2M 2001-2006
  • DataTAG EU EC 4M 2002-2004
  • About to be Funded Project
  • GridPP UK PPARC gt15M? 2001-2004
  • Many national projects of interest to HENP
  • Initiatives in US, UK, Italy, France, NL,
    Germany, Japan,
  • EU networking initiatives (GĂ©ant, SURFNet)
  • US Distributed Terascale Facility (53M, 12
    TFL, 40 Gb/s network)

in final stages of approval
48
Grid Physics Network (GriPhyN) Enabling RD for
advanced data grid systems,focusing in
particular on Virtual Data concept
ATLAS CMS LIGO SDSS
49
International Virtual Data Grid Laboratory
50
TeraGrid NCSA, ANL, SDSC, Caltech
StarLight Intl Optical Peering Point (see
www.startap.net)
A Preview of the Grid Hierarchyand Networks of
the LHC Era
Abilene
Chicago
Indianapolis
DTF Backplane(4x? 40 Gbps)
Urbana
Pasadena
Starlight / NW Univ
UIC
I-WIRE
San Diego
Multiple Carrier Hubs
Ill Inst of Tech
ANL
OC-48 (2.5 Gb/s, Abilene)
Univ of Chicago
Multiple 10 GbE (Qwest)
Indianapolis (Abilene NOC)
Multiple 10 GbE (I-WIRE Dark Fiber)
NCSA/UIUC
  • Solid lines in place and/or available in 2001
  • Dashed I-WIRE lines planned for Summer 2002

Source Charlie Catlett, Argonne
51
PACI, TeraGrid and HENP
  • The scale, complexity and global extent of the
    LHC Data Analysis problem is unprecedented
  • The solution of the problem, using globally
    distributed Grids, is mission-critical for
    frontier science and engineering
  • HENP has a tradition of deploying new highly
    functional systems (and sometimes new
    technologies) to meet its technical and
    ultimately its scientific needs
  • HENP problems are mostly embarrassingly
    parallel but potentially overwhelming in their
    data- and network intensiveness
  • HENP/Computer Science synergy has increased
    dramatically over the last two years, focused on
    Data Grids
  • Successful collaborations in GriPhyN, PPDG, EU
    Data Grid
  • The TeraGrid (present and future) and its
    development program is scoped at an appropriate
    level of depth and diversity
  • to tackle the LHC and other Petascale
    problems, over a 5 year time span
  • matched to the LHC time schedule, with full ops.
    In 2007

52
Selected Major Grid Projects
New
New
53
Selected Major Grid Projects
New
New
New
New
New
54
Selected Major Grid Projects
New
New
55
Selected Major Grid Projects
New
New
Also many technology RD projects e.g., Condor,
NetSolve, Ninf, NWS See also www.gridforum.org
56
Grid Related Work at Michigan
57
Grid Activities at Michigan
  • There are many ongoing activities related to the
    grid within the department
  • NPACI/CPC collaboration on grid development
  • Collaboration with the Visible Human Project on
    networking and grid performance issues
  • Authenticated QoS work with CITI/ITCOM
  • Collaborative tools and Web Lecture Archive
    Project (see http//wlap.org)
  • Network issues bandwidth, services and
    performance (I will focus on this later)

58
USATLAS Data Grid Testbed
U Michigan
Boston University
UC Berkeley LBNL-NERSC
Argonne National Laboratory
Brookhaven National Laboratory
University of Oklahoma
Prototype Tier 2s
Indiana University
University of Texas at Arlington
HPSS sites
59
US ATLAS Grid Testbed Activities
  • We are an active participant in the US ATLAS grid
    testbed
  • Collaboration with CPC on grid issues and
    development
  • Hosted a US ATLAS Grid workshop in Winter 2001
  • Leadership in network issues
  • Strong collaborative tools effort
  • Active in the Global Grid forum and Internet2
    HENP WG
  • Testbed Activities
  • Network monitoring
  • Security configuration
  • Hardware testing for high performance bottlenecks
  • Certificate attributes (grid account management)
  • PACMAN cache site
  • Kick-start (one floppy) OS install development
  • Using AFS as a replacement for NFS

60
Internet2 HENP Networking WG Mission
  • To help ensure that the required
  • National and international network
    infrastructures
  • Standardized tools and facilities for high
    performance and end-to-end monitoring and
    tracking, and
  • Collaborative systems
  • are developed and deployed in a timely manner,
    and used effectively to meet the needs of the US
    LHC and other major HENP Programs, as well as
    the general needs of our scientific community.
  • To carry out these developments in a way that is
    broadly applicable across many fields, within and
    beyond the scientific community
  • Co-Chairs S. McKee (Michigan), H. Newman
    (Caltech) With thanks to R. Gardner and J.
    Williams (Indiana)

61
UM/ATLAS Grid ClusterCurrent Status as of
January 2002
All systems running Globus 1.1.4 and Condor
62
Networking and the Grid
63
Why Networking?
  • Since the early 1980s physicists have depended
    upon leading-edge networks to enable ever larger
    international collaborations.
  • Major HEP collaborations, such as ATLAS, require
    rapid access to event samples from massive data
    stores, not all of which can be locally stored at
    each computational site.
  • Evolving integrated applications, i.e. Data
    Grids, rely on seamless, transparent operation of
    the underlying LANs and WANs.
  • Networks are among the most basic Grid building
    blocks.

64
Transatlantic Net WG (HN, L. Price)
Bandwidth Requirements

Installed BW. Maximum Link Occupancy 50
Assumed The Network Challenge is Shared by Both
Next- and Present Generation Experiments
65
TCP WAN Performance
  • Mathis, et. al., Computer Communications Review
    v27, 3, July 1997, demonstrated the dependence of
    bandwidth on network parameters

BW - Bandwidth MSS Max. Segment Size RTT
Round Trip Time PkLoss Packet loss rate
If you want to get 90 Mbps via TCP/IP on a WAN
link from LBL to UM you need a packet loss lt
1.8e-6 !! (70 ms RTT).
66
Network Monitoring Iperf
(http//atgrid.physics.lsa.umich.edu/cricket/cric
ket/grapher.cgi)
  • We have setup testbed network monitoring using
    Iperf (V1.2) (S. McKee(Umich), D. Yu (BNL))
  • We test both UDP (90 Mbps sending) and TCP
    between all combinations of our 8 testbed sites.
  • Globus is used to initiate both the client and
    server Iperf processes.

67
Testbed Network Measurements
68
Iperf Network Test Setup
69
UM Network IPERF Results
Our new switch has enabled us to increase our
bandwidth to the edge of campus by a factor of
7-15 (Gig vs Campus)
70
Achieving High Performance Networking
  • Server and Client CPU, I/O and NIC throughput
    sufficient
  • Must consider firmware, hard disk interfaces, bus
    type/capacity
  • Knowledge base of hardware performance, tuning
    issues, examples
  • TCP/IP stack configuration and tuning is
    Absolutely Required
  • Large windows, multiple streams
  • No Local infrastructure bottlenecks
  • Gigabit Ethernet clear path between selected
    host pairs
  • To 10 Gbps Ethernet by 2003
  • Careful Router/Switch configuration and
    monitoring
  • Enough router Horsepower (CPUs, Buffer Size,
    Backplane BW)
  • Packet Loss must be Zero (well below 0.1)
  • i.e. No Commodity networks (need ESNet, I2 type
    networks)
  • End-to-end monitoring and tracking of performance

71
Back to the Grid
  • The Globus Toolkit

72
The Globus ProjectMaking Grid computing a
reality
  • Close collaboration with real Grid projects in
    science and industry
  • Development and promotion of standard Grid
    protocols to enable interoperability and shared
    infrastructure
  • Development and promotion of standard Grid
    software APIs and SDKs to enable portability and
    code sharing
  • The Globus Toolkit Open source, reference
    software base for building grid infrastructure
    and applications
  • Global Grid Forum Development of standard
    protocols and APIs for Grid computing

73
Globus Toolkit Components
  • Two major Data Grid components
  • 1. Data Transport and Access
  • Common protocol
  • Secure, efficient, flexible, extensible data
    movement
  • Family of tools supporting this protocol
  • 2. Replica Management Architecture
  • Simple scheme for managing
  • multiple copies of files
  • collections of files

74
Layered Grid Architecture(By Analogy to Internet
Architecture)
75
The Hourglass Model
  • Focus on architecture issues
  • Propose set of core services as basic
    infrastructure
  • Use to construct high-level, domain-specific
    solutions
  • Design principles
  • Keep participation cost low
  • Enable local control
  • Support for adaptation
  • IP hourglass model

A p p l i c a t i o n s
Diverse global services
Core services
Local OS
76
Resource LayerProtocols Services
  • Grid Resource Allocation Mgmt (GRAM)
  • Remote allocation, reservation, monitoring,
    control of compute resources
  • GridFTP protocol (FTP extensions)
  • High-performance data access transport
  • Grid Resource Information Service (GRIS)
  • Access to structure state information
  • Network reservation, monitoring, control
  • All built on connectivity layer GSI IP

GridFTP www.gridforum.org GRAM, GRIS
www.globus.org
77
Collective LayerProtocols Services
  • Index servers aka metadirectory services
  • Custom views on dynamic resource collections
    assembled by a community
  • Resource brokers (e.g., Condor Matchmaker)
  • Resource discovery and allocation
  • Replica catalogs
  • Replication services
  • Co-reservation and co-allocation services
  • Workflow management services
  • Etc.

Condor www.cs.wisc.edu/condor
78
Example High-ThroughputComputing System
App
High Throughput Computing System
Collective (App)
Dynamic checkpoint, job management, failover,
staging
Collective (Generic)
Brokering, certificate authorities
Access to data, access to computers, access to
network performance data
Resource
Communication, service discovery (DNS),
authentication, authorization, delegation
Connect
Storage systems, schedulers
Fabric
79
Virtual Data Queries
  • A query for events implies
  • Really means asking if a input data sample
    corresponding to a set of calibrations, methods,
    and perhaps Monte Carlo history match a set of
    criteria
  • It is vital to know, for example
  • What data sets already exist, and in which
    formats? (ESD, AOD,Physics Objects) If not,
    can it be materialized?
  • Was this data calibrated optimally?
  • If I want to recalibrate a detector, what is
    required?
  • Methods
  • Virtual data catalogs and APIs
  • Data signatures
  • Interface to Event Selector Service

80
Virtual Data Scenario
  • A physicist issues a query for events
  • Issues
  • How expressive is this query?
  • What is the nature of the query?
  • What language (syntax) will be supported for the
    query?
  • Algorithms are already available in local shared
    libraries
  • For ATLAS, an Athena service consults an ATLAS
    Virtual Data Catalog or Registry Service
  • Three possibilities
  • File exists on local machine
  • Analyze it
  • File exists in a remote store
  • Copy the file, then analyze it
  • File does not exists
  • Generate, reconstruct, analyze possibly done
    remotely, then copied

81
The Future of the Grid
82
Problem Evolution
  • Past-present O(102) high-end systems Mb/s
    networks centralized (or entirely local) control
  • I-WAY (1995) 17 sites, week-long 155 Mb/s
  • GUSTO (1998) 80 sites, long-term experiment
  • NASA IPG, NSF NTG O(10) sites, production
  • Present O(104-106) data systems, computers Gb/s
    networks scaling, decentralized control
  • Scalable resource discovery restricted
    delegation community policy Data Grid 100s of
    sites, O(104) computers complex policies
  • Future O(106-109) data, sensors, computers Tb/s
    networks highly flexible policy, control

83
The Globus View of the FutureAll Software is
Network-Centric
  • We dont build or buy computers anymore, we
    borrow or lease required resources
  • When I walk into a room, need to solve a problem,
    need to communicate
  • A computer is a dynamically, often
    collaboratively constructed collection of
    processors, data sources, sensors, networks
  • Similar observations apply for software

84
And Thus
  • Reduced barriers to access mean that we do much
    more computing, and more interesting computing,
    than today gt Many more components ( services)
    massive parallelism
  • All resources are owned by others gt Sharing (for
    fun or profit) is fundamental trust, policy,
    negotiation, payment
  • All computing is performed on unfamiliar systems
    gt Dynamic behaviors, discovery, adaptivity,
    failure

85
Future of the Grid for HEP
  • Grid Optimist
  • Best thing since the WWW. Dont worry, the grid
    will solve all our computational and data
    problems! Just click Install
  • Grid Pessimist
  • The grid is merely an excuse by computer
    scientists to milk the political system for more
    research grants so they can write yet more lines
    of useless code The Economist, June 21, 2001
  • A distraction from getting real science done
    McCubbin
  • Grid Realist
  • The grid can solve our problems, because we
    design it to! We must work closely with the
    developers as it evolves, providing our
    requirements and testing their deliverables in
    our environment.

86
Conclusions
  • LHC computing requirements are x 5-10 existing
    experiments in both data volume and CPU
    requirements
  • LHC Physics will depend heavily on resources
    outside of CERN
  • LHC Computing Model adopted by CERN
  • Strong endorsement multi-tiered, hierarchy of
    distributed resources
  • This model will rely on grid software to provide
    efficient, easy access for physicists
  • This is a new platform for physics analysis
  • Like the web, if the grid is going to happen, it
    will be pushed forward by HENP experiments

87
For More Information on the Grid
  • Globus Project
  • www.globus.org
  • Grid Forum
  • www.gridforum.org
  • Online tutorials/papers
  • www.globus.org/training/
  • www.globus.org/research/papers.html
  • Book (Morgan Kaufman)
  • www.mkp.com/grids

88
Baseline BW for the US-CERN Link HENP
Transatlantic WG (DOENSF)
Transoceanic NetworkingIntegrated with the
TeraGrid, Abilene, Regional Netsand Continental
NetworkInfrastructuresin US, Europe, Asia,
South America
US-CERN Plans 155 Mbps to 2 X 155 Mbps this
Year 622 Mbps in April 2002DataTAG 2.5 Gbps
Research Link in Summer 200210 Gbps Research
Link in 2003
89
LHC Schedule
  • Dec 2005 Ring closed and cooled
  • 2006-2007
  • April First collisions L5x1032 to 2x1033 ? 1
    fb-1
  • Jan-March Machine commissioning with 1 proton
    beam
  • Start detector
    commissioning 105 Z ? ??, W ? ??, tt events
  • May-July Shutdown (continue detector
    installation)
  • August Physics Run L2x1033, 10 fb-1
  • Complete detector commissioning
  • ? February 2007 Start of Physics
  • 2008
  • High luminosity running L2x1034, 100 fb-1 per
    year

90
Standard Model Higgs Production
K. Jacobs, Fermilab Higgs Workshop, May 2001
91
Modeling and SimulationMONARC System
  • Modelling and understanding current systems,
    their performance and limitations, is essential
    for the design of the future large scale
    distributed processing systems.
  • The simulation program developed within the
    MONARC (Models Of Networked Analysis At Regional
    Centers) project is based on a process oriented
    approach for discrete event simulation. It is
    based on the on Java(TM) technology and provides
    a realistic modelling tool for such large scale
    distributed systems.

SIMULATION of Complex Distributed Systems
92
MONARC SONN 3 Regional Centers Learning to
Export Jobs (Day 9)
ltEgt 0.73
ltEgt 0.83
1MB/s 150 ms RTT
CERN30 CPUs
CALTECH 25 CPUs
1.2 MB/s 150 ms RTT
0.8 MB/s 200 ms RTT
NUST 20 CPUs
ltEgt 0.66
Day 9
93
Data Grid Reference Architecture
Application
Discipline-Specific Data Grid Application
Request Management
Catalogs
Replica Management
Community Policy

Collective
Access to data, access to computers, access to
network performance data,
Resource
Communication, service discovery (DNS),
authentication, delegation
Connectivity
Storage Systems
Compute Systems
Networks
Code Repositories
Fabric

94
Data Grid Reference Architecture
User Applications
Request Formulation
Virtual Data Catalogs
Request Manager
Request Planner
Request Executor
Storage Systems
Code Repositories
Computers
Networks
95
Grid Architectures and Athena
  • Grid Services
  • Resource discovery
  • Scheduling
  • Security
  • Monitoring
  • Data Access
  • Policy
  • Athena Services
  • Application manager
  • Job Options service
  • Event Selector service
  • Event persistency service
  • Detector persistency
  • Histogram service
  • User interfaces
  • Visualization
  • Database
  • Event model
  • Object federations
  • Concurrency

96
Athenas Persistency Mechanism
97
ATLAS Grid Testbed (US)
  • 8 sites
  • University groups BU, IU, UM, OU, UTA
  • Labs ANL, BNL, LBNL
  • 15-20 users
  • All sites
  • Globus Condor
  • AFS, ATLAS software release
  • Dedicated resources
  • Accounts for most users on all machines
  • Applications
  • Monte Carlo production w/ legacy code
  • Athena controlled Monte Carlo

98
Motivation for a Common Data Access Protocol
  • Existing distributed data storage systems
  • DPSS, HPSS focus on high-performance access,
    utilize parallel data transfer, striping
  • DFS focus on high-volume usage, dataset
    replication, local caching
  • SRB connects heterogeneous data collections,
    uniform client interface, metadata queries
  • Problems
  • Incompatible (and proprietary) protocols
  • Each require custom client
  • Partitions available data sets and storage
    devices
  • Each protocol has subset of desired functionality

99
A Common, Secure,Efficient Data Access Protocol
  • Common, extensible transfer protocol
  • Common protocol means all can interoperate
  • Decouple low-level data transfer mechanisms from
    the storage service
  • Advantages
  • New, specialized storage systems are
    automatically compatible with existing systems
  • Existing systems have richer data transfer
    functionality
  • Interface to many storage systems
  • HPSS, DPSS, file systems
  • Plan for SRB integration

100
iVDGL Architecture
101
The 13.6 TF TeraGridComputing at 40 Gb/s
Site Resources
Site Resources
26
HPSS
HPSS
4
24
External Networks
External Networks
8
5
Caltech
Argonne
External Networks
External Networks
NCSA/PACI 8 TF 240 TB
SDSC 4.1 TF 225 TB
Site Resources
Site Resources
HPSS
UniTree
TeraGrid/DTF NCSA, SDSC, Caltech, Argonne
www.teragrid.org
102
Grid RD Focal Areas for NPACI/HENP Partnership
  • Development of Grid-Enabled User Analysis
    Environments
  • CLARENS (IGUANA) Project for Portable
    Grid-Enabled Event Visualization, Data
    Processing and Analysis
  • Object Integration backed by an ORDBMS, and
    File-Level Virtual Data Catalogs
  • Simulation Toolsets for Systems Modeling,
    Optimization
  • For example the MONARC System
  • Globally Scalable Agent-Based Realtime
    Information Marshalling Systems
  • To face the next-generation challenge of
    DynamicGlobal Grid design and operations
  • Self-learning (e.g. SONN) optimization
  • Simulation (Now-Casting) enhanced to monitor,
    track and forward predict site, network and
    global system state
  • 1-10 Gbps Networking development and global
    deployment
  • Work with the TeraGrid, STARLIGHT, Abilene, the
    iVDGL GGGOC, HENP Internet2 WG, Internet2 E2E,
    and DataTAG
  • Global Collaboratory Development e.g. VRVS,
    Access Grid

103
Virtual Data Registries
Event Selector Service
Algorithm creates VD IDs
Virtual Data Registry Service
104
Current Grid Challenges Resource Discovery,
Co-Scheduling, Transparency
  • Discovery and Efficient Co-Scheduling of
    Computing, Data Handling, and Network Resources
  • Effective, Consistent Replica Management
  • Virtual Data Recomputation Versus Data Transport
    Decisions
  • Reduction of Complexity In a Petascale World
  • GA3 Global Authentication, Authorization,
    Allocation
  • VDT Transparent Access to Results (and Data
    When Necessary)
  • Location Independence of the User Analysis,
    Grid,and Grid-Development Environments
  • Seamless Multi-Step Data Processing and
    AnalysisDAGMan (Wisc), MOPIMPALA(FNAL)

105
Next Round Grid Challenges Global Workflow
Monitoring, Management, and Optimization
  • Workflow Management, Balancing Policy Versus
    Moment-to-moment Capability to Complete Tasks
  • Balance High Levels of Usage of Limited Resources
    Against Better Turnaround Times for Priority
    Jobs
  • Goal-Oriented According to (Yet to be Developed)
    Metrics
  • Maintaining a Global View of Resources and System
    State
  • Global System Monitoring, Modeling,
    Quasi-realtime simulation feedback on the
    Macro- and Micro-Scales
  • Adaptive Learning new paradigms for execution
    optimization and Decision Support (eventually
    automated)
  • Grid-enabled User Environments
Write a Comment
User Comments (0)
About PowerShow.com