The Grid: The Future of High Energy Physics Computing

About This Presentation

Title:

The Grid: The Future of High Energy Physics Computing

Description:

XML, Condor ClassAds, Globus RSL. X.509 certificate format ... g., Condor ClassAds - XML ... Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 86

Provided by: shaw179

Category:

more less

Transcript and Presenter's Notes

Title: The Grid: The Future of High Energy Physics Computing

1
The Grid The Future of High Energy Physics
Computing?

Shawn McKee
January 7, 2002
University of Michigan

2
Acknowledgements

Disclaimer This talk will be an overview from a
physicist who is a grid user, rather than a
computer scientist who is a grid expert!
Much of this talk was borrowed from various
sources. I would like to thank
Rob Gardner (IU)
Harvey Newman (Caltech)
Jennifer Schopf (Northwestern)
The Globus Team

3
Outline

Definitions
Example Grid Uses
HEP Motivations for the Grid
LHC Experiments and their scope
ATLAS as an example
LHC Tiered computing model
HENP Related Grid Projects
Grid Work at Michigan
Globus and the Globus Toolkit
Conclusions

4
What is The Grid?

There are many answers and interpretations
The term was originally coined in the mid-1990s
(in analogy with the power grid?) and can be
described thusly
The grid provides flexible, secure,
coordinated resource sharing among dynamic
collections of individuals, institutions and
resources (virtual organizationsVOs)

5
Grid Perspectives

Users Viewpoint
A virtual computer which minimizes time to
completion for my application while transparently
managing access to inputs and resources
Programmers Viewpoint
A toolkit of applications and APIs which provide
transparent access to distributed resources
Administrators Viewpoint
An environment to monitor, manage and secure
access to geographically distributed computers,
storage and networks.

6
Some Important Definitions

Resource
Network protocol
Network enabled service
Application Programmer Interface (API)
Software Development Kit (SDK)
Syntax
Not discussed, but important policies

From Introduction to Grids and Globus
7
Resource

An entity that is to be shared
E.g., computers, storage, data, software
Does not have to be a physical entity
E.g., Condor pool, distributed file system,
Defined in terms of interfaces, not devices
E.g. scheduler such as LSF and PBS define a
compute resource
Open/close/read/write define access to a
distributed file system, e.g. NFS, AFS, DFS

8
Network Protocol

A formal description of message formats and a set
of rules for message exchange
Rules may define sequence of message exchanges
Protocol may define state-change in endpoint,
e.g., file system state change
Good protocols designed to do one thing
Protocols can be layered
Examples of protocols
IP, TCP, TLS (was SSL), HTTP, Kerberos

9
Network Enabled Services

Implementation of a protocol that defines a set
of capabilities
Protocol defines interaction with service
All services require protocols
Not all protocols are used to provide services
(e.g. IP, TLS)
Examples FTP and Web servers

10
Application Programming Interface

A specification for a set of routines to
facilitate application development
Refers to definition, not implementation
E.g., there are many implementations of MPI
Spec often language-specific (or IDL)
Routine name, number, order and type of
arguments mapping to language constructs
Behavior or function of routine
Examples
GSS API (security), MPI (message passing)

11
Software Development Kit

A particular instantiation of an API
SDK consists of libraries and tools
Provides implementation of API specification
Can have multiple SDKs for an API
Examples of SDKs
MPICH, Motif Widgets

12
Syntax

Rules for encoding information, e.g.
XML, Condor ClassAds, Globus RSL
X.509 certificate format (RFC 2459)
Cryptographic Message Syntax (RFC 2630)
Distinct from protocols
One syntax may be used by many protocols (e.g.,
XML) useful for other purposes
Syntaxes may be layered
E.g., Condor ClassAds -gt XML -gt ASCII
Important to understand layerings when comparing
or evaluating syntaxes

13
The Grid Problem

Flexible, secure, coordinated resource sharing
among dynamic collections of individuals,
institutions, and resources
From The Anatomy of the Grid Enabling Scalable
Virtual Organizations
Enable communities (virtual organizations) to
share geographically distributed resources as
they pursue common goals -- assuming the absence
of
central location,
central control,
omniscience,
existing trust relationships.

14
Elements of the Problem

Resource sharing
Computers, storage, sensors, networks,
Sharing always conditional issues of trust,
policy, negotiation, payment,
Coordinated problem solving
Beyond client-server distributed data analysis,
computation, collaboration,
Dynamic, multi-institutional virtual orgs
Community overlays on classic org structures
Large or small, static or dynamic

15
Why Grids?

A biochemist exploits 10,000 computers to screen
100,000 compounds in an hour
1,000 physicists worldwide pool resources for
petaop analyses of petabytes of data
Civil engineers collaborate to design, execute,
analyze shake table experiments
Climate scientists visualize, annotate, analyze
terabyte simulation datasets
An emergency response team couples real time
data, weather model, population data

16
Why Grids? (contd)

A multidisciplinary analysis in aerospace couples
code and data in four companies
A home user invokes architectural design
functions at an application service provider
An application service provider purchases cycles
from compute cycle providers
Scientists working for a multinational soap
company design a new product
A community group pools members PCs to analyze
alternative designs for a local road

17
Online Access to Scientific Instruments
Advanced Photon Source
wide-area dissemination
desktop VR clients with shared controls
real-time collection
archival storage
tomographic reconstruction
DOE X-ray grand challenge ANL, USC/ISI, NIST,
U.Chicago
18
Mathematicians Solve NUG30

Looking for the solution to the NUG30 quadratic
assignment problem
An informal collaboration of mathematicians and
computer scientists
Condor-G delivered 3.46E8 CPU seconds in 7 days
(peak 1009 processors) in U.S. and Italy (8 sites)

14,5,28,24,1,3,16,15, 10,9,21,2,4,29,25,22, 13,26,
17,30,6,20,19, 8,18,7,27,12,11,23
MetaNEOS Argonne, Iowa, Northwestern, Wisconsin
19
Network for EarthquakeEngineering Simulation

NEESgrid national infrastructure to couple
earthquake engineers with experimental
facilities, databases, computers, each other
On-demand access to experiments, data streams,
computing, archives, collaboration

NEESgrid Argonne, Michigan, NCSA, UIUC, USC
20
Home ComputersEvaluate AIDS Drugs

Community
1000s of home computer users
Philanthropic computing vendor (Entropia)
Research group (Scripps)
Common goal advance AIDS research

21
Data Grids for High Energy Physics
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
PByte/sec
100 MBytes/sec
Online System
Offline Farm,CERN Computer Ctr 25 TIPS
Tier 0 1
HPSS
2.5 Gbits/sec
Tier 1
France
Italy
UK
BNL Center
Tier 2
2.5 Gbps
Tier 3
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels
Institute 0.25TIPS
Institute
Institute
Institute
100 - 1000 Mbits/sec
Physics data cache
Tier 4
Workstations
22
Broader Context

Grid Computing has much in common with major
industrial thrusts
Business-to-business, Peer-to-peer, Application
Service Providers, Storage Service Providers,
Distributed Computing, Internet Computing
Sharing issues not adequately addressed by
existing technologies
Complicated requirements run program X at site
Y subject to community policy P, providing access
to data at Z according to policy Q
High performance unique demands of advanced
high-performance systems

23
Why Now?

Moores law improvements in computing produce
highly functional endsystems
The Internet and burgeoning wired and wireless
provide universal connectivity
Changing modes of working and problem solving
emphasize teamwork, computation
Network exponentials produce dramatic changes in
geometry and geography

24
Network Exponentials

Network vs. computer performance
Computer speed doubles every 18 months
Network speed doubles every 9 months
Difference order of magnitude per 5 years
1986 to 2000
Computers x 500
Networks x 340,000
2001 to 2010
Computers x 60
Networks x 4000

Moores Law vs. storage improvements vs. optical
improvements. Graph from Scientific American
(Jan-2001) by Cleo Vilett, source Vined Khoslan,
Kleiner, Caufield and Perkins.
25
The Network

As can be seen in the previous transparency, it
can be argued it is the evolution of the network
which has been the primary motivator for the
Grid.
Ubiquitous, dependable worldwide networks have
opened up the possibility of tying together
geographically distributed resources
The success of the WWW for sharing information
has spawned a push for a system to share
resources
The network has become the virtual bus of a
virtual computer.
More on this later

26
Motivation for the Grid

A HEP Perspective

27
Large Hadron Collider at CERN
28
Four LHC Experiments The Petabyte to Exabyte
Challenge

ATLAS, CMS, ALICE, LHCBHiggs New particles
Quark-Gluon Plasma CP Violation

Data stored 40 Petabytes/Year and UP
CPU 0.30 Petaflops and UP
0.1 to 1 Exabyte (1 EB 1018
Bytes) (2007) (2012 ?) for the LHC
Experiments
29
How Much Data is Involved?
High Level-1 Trigger(1 MHz)
High No. ChannelsHigh Bandwidth(500 Gbit/s)
Level 1 Rate (Hz)
106
LHCB
ATLAS CMS
105
HERA-B
KLOE
TeV II
104
Hans Hoffman DOE/NSF Review, Nov 00
High Data Archive(PetaByte)
CDF/D0
103
H1ZEUS
ALICE
NA49
UA1
102
104
105
106
107
LEP
Event Size (bytes)
30
ATLAS

A Torroidal LHC Apparatus
Collaboration
150 institutes
1850 physicists
Detector
Inner tracker
Calorimeter
Magnet
Muon
United States ATLAS
29 universities, 3 national labs
20 of ATLAS

31
(No Transcript)
32
Discovery Potential for SM Higgs Boson

Good sensitivity over the full mass range from
100 GeV to 1 TeV
For most of the mass range at least two channels
available
Detector performance is crucial b-tag, leptons,
g, E resolution, g / jet separation, ...

33
(No Transcript)
34
Data Flow from ATLAS
40 MHz (40 TB/sec)
level 1 - special hardware
75 KHz (75 GB/sec)
level 2 - embedded processors
5 KHz (5 GB/sec)
level 3 - PCs
ATLAS 9 PB/y one million PC hard drives!
100 Hz (100 MB/sec)
data recording offline analysis
35
ATLAS Parameters

Running conditions in the early years
Raw event size 2 MB
2.7x109 event sample ? 5.4 PB/year, before data
processing
Reconstructed events, Monte Carlo data ?
9 PB/year (2PB disk) CPU 2M SI95 (todays PC
20 SI95)
CERN alone can handle only 1/3 of these
resourceshow will we handle this?

36
Data IntensiveComputing and Grids

The term Data Grid is often used
Unfortunate as it implies a distinct
infrastructure, which it isnt but easy to say
Data-intensive computing shares numerous
requirements with collaboration, instrumentation,
computation,
Security, resource mgt, info services, etc.
Important to exploit commonalities as very
unlikely that multiple infrastructures can be
maintained
Fortunately this seems easy to do!

37
Data Intensive Issues Include

Harness potentially large numbers of data,
storage, network resources located in distinct
administrative domains
Respect local and global policies governing what
can be used for what
Schedule resources efficiently, again subject to
local and global constraints
Achieve high performance, with respect to both
speed and reliability
Catalog software and virtual data

38
Examples ofDesired Data Grid Functionality

High-speed, reliable access to remote data
Automated discovery of best copy of data
Manage replication to improve performance
Co-schedule compute, storage, network
Transparency wrt delivered performance
Enforce access control on data
Allow representation of global resource
allocation policies

39
HEP Data Analysis

Raw data
hits, pulse heights
Reconstructed data (ESD)
tracks, clusters
Analysis Objects (AOD)
Physics Objects
Summarized
Organized by physics topic
Ntuples, histograms, statistical data

40
Production Analysis
Trigger System
Data Acquisition
Run Conditions
Level 3 trigger
Calibration Data
Raw data
Trigger Tags
Reconstruction
Event Summary Data ESD
Event Tags
coordination required at the collaboration and
group levels
41
Physics Analysis
Event Tags
Tier 0,1 Collaboration wide
Event Selection
Analysis Objects
Calibration Data
Analysis Processing
Raw Data
Tier 2 Analysis Groups
PhysicsObjects StatObjects
PhysicsObjects StatObjects
PhysicsObjects StatObjects
Tier 3, 4 Physicists
Physics Analysis
42
A Model Architecture for Data Grids
Attribute Specification
Replica Catalog
Metadata Catalog
Application
Multiple Locations
Logical Collection and Logical File Name
MDS
Selected Replica
Replica Selection
Performance Information Predictions
NWS
GridFTP Control Channel
Disk Cache
GridFTPDataChannel
Tape Library
Disk Array
Disk Cache
Replica Location 1
Replica Location 2
Replica Location 3
43
LHC Computing Model(Based on MONARC Simulations)

Hierarchical, distributed tiers
The grid is necessary to tie these distributed
resources together

Universities
Tier-2
Tier-2
Tier-1 BNL, FNAL
CERN
National Regional Computing Center
Dedicated or QoS Network Links
Tier-0
44
Why Worldwide Computing? Regional Center Concept
Goals

Managed, fair-shared access for Physicists
everywhere
Maximize total funding resources while meeting
the total computing and data handling needs
Balance proximity of datasets to large central
resources, against regional resources under more
local control
Tier-N Model
Efficient network use higher throughput on short
paths
Local gt regional gt national gt international
Utilizing all intellectual resources, in several
time zones
CERN, national labs, universities, remote sites
Involving physicists and students at their home
institutions
Greater flexibility to pursue different physics
interests, priorities, and resource allocation
strategies by region
And/or by Common Interests (physics topics,
subdetectors,)
Manage the Systems Complexity
Partitioning facility tasks, to manage and focus
resources

45
Tier 2 Centers

Bring LHC Physics to the Universities
Optimize physics discovery potential
Standard configuration optimized for analysis at
the DST level
Primary Resource for Monte Carlo Simulation
Production level particle searches (University
autonomy)
Configuration
Commodity Pentium/Linux (100K SpecInt95)
Tier 1 500K
Estimated 144 Dual Processor Nodes Tier 1
640
Online Storage 100 TB Disk
Tier 1 1000 TB
High Performance Storage Area Network

46
Who is working on the Grid?

HEP Perspective

47
HENP Related Data Grid Projects

Funded Projects
PPDG I USA DOE 2M 1999-2001
GriPhyN USA NSF 11.9M 1.6M 2000-2005
EU DataGrid EU EC 10M 2001-2004
PPDG II (CP) USA DOE 9.5M 2001-2004
iVDGL USA NSF 13.7M 2M 2001-2006
DataTAG EU EC 4M 2002-2004
About to be Funded Project
GridPP UK PPARC gt15M? 2001-2004
Many national projects of interest to HENP
Initiatives in US, UK, Italy, France, NL,
Germany, Japan,
EU networking initiatives (Géant, SURFNet)
US Distributed Terascale Facility (53M, 12
TFL, 40 Gb/s network)

in final stages of approval
48
Grid Physics Network (GriPhyN) Enabling RD for
advanced data grid systems,focusing in
particular on Virtual Data concept
ATLAS CMS LIGO SDSS
49
International Virtual Data Grid Laboratory
50
TeraGrid NCSA, ANL, SDSC, Caltech
StarLight Intl Optical Peering Point (see
www.startap.net)
A Preview of the Grid Hierarchyand Networks of
the LHC Era
Abilene
Chicago
Indianapolis
DTF Backplane(4x? 40 Gbps)
Urbana
Pasadena
Starlight / NW Univ
UIC
I-WIRE
San Diego
Multiple Carrier Hubs
Ill Inst of Tech
ANL
OC-48 (2.5 Gb/s, Abilene)
Univ of Chicago
Multiple 10 GbE (Qwest)
Indianapolis (Abilene NOC)
Multiple 10 GbE (I-WIRE Dark Fiber)
NCSA/UIUC

Solid lines in place and/or available in 2001
Dashed I-WIRE lines planned for Summer 2002

Source Charlie Catlett, Argonne
51
PACI, TeraGrid and HENP

The scale, complexity and global extent of the
LHC Data Analysis problem is unprecedented
The solution of the problem, using globally
distributed Grids, is mission-critical for
frontier science and engineering
HENP has a tradition of deploying new highly
functional systems (and sometimes new
technologies) to meet its technical and
ultimately its scientific needs
HENP problems are mostly embarrassingly
parallel but potentially overwhelming in their
data- and network intensiveness
HENP/Computer Science synergy has increased
dramatically over the last two years, focused on
Data Grids
Successful collaborations in GriPhyN, PPDG, EU
Data Grid
The TeraGrid (present and future) and its
development program is scoped at an appropriate
level of depth and diversity
to tackle the LHC and other Petascale
problems, over a 5 year time span
matched to the LHC time schedule, with full ops.
In 2007

52
Selected Major Grid Projects
New
New
53
Selected Major Grid Projects
New
New
New
New
New
54
Selected Major Grid Projects
New
New
55
Selected Major Grid Projects
New
New
Also many technology RD projects e.g., Condor,
NetSolve, Ninf, NWS See also www.gridforum.org
56
Grid Related Work at Michigan
57
Grid Activities at Michigan

There are many ongoing activities related to the
grid within the department
NPACI/CPC collaboration on grid development
Collaboration with the Visible Human Project on
networking and grid performance issues
Authenticated QoS work with CITI/ITCOM
Collaborative tools and Web Lecture Archive
Project (see http//wlap.org)
Network issues bandwidth, services and
performance (I will focus on this later)

58
USATLAS Data Grid Testbed
U Michigan
Boston University
UC Berkeley LBNL-NERSC
Argonne National Laboratory
Brookhaven National Laboratory
University of Oklahoma
Prototype Tier 2s
Indiana University
University of Texas at Arlington
HPSS sites
59
US ATLAS Grid Testbed Activities

We are an active participant in the US ATLAS grid
testbed
Collaboration with CPC on grid issues and
development
Hosted a US ATLAS Grid workshop in Winter 2001
Leadership in network issues
Strong collaborative tools effort
Active in the Global Grid forum and Internet2
HENP WG

Testbed Activities
Network monitoring
Security configuration
Hardware testing for high performance bottlenecks
Certificate attributes (grid account management)
PACMAN cache site
Kick-start (one floppy) OS install development
Using AFS as a replacement for NFS

60
Internet2 HENP Networking WG Mission

To help ensure that the required
National and international network
infrastructures
Standardized tools and facilities for high
performance and end-to-end monitoring and
tracking, and
Collaborative systems
are developed and deployed in a timely manner,
and used effectively to meet the needs of the US
LHC and other major HENP Programs, as well as
the general needs of our scientific community.
To carry out these developments in a way that is
broadly applicable across many fields, within and
beyond the scientific community
Co-Chairs S. McKee (Michigan), H. Newman
(Caltech) With thanks to R. Gardner and J.
Williams (Indiana)

61
UM/ATLAS Grid ClusterCurrent Status as of
January 2002
All systems running Globus 1.1.4 and Condor
62
Networking and the Grid
63
Why Networking?

Since the early 1980s physicists have depended
upon leading-edge networks to enable ever larger
international collaborations.
Major HEP collaborations, such as ATLAS, require
rapid access to event samples from massive data
stores, not all of which can be locally stored at
each computational site.
Evolving integrated applications, i.e. Data
Grids, rely on seamless, transparent operation of
the underlying LANs and WANs.
Networks are among the most basic Grid building
blocks.

64
Transatlantic Net WG (HN, L. Price)
Bandwidth Requirements

Installed BW. Maximum Link Occupancy 50
Assumed The Network Challenge is Shared by Both
Next- and Present Generation Experiments
65
TCP WAN Performance

Mathis, et. al., Computer Communications Review
v27, 3, July 1997, demonstrated the dependence of
bandwidth on network parameters

BW - Bandwidth MSS Max. Segment Size RTT
Round Trip Time PkLoss Packet loss rate
If you want to get 90 Mbps via TCP/IP on a WAN
link from LBL to UM you need a packet loss lt
1.8e-6 !! (70 ms RTT).
66
Network Monitoring Iperf
(http//atgrid.physics.lsa.umich.edu/cricket/cric
ket/grapher.cgi)

We have setup testbed network monitoring using
Iperf (V1.2) (S. McKee(Umich), D. Yu (BNL))
We test both UDP (90 Mbps sending) and TCP
between all combinations of our 8 testbed sites.
Globus is used to initiate both the client and
server Iperf processes.

67
Testbed Network Measurements
68
Iperf Network Test Setup
69
UM Network IPERF Results
Our new switch has enabled us to increase our
bandwidth to the edge of campus by a factor of
7-15 (Gig vs Campus)
70
Achieving High Performance Networking

Server and Client CPU, I/O and NIC throughput
sufficient
Must consider firmware, hard disk interfaces, bus
type/capacity
Knowledge base of hardware performance, tuning
issues, examples
TCP/IP stack configuration and tuning is
Absolutely Required
Large windows, multiple streams
No Local infrastructure bottlenecks
Gigabit Ethernet clear path between selected
host pairs
To 10 Gbps Ethernet by 2003
Careful Router/Switch configuration and
monitoring
Enough router Horsepower (CPUs, Buffer Size,
Backplane BW)
Packet Loss must be Zero (well below 0.1)
i.e. No Commodity networks (need ESNet, I2 type
networks)
End-to-end monitoring and tracking of performance

71
Back to the Grid

The Globus Toolkit

72
The Globus ProjectMaking Grid computing a
reality

Close collaboration with real Grid projects in
science and industry
Development and promotion of standard Grid
protocols to enable interoperability and shared
infrastructure
Development and promotion of standard Grid
software APIs and SDKs to enable portability and
code sharing
The Globus Toolkit Open source, reference
software base for building grid infrastructure
and applications
Global Grid Forum Development of standard
protocols and APIs for Grid computing

73
Globus Toolkit Components

Two major Data Grid components
1. Data Transport and Access
Common protocol
Secure, efficient, flexible, extensible data
movement
Family of tools supporting this protocol
2. Replica Management Architecture
Simple scheme for managing
multiple copies of files
collections of files

74
Layered Grid Architecture(By Analogy to Internet
Architecture)
75
The Hourglass Model

Focus on architecture issues
Propose set of core services as basic
infrastructure
Use to construct high-level, domain-specific
solutions
Design principles
Keep participation cost low
Enable local control
Support for adaptation
IP hourglass model

A p p l i c a t i o n s
Diverse global services
Core services
Local OS
76
Resource LayerProtocols Services

Grid Resource Allocation Mgmt (GRAM)
Remote allocation, reservation, monitoring,
control of compute resources
GridFTP protocol (FTP extensions)
High-performance data access transport
Grid Resource Information Service (GRIS)
Access to structure state information
Network reservation, monitoring, control
All built on connectivity layer GSI IP

GridFTP www.gridforum.org GRAM, GRIS
www.globus.org
77
Collective LayerProtocols Services

Index servers aka metadirectory services
Custom views on dynamic resource collections
assembled by a community
Resource brokers (e.g., Condor Matchmaker)
Resource discovery and allocation
Replica catalogs
Replication services
Co-reservation and co-allocation services
Workflow management services
Etc.

Condor www.cs.wisc.edu/condor
78
Example High-ThroughputComputing System
App
High Throughput Computing System
Collective (App)
Dynamic checkpoint, job management, failover,
staging
Collective (Generic)
Brokering, certificate authorities
Access to data, access to computers, access to
network performance data
Resource
Communication, service discovery (DNS),
authentication, authorization, delegation
Connect
Storage systems, schedulers
Fabric
79
Virtual Data Queries

A query for events implies
Really means asking if a input data sample
corresponding to a set of calibrations, methods,
and perhaps Monte Carlo history match a set of
criteria
It is vital to know, for example
What data sets already exist, and in which
formats? (ESD, AOD,Physics Objects) If not,
can it be materialized?
Was this data calibrated optimally?
If I want to recalibrate a detector, what is
required?
Methods
Virtual data catalogs and APIs
Data signatures
Interface to Event Selector Service

80
Virtual Data Scenario

A physicist issues a query for events
Issues
How expressive is this query?
What is the nature of the query?
What language (syntax) will be supported for the
query?
Algorithms are already available in local shared
libraries
For ATLAS, an Athena service consults an ATLAS
Virtual Data Catalog or Registry Service
Three possibilities
File exists on local machine
Analyze it
File exists in a remote store
Copy the file, then analyze it
File does not exists
Generate, reconstruct, analyze possibly done
remotely, then copied

81
The Future of the Grid
82
Problem Evolution

Past-present O(102) high-end systems Mb/s
networks centralized (or entirely local) control
I-WAY (1995) 17 sites, week-long 155 Mb/s
GUSTO (1998) 80 sites, long-term experiment
NASA IPG, NSF NTG O(10) sites, production
Present O(104-106) data systems, computers Gb/s
networks scaling, decentralized control
Scalable resource discovery restricted
delegation community policy Data Grid 100s of
sites, O(104) computers complex policies
Future O(106-109) data, sensors, computers Tb/s
networks highly flexible policy, control

83
The Globus View of the FutureAll Software is
Network-Centric

We dont build or buy computers anymore, we
borrow or lease required resources
When I walk into a room, need to solve a problem,
need to communicate
A computer is a dynamically, often
collaboratively constructed collection of
processors, data sources, sensors, networks
Similar observations apply for software

84
And Thus

Reduced barriers to access mean that we do much
more computing, and more interesting computing,
than today gt Many more components ( services)
massive parallelism
All resources are owned by others gt Sharing (for
fun or profit) is fundamental trust, policy,
negotiation, payment
All computing is performed on unfamiliar systems
gt Dynamic behaviors, discovery, adaptivity,
failure

85
Future of the Grid for HEP

Grid Optimist
Best thing since the WWW. Dont worry, the grid
will solve all our computational and data
problems! Just click Install
Grid Pessimist
The grid is merely an excuse by computer
scientists to milk the political system for more
research grants so they can write yet more lines
of useless code The Economist, June 21, 2001
A distraction from getting real science done
McCubbin
Grid Realist
The grid can solve our problems, because we
design it to! We must work closely with the
developers as it evolves, providing our
requirements and testing their deliverables in
our environment.

86
Conclusions

LHC computing requirements are x 5-10 existing
experiments in both data volume and CPU
requirements
LHC Physics will depend heavily on resources
outside of CERN
LHC Computing Model adopted by CERN
Strong endorsement multi-tiered, hierarchy of
distributed resources
This model will rely on grid software to provide
efficient, easy access for physicists
This is a new platform for physics analysis
Like the web, if the grid is going to happen, it
will be pushed forward by HENP experiments

87
For More Information on the Grid

Globus Project
www.globus.org
Grid Forum
www.gridforum.org
Online tutorials/papers
www.globus.org/training/
www.globus.org/research/papers.html
Book (Morgan Kaufman)
www.mkp.com/grids

88
Baseline BW for the US-CERN Link HENP
Transatlantic WG (DOENSF)
Transoceanic NetworkingIntegrated with the
TeraGrid, Abilene, Regional Netsand Continental
NetworkInfrastructuresin US, Europe, Asia,
South America
US-CERN Plans 155 Mbps to 2 X 155 Mbps this
Year 622 Mbps in April 2002DataTAG 2.5 Gbps
Research Link in Summer 200210 Gbps Research
Link in 2003
89
LHC Schedule

Dec 2005 Ring closed and cooled
2006-2007
April First collisions L5x1032 to 2x1033 ? 1
fb-1
Jan-March Machine commissioning with 1 proton
beam
Start detector
commissioning 105 Z ? ??, W ? ??, tt events
May-July Shutdown (continue detector
installation)
August Physics Run L2x1033, 10 fb-1
Complete detector commissioning
? February 2007 Start of Physics
2008
High luminosity running L2x1034, 100 fb-1 per
year

90
Standard Model Higgs Production
K. Jacobs, Fermilab Higgs Workshop, May 2001
91
Modeling and SimulationMONARC System

Modelling and understanding current systems,
their performance and limitations, is essential
for the design of the future large scale
distributed processing systems.
The simulation program developed within the
MONARC (Models Of Networked Analysis At Regional
Centers) project is based on a process oriented
approach for discrete event simulation. It is
based on the on Java(TM) technology and provides
a realistic modelling tool for such large scale
distributed systems.

SIMULATION of Complex Distributed Systems
92
MONARC SONN 3 Regional Centers Learning to
Export Jobs (Day 9)
ltEgt 0.73
ltEgt 0.83
1MB/s 150 ms RTT
CERN30 CPUs
CALTECH 25 CPUs
1.2 MB/s 150 ms RTT
0.8 MB/s 200 ms RTT
NUST 20 CPUs
ltEgt 0.66
Day 9
93
Data Grid Reference Architecture
Application
Discipline-Specific Data Grid Application
Request Management
Catalogs
Replica Management
Community Policy

Collective
Access to data, access to computers, access to
network performance data,
Resource
Communication, service discovery (DNS),
authentication, delegation
Connectivity
Storage Systems
Compute Systems
Networks
Code Repositories
Fabric

94
Data Grid Reference Architecture
User Applications
Request Formulation
Virtual Data Catalogs
Request Manager
Request Planner
Request Executor
Storage Systems
Code Repositories
Computers
Networks
95
Grid Architectures and Athena

Grid Services
Resource discovery
Scheduling
Security
Monitoring
Data Access
Policy

Athena Services
Application manager
Job Options service
Event Selector service
Event persistency service
Detector persistency
Histogram service
User interfaces
Visualization
Database
Event model
Object federations
Concurrency

96
Athenas Persistency Mechanism
97
ATLAS Grid Testbed (US)

8 sites
University groups BU, IU, UM, OU, UTA
Labs ANL, BNL, LBNL
15-20 users
All sites
Globus Condor
AFS, ATLAS software release
Dedicated resources
Accounts for most users on all machines
Applications
Monte Carlo production w/ legacy code
Athena controlled Monte Carlo

98
Motivation for a Common Data Access Protocol

Existing distributed data storage systems
DPSS, HPSS focus on high-performance access,
utilize parallel data transfer, striping
DFS focus on high-volume usage, dataset
replication, local caching
SRB connects heterogeneous data collections,
uniform client interface, metadata queries
Problems
Incompatible (and proprietary) protocols
Each require custom client
Partitions available data sets and storage
devices
Each protocol has subset of desired functionality

99
A Common, Secure,Efficient Data Access Protocol

Common, extensible transfer protocol
Common protocol means all can interoperate
Decouple low-level data transfer mechanisms from
the storage service
Advantages
New, specialized storage systems are
automatically compatible with existing systems
Existing systems have richer data transfer
functionality
Interface to many storage systems
HPSS, DPSS, file systems
Plan for SRB integration

100
iVDGL Architecture
101
The 13.6 TF TeraGridComputing at 40 Gb/s
Site Resources
Site Resources
26
HPSS
HPSS
4
24
External Networks
External Networks
8
5
Caltech
Argonne
External Networks
External Networks
NCSA/PACI 8 TF 240 TB
SDSC 4.1 TF 225 TB
Site Resources
Site Resources
HPSS
UniTree
TeraGrid/DTF NCSA, SDSC, Caltech, Argonne
www.teragrid.org
102
Grid RD Focal Areas for NPACI/HENP Partnership

Development of Grid-Enabled User Analysis
Environments
CLARENS (IGUANA) Project for Portable
Grid-Enabled Event Visualization, Data
Processing and Analysis
Object Integration backed by an ORDBMS, and
File-Level Virtual Data Catalogs
Simulation Toolsets for Systems Modeling,
Optimization
For example the MONARC System
Globally Scalable Agent-Based Realtime
Information Marshalling Systems
To face the next-generation challenge of
DynamicGlobal Grid design and operations
Self-learning (e.g. SONN) optimization
Simulation (Now-Casting) enhanced to monitor,
track and forward predict site, network and
global system state
1-10 Gbps Networking development and global
deployment
Work with the TeraGrid, STARLIGHT, Abilene, the
iVDGL GGGOC, HENP Internet2 WG, Internet2 E2E,
and DataTAG
Global Collaboratory Development e.g. VRVS,
Access Grid

103
Virtual Data Registries
Event Selector Service
Algorithm creates VD IDs
Virtual Data Registry Service
104
Current Grid Challenges Resource Discovery,
Co-Scheduling, Transparency

Discovery and Efficient Co-Scheduling of
Computing, Data Handling, and Network Resources
Effective, Consistent Replica Management
Virtual Data Recomputation Versus Data Transport
Decisions
Reduction of Complexity In a Petascale World
GA3 Global Authentication, Authorization,
Allocation
VDT Transparent Access to Results (and Data
When Necessary)
Location Independence of the User Analysis,
Grid,and Grid-Development Environments
Seamless Multi-Step Data Processing and
AnalysisDAGMan (Wisc), MOPIMPALA(FNAL)

105
Next Round Grid Challenges Global Workflow
Monitoring, Management, and Optimization

Workflow Management, Balancing Policy Versus
Moment-to-moment Capability to Complete Tasks
Balance High Levels of Usage of Limited Resources
Against Better Turnaround Times for Priority
Jobs
Goal-Oriented According to (Yet to be Developed)
Metrics
Maintaining a Global View of Resources and System
State
Global System Monitoring, Modeling,
Quasi-realtime simulation feedback on the
Macro- and Micro-Scales
Adaptive Learning new paradigms for execution
optimization and Decision Support (eventually
automated)
Grid-enabled User Environments