Title: Grid3: Practice and Principles
1Grid3 Practice and Principles
- Rob Gardner
- University of Chicago
- rwg_at_hep.uchicago.edu
- April 12, 2004
2Introduction
- Today Id like to review a little about Grid3,
its principles and practice - Id also like to show how we intend to use it for
ATLAS Data Challenge 2 - and, how we will evolve it over the course of
the next several months - Acknowledgements
- Im entirely reporting on others work!
- Much of the project was summarized at the iVDGL
NSF review (Feb 04), and US LHC Project reviews
(Jan 04) - Especially I am indebted to Marco Mambelli, Ian
Fisk, Ruth Pordes, Jorge Rodriguez, Leigh
Grundhoefer for slides
3Outline
- Intro/project background
- Grid3 infrastructure and operations
- Applications
- Metrics and Lessons
- Grid3, development, and ATLAS DC2
- Conclusions
4Grid themes then and now
- Eg. proposal (simple, naïve?)
- Internationally integrated virtual data grid
system, interoperable, controlled sharing of data
and compute resources - Common grid infrastructure, export to other
application domains - Large scale research Laboratory for VDT
development
- Now
- Many new initiatives, in US worldwide
- Dynamic resources and organizations
- Dynamic project priorities
- Ability to adapt to change is key factor to
success - ? Grid2003 (Grid3)
5Grid2003 Project history
- Joint project with USATLAS, USCMS, iVDGL, PPDG,
GriPhyN - Organized as a Project Grid2003
- Developed Summer/Fall 2003 Grid3 grid
- benefited from STAR cycles and local efforts at
BNL/ITD - Use US-developed components
- VDT based (GRAM, Gridftp, MDS, Monitoring
components) applications - iGOC monitoring and VO level services
- Interoperate, or federate with other Grids like
LCG - ATLAS successful use of ChimeraLCG-1 last
December - USCMS storage element interoperability
623 institutes
Argonne National Laboratory Jerry Gieraltowski,
Scott Gose, Natalia Maltsev, Ed May, Alex
Rodriguez, Dinanath Sulakhe Boston University Jim
Shank, Saul Youssef Brookhaven National
Laboratory David Adams, Rich Baker, Wensheng
Deng, Jason Smith, Dantong Yu Caltech Iosif
Legrand, Suresh Singh, Conrad Steenberg, Yang
Xia Fermi National Accelerator Laboratory Anzar
Afaq, Eileen Berman, James Annis, Lothar
Bauerdick, Michael Ernst, Ian Fisk, Lisa
Giacchetti, Greg Graham, Anne Heavey, Joe Kaiser,
Nickolai Kuropatkin, Ruth Pordes, Vijay Sekhri,
John Weigand, Yujun Wu Hampton University
Keith Baker, Lawrence Sorrillo Harvard
University John Huth Indiana University Matt
Allen, Leigh Grundhoefer, John Hicks, Fred
Luehring, Steve Peck, Rob Quick, Stephen Simms
Johns Hopkins University George Fekete, Jan
vandenBerg Kyungpook National University /
KISTI Kihyeon Cho, Kihwan Kwon, Dongchul Son,
Hyoungwoo Park Lawrence Berkeley National
Laboratory Shane Canon, Jason Lee, Doug Olson,
Iowa Sakrejda, Brian Tierney University at
Buffalo Mark Green, Russ Miller
University of California San Diego James Letts,
Terrence Martin University of Chicago David Bury,
Catalin Dumitrescu, Daniel Engh, Ian Foster,
Robert Gardner, Marco Mambelli, Yuri Smirnov,
Jens Voeckler, Mike Wilde, Yong Zhao, Xin
Zhao University of Florida Paul Avery, Richard
Cavanaugh, Bockjoo Kim, Craig Prescott, Jorge L.
Rodriguez, Andrew Zahn University of
Michigan Shawn McKee University of New
Mexico Christopher T. Jordan, James E. Prewett,
Timothy L. Thomas University of Oklahoma Horst
Severini University of Southern California Ben
Clifford, Ewa Deelman, Larry Flon, Carl
Kesselman, Gaurang Mehta, Nosa Olomu, Karan
Vahi University of Texas, Arlington Kaushik De,
Patrick McGuigan, Mark Sosebee University of
Wisconsin-Madison Dan Bradley, Peter Couvares,
Alan De Smet, Carey Kireyev, Erik Paulson, Alain
Roy University of Wisconsin-Milwaukee Scott
Koranda, Brian Moe Vanderbilt University Bobby
Brown, Paul Sheldon Contact authors
60 people working directly 8 full time, 10 half
time, 20 site admins ¼ time
7Grid3 services, roughly
- Site Software
- VO Services
- Information Services
- Monitoring
- Applications
8Operations - Site Software
- Design of Grid3 Site software distribution base
- Largely based upon successful WorldGrid
installation deployment - Create and Maintain Installation Guides
- Coordinate upgrades after initial installation
(Installation Fests)
Pacman
iVDGLGrid3 Site Software
Grid3 Site
VDT VO service GIIS Reg InfoProv G3 Schema LogMgmt
Compute Facility
Storage
9Operations Security Base
- One of the Virtual Organizations Registration
Authorities (VO RA) operating with some delegated
authority of the DOE Grids Certificate Authority
is the iVDGL Registration Authority - The iVDGL RA is used to check the identity of
individuals requesting certificates. - 282 iVDGL Certificates have been issued for iVDGL
use.
10Operations Security Model
- Provide Grid3 compute resources with automated
multi-VO authorization model, using VOMS and
mkgridmap - Each VO manages a service and its members
- Each Grid3 site is able to generate a Globus
Auth. file with an authenticated SOAP query to
each VO service
SDSS VOMS
USCMS VOMS
Grid3 Sites
Globus Auth.
USAtlas VOMS
BTeVVOMS
LSC VOMS
iVDGLVOMS
11Operations - Support and Policy
- Investigation and resolution of grid middleware
problems at the level of 16-20 contacts per week
- Develop Service Level Agreements for Grid service
systems and iGOC support service - and other centralized support points such as LHC
Tier1 - Membership Charter completed which defines the
process to add new VOs, sites and applications
to the Grid Laboratory - Support Matrix defining Grid3 and VO services
providers and contact information
12Operations - Index Services
- Hierarchical Globus Information Index Service
(GIIS) design - Automated Resource registration to Index Service
- MDS Grid3 Schema development and Information
Provider verification - MDS tuning for large heterogeneous grid
GRID3 Site Resources
VO Index Service (6)
Grid3 Index Service
USAtlas GIIS
Boston GRIS
Boston U
GRID3 Location Grid3 Data_DIR
UofChicago
GRID3 GIIS
ANL
BNL
ANL BNL Boston U UofChicago
UFL GRIS
USCMS GIIS
Grid3 Location Grid3 Data DIR Grid3
Applications Grid3 Temporary DIR
UFL FNAL RiceU CalTech
UFL FNAL RiceU CalTech
13Grid Operations - Site Monitoring
- Ganglia
- Open source tool to collect cluster monitoring
information such as CPU and network load, memory
and disk usage - MonA LISA
- Monitoring tool to support resource discovery,
access to information and gateway to other
information gathering systems - ACDC Job Monitoring System
- Application using grid submitted jobs to query
the job managers and collect information about
jobs. This information is stored in a DB and
available for aggregated queries and browsing. - Metrics Data Viewer (MDViewer)
- analyzes and plots information collected by the
different monitoring tools, such as the DBs at
iGOC. - Globus MDS
- Grid2003 Schema for Information Services and
Index Services for Information services
14Monitoring services
Intermediaries
Producers
Consumers
OS (syscall, /proc)
WWW
GRIS
Reports
Log files
System config.
Job manager
MonALISA client
User clients
MDViewer
15Ganglia
- Usage information
- CPU load
- NIC traffic
- Memory usage
- Disk usage
- Used directly and indirectly
- Site Web pages
- Central Web pages
- MonALISA agent
16MonALISA
- Flexible framework
- Java based
- JINI directory
- Multiple agents
- Nice graphic interface
- 3D globe
- Elastic connection
- ML repository
- Persistent repository
17Metrics Data Viewer
- Flexible tool for information analysis
- Databases, log files
- Multiple plot capabilities
- Predefined plots, possibility to add new ones
- Customizable, possibility to export the plots
18MDViever (2)
- CPU provided
- CPU used
- Load
- Usage per VO
- IO
- NIC
- File transfers per VO
- Jobs
- Submitted, Failed,
- Running, Idle,
19MDViever (3)
20Operations - Site Catalog Map
21Grid2003 Applications
22Application Overview
- 7 Scientific applications and 3 CS demonstrators
- All iVDGL experiments participated in the
Grid2003 project - A third HEP and two Bio-Chemical experiments also
participated - Over 100 users authorized to run on Grid3
- Application execution performed by dedicated
individuals - Typically 1, 2 or 3 users ran the applications
from a particular experiment - Participation from all Grid3 sites
- Sites categorized according to policies and
resource - Applications ran concurrently on most of the
sites - Large sites with generous local use policies
where more popular
23Scientific Applications
- High Energy Physics Simulation and Analysis
- USCMS MOP, GEANT based full MC simulation and
reconstruction - Work flow and batch job scripts generated by
McRunJob - Jobs generated at MOP master (outside of Grid3)
which submits to Grid3 sites via condor-G - Data products are archived at FermiLab
SRM/dcache - USATLAS GCE, GEANT based full MC simulation and
reconstruction - Workflow is generated by Chimera VDS, Pegasus
grid scheduler and globus MDS for resource
discovery - Data products archived at BNL Magada and globus
RLS are employed - USATLAS DIAL, Distributed analysis application
- Dataset catalogs built, n-tuple analysis and
histogramming (data generated on Grid3) - BTeV Full MC simulation
- Also utilizes the Chimera workflow generator and
condor G (VDT)
24Scientific Applications, cont
- Astrophysics and Astronomical
- LIGO/LSC blind search for continuous
gravitational waves - SDSS maxBcg, cluster finding package
- Bio-Chemical
- SnB Bio-molecular program, analyses on X-ray
diffraction to find molecular structures - GADU/Gnare Genome analysis, compares protein
sequences - Computer Science
- Evaluation of Adaptive data placement and
scheduling algorithms
25 CS Demonstrator Applications
- Exerciser
- Periodically runs low priority jobs at each site
to test operational status - NetLogger-grid2003
- Monitored data transfers between Grid3 sites via
NetLogger instrumented pyglobus-url-copy - GridFTP Demo
- Data mover application using GridFTP designed to
meet the 2TB/day metric
26Running on Grid3
- With information provided by the Grid3
information system - Composes list of target sites
- Resource available
- Local site policies
- Finds where to install application and where to
write data - MDS information system used
- Provides pathname for APP, DATA, TMP and
WNTMP - User sends and remotely installs application from
a local site - User submit job(s) through globus GRAM
- User does not need to interact with local site
administrators
27US CMS use of Grid3
- history of past three months cpu-days)
28ATLAS PreDC2 on Grid3 (Fall 2003)
- US ATLAS PreDC2 exercise
- Development of ATLAS tools for DC2
- Collaborative work on Grid2003 project
- Gain experience with the LCG grid
US ATLAS Testbed
US ATLAS shared, heterogeneous resources
contributed to Grid2003
29PreDC2 approach
- Monte Carlo production on Grid3 sites
- Dynamic software installation of ATLAS releases
on Grid3 - Integrated GCE client-server based on VDT tools
(Chimera, Pegasus, control database, lightweight
grid scheduler, Globus MDS) - Measure job performance metrics using MonALISA
and MDViewer (metrics data viewer) - Collect and archive output to BNL
- MAGDA and Globus RLS used (distributed LRC RLI
at two sites) - Reconstruct fraction of data at CERN and other
LCG-1 sites - To pursue Chimera/VDT interoperability issues
with LCG-1 - Copy data back to BNL exercise Tier1-CERN links
- Successful integration post SC2003 !
- Analysis Distributed analysis of datasets using
DIAL - Dataset catalogs built, n-tuple analysis and
histogramming - Metrics collected and archived on all of the above
30US ATLAS Datasets on Grid3
ATLAS
- Grid3 resources used
- 16 sites, 1500 CPUs exercised peak 400 jobs
over three week period - Higgs ? 4 lepton sample
- Simulation and Reconstruction
- 2000 jobs ( X 6 subjobs) 100200 events per job
( 200K events) - 500 GB output data files
- Top sample
- Reproduce DC1 dataset simulation and
reconstruction steps - 1200 jobs ( x 6 subjobs) 100 events per job
(120K sample) - 480 GB input data files
- Data used by PhD student at U. Geneva
11/18/03 50 sample 800 jobs
H ? 4e
31US ATLAS during SC2003
Total
CPU usage (totals)
ATLAS
Total
CMS
CPU usage (by day)
ATLAS
32US ATLAS and LCG
- ATLSIM output staged on disk at BNL
- Use GCE-Client host to submit to LCG-1 server
- Jobs executing on LCG-1
- Input files registered at BNL-RLI
- Stage data from BNL to local scratch area
- Run Athena-reconstruction using release 6.5.0
- Write data to local storage element
- Copy back to disk cache at BNL, RLS register
- implementing 3rd-party transfer between LCG SE
and BNL gridftp servers - Post job
- Magda registration of ESD (Combined ntuple
output), Verification - DIAL dataset catalog and analysis
- Many lessons learned about site architectural
differences, service configurations - Concrete steps towards combined use of LCG and US
Grids!
33Grid2003 Metrics and Lessons
34Metrics Summary
35Grid3 Metrics Collection
- Grid3 monitoring system
- MonALISA
- MetricsDataViewer
- Queries to persistent storage DB
- MonALISA plots
- MDViewer plots
36Grid2003 Metrics Results
- Hardware resources
- Total of 2762 CPUs
- Maximum CPU count
- Off project contribution gt 60
- Total of 27 sites
- 27 administrative domains with local policies in
effect - All across US and Korea
- Running jobs
- Peak number of jobs 1100
- During SC2003 various applications were running
simultaneously across various Grid3 sites
37Data transfers around Grid3 sites
- Data transfer metric
- GridFTP demo
- Data transfer application
- Used concurrently with application runs
- Target met 11.12.03 (4.4 TB)
38Global Lessons
- Grid2003 was an evolutionary change in
environment - The infrastructure was built on established
testbed efforts from participating experiments - However, it was a revolutionary change in scale
- The available computing and storage resources
increased - Factor of 4 -5 over individual VO environments
- The human interactions in the project increased
- More participants from more experiments and
closer interactions with CS colleagues. - Difficult to find a lot of positive examples of
successful common projects between experiments - Grid2003 is an exemplar of how this can work
39Lessons Categories
The Grid2003 Lessons can be divided into two
categories
- Architectural Lessons
- Service Setup on the clusters
- Processing, Storage, and Information Providing
Issues - Scale that can be realistically achieved
- Interoperability issues with other grid projects
- Identifying Single Point Failures
- Operational Lessons
- Software Configuration and Software Updates
- Trouble Shooting
- Support and Contacts
- Information exchange and coordination
- Service Levels
- Grid Operations
These lessons are guiding the next phase of
development
40Identifying System Bottlenecks
- Grid2003 attempted to keep requirements on sites
light - Install software on interface nodes
- Do not install software on worker nodes
- Makes the system flexible and deployable on a lot
of architectures. - Maximizes participation
- Can place heavy loads on the gateways systems
- As these services become overloaded they also
become unreliable - Need to improve the scalability of the
architecture - Difficult to make larger clusters
- Already headnodes need to be powerful systems to
keep up with the requirements
Worker Node
Worker Node
LAN
Worker Node
Storage
Worker Node
Head Node
Grid2003 Packages
WAN
41Gateway Functionality
- Gateway systems are currently expected to
- Handle GRAM bridge for processing
- Involves querying running and queued jobs
regularly - Significant Load
- Handle data transfers in and out for all jobs
running on the cluster - Allows worker nodes to not need WAN access
- Significant Load
- Monitor and publish status of cluster elements
- Audit and record cluster usage
- Small Load
- Collect and publish information
- Site configuration and GLUE schema elements
- Small Load
42Processing Requirements
- Grid2003 is a diverse set of sites
- We are attempting to provide applications with a
richer description of resource requirements and
collect job execution requirements - We have not deployed resource brokers and have
limited use of job optimizers. - Even with manually submitted grid jobs. It is
hard to know and the satisfy the site
requirements - Periodically have fallen back to e-mail
information exchange - Equally hard for sites to know what the
requirements and activities of the incoming
processes will be. - Grid2003 lesson was that it is difficult to
submit jobs with requirement exchange - Need to improve and automate information exchange.
43Single Point Failures
- The Grid2003 team already knew that single point
failures are bad - Fortunately, We also learned that in our
environment they are rare. - The grid certificate authorities are, by design,
a single point of information. - The only grid wide meltdowns were a result of
certificate authority problems. - Unfortunately, this happened more than once.
- The Grid2003 information providers were not
protected against loss - Even during failures the grid functionality was
not lost - The information provided is currently fairly
static
44 Operational Issue Configuration
- One of the first operational lessons Grid2003
learned was the need for better tools to check
and verify the configuration - We had good packaging and deployment tools
through Pacman, but we spent an enormous amount
of time and effort diagnosing configuration
problems - Early on Grid2003 implemented tools and scripts
to check site status, but there were few tools
that allowed the site admin to check site
configuration - Too much manual information exchange was required
before a site was operational - In the next phase of the project we expect to
improve this.
45Software Updates
- The software updates in Grid2003 varied from
relatively easy to extremely painful - Some software packages could self-update
- The monitoring framework Mona Lisa could upgrade
the entire distribution - This allowed developers to make improvements and
satisfy changing requirements without involving
site admins - Could be set to update regularly with a cron job
or triggered manually - Grid2003 is investigating whether this is a
reasonable operating mode - This is functionality is becoming more common
46Grid3 Evolution Grid3
47Grid3 ? Grid3
- Endorsement in December for extension of project
- US LHC, Trillium Grid projects support adiabatic
upgrades to Grid3 and continued operation for
next 6 months - Begin as well a development activity focusing on
grid/web services, but at a much reduced level - Planning exercise in January-Feb to collect key
issues from stakeholders - each VO US ATLAS, US CMS, SDSS, LIGO, BTEV,
IVDGL - IGOC coordinating operations
- VDT as the main supplier of core middleware
- Two key recommendations
- Development of storage element interface and
introduction into Grid3 - Support for a common development test grid
environment
48Grid Deployment Goals for 2004
- Detailed instructions for operating grid
middleware through firewalls and other port
limiting software. - Introduction of VOMS extended X.509 proxies to
allow callout authorization at the resource - Secure web service for User Registration for VO
members - Evaluate new methods for VO accounts at sites
- Move to new Pacman Version 3 for software
distribution and installation. - Adoption of new VDT Core Software which provides
install time service configuration - Support and testing for other distributed file
system architectures - Install time support for multi- homed systems
49 iGOC Infrastructure Goals for 2004
- Status of site monitoring to enable a effective
job scheduling system - collection and archival
of monitoring data. - Reporting of available storage resources at each
site. - Status display of current and queued workload for
each site. - Integrate Grid3 schema into GLUE schema.
- Acceptable Use Policy in development
- Grid3 Operational Model in negotiation
50New Development Grid Infrastructure
Grid3 Common Environment
grid3dev Environment
- Authentication Service
- Approved VOMS servers
- Monitoring Service
- catalog
- MonALISA
- ganglia
- ACDC
- Stable Grid3 software cache
- Authentication Service
- test VOMS server
- Approved VOMS
- servers
- new VOMS server(s)
- Monitoring Service
- catalog (test vers)
- MonALISA (test vers)
- ganglia (test vers)
- Development s/w caches
51Grid3 upgrades
- Tests of new Grid3 installation on US ATLAS DTG
(Development Test Grid) for past two weeks - These tests based on VDT 1.1.13
- Decision to wait for VDT 1.1.14
- VDT 1.1.14 has many desirable features
- Globus upgrade to 2.4 has many patches
- MDS and rest of Globus software now in synch
- Much improved MonALISA installation
- Upgrade can proceed adiabatically
- submission to mixed VDT grid okay
- Support for storage element will come later
during DC2
52US ATLAS DC2 and Grid3
53Execution System for DC2
- The execution system for US ATLAS is based on the
ATLAS supervisor/executor job model - ATLAS jobs described in the ATLAS production
database must be translated by the executor into
the job description language and execution
environment of the US grid system - Based on the VDT and divided into two parts
- client side (job submission)
- Supervisor (Windmill) client
- GCE-Client (Grid Component Environment software
VDT-Client, Chimera virtual data system, Pegasus
DAG builder) - Capone (ATLAS job execution framework) web
service - server side (job execution) the Grid
- Grid3 middleware and ACE (ATLAS software kit
releases, GCE-Server, DC2 transformations, Pacman
readiness kit) - and other services information, monitoring, VO
management
54Phase 1
Windmill host oracle client user certificate
Jabber Switch
PDB
J Proxy any host
XMPP (XML) SOAP GRAM (Globus) TCP/IP Catalog
reference
Capone Web Service
Capone diagnostic client
RLS Globus RLS PRD ATLAS prod db DQ Don Quixote
GCE Client
submit host user certificate
RLS
DQ
CE Grid3 (VDT) ACE GCE-Server Site
Readiness ATLAS releases
Sim outFile SE
Gen inFile SE
any grid visible location
BNL
55Saul Youssef
DC2 Environments Use Pacman 3 to capture working
environments as things come together for DC2
ACE
Windmill
Capone
ACE
Don Quijote
Monitor sites, update software, add sites
56Windmill Capone Communication
- Supervisor request (WS/Jabber)
- XML message
- Request message translated
- Processing
- CPE elaboration
- Grid interactions
- Don Quixote interactions
- Response
- XML response
- Response to Supervisor
Supervisor
Messaging (WS/Jab)
Translate
5
ProcDB
CPE
4.A
4.B
4.C
Grid (GCE)
DQx
57Capone Processing
- Jobs received from supervisor undergo 13 process
steps - Receive, Translate, DAXgen, RLSreg, Schedule,
cDAGgen, Submit, Run, Check, stageOut, Clean,
Finish, Kill - Status codes
- Success/failure (each step)
- Completion/failure (job)
executeJob
received
recovery
stageOut
stageOut
recovery
end
fixJob
58DC2 Metrics, to be collected
- CPU usedaverage number of CPU used during the
day - CPU providedaverage number of CPU provided (by
ATLAS and by the Grid) during the day - Response timecompletion time submission time
- Jobs wanted, submitted, successfully completed,
failed
59Conclusions and Outlook
- Grid2003 taught us many lessons about how to
deploy and run a grid - Grid3 will be a critical resource for continued
Data Challenges, which are driving the
development of the next phase of development - Challenge will be to maintain vital RD effort
while doing sustained production operations - ATLAS DC2 beginning May 1
- will provide a new class of lessons, directions
for future development -