Title: Introduction to Grid Computing and the Globus Toolkit
1Introduction to Grid Computing and the Globus
Toolkit
- Jennifer M. Schopf
- Argonne National Lab
- National eScience Centre
2Overview
- What is a Grid
- What does the Globus Toolkit do?
- Security
- Resource Management
- Data Management
- Monitoring
- Example Application OSG
- Conclusions
3What is a Grid?
- Resource sharing
- Computers, storage, sensors, networks,
- Sharing always conditional issues of trust,
policy, negotiation, payment, - Coordinated problem solving
- Beyond client-server distributed data analysis,
computation, collaboration, - Dynamic, multi-institutional virtual orgs
- Community overlays on classic org structures
- Large or small, static or dynamic
4What is a Grid?
- Many definitions many differences especially
between academics and industry - Both use the buzzword to get funding
- My definition
- Resource sharing
- Coordinated problem solving
- Dynamic, multi-institutional virtual orgs
5Resource Sharing
- Resources can be anything-
- Computers
- Storage/repositories
- Sensors and Networks
- People and software
- Local Control of the resources, and local
policies for their use - Sharing is always conditional
- Issues of trust, policy
- Negotiation and payment
6Coordinated Problem Solving
- Beyond client-server
- Client Server defines a small set of
well-understood interactions as the only ones
that can take place - Actions in this space can include
- Distributed data analysis
- Computation and visualization of results
- Collaboration
7Dynamic, Multi-institutionalVirtual Organizations
- Crossing administrative domains
- No one has full control over the resources
- Local policy not global
- Different local policy on different sites
- Community overlays on classic organizational
structures - Large or small, static or dynamic
8Why is this hard/different?
- Lack of central control
- Cannot dictate what runs on a resource, how or
when - Different policies at different sites
- Heterogeneity is everywhere
- Shared resources
- Contention, variability
- Communication
- Different sites implies different sys admins,
users, institutional goals, and often strong
personalities
9So why do it?
- Computations that need to be done with a time
limit - Data that cant fit on one site
- Data owned by multiple sites
- Applications that need to be run bigger, faster,
more
10What Kinds of Applications?
- Computation intensive
- Interactive simulation (climate modeling,
financial mkts) - Very large-scale simulation and analysis (galaxy
formation, gravity waves, battlefield simulation,
business models) - Engineering (parameter studies, linked models)
- Data intensive
- Experimental data analysis (high-energy physics)
- Image and sensor analysis (atro, climate study,
ecology) - Distributed collaboration
- Online instrumentation (microscopes, x-ray
devices, etc.) - Remote visualization (climate studies, biology)
- Engineering (large-scale structural testing, chem
eng.) - All required people in several organization to
collaborate and share computing resources, data,
instruments
11History
- In the early 90s, Ian Foster (ANL, U-C) and Carl
Kesselman (USC-ISI) enjoyed helping scientists
apply distributed computing. - Opportunities seemed ripe for the picking.
- Application of technology always uncovers new and
interesting requirements. - Science is cool!
- Big/Innovative science is even cooler!
12What Types of Problems?
- While helping to build/integrate a diverse
- range of applications, the same problems
- kept showing up over and over again.
- Too hard to keep track of authentication data
(ID/password) across institutions - Too many ways to submit jobs
- Too many ways to store access files and data
- Too many ways to keep track of data
- Too easy to leave dangling resources lying
around (robustness)
13What Was Needed
- Solutions to common problems
- Way to address heterogeniety
- Way to use standards- or to help push standards
forward - Without standards we cant have interoperability
- Globus Toolkit was built to address this
14Overview
- What is a Grid
- What does the Globus Toolkit do?
- Security
- Monitoring
- Resource Management
- Data Management
- Example Application Grid3
- Conclusions
15The Role of the Globus Toolkit
- A collection of solutions to problems that come
up frequently when building collaborative
distributed applications - Heterogeneity
- A focus, in particular, on overcoming
heterogeneity for application developers - Standards
- We capitalize on and encourage use of existing
standards (IETF, W3C, OASIS, GGF) - GT also includes reference implementations of
new/proposed standards in these organizations
16Globus is an Hour Glass
Higher-Level Services and Users
- Local sites have an their own policies, installs
heterogeneity! - Queuing systems, monitors, network protocols, etc
- Globus unifies
- Build on Web services
- Use WS-RF, WS-Notification to represent/access
state - Common management abstractions interfaces
Standard GT4 Interfaces
Local heterogeneity
17What Is the Globus Toolkit?
- Collection of solutions to common problems when
building collaborative distributed applications. - A set of basic Grid services
- Job submission/management
- File transfer (individual, queued)
- Database access
- Data management (replication, metadata)
- Monitoring/Indexing system information
- A Grid development environment for your own
services - Building blocks for WSRF-compliant Web Services,
including security infrastructure - Tools and Examples
- The prerequisites for many Grid community tools!
18Globus IsStandard Plumbing for the Grid
- Not turnkey solutions, but building blocks and
tools for application developers and system
integrators. - Some components (e.g., file transfer) go farther
than others (e.g., remote job submission) toward
end-user relevance. - Since these solutions exist and others are
already using them (and theyre free), its
easier to reuse than to reinvent. - And compatibility with other Grid systems comes
for free!
19How it Really Happens
ComputeServer
SimulationTool
ComputeServer
WebBrowser
WebPortal
RegistrationService
Camera
TelepresenceMonitor
DataViewerTool
Camera
Database service
ChatTool
DataCatalog
Database service
CredentialRepository
Database service
Certificate authority
Resources implement standard access management
interfaces
Collective services aggregate /or virtualize
resources
Users work with client applications
Application services organize VOs enable access
to other services
20How it Really Happens(without the Globus Toolkit)
ComputeServer
A
SimulationTool
ComputeServer
B
WebBrowser
WebPortal
RegistrationService
Camera
TelepresenceMonitor
DataViewerTool
Camera
Database service
C
ChatTool
DataCatalog
Database service
D
CredentialRepository
Database service
E
Certificate authority
Resources implement standard access management
interfaces
Collective services aggregate /or virtualize
resources
Users work with client applications
Application services organize VOs enable access
to other services
21How it Really Happens(with the Grid)
ComputeServer
GlobusGRAM
SimulationTool
ComputeServer
GlobusGRAM
WebBrowser
Portal/ CHEF
Globus IndexService
Camera
TelepresenceMonitor
DataViewerTool
Camera
Database service
GlobusOGSA- DAI
CHEF ChatTeamlet
GlobusMCS/RLS
Database service
GlobusOGSA DAI
MyProxy Cred. Rep.
Database service
GlobusOGSA DAI
CertificateAuthority
Resources implement standard access management
interfaces
Collective services aggregate /or virtualize
resources
Users work with client applications
Application services organize VOs enable access
to other services
22Globus Toolkit V4.0
- Released on April 2005
- Previous fifteen months spent on design,
development, and testing - 1.8M lines of code
- Major contributions from five institutions
- Hundreds of millions of service calls executed
over weeks of continuous operation - Significant improvements over GT3 code base in
all dimensions
23Our Goals for GT4
- Usability, reliability, scalability,
- Web service components have quality equal or
superior to pre-WS components - Documentation at acceptable quality level
- Consistency with latest standards (WS-, WSRF,
WS-N, etc.) and Apache platform - WS-I Basic (Security) Profile compliant
- New components, platforms, languages
- And links to larger Globus ecosystem
24Open Source/Open Standards
- WSRF developed in collaboration with IBM
- Currently in OASIS process
- Contributions to Apache for
- WS-Security
- WS-Addressing
- Axis
- Apollo (WSRF)
- Hermes (WS-Notification)
25WSRF vs XML/SOAP
- The definition of WSRF means that the Grid and
Web services communities can move forward on a
common base - Why Not Just Use XML/SOAP?
- WSRF and WS-N are just XML and SOAP
- WSRF and WS-N are just Web services
- Benefits of following the specs
- These patterns represent best practices that have
been learned in many Grid applications - There is a community behind them
- Why reinvent the wheel?
- Standards facilitate interoperability
26Globus Toolkit Open Source Grid Infrastructure
Globus Toolkit v4 www.globus.org
Data Replication
Replica Location
Grid Telecontrol Protocol
CredentialMgmt
Data Access Integration
Community Scheduling Framework
Delegation
Python Runtime
WebMDS
Reliable File Transfer
CommunityAuthorization
Trigger
C Runtime
Workspace Management
GridFTP
Authentication Authorization
Grid Resource Allocation Management
Index
Java Runtime
Data Mgmt
Security
CommonRuntime
Execution Mgmt
Info Services
27Why Grid Security is Hard
- Resources being used may be valuable the
problems being solved sensitive - Resources are often located in distinct
administrative domains - Each resource has own policies procedures
- Set of resources used by a single computation may
be large, dynamic, and unpredictable - Not just client/server, requires delegation
- It must be broadly available applicable
- Standard, well-tested, well-understood protocols
integrated with wide variety of tools
28Security Tools
- Grid Security is based on public key
infrasturcture - Basic Grid Security Mechanisms
- Certificate Generation Tools
- Certificate Management Tools
- Getting users registered to use a Grid
- Getting Grid credentials to wherever theyre
needed in the system - Authorization/Access Control Tools
- Storing and providing access to system-wide
authorization information
29Basic Grid Security Mechanisms
- Globus Toolkit provides
- Grid-wide identities implemented as PKI
certificates - Transport-level and message-level authentication
- Ability to delegate credentials to agents
- Ability to map between Grid local identities
- Local security administration enforcement
- Single sign-on support implemented as proxies
- A plug in framework for authorization decisions
30Other Security Services Include
- MyProxy
- Simplified credential management
- Web portal integration
- Single-sign-on support
- KCA kx.509
- Bridging into/out-of Kerberos domains
- SimpleCA
- Online credential generation
- PERMIS
- Authorization service callout
31A Cautionary Note
- Grid security mechanisms are tedious to set up.
- If exposed to users, hand-holding is usually
required. - These mechanisms can be hidden entirely from end
users, but still used behind the scenes. - These mechanisms exist for good reasons.
- Many useful things can be done without Grid
security. - It is unlikely that an ambitious project could go
into production operation without security like
this. - Most successful projects end up using Grid
security, but using it in ways that end users
dont see much.
32The Resource Management Challenge
- Enabling secure, controlled remote access to
heterogeneous computational resources and
management of remote computation - Authentication and authorization
- Resource discovery characterization
- Reservation and allocation
- Computation monitoring and control
- Addressed by a set of protocols services
- GRAM protocol as a basic building block
- Resource brokering co-allocation services
- GSI for security, MDS for discovery
33GRAM - Basic Job Submission and Control Service
- A uniform service interface for remote job
submission and control - Includes file staging and I/O management
- Includes reliability features
- Supports basic Grid security mechanisms
- Available in Pre-WS and WS
- GRAM is not a scheduler.
- No scheduling
- No metascheduling/brokering
- Often used as a front-end to schedulers, and
often used to simplify metaschedulers/brokers
34Execution Management (GRAM)
- Common WS interface to schedulers
- Unix, Condor, LSF, PBS, SGE,
- More generally interface for process execution
management - Lay down execution environment
- Stage data
- Monitor manage lifecycle
- Kill it, clean up
- A basis for application-driven provisioning
35GT4 GRAM
- 2nd-generation WS implementation
- optimized for performance, stability,
scalability - Streamlined critical path
- Use only what you need
- Flexible credential management
- Credential cache delegation service
- GridFTP RFT used for data operations
- Data staging streaming output
- Eliminates redundant GASS code
- Single and multi-job support
36Globus Toolkit Open Source Grid Infrastructure
Globus Toolkit v4 www.globus.org
Data Replication
Replica Location
Grid Telecontrol Protocol
CredentialMgmt
Data Access Integration
Community Scheduling Framework
Delegation
Python Runtime
WebMDS
Reliable File Transfer
CommunityAuthorization
Trigger
C Runtime
Workspace Management
GridFTP
Authentication Authorization
Grid Resource Allocation Management
Index
Java Runtime
Data Mgmt
Security
CommonRuntime
Execution Mgmt
Info Services
37GT4 Data Management
- Stage/move large data to/from nodes
- GridFTP, Reliable File Transfer (RFT)
- Alone, and integrated with GRAM
- Locate data of interest
- Replica Location Service (RLS)
- Replicate data for performance/reliability
- Distributed Replication Service (DRS)
- Provide access to diverse data sources
- File systems, parallel file systems, hierarchical
storage GridFTP - Databases OGSA DAI
38Data Management
(3) Log. Info
(1) Attribute Specification
Replica Catalog
Metadata Catalog
Application
(4) Multiple Locations
(2) Logical Collection and Logical File Name
MDS
(5) Selected Replica
Replica Selection
(6)PhysInfo
Performance Information Predictions
NWS
GridFTP Control Channel
Disk Cache
GridFTPDataChannel
Tape Library
Disk Array
Disk Cache
Replica Location 1
Replica Location 2
Replica Location 3
39GridFTP
- A high-performance, secure, reliable data
transfer protocol optimized for high-bandwidth
wide-area networks - FTP with well-defined extensions
- Uses basic Grid security (control and data
channels) - Multiple data channels for parallel transfers
- Partial file transfers
- Third-party (direct server-to-server) transfers
- Reusable data channels
- Command pipelining
- GGF recommendation GFD.20
40Striped GridFTP Service
- A distributed GridFTP service that runs on a
storage cluster - Every node of the cluster is used to transfer
data into/out of the cluster - Head node coordinates transfers
- Multiple NICs/internal busses lead to very high
performance - Maximizes use of Gbit WANs
41RFT - File Transfer Queuing
- A GT4 web service for queuing file transfer
requests - Server-to-server transfers
- Checkpointing for restarts
- Database back-end for failovers
- Allows clients to requests transfers and then
disappear - No need to manage the transfer
- Status monitoring available if desired
42Replica Location Service
- Identify location of files via logical to
physical name map - Distributed indexing of names, fault tolerant
update protocols - GT4 version scalable stable
- Managing 40 million files across 10 sites
Index
Index
43OGSA-DAI
- Web service interface for accessing XML and
relational data stores - Implements the GGF DAIS WG standard (in progress)
Figure courtesy of Malcolm Atkinson and Rob
Baxter, UK eScience Center
44Globus Toolkit Open Source Grid Infrastructure
Globus Toolkit v4 www.globus.org
Data Replication
Replica Location
Grid Telecontrol Protocol
CredentialMgmt
Data Access Integration
Community Scheduling Framework
Delegation
Python Runtime
WebMDS
Reliable File Transfer
CommunityAuthorization
Trigger
C Runtime
Workspace Management
GridFTP
Authentication Authorization
Grid Resource Allocation Management
Index
Java Runtime
Data Mgmt
Security
CommonRuntime
Execution Mgmt
Info Services
45Monitoring and Discovery System(MDS4)
- Grid-level monitoring system used most often for
resource selection - Aid user/agent to identify host(s) on which to
run an application - Uses standard interfaces to provide publishing of
data, discovery, and data access, including
subscription/notification - WS-ResourceProperties, WS-BaseNotification,
WS-ServiceGroup - Functions as an hourglass to provide a common
interface to lower-level monitoring tools
46Information Users Schedulers, Portals, etc.
WS standard interfaces for subscription,
registration, notification
GLUE Schema Attributes (cluster info, queue info,
FS info)
47MDS4 Components
- Higher level services
- Index Service a way to aggregate data
- Trigger Service a way to be notified of changes
- Both built on common aggregator framework
- Information providers
- Monitoring is a part of every WSRF service
- Non-WS services can also be used
- Clients
- WebMDS
- All of the tool are schema-agnostic, but
interoperability needs a well-understood common
language
48MDS4 Index Service
- Index Service is both registry and cache
- Subscribes to information providers
- Publishes (as resource properties)
- Datatype and data provider info, like a registry
- Last value of data, like a cache
- In memory default approach, DB backing store
currently being developed to allow for very large
indexes - Soft-state registration
- Can be set up for a site or set of sites, a
specific set of project data, or for
user-specific data only - Can be a multi-rooted hierarchy
49MDS4 Trigger Service
- Subscribe to a set of resource properties
- Evaluate that data against a set of
pre-configured conditions (triggers) - When a condition matches, email is sent to
pre-defined address - Similar functionality in Hawkeye
50Information ProvidersCluster and Queue Data
- Interfaces to Hawkeye, Ganglia, CluMon
- Not WS so these are Execution Sources
- Basic host data (name, ID), processor
information, memory size, OS name and version,
file system data, processor load data - Some condor/cluster specific data
- Interfaces to PBS, Torque LSF queue system
- Queue information, number of CPUs available and
free, job count information, some memory
statistics and host info for head node of cluster
51Information ProvidersGT4 Services
- Every WS built using GT4 core
- ServiceMetaDataInfo element includes start time,
version, and service type name - Reliable File Transfer Service (RFT)
- Service status data, number of active transfers,
transfer status, information about the resource
running the service - Community Authorization Service (CAS)
- Identifies the VO served by the service instance
- Replica Location Service (RLS)
- Note not a WS
- Location of replicas on physical storage systems
(based on user registrations) for later queries
52WebMDS User Interface
- Web-based interface to WSRF resource property
information - User-friendly front-end to the Index Service
- Uses standard resource property requests to query
resource property data - XSLT transforms to format and display them
- Customized pages are simply done by using HTML
form options and creating your own XSLT
transforms - Sample page
- http//mds.globus.org8080/webmds/webmds?infoinde
xinfoxslservicegroupxsl
53WebMDS Service
54(No Transcript)
55Globus Toolkit Open Source Grid Infrastructure
Globus Toolkit v4 www.globus.org
Data Replication
Replica Location
Grid Telecontrol Protocol
CredentialMgmt
Data Access Integration
Community Scheduling Framework
Delegation
Python Runtime
WebMDS
Reliable File Transfer
CommunityAuthorization
Trigger
C Runtime
Workspace Management
GridFTP
Authentication Authorization
Grid Resource Allocation Management
Index
Java Runtime
Data Mgmt
Security
CommonRuntime
Execution Mgmt
Info Services
56Overview
- What is a Grid
- What does the Globus Toolkit do?
- Security
- Resource Management
- Data Management
- Monitoring
- Example Application Grid3
- Conclusions
57- Open Science Grid (OSG)
- 28 sites (2100-2800 CPUs) growing
- 400-1300 concurrent jobs
- 8 substantial applications CS experiments
- Running since October 2003
Korea
http//www.ivdgl.org/grid2003
58Grid2003 Project Goals
- Ramp up U.S. Grid capabilities in anticipation of
LHC experiment needs in 2005. - Build, deploy, and operate a working Grid.
- Include all U.S. LHC institutions.
- Run real scientific applications on the Grid.
- Provide state-of-the-art monitoring services.
- Cover non-technical issues (e.g., SLAs) as well
as technical ones. - Unite the U.S. CS and Physics projects that are
aimed at support for LHC. - Common infrastructure
- Joint (collaborative) work
59ExampleOSGWorkflows
Genome sequence analysis
Sloan digital sky survey
Physics data analysis
60Grid2003 Components
- Security
- GT GSI, GSI-OpenSSH, Community Authorization
Sevrice (CAS) - Monitoring
- GT MDS, Ganglia on local systems
- Job Submission
- GT GRAM, Chimera Pegasus for workflows
- Data Tools
- GT GridFTP, GT RLS, metadata catalogs
61OSG Metrics
62OSG Summary
- Working Grid for wide set of applications
- Joint effort between application scientists,
computer scientists - Globus software as a starting point, additions
from other communities as needed
63What Should You TakeAway From This Talk
- Grids are a way to work between administrative
domains - The Globus Toolkit offers a starting point to
building these applications - Many applications research, science and business
use these resources - Much work still to be done in this area- many
open research questions!
64For More Information
- Jennifer M. Schopf
- jms_at_mcs.anl.gov
- www.mcs.anl.gov/jms
- Support from DOE, NSF, NeSC
- This talk
- www.mcs.anl.gov/jms/Talks (not there yet)
- Globus Toolkit
- www.globus.org
- UK ETF GT4 report
- www.nesc.ac.uk/technical_papers/ UKeS-2005-03.pdf