Title: Distributed Systems Management
1Distributed Systems Management
- Presented by
- Rajesh Kumar
- Gabriela Jacques da Silva
2Motivation
- A layer of indirection solves every
- problem in computer science
- David Wheeler
3Problem
- Distributed infrastructures are popular and are
here to stay. - How to efficiently and smoothly manage operations
in distributed systems?
4Metacomputing
- A networked environment of resources connected
by high-speed links functioning as one huge
virtual super-computer - Isnt that grid-computing?
5Virtual Organizations
Source Wikipedia
6Why metacomputing?
- Its the money,
- A particular configuration is required rarely
- Challenge of executing high-performance
applications - Resources are hard and expensive to replicate
- Still valid in few contexts
7Applications
- Desktop supercomputing
- Visualization capability linked with
supercomputers and databases - Smart instruments
- Instruments linked with supercomputers for
real-time data processing and actuation - Collaborative environments
- Link virtual environments for remote user
interaction - Distributed supercomputing
- Connect multiple computers to tackle hard
problems - Which one would be the most prevalent?
8Metasystem characteristics
- Scale and distributed selection
- Heterogeneous across all levels
- Unpredictable structure
- Dynamic and unpredictable behavior
- Multiple administrative domains
9Globus
- Goal to address the problems of configuration
and performance optimization in metacomputing
environments - Developing low-level mechanisms to support
high-level services - Techniques that allow higher-level services to
observe and guide the operation of these
mechanisms
10Source Original paper
11Low-level modules
- Resource (al)location
- Communications
- Unified resource information service
- Authentication
- Process creation
- Data access
12Support for AWARE Services
- Adaptive Wide Area Resource Environment is one
of the long-term goals of Globus. It is supported
through the following mechanisms - Rule-based selection
- Resource property inquiry
- Notification or call-back mechanism
13Communications
- Based on Nexus Communication Library
- Five concepts node, thread, context, global
pointer, Remote Service Request (RSR)
14Metacomputing Directory Service
- Information provided includes
- Resource configuration details
- Real-time performance information
- Application-specific information
- Information can be aggregated from multiple
sources such as NIS, SNMP - Information represented and accessed using
interface defined by LDAP
15Authentication
- Use the Generic Security System (GSS) which
defines a standard procedure and API for
obtaining credentials (password or certificates),
for mutual authentication (client and server),
and for message-oriented encryption and
decryption. - GSS is independent of any particular security
mechanism and can be layered on top of different
security methods, such as Kerberos and SSL.
16Data Access Services
Source Original paper
17Available in High-Quality Open Source Software
Globus Toolkit v4 www.globus.org
Data Replication
Replica Location
Grid Telecontrol Protocol
CredentialMgmt
Data Access Integration
Community Scheduling Framework
Delegation
WebMDS
Reliable File Transfer
CommunityAuthorization
Trigger
Workspace Management
GridFTP
Authentication Authorization
Grid Resource Allocation Management
Index
Data Mgmt
Security
CommonRuntime
Execution Mgmt
Info Services
I. Foster, Globus Toolkit Version 4 Software
for Service-Oriented Systems, LNCS 3779, 2-13,
2005
18Higher-level services
- Parallel Programming Interfaces
- Numerous PPI have been adapted to use Globus for
low-level services. - Unified Certificate-based Authentication
- Define a global, public key-based authentication
space for all users and resources. - Provide a centralized authority that defines
system-wide names (accounts) for users and
resources. - Basic implementation to show how it can be done
19Discussion
- Does it support executing multiple applications
on the grid? - How would it handle conflicting requirements
- Inter-domain job scheduling
- Job migration?
- Fault-tolerance and error recovery
20Condor and the Grid
- Douglas Thain, Todd Tannenbaum, Miron Livny
21Condor
- Batch-scheduler for high-throughput computing
- University of Wisconsin - 1988
- Steal idle cycles from any workstations -
opportunistic computing - resource finder
- batch queue manager
- scheduler
- Well deployed
- base for commercial systems - LoadLeveler
22Condor
23Key Mechanisms
- ClassAds
- resource matching
- Job checkpoint and migration
- failure
- workstation is now in use
- Remote system call
- mobile sandbox
- redirection of I/O
24Condor-G - Condor Globus
- Globus
- Widely used
- Speaks with foreign batch systems
- Interdomain authentication
- No error recovery
- Condor
- Reliable submission
- Job management
25Condor-G
26Building Computing Communities
Matchmaker
Problem Solver
User
Agent
Resource
Shadow
Sandbox
Job
27Condor Pool
Resource
Agent
Matchmaker
Resource
Resource
28Collaboration between Pools
Pool B
Pool A
R
R
R
R
M
M
R
R
A
29Pools in Condor-G
Foreign batch sched
Foreign batch sched
R
R
R
R
R
R
M
Q
Q
GRAM
GRAM
A
30Pools in Condor-G
Personal Condor Pool
R
R
R
R
R
R
M
Q
Q
A
31Pools in Condor-G
Personal Condor Pool
R
R
R
R
R
R
M
Q
Q
A
32Planning and Scheduling
- Different administrative domains ?different
scheduling policies - Planning ? where
- Scheduling ? when
- Matchmaker
- Agents and resources - classified advertisements
(ClassAds) - Pairing based on constraints
- Matching ? Notification ? Claiming
33Classified Advertisements
- Submitting a job
- creates job ClassAds
- Type Job
- Owner gjsilva
- Cmd my_computation
- WantRemoteSyscalls 1
- WantCheckpoint 1
- Args -example args
- Constraint
- other.Type Machine
- Arch INTEL
- OpSys LINUX
- other.Memory gt 1000
34Classified Advertisements
- Announcing resources
- Type Machine
- Activity idle
- KeyboardIdle 36000 // seconds
- Disk 20 // GB
- Memory 512
- Arch INTEL
- OpSys LINUX
- State Unclaimed
- Friends gjsilva, rkumar8
- Constraint
- member(other.Owner, Friends
-
35Problem Solvers
- Provide programming models
- How to execute job - application layer
- Master-Worker
- Embarrassingly parallel programs
- Jobs are independent
- Directed Acyclic Graph Manager
- Enforce order on job completion
36Master-Worker
37DAGMan
38Split Execution
- Guarantee job isolation and correct execution
- Shadow
- User
- Provides executable, arguments, environment,
input files - Sandbox
- Safe place to execute
- Create appropriate environment
- Fetch files through RPCs
39Split Execution
40Split Execution
41Discussion
- Very reliable, stable system
- Most issues have been taken care of in recent
versions - Problem solvers
- Simple models
- Fail to provide solution that explore locality
- Checkpointing
- Single independent job ? easy task
- How to do migration for jobs with connections?
MPI applications? - No single point of failure
- Hard to find information about failure handling
in critical components - Is it possible to be selfish?
- set a low ranking
- allow matchmaking but deny resource claim
- Firewall issues?
42Globus and PlanetLab Resource Management
Solutions Compared
- M. Ripeanu, M. Bowman, J. Chase, I. Foster, M.
Milenkovic
43What is common any way?
- Both build infrastructures that enable federated,
extensible and secure resource - Across distributed trust domains
44Why compare?
- Some functionality can be transferred
- Some functionality might be complementary
- Synergistic evolution is possible
45Some issues though
- Both are active projects hence compare existing
and planned functionality - Projects are complementary
- Globus is a software toolkit with many
deployments Planetlab is a single deployment
with some software
46PlanetLab - recap
- Infrastructure testbed especially for network
services - Best suited for services that need dispersed
nodes - Experimental and production use
- Designed to run on dedicated hosts
- Uses virtualization
- Low level system abstraction
- The user sees a distributed set of virtual
containers - Higher value services are built on top
- Currently 753 nodes on 363 sites
47Similar but not quite
- User communities
- Application characteristics
- Resources
- Resource Ownership
48User communities
- PlanetLab
- CS researchers
- Minimal functionality
- Users build their own
- Duplicated effort ?
- But competitive ?
- Globus
- Heterogeneous set
- Rich functionality
- Standardized
- Can be further built upon
49Application Characteristics
- PlanetLab
- Network-intensive
- Experimentation
- Distributedness is an objective
- Globus
- Computation-intensive
- High-performance
- Distributedness is a necessary evil
50Resources
- PlanetLab
- Trend is towards standardization of resources
- An economic necessity
- Globus
- Supports diverse devices and platforms
- A feature
51Resource ownership
- Globus
- Resource owner controls the site
- PlanetLab
- Limited control for the resource owner
- Homogeneity is required
- PlanetLab admin has root access on nodes
- PlanetLab admin has access to a remote power
button
Source Original paper
52Different assumptions ? Different solutions
- Local resource management
- Global federation-building
53Local resource management
- Globus
- Unified interface for local resource management
- Underlying mechanisms may vary
- Main abstractions
- Service
- Job
- PlanetLab
- Low level management functionality
- Same for all individual resources
- Main abstraction
- Virtual Machine
54Federated resource sharing
- Global view of the resources
- Basic concept delegation
- Resource usage delegation
- Delegate the right to consume resources of a site
- Delegate to an application or a broker
- Identity delegation
- Delegate ones identity to another to act on his
behalf - Not available in PlanetLab
55Global resource allocation and scheduling
- Globus
- Exploits identity delegation
- PlanetLab
- Exploits resource usage delegation
Job submission identity delegation
Forward Capability
Brokers
Job submission
Job submission
Provide capability
Node Managers
Source presentation by Dionysis Logothetis
56Globus on top of PlanetLab
- PlanetLab
- Interoperability
- Identity delegation
- Globus
- Integrating community contributions
- Resource usage delegation rights
57What is this?
Source Original paper
58Discussion
- What does it give to run Globus over PlanetLab?
- Heterogeneity is not the goal of PlanetLab
- May never need interoperability
- Why not build another Grid and deploy
mTCP/BANANAS?