Distributed Systems Management

About This Presentation

Title:

Distributed Systems Management

Description:

Condor. Batch-scheduler for high-throughput computing. University of Wisconsin - 1988 ... Personal Condor Pool. Planning and Scheduling ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 58

Provided by: csU70

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Systems Management

1
Distributed Systems Management

Presented by
Rajesh Kumar
Gabriela Jacques da Silva

2
Motivation

A layer of indirection solves every
problem in computer science
David Wheeler

3
Problem

Distributed infrastructures are popular and are
here to stay.
How to efficiently and smoothly manage operations
in distributed systems?

4
Metacomputing

A networked environment of resources connected
by high-speed links functioning as one huge
virtual super-computer
Isnt that grid-computing?

5
Virtual Organizations
Source Wikipedia
6
Why metacomputing?

Its the money,
A particular configuration is required rarely
Challenge of executing high-performance
applications
Resources are hard and expensive to replicate
Still valid in few contexts

7
Applications

Desktop supercomputing
Visualization capability linked with
supercomputers and databases
Smart instruments
Instruments linked with supercomputers for
real-time data processing and actuation
Collaborative environments
Link virtual environments for remote user
interaction
Distributed supercomputing
Connect multiple computers to tackle hard
problems
Which one would be the most prevalent?

8
Metasystem characteristics

Scale and distributed selection
Heterogeneous across all levels
Unpredictable structure
Dynamic and unpredictable behavior
Multiple administrative domains

9
Globus

Goal to address the problems of configuration
and performance optimization in metacomputing
environments
Developing low-level mechanisms to support
high-level services
Techniques that allow higher-level services to
observe and guide the operation of these
mechanisms

10
Source Original paper
11
Low-level modules

Resource (al)location
Communications
Unified resource information service
Authentication
Process creation
Data access

12
Support for AWARE Services

Adaptive Wide Area Resource Environment is one
of the long-term goals of Globus. It is supported
through the following mechanisms
Rule-based selection
Resource property inquiry
Notification or call-back mechanism

13
Communications

Based on Nexus Communication Library
Five concepts node, thread, context, global
pointer, Remote Service Request (RSR)

14
Metacomputing Directory Service

Information provided includes
Resource configuration details
Real-time performance information
Application-specific information
Information can be aggregated from multiple
sources such as NIS, SNMP
Information represented and accessed using
interface defined by LDAP

15
Authentication

Use the Generic Security System (GSS) which
defines a standard procedure and API for
obtaining credentials (password or certificates),
for mutual authentication (client and server),
and for message-oriented encryption and
decryption.
GSS is independent of any particular security
mechanism and can be layered on top of different
security methods, such as Kerberos and SSL.

16
Data Access Services
Source Original paper
17
Available in High-Quality Open Source Software
Globus Toolkit v4 www.globus.org
Data Replication
Replica Location
Grid Telecontrol Protocol
CredentialMgmt
Data Access Integration
Community Scheduling Framework
Delegation
WebMDS
Reliable File Transfer
CommunityAuthorization
Trigger
Workspace Management
GridFTP
Authentication Authorization
Grid Resource Allocation Management
Index
Data Mgmt
Security
CommonRuntime
Execution Mgmt
Info Services
I. Foster, Globus Toolkit Version 4 Software
for Service-Oriented Systems, LNCS 3779, 2-13,
2005
18
Higher-level services

Parallel Programming Interfaces
Numerous PPI have been adapted to use Globus for
low-level services.
Unified Certificate-based Authentication
Define a global, public key-based authentication
space for all users and resources.
Provide a centralized authority that defines
system-wide names (accounts) for users and
resources.
Basic implementation to show how it can be done

19
Discussion

Does it support executing multiple applications
on the grid?
How would it handle conflicting requirements
Inter-domain job scheduling
Job migration?
Fault-tolerance and error recovery

20
Condor and the Grid

Douglas Thain, Todd Tannenbaum, Miron Livny

21
Condor

Batch-scheduler for high-throughput computing
University of Wisconsin - 1988
Steal idle cycles from any workstations -
opportunistic computing
resource finder
batch queue manager
scheduler
Well deployed
base for commercial systems - LoadLeveler

22
Condor
23
Key Mechanisms

ClassAds
resource matching
Job checkpoint and migration
failure
workstation is now in use
Remote system call
mobile sandbox
redirection of I/O

24
Condor-G - Condor Globus

Globus
Widely used
Speaks with foreign batch systems
Interdomain authentication
No error recovery
Condor
Reliable submission
Job management

25
Condor-G
26
Building Computing Communities
Matchmaker
Problem Solver
User
Agent
Resource
Shadow
Sandbox
Job
27
Condor Pool
Resource
Agent
Matchmaker
Resource
Resource
28
Collaboration between Pools
Pool B
Pool A
R
R
R
R
M
M
R
R
A
29
Pools in Condor-G
Foreign batch sched
Foreign batch sched
R
R
R
R
R
R
M
Q
Q
GRAM
GRAM
A
30
Pools in Condor-G
Personal Condor Pool
R
R
R
R
R
R
M
Q
Q
A
31
Pools in Condor-G
Personal Condor Pool
R
R
R
R
R
R
M
Q
Q
A
32
Planning and Scheduling

Different administrative domains ?different
scheduling policies
Planning ? where
Scheduling ? when
Matchmaker
Agents and resources - classified advertisements
(ClassAds)
Pairing based on constraints
Matching ? Notification ? Claiming

33
Classified Advertisements

Submitting a job
creates job ClassAds
Type Job
Owner gjsilva
Cmd my_computation
WantRemoteSyscalls 1
WantCheckpoint 1
Args -example args
Constraint
other.Type Machine
Arch INTEL
OpSys LINUX
other.Memory gt 1000

34
Classified Advertisements

Announcing resources
Type Machine
Activity idle
KeyboardIdle 36000 // seconds
Disk 20 // GB
Memory 512
Arch INTEL
OpSys LINUX
State Unclaimed
Friends gjsilva, rkumar8
Constraint
member(other.Owner, Friends

35
Problem Solvers

Provide programming models
How to execute job - application layer
Master-Worker
Embarrassingly parallel programs
Jobs are independent
Directed Acyclic Graph Manager
Enforce order on job completion

36
Master-Worker
37
DAGMan
38
Split Execution

Guarantee job isolation and correct execution
Shadow
User
Provides executable, arguments, environment,
input files
Sandbox
Safe place to execute
Create appropriate environment
Fetch files through RPCs

39
Split Execution
40
Split Execution
41
Discussion

Very reliable, stable system
Most issues have been taken care of in recent
versions
Problem solvers
Simple models
Fail to provide solution that explore locality
Checkpointing
Single independent job ? easy task
How to do migration for jobs with connections?
MPI applications?
No single point of failure
Hard to find information about failure handling
in critical components
Is it possible to be selfish?
set a low ranking
allow matchmaking but deny resource claim
Firewall issues?

42
Globus and PlanetLab Resource Management
Solutions Compared

M. Ripeanu, M. Bowman, J. Chase, I. Foster, M.
Milenkovic

43
What is common any way?

Both build infrastructures that enable federated,
extensible and secure resource
Across distributed trust domains

44
Why compare?

Some functionality can be transferred
Some functionality might be complementary
Synergistic evolution is possible

45
Some issues though

Both are active projects hence compare existing
and planned functionality
Projects are complementary
Globus is a software toolkit with many
deployments Planetlab is a single deployment
with some software

46
PlanetLab - recap

Infrastructure testbed especially for network
services
Best suited for services that need dispersed
nodes
Experimental and production use
Designed to run on dedicated hosts
Uses virtualization
Low level system abstraction
The user sees a distributed set of virtual
containers
Higher value services are built on top
Currently 753 nodes on 363 sites

47
Similar but not quite

User communities
Application characteristics
Resources
Resource Ownership

48
User communities

PlanetLab
CS researchers
Minimal functionality
Users build their own
Duplicated effort ?
But competitive ?

Globus
Heterogeneous set
Rich functionality
Standardized
Can be further built upon

49
Application Characteristics

PlanetLab
Network-intensive
Experimentation
Distributedness is an objective

Globus
Computation-intensive
High-performance
Distributedness is a necessary evil

50
Resources

PlanetLab
Trend is towards standardization of resources
An economic necessity

Globus
Supports diverse devices and platforms
A feature

51
Resource ownership

Globus
Resource owner controls the site

PlanetLab
Limited control for the resource owner
Homogeneity is required
PlanetLab admin has root access on nodes
PlanetLab admin has access to a remote power
button

Source Original paper
52
Different assumptions ? Different solutions

Local resource management
Global federation-building

53
Local resource management

Globus
Unified interface for local resource management
Underlying mechanisms may vary
Main abstractions
Service
Job

PlanetLab
Low level management functionality
Same for all individual resources
Main abstraction
Virtual Machine

54
Federated resource sharing

Global view of the resources
Basic concept delegation
Resource usage delegation
Delegate the right to consume resources of a site
Delegate to an application or a broker
Identity delegation
Delegate ones identity to another to act on his
behalf
Not available in PlanetLab

55
Global resource allocation and scheduling

Globus
Exploits identity delegation

PlanetLab
Exploits resource usage delegation

Job submission identity delegation
Forward Capability

Brokers
Job submission
Job submission
Provide capability

Node Managers

Source presentation by Dionysis Logothetis
56
Globus on top of PlanetLab

PlanetLab
Interoperability
Identity delegation

Globus
Integrating community contributions
Resource usage delegation rights

57
What is this?
Source Original paper
58
Discussion

What does it give to run Globus over PlanetLab?
Heterogeneity is not the goal of PlanetLab
May never need interoperability
Why not build another Grid and deploy
mTCP/BANANAS?

Write a Comment

User Comments (0)

About PowerShow.com

Distributed Systems Management - PowerPoint PPT Presentation

Distributed Systems Management

Condor. Batch-scheduler for high-throughput computing. University of Wisconsin - 1988 ... Personal Condor Pool. Planning and Scheduling ... – PowerPoint PPT presentation