Distributed Systems Management - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Distributed Systems Management

Description:

Condor. Batch-scheduler for high-throughput computing. University of Wisconsin - 1988 ... Personal Condor Pool. Planning and Scheduling ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 58
Provided by: csU70
Category:

less

Transcript and Presenter's Notes

Title: Distributed Systems Management


1
Distributed Systems Management
  • Presented by
  • Rajesh Kumar
  • Gabriela Jacques da Silva

2
Motivation
  • A layer of indirection solves every
  • problem in computer science
  • David Wheeler

3
Problem
  • Distributed infrastructures are popular and are
    here to stay.
  • How to efficiently and smoothly manage operations
    in distributed systems?

4
Metacomputing
  • A networked environment of resources connected
    by high-speed links functioning as one huge
    virtual super-computer
  • Isnt that grid-computing?

5
Virtual Organizations
Source Wikipedia
6
Why metacomputing?
  • Its the money,
  • A particular configuration is required rarely
  • Challenge of executing high-performance
    applications
  • Resources are hard and expensive to replicate
  • Still valid in few contexts

7
Applications
  • Desktop supercomputing
  • Visualization capability linked with
    supercomputers and databases
  • Smart instruments
  • Instruments linked with supercomputers for
    real-time data processing and actuation
  • Collaborative environments
  • Link virtual environments for remote user
    interaction
  • Distributed supercomputing
  • Connect multiple computers to tackle hard
    problems
  • Which one would be the most prevalent?

8
Metasystem characteristics
  • Scale and distributed selection
  • Heterogeneous across all levels
  • Unpredictable structure
  • Dynamic and unpredictable behavior
  • Multiple administrative domains

9
Globus
  • Goal to address the problems of configuration
    and performance optimization in metacomputing
    environments
  • Developing low-level mechanisms to support
    high-level services
  • Techniques that allow higher-level services to
    observe and guide the operation of these
    mechanisms

10
Source Original paper
11
Low-level modules
  • Resource (al)location
  • Communications
  • Unified resource information service
  • Authentication
  • Process creation
  • Data access

12
Support for AWARE Services
  • Adaptive Wide Area Resource Environment is one
    of the long-term goals of Globus. It is supported
    through the following mechanisms
  • Rule-based selection
  • Resource property inquiry
  • Notification or call-back mechanism

13
Communications
  • Based on Nexus Communication Library
  • Five concepts node, thread, context, global
    pointer, Remote Service Request (RSR)

14
Metacomputing Directory Service
  • Information provided includes
  • Resource configuration details
  • Real-time performance information
  • Application-specific information
  • Information can be aggregated from multiple
    sources such as NIS, SNMP
  • Information represented and accessed using
    interface defined by LDAP

15
Authentication
  • Use the Generic Security System (GSS) which
    defines a standard procedure and API for
    obtaining credentials (password or certificates),
    for mutual authentication (client and server),
    and for message-oriented encryption and
    decryption.
  • GSS is independent of any particular security
    mechanism and can be layered on top of different
    security methods, such as Kerberos and SSL.

16
Data Access Services
Source Original paper
17
Available in High-Quality Open Source Software
Globus Toolkit v4 www.globus.org
Data Replication
Replica Location
Grid Telecontrol Protocol
CredentialMgmt
Data Access Integration
Community Scheduling Framework
Delegation
WebMDS
Reliable File Transfer
CommunityAuthorization
Trigger
Workspace Management
GridFTP
Authentication Authorization
Grid Resource Allocation Management
Index
Data Mgmt
Security
CommonRuntime
Execution Mgmt
Info Services
I. Foster, Globus Toolkit Version 4 Software
for Service-Oriented Systems, LNCS 3779, 2-13,
2005
18
Higher-level services
  • Parallel Programming Interfaces
  • Numerous PPI have been adapted to use Globus for
    low-level services.
  • Unified Certificate-based Authentication
  • Define a global, public key-based authentication
    space for all users and resources.
  • Provide a centralized authority that defines
    system-wide names (accounts) for users and
    resources.
  • Basic implementation to show how it can be done

19
Discussion
  • Does it support executing multiple applications
    on the grid?
  • How would it handle conflicting requirements
  • Inter-domain job scheduling
  • Job migration?
  • Fault-tolerance and error recovery

20
Condor and the Grid
  • Douglas Thain, Todd Tannenbaum, Miron Livny

21
Condor
  • Batch-scheduler for high-throughput computing
  • University of Wisconsin - 1988
  • Steal idle cycles from any workstations -
    opportunistic computing
  • resource finder
  • batch queue manager
  • scheduler
  • Well deployed
  • base for commercial systems - LoadLeveler

22
Condor
23
Key Mechanisms
  • ClassAds
  • resource matching
  • Job checkpoint and migration
  • failure
  • workstation is now in use
  • Remote system call
  • mobile sandbox
  • redirection of I/O

24
Condor-G - Condor Globus
  • Globus
  • Widely used
  • Speaks with foreign batch systems
  • Interdomain authentication
  • No error recovery
  • Condor
  • Reliable submission
  • Job management

25
Condor-G
26
Building Computing Communities
Matchmaker
Problem Solver
User
Agent
Resource
Shadow
Sandbox
Job
27
Condor Pool
Resource
Agent
Matchmaker
Resource
Resource
28
Collaboration between Pools
Pool B
Pool A
R
R
R
R
M
M
R
R
A
29
Pools in Condor-G
Foreign batch sched
Foreign batch sched
R
R
R
R
R
R
M
Q
Q
GRAM
GRAM
A
30
Pools in Condor-G
Personal Condor Pool
R
R
R
R
R
R
M
Q
Q
A
31
Pools in Condor-G
Personal Condor Pool
R
R
R
R
R
R
M
Q
Q
A
32
Planning and Scheduling
  • Different administrative domains ?different
    scheduling policies
  • Planning ? where
  • Scheduling ? when
  • Matchmaker
  • Agents and resources - classified advertisements
    (ClassAds)
  • Pairing based on constraints
  • Matching ? Notification ? Claiming

33
Classified Advertisements
  • Submitting a job
  • creates job ClassAds
  • Type Job
  • Owner gjsilva
  • Cmd my_computation
  • WantRemoteSyscalls 1
  • WantCheckpoint 1
  • Args -example args
  • Constraint
  • other.Type Machine
  • Arch INTEL
  • OpSys LINUX
  • other.Memory gt 1000

34
Classified Advertisements
  • Announcing resources
  • Type Machine
  • Activity idle
  • KeyboardIdle 36000 // seconds
  • Disk 20 // GB
  • Memory 512
  • Arch INTEL
  • OpSys LINUX
  • State Unclaimed
  • Friends gjsilva, rkumar8
  • Constraint
  • member(other.Owner, Friends

35
Problem Solvers
  • Provide programming models
  • How to execute job - application layer
  • Master-Worker
  • Embarrassingly parallel programs
  • Jobs are independent
  • Directed Acyclic Graph Manager
  • Enforce order on job completion

36
Master-Worker
37
DAGMan
38
Split Execution
  • Guarantee job isolation and correct execution
  • Shadow
  • User
  • Provides executable, arguments, environment,
    input files
  • Sandbox
  • Safe place to execute
  • Create appropriate environment
  • Fetch files through RPCs

39
Split Execution
40
Split Execution
41
Discussion
  • Very reliable, stable system
  • Most issues have been taken care of in recent
    versions
  • Problem solvers
  • Simple models
  • Fail to provide solution that explore locality
  • Checkpointing
  • Single independent job ? easy task
  • How to do migration for jobs with connections?
    MPI applications?
  • No single point of failure
  • Hard to find information about failure handling
    in critical components
  • Is it possible to be selfish?
  • set a low ranking
  • allow matchmaking but deny resource claim
  • Firewall issues?

42
Globus and PlanetLab Resource Management
Solutions Compared
  • M. Ripeanu, M. Bowman, J. Chase, I. Foster, M.
    Milenkovic

43
What is common any way?
  • Both build infrastructures that enable federated,
    extensible and secure resource
  • Across distributed trust domains

44
Why compare?
  • Some functionality can be transferred
  • Some functionality might be complementary
  • Synergistic evolution is possible

45
Some issues though
  • Both are active projects hence compare existing
    and planned functionality
  • Projects are complementary
  • Globus is a software toolkit with many
    deployments Planetlab is a single deployment
    with some software

46
PlanetLab - recap
  • Infrastructure testbed especially for network
    services
  • Best suited for services that need dispersed
    nodes
  • Experimental and production use
  • Designed to run on dedicated hosts
  • Uses virtualization
  • Low level system abstraction
  • The user sees a distributed set of virtual
    containers
  • Higher value services are built on top
  • Currently 753 nodes on 363 sites

47
Similar but not quite
  • User communities
  • Application characteristics
  • Resources
  • Resource Ownership

48
User communities
  • PlanetLab
  • CS researchers
  • Minimal functionality
  • Users build their own
  • Duplicated effort ?
  • But competitive ?
  • Globus
  • Heterogeneous set
  • Rich functionality
  • Standardized
  • Can be further built upon

49
Application Characteristics
  • PlanetLab
  • Network-intensive
  • Experimentation
  • Distributedness is an objective
  • Globus
  • Computation-intensive
  • High-performance
  • Distributedness is a necessary evil

50
Resources
  • PlanetLab
  • Trend is towards standardization of resources
  • An economic necessity
  • Globus
  • Supports diverse devices and platforms
  • A feature

51
Resource ownership
  • Globus
  • Resource owner controls the site
  • PlanetLab
  • Limited control for the resource owner
  • Homogeneity is required
  • PlanetLab admin has root access on nodes
  • PlanetLab admin has access to a remote power
    button

Source Original paper
52
Different assumptions ? Different solutions
  • Local resource management
  • Global federation-building

53
Local resource management
  • Globus
  • Unified interface for local resource management
  • Underlying mechanisms may vary
  • Main abstractions
  • Service
  • Job
  • PlanetLab
  • Low level management functionality
  • Same for all individual resources
  • Main abstraction
  • Virtual Machine

54
Federated resource sharing
  • Global view of the resources
  • Basic concept delegation
  • Resource usage delegation
  • Delegate the right to consume resources of a site
  • Delegate to an application or a broker
  • Identity delegation
  • Delegate ones identity to another to act on his
    behalf
  • Not available in PlanetLab

55
Global resource allocation and scheduling
  • Globus
  • Exploits identity delegation
  • PlanetLab
  • Exploits resource usage delegation

Job submission identity delegation
Forward Capability


Brokers
Job submission
Job submission
Provide capability


Node Managers


Source presentation by Dionysis Logothetis
56
Globus on top of PlanetLab
  • PlanetLab
  • Interoperability
  • Identity delegation
  • Globus
  • Integrating community contributions
  • Resource usage delegation rights

57
What is this?
Source Original paper
58
Discussion
  • What does it give to run Globus over PlanetLab?
  • Heterogeneity is not the goal of PlanetLab
  • May never need interoperability
  • Why not build another Grid and deploy
    mTCP/BANANAS?
Write a Comment
User Comments (0)
About PowerShow.com