Grid Computing 2 - PowerPoint PPT Presentation

About This Presentation

Title:

Grid Computing 2

Description:

(Condor now being ported to NT) ... Schedd queues jobs submitted to Condor at that WS and seeks resources for them ... Flock of Condors ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 42

Provided by: csewe4

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Grid Computing 2

1
Grid Computing 2

http//www.globus.org
http//www.cs.virginia.edu/legion/
http//www.cs.wisc.edu/condor/
(thanks to shava and holly see notes for CSE
225)

2
Outline

Today
Condor
Globus
Legion
Next class
Talk by Marc Snir, Architect of IBMs Blue Gene
Tuesday June 6, APM 4301 100-200

3
Condor

Condor is a high-throughput scheduler
Main idea is to leverage free cycles on very
large collections of privately owned,
non-dedicated desktop workstations
Performance measure is throughput of jobs
Rather than how fast can a particular job run,
how many jobs can complete over a long period of
time.
Developed by Miron Livny et al. at U. of Wisconsin

4
Condor Basics

Condor hunter of idle workstations
Condor pool consists of large number of privately
controlled UNIX workstations
(Condor now being ported to NT)
WS owners define the conditions under which the
WS can be allocated by Condor to an external user
External Condor jobs run while machines are idle
User does not need a login on participating
machines
Uses remote system calls to submitting WS

5
Condor Architecture (all machines in same Condor
Pool)

Architecture
Each WS runs Schedd and Startd daemons
Startd monitors and terminates jobs assigned by
CM
Schedd queues jobs submitted to Condor at that WS
and seeks resources for them
Central Manager (CM) WS controls allocation and
execution for all jobs

6
Standard Condor Protocol (all machines in same
Condor Pool)

Protocol
Schedd (submitting machine) sends job context to
CM Execution machine sends machine context to CM
CM identifies a match between job requirements
and execution machine resources
CM sends to Schedd the execution machine ID
Schedd forks a Shadow process on submission
machine
Shadow passes job requirements to Startd on
execution machine and gets acknowledgement that
execution machine is still idle
Shadow sends executable to execution machine
where it executes until completion or migration

7
More Condor Basics

Participating condor machines not required to
share file systems
No source code changes to users code required to
use Condor, users must re-link their program in
order to use checkpoint and migration
vanilla jobs vs. condor jobs
Condor jobs allocated to good target resource
using a matchmaker
Single condor jobs automatically checkpointed and
migrated between WSs, and restarted as needed

8
Condor Remote System Call Strategy

Job must be able to read and write files on its
submit workstation

9
Condor Matchmaking

Matchmaking mechanism matches job specs to
machine characteristics
Matchmaking done using classads
Resources produce resource offer ads
Include information such as available RAM memory,
CPU type and speed, virtual memory size, physical
location, current load average, etc.
Jobs provide resource request ad which defines
the required and desired set of resources to run
on
Condor acts as a broker which matches and ranks
resource offer ads against resource request ads
Condor makes sure that all requirements in both
ads are satisfied
Priorities of users and certain types of ads also
taken into consideration

10
Condor Checkpointing

When WS owner returns, job can be checkpointed
and restarted on another WS
Periodic checkpoint feature can periodically
checkpoint the job so that work is not lost
should the job be migrated
Condor jobs vs. vanilla jobs
Condor job executables must be relinked and can
be checkpointed, migrated and restarted
Vanilla jobs are not relinked and cannot be
checkpointed and migrated

11
Condor Checkpointing Limitations

Only single process jobs supported
Inter-process communication not supported
(socket, send, recv, etc. not implemented)
All file operations idempotent (read-only,
write-only work correctly, read and write to the
same file may not)
Disk space must be available to store the
checkpoint file on the submitting machines.
Each checkpointed job has an associated
checkpoint file which is approximately the size
of the address space of the process.

12
Condor-PVM and Parallel Jobs

PVM master/slave jobs can be submitted to Condor
pool. (Special condor-pvm universe)
Master is run on machine where the job was
submitted
Slaves pulled from the condor pool as they become
available
Condor acts as resource manager for pvm daemon
Whenever pvm program asks for nodes, request is
remapped to Condor
Condor finds machine in condor pool and adds it
to pvm virtual machine

13
Condor and the Grid

Condor and the Alliance
Condor one of the Grid technologies deployed by
the Alliance
Used for production high-throughput computing by
partners
Condor and Globus
Globus can use Condor as a local resource
manager.
Globus RSL specs translated into matchmaker
classads

14
Condor and the Grid

Flock of Condors
Aggregation of condor pools into flock enables
Condor pools to cross load-sharing and protection
boundaries
Condor flock may include Condor pools connected
by wide-area networks
Infrastructure
Idea is to add Gateway machine for every pool.
Gateway machines act as resource brokers for
machines external to a pool
In published description, GW machine presents
randomly chosen external pools/machines
CM does not need to know about flocking
Each GW machine runs GW-startd and GW-schedd as
with a single condor pool

15
Flocking Protocol(machines in different pools)
16
Globus

Globus -- integrated toolkit of Grid services
Developed by Ian Foster (ANL/UC) and Carl
Kesselman (USC/ISI)
Bag of services model applications can use
Grid services without having to adopt a
particular programming model

17
Core Globus Services

Resource allocation and process management (GRAM,
DUROC, RSL)
Information Infrastructure (MDS)
Security (GSI)
Communication (Nexus)
Remote Access (GASS, GEM)
Fault Detection (HBM)
QoS (GARA, Gloperf)

18
Globus Layered Architecture
Applications
High-level Services and Tools
GlobusView
Testbed Status
DUROC
globusrun
MPI
Nimrod/G
MPI-IO
CC
Core Services
GRAM
Nexus
Metacomputing Directory Service
Globus Security Interface
Heartbeat Monitor
Gloperf
GASS
19
Globus Resource Management Services

Resource Management services provide mechanism
for remote job submission and management
3 low level services
GRAM (Globus Resource Allocation Manager)
Provides remote job submission and management
DUROC (Dynamically Updated Request Online
Co-allocator)
Provides simultaneous job submission
Layers on top of GRAM
RSL (Resource Specification Language)
Language used to communicate resource requests

20
Globus Resource Management Architecture
RSL specialization
RSL
Information Service
Queries
Application
Info
Ground RSL
Simple ground RSL
Local resource managers
GRAM
GRAM
GRAM
LSF
EASY-LL
NQE
21
Globus Information Infrastructure

MDS (Metacomputing Directory Service)
MDS stores information about entry some type of
object (organization, person, network, computer,
etc.)
Object class associated with each entry describes
a set of entry attributes
LDAP (Lightweight Directory Access Protocol) used
to store information about resources
LDAP hierarchical, tree-structured information
model defining form and character of information

22
Globus Security Service

GSI (Grid Security Infrastructure)
Provides public key-based security system that
layers on top of local site security
User identified to system using X.509 certificate
containing info about the duration of
permissions, public key, signature of certificate
authority
User also has private key
Provides users with a single sign-on access to
the various sites to which they are authorized

23
More GSI

Resource management system uses GSI to establish
which machines user may have access to
GSI system allows for proxies so that user only
need logon once, as opposed to logging on for all
machines involved in a distributed computation
Proxies used for short-term authentication,
rather than long-term use

24
Globus Communication Services

Nexus
Communication library which provides asynchronous
RPC, multi-method communication, data conversion
and multi-threading facilities
I/O
Low level communication library which provides a
thin wrapper around TCP, UDP, IP multicast and
file I/O
Integrates GSI into TCP communication

25
Globus Remote Access Services

GASS (Globus Access to Secondary Storage)
Provides secure remote access to files
GEM (Globus Executable Management)
Intended to support identification, location, and
creation of executables in a heterogeneous
environment.

26
Globus Fault Detection Services

HBM (Heartbeat Monitor)
Provides mechanisms for monitoring multiple
remote processes in a job and enabling
application to respond to failures
Nexus Fault Detection
Notifies applications using Nexus when a
communicating process fails (but not which one)

27
Globus QoS Services

GARA (Globus Architecture for Reservation and
Allocation)
Provides dedicated access to collections of
resources via reservations
Gloperf
Provides bandwidth and latency information
Wolskis NWS being integrated with Globus
NWS provides monitoring and predictive
information

28
Globus and the Grid

Major player in Grid Infrastructure development
Currently deployed widely
User community strong
Infrastructure supported by IPG, Alliance and
NPACI
Exclusive infrastructure of Alliance and IPG

29
Legion

Developed by Andrew Grimshaw (UVA)
Provides single, coherent virtual machine model
that addresses grid issues within a reflective,
object-based metasystem
Everything is an object in Legion model HW
resources, SW resources, etc.

30
Legion Goals

Site autonomy
Each organization maintains control over their
own resources
Extensibility
Users can construct own mechanisms and policies
within Legion
Scalability
No centralized structures or servers full
distribution

31
Legion Goals

Easy to use / seamless
System must hide complexity of environment
Ninja users must be able to tune applications
High performance via parallelism
Coarse-grained applications should perform well
Single, persistent object space
Single name space, transparent of location or
replication
Security
do no harm Legion should not weaken local
security policies

32
Legion Object Model

Every Legion object is defined and managed by its
class object class objects act as managers and
make policy, as well as define instances
Legion defines the interface and basic
functionality of a set of core object types which
support basic services
Users may also define and build their own class
objects

33
Legion Object Model

Core Objects
Host objects
Encapsulate machine capabilities in Legion
(processors and memory)
Currently represent single host systems
(uniprocessor and multiprocessor shared memory)
Vault objects
Represents persistent storage
Implementation objects
Generally an executable file host object can
execute when it receives a request to activate or
create an object

34
Legion Object Model

Basic system services provided by core objects
Naming and binding, object creation, activation,
deactivation and deletion
Responsibility for system-level functionality
endowed on classes
Classes (which are also objects) define and
manage objects associated with them
Classes create new instances, schedule them for
execution, activate and deactivate them, and
provide current location info for contacting them
Users can define and build own class objects

35
Legion Programming

Legion supports MPI and PVM libraries via
emulation libraries (which use runtime Legion
library)
Applications need to be recompiled and relinked
Legion supports BFS (Basic Fortran Support) and
Java
Legion OO programming language Mentat (MPL)

36
Legion and the Grid

Major Grid player with Globus
Legion infrastructure deployed at NPACI,
Department of Defense Modernization sites, being
considered as infrastructure for Boeings
distributed product data management and
manufacturing resource control systems.
Large-scale application implementations of
molecular dynamics applications Charmm and
Amber at NPACI

37
Still other Infrastructure Approaches

Corba
Globe (Europe)
Suma (Venezuela)
Web-based approaches (Geoffrey Fox)
Jini (Sun)
DCom (MS)
etc.

38
Whats Missing?

How do we ensure application performance?
Performance-efficient application development and
execution
Ninja programming
AppLeS, Nimrod, Mars, Prophet/Gallop, MSHN, etc.
GrADS

39
GrADS Grid Application Development and
Execution Environment

Prototype system which facilitates end-to-end
grid-aware program development
Based on the idea of a performance economy in
which negotiated contracts bind application to
resources

Joint project with large team of researchers
Ken Kennedy
Jack Dongarra
Dennis Gannon
Dan Reed
Lennart Johnsson

Andrew Chien Rich Wolski Ian Foster Carl
Kesselman Fran Berman
40
Cool GrADS Ideas

Performance Contracts
Vehicle for sharing complex, multi-dimensional
performance information between components
Performance Economy
Framework in which to negotiate services and
promote performance.
Performance contracts play fundamental role in
exchange of information and binding of resources
Resource allocation and performance steering
using fuzzy logic (AppLePilot)
Mechanism for describing quality of information
Allows for performance steering based on
evaluation of application progress

41
Next Time

Talk by Marc Snir, Architect of IBMs Blue Gene
Tuesday June 6, APM 4301 100-200
Abstract
IBM Research announced in December a 5 year,
100M research project aimed at developing a
petaop computer and using it for research in
computational biology. The talk will discuss the
architectural choices involved in the design of a
petaop computer, and will present the design
point pursued by the Blue Gene project. We shall
discuss the mapping of molecular dynamic
computations onto the Blue Gene architecture and
outline research problems in Computer Science and
Computational Biology that such project
motivates.