Grid Computing 2 - PowerPoint PPT Presentation

About This Presentation
Title:

Grid Computing 2

Description:

(Condor now being ported to NT) ... Schedd queues jobs submitted to Condor at that WS and seeks resources for them ... Flock of Condors ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 42
Provided by: csewe4
Learn more at: https://cseweb.ucsd.edu
Category:
Tags: computing | grid

less

Transcript and Presenter's Notes

Title: Grid Computing 2


1
Grid Computing 2
  • http//www.globus.org
  • http//www.cs.virginia.edu/legion/
  • http//www.cs.wisc.edu/condor/
  • (thanks to shava and holly see notes for CSE
    225)

2
Outline
  • Today
  • Condor
  • Globus
  • Legion
  • Next class
  • Talk by Marc Snir, Architect of IBMs Blue Gene
  • Tuesday June 6, APM 4301 100-200

3
Condor
  • Condor is a high-throughput scheduler
  • Main idea is to leverage free cycles on very
    large collections of privately owned,
    non-dedicated desktop workstations
  • Performance measure is throughput of jobs
  • Rather than how fast can a particular job run,
    how many jobs can complete over a long period of
    time.
  • Developed by Miron Livny et al. at U. of Wisconsin

4
Condor Basics
  • Condor hunter of idle workstations
  • Condor pool consists of large number of privately
    controlled UNIX workstations
  • (Condor now being ported to NT)
  • WS owners define the conditions under which the
    WS can be allocated by Condor to an external user
  • External Condor jobs run while machines are idle
  • User does not need a login on participating
    machines
  • Uses remote system calls to submitting WS

5
Condor Architecture (all machines in same Condor
Pool)
  • Architecture
  • Each WS runs Schedd and Startd daemons
  • Startd monitors and terminates jobs assigned by
    CM
  • Schedd queues jobs submitted to Condor at that WS
    and seeks resources for them
  • Central Manager (CM) WS controls allocation and
    execution for all jobs

6
Standard Condor Protocol (all machines in same
Condor Pool)
  • Protocol
  • Schedd (submitting machine) sends job context to
    CM Execution machine sends machine context to CM
  • CM identifies a match between job requirements
    and execution machine resources
  • CM sends to Schedd the execution machine ID
  • Schedd forks a Shadow process on submission
    machine
  • Shadow passes job requirements to Startd on
    execution machine and gets acknowledgement that
    execution machine is still idle
  • Shadow sends executable to execution machine
    where it executes until completion or migration

7
More Condor Basics
  • Participating condor machines not required to
    share file systems
  • No source code changes to users code required to
    use Condor, users must re-link their program in
    order to use checkpoint and migration
  • vanilla jobs vs. condor jobs
  • Condor jobs allocated to good target resource
    using a matchmaker
  • Single condor jobs automatically checkpointed and
    migrated between WSs, and restarted as needed

8
Condor Remote System Call Strategy
  • Job must be able to read and write files on its
    submit workstation

9
Condor Matchmaking
  • Matchmaking mechanism matches job specs to
    machine characteristics
  • Matchmaking done using classads
  • Resources produce resource offer ads
  • Include information such as available RAM memory,
    CPU type and speed, virtual memory size, physical
    location, current load average, etc.
  • Jobs provide resource request ad which defines
    the required and desired set of resources to run
    on
  • Condor acts as a broker which matches and ranks
    resource offer ads against resource request ads
  • Condor makes sure that all requirements in both
    ads are satisfied
  • Priorities of users and certain types of ads also
    taken into consideration

10
Condor Checkpointing
  • When WS owner returns, job can be checkpointed
    and restarted on another WS
  • Periodic checkpoint feature can periodically
    checkpoint the job so that work is not lost
    should the job be migrated
  • Condor jobs vs. vanilla jobs
  • Condor job executables must be relinked and can
    be checkpointed, migrated and restarted
  • Vanilla jobs are not relinked and cannot be
    checkpointed and migrated

11
Condor Checkpointing Limitations
  • Only single process jobs supported
  • Inter-process communication not supported
    (socket, send, recv, etc. not implemented)
  • All file operations idempotent (read-only,
    write-only work correctly, read and write to the
    same file may not)
  • Disk space must be available to store the
    checkpoint file on the submitting machines.
  • Each checkpointed job has an associated
    checkpoint file which is approximately the size
    of the address space of the process.

12
Condor-PVM and Parallel Jobs
  • PVM master/slave jobs can be submitted to Condor
    pool. (Special condor-pvm universe)
  • Master is run on machine where the job was
    submitted
  • Slaves pulled from the condor pool as they become
    available
  • Condor acts as resource manager for pvm daemon
  • Whenever pvm program asks for nodes, request is
    remapped to Condor
  • Condor finds machine in condor pool and adds it
    to pvm virtual machine

13
Condor and the Grid
  • Condor and the Alliance
  • Condor one of the Grid technologies deployed by
    the Alliance
  • Used for production high-throughput computing by
    partners
  • Condor and Globus
  • Globus can use Condor as a local resource
    manager.
  • Globus RSL specs translated into matchmaker
    classads

14
Condor and the Grid
  • Flock of Condors
  • Aggregation of condor pools into flock enables
    Condor pools to cross load-sharing and protection
    boundaries
  • Condor flock may include Condor pools connected
    by wide-area networks
  • Infrastructure
  • Idea is to add Gateway machine for every pool.
  • Gateway machines act as resource brokers for
    machines external to a pool
  • In published description, GW machine presents
    randomly chosen external pools/machines
  • CM does not need to know about flocking
  • Each GW machine runs GW-startd and GW-schedd as
    with a single condor pool

15
Flocking Protocol(machines in different pools)
16
Globus
  • Globus -- integrated toolkit of Grid services
  • Developed by Ian Foster (ANL/UC) and Carl
    Kesselman (USC/ISI)
  • Bag of services model applications can use
    Grid services without having to adopt a
    particular programming model

17
Core Globus Services
  • Resource allocation and process management (GRAM,
    DUROC, RSL)
  • Information Infrastructure (MDS)
  • Security (GSI)
  • Communication (Nexus)
  • Remote Access (GASS, GEM)
  • Fault Detection (HBM)
  • QoS (GARA, Gloperf)

18
Globus Layered Architecture
Applications
High-level Services and Tools
GlobusView
Testbed Status
DUROC
globusrun
MPI
Nimrod/G
MPI-IO
CC
Core Services
GRAM
Nexus
Metacomputing Directory Service
Globus Security Interface
Heartbeat Monitor
Gloperf
GASS
19
Globus Resource Management Services
  • Resource Management services provide mechanism
    for remote job submission and management
  • 3 low level services
  • GRAM (Globus Resource Allocation Manager)
  • Provides remote job submission and management
  • DUROC (Dynamically Updated Request Online
    Co-allocator)
  • Provides simultaneous job submission
  • Layers on top of GRAM
  • RSL (Resource Specification Language)
  • Language used to communicate resource requests

20
Globus Resource Management Architecture
RSL specialization
RSL
Information Service
Queries
Application
Info
Ground RSL
Simple ground RSL
Local resource managers
GRAM
GRAM
GRAM
LSF
EASY-LL
NQE
21
Globus Information Infrastructure
  • MDS (Metacomputing Directory Service)
  • MDS stores information about entry some type of
    object (organization, person, network, computer,
    etc.)
  • Object class associated with each entry describes
    a set of entry attributes
  • LDAP (Lightweight Directory Access Protocol) used
    to store information about resources
  • LDAP hierarchical, tree-structured information
    model defining form and character of information

22
Globus Security Service
  • GSI (Grid Security Infrastructure)
  • Provides public key-based security system that
    layers on top of local site security
  • User identified to system using X.509 certificate
    containing info about the duration of
    permissions, public key, signature of certificate
    authority
  • User also has private key
  • Provides users with a single sign-on access to
    the various sites to which they are authorized

23
More GSI
  • Resource management system uses GSI to establish
    which machines user may have access to
  • GSI system allows for proxies so that user only
    need logon once, as opposed to logging on for all
    machines involved in a distributed computation
  • Proxies used for short-term authentication,
    rather than long-term use

24
Globus Communication Services
  • Nexus
  • Communication library which provides asynchronous
    RPC, multi-method communication, data conversion
    and multi-threading facilities
  • I/O
  • Low level communication library which provides a
    thin wrapper around TCP, UDP, IP multicast and
    file I/O
  • Integrates GSI into TCP communication

25
Globus Remote Access Services
  • GASS (Globus Access to Secondary Storage)
  • Provides secure remote access to files
  • GEM (Globus Executable Management)
  • Intended to support identification, location, and
    creation of executables in a heterogeneous
    environment.

26
Globus Fault Detection Services
  • HBM (Heartbeat Monitor)
  • Provides mechanisms for monitoring multiple
    remote processes in a job and enabling
    application to respond to failures
  • Nexus Fault Detection
  • Notifies applications using Nexus when a
    communicating process fails (but not which one)

27
Globus QoS Services
  • GARA (Globus Architecture for Reservation and
    Allocation)
  • Provides dedicated access to collections of
    resources via reservations
  • Gloperf
  • Provides bandwidth and latency information
  • Wolskis NWS being integrated with Globus
  • NWS provides monitoring and predictive
    information

28
Globus and the Grid
  • Major player in Grid Infrastructure development
  • Currently deployed widely
  • User community strong
  • Infrastructure supported by IPG, Alliance and
    NPACI
  • Exclusive infrastructure of Alliance and IPG

29
Legion
  • Developed by Andrew Grimshaw (UVA)
  • Provides single, coherent virtual machine model
    that addresses grid issues within a reflective,
    object-based metasystem
  • Everything is an object in Legion model HW
    resources, SW resources, etc.

30
Legion Goals
  • Site autonomy
  • Each organization maintains control over their
    own resources
  • Extensibility
  • Users can construct own mechanisms and policies
    within Legion
  • Scalability
  • No centralized structures or servers full
    distribution

31
Legion Goals
  • Easy to use / seamless
  • System must hide complexity of environment
  • Ninja users must be able to tune applications
  • High performance via parallelism
  • Coarse-grained applications should perform well
  • Single, persistent object space
  • Single name space, transparent of location or
    replication
  • Security
  • do no harm Legion should not weaken local
    security policies

32
Legion Object Model
  • Every Legion object is defined and managed by its
    class object class objects act as managers and
    make policy, as well as define instances
  • Legion defines the interface and basic
    functionality of a set of core object types which
    support basic services
  • Users may also define and build their own class
    objects

33
Legion Object Model
  • Core Objects
  • Host objects
  • Encapsulate machine capabilities in Legion
    (processors and memory)
  • Currently represent single host systems
    (uniprocessor and multiprocessor shared memory)
  • Vault objects
  • Represents persistent storage
  • Implementation objects
  • Generally an executable file host object can
    execute when it receives a request to activate or
    create an object

34
Legion Object Model
  • Basic system services provided by core objects
  • Naming and binding, object creation, activation,
    deactivation and deletion
  • Responsibility for system-level functionality
    endowed on classes
  • Classes (which are also objects) define and
    manage objects associated with them
  • Classes create new instances, schedule them for
    execution, activate and deactivate them, and
    provide current location info for contacting them
  • Users can define and build own class objects

35
Legion Programming
  • Legion supports MPI and PVM libraries via
    emulation libraries (which use runtime Legion
    library)
  • Applications need to be recompiled and relinked
  • Legion supports BFS (Basic Fortran Support) and
    Java
  • Legion OO programming language Mentat (MPL)

36
Legion and the Grid
  • Major Grid player with Globus
  • Legion infrastructure deployed at NPACI,
    Department of Defense Modernization sites, being
    considered as infrastructure for Boeings
    distributed product data management and
    manufacturing resource control systems.
  • Large-scale application implementations of
    molecular dynamics applications Charmm and
    Amber at NPACI

37
Still other Infrastructure Approaches
  • Corba
  • Globe (Europe)
  • Suma (Venezuela)
  • Web-based approaches (Geoffrey Fox)
  • Jini (Sun)
  • DCom (MS)
  • etc.

38
Whats Missing?
  • How do we ensure application performance?
  • Performance-efficient application development and
    execution
  • Ninja programming
  • AppLeS, Nimrod, Mars, Prophet/Gallop, MSHN, etc.
  • GrADS

39
GrADS Grid Application Development and
Execution Environment
  • Prototype system which facilitates end-to-end
    grid-aware program development
  • Based on the idea of a performance economy in
    which negotiated contracts bind application to
    resources
  • Joint project with large team of researchers
  • Ken Kennedy
  • Jack Dongarra
  • Dennis Gannon
  • Dan Reed
  • Lennart Johnsson

Andrew Chien Rich Wolski Ian Foster Carl
Kesselman Fran Berman
40
Cool GrADS Ideas
  • Performance Contracts
  • Vehicle for sharing complex, multi-dimensional
    performance information between components
  • Performance Economy
  • Framework in which to negotiate services and
    promote performance.
  • Performance contracts play fundamental role in
    exchange of information and binding of resources
  • Resource allocation and performance steering
    using fuzzy logic (AppLePilot)
  • Mechanism for describing quality of information
  • Allows for performance steering based on
    evaluation of application progress

41
Next Time
  • Talk by Marc Snir, Architect of IBMs Blue Gene
  • Tuesday June 6, APM 4301 100-200
  • Abstract
  • IBM Research announced in December a 5 year,
    100M research project aimed at developing a
    petaop computer and using it for research in
    computational biology. The talk will discuss the
    architectural choices involved in the design of a
    petaop computer, and will present the design
    point pursued by the Blue Gene project. We shall
    discuss the mapping of molecular dynamic
    computations onto the Blue Gene architecture and
    outline research problems in Computer Science and
    Computational Biology that such project
    motivates.
Write a Comment
User Comments (0)
About PowerShow.com