Melody: A Desktop Grid Architecture for Computational Workloads - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Melody: A Desktop Grid Architecture for Computational Workloads

Description:

Melody: A Desktop Grid Architecture for Computational Workloads. Vijay K Naik ... Leveraging idle cycles of desktop PCs in a co-operative environment ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 39
Provided by: Swa71
Learn more at: https://users.sdsc.edu
Category:

less

Transcript and Presenter's Notes

Title: Melody: A Desktop Grid Architecture for Computational Workloads


1
Melody A Desktop Grid Architecture for
Computational Workloads
  • Vijay K Naik
  • Swaminathan Sivasubramanian
  • Sriram Krishnan

2
Grid with Desktop Resources
Enterprise Computation
Z.. Z..
Discovery, deployment, scheduling, rebalancing
Grid
Z.. Z..
Z.. Z..
Z.. Z..
3
Outline
  • Desktop Grids
  • Introduction to Melody
  • Overview of Melody client side architecture
  • Overview of Melody server side architecture
  • Related Work
  • Conclusions and Future work

4
Desktop Grid Environments
  • Intranet
  • All desktops are under a single administrative
    domain
  • Grid nodes are highly reliable and available
  • Internet
  • Desktops spread over the Internet
  • Grid nodes cannot be assumed to be reliable
  • Co-operative environment
  • Desktops span across co-operating Intranets
  • E.g., Business partners, collaborating
    organizations, universities, etc., each with
    their own administration
  • Grid nodes need not be reliable, however
    co-operating administrators can take action to
    handle reliability issues

5
Grid Workloads
  • Two kinds of Grid workloads
  • Transactional e.g., Web Service based
    transactions
  • Computational workload e.g., batch workloads
  • Harmony
  • Dynamic scheduling based on resource availability
  • Desktop owners set policies of usage for grid
    workload
  • Targeted for an intranet scenario
  • Melody
  • Grid infrastructure for running loosely coupled
    and asynchronous Grid applications
  • Co-operative environment
  • Focus of this talk Melody

6
Melody Introduction
  • Desktop Grid architecture for running
    computational workloads
  • Workloads are assumed to be loosely coupled
  • Not suitable for jobs requiring communication
    during the execution of jobs on the Grid nodes
  • Joint collaboration with University of Maryland
  • Runs Hierarchical N-Body Simulations as the Grid
    application
  • Will be deployed in the High Schools of Maryland

7
Melody System setup
High School 1
Grid User1
Creates and manages school accounts, grid server
Submits new grid jobs, reviews results
Adds new nodes for the account and installs
client side software
Grid Administrator
High School n
Grid Server

High School 3
School Administrator
Grid User2
High School 2
8
Overview of Melody client side architecture
9
Components in Client Side
  • Grid Agent
  • Grid Administrator portals
  • School Administrator portals
  • Grid User portals

10
Grid Agent
  • Client side software that interacts with the grid
    server and responsible for orchestrating the life
    cycle of instances of multiple Grid applications
  • Functionalities
  • Orchestrating life cycle of Grid applications
  • Grid Application interaction
  • Grid node characterization
  • Grid Server communication/interaction
  • Job Monitoring
  • Runs as a screensaver or a stand-alone
    application

11
Interactions with Grid Application
  • Applications are treated as black boxes
  • Applications are not re-compiled
  • Applications are expected to conform to a model
    of execution that defines
  • Input and output format for the grid application
  • Checkpointing format
  • Error semantics
  • Grid Agent (GA) uses the interfaces defined by
    this model to start the application
    (fresh/checkpointed state)
  • GA can run potentially different grid
    applications

12
Grid node characterization
  • Quality of a grid node
  • Computing power, memory capacity
  • Determines the turn-around time for a grid job
  • Determining quality
  • Generic method
  • Transmit system information (processor, memory
    and disk space available)
  • Grid server infers the quality based on this
    information
  • Less accurate, however independent of grid
    application
  • Application-oriented method
  • Perform a benchmark run to determine relative
    quality
  • More precise, but have to perform for each kind
    of application

13
Grid node characterization (contd..)
  • GA records the availability of the grid node
  • Percentage of availability for Grid computations
  • Information regarding interval of availability
  • Used to determine which job can be at least to
    the next over the time interval available
  • Availability information is used by the server
    for scheduling
  • 50 available machine half powerful
  • Desktop availability - bi-modal behavior
  • Run different kinds of jobs/application during
    different modes

14
Grid Server communication
  • Grid Agent communicates with Grid Server for
  • Registering a grid node
  • Requesting a new job
  • Updating its availability and quality
    characteristics
  • Returning results
  • Sending application-related error details
  • Communication using Web services
  • Each interaction a web service call

15
Grid node - Typical scenario
School Administrator
Grid Administrator
Registers itself
Downloads grid application package
Updates quality values
Gets a new job, instantiates it
Monitor the job status
Return results
Grid Server
16
Error Handling
  • Errors
  • Application related errors
  • Communication errors
  • Critical errors
  • Application related errors
  • E.g.s, Application instantiation error, result
    reading error etc.,
  • Possibly due to intentional/unintentional
  • Corruption of grid application package, state and
    output files
  • Corruption of registries
  • Grid Agent sends the application related error
    code to Grid Server, if possible

17
Error Handling (contd..)
  • Communication errors
  • Possibly due to network/server failure
  • Client uses an exponential back-off algorithm for
    retrying its web-service call
  • Decreases network congestion
  • Critical errors
  • E.g., Registry entries are corrupted, application
    package not found etc.,
  • Grid Agent exits with a message

18
Grid Agent Implementation
  • Grid Agent - implemented in C
  • Web Service calls using gSOAP C client stubs
  • Application invocation, monitoring
  • Windows process control APIs
  • Stores node characteristics and state information
    in the windows registry
  • Self Installer package
  • Grid Agent package was created using
    Installshield
  • Interactions with Grid Administrator, School
    administrator and Grid user
  • Portals implemented using JSPs

19
Overview of Melody Server Side Architecture
20
The Grid Server Architecture
  • Requirements Overview
  • Data Management
  • Management of job requirements, inputs and
    outputs
  • Management of Grid nodes
  • Scheduling and retrieval of results
  • Appropriate allocation of jobs to Grid nodes
  • Receiving results back from the Grid nodes
  • Reliability
  • Verification of correctness of results
  • Ensuring that all jobs complete execution
  • Security
  • Ensuring that only authorized Grid nodes access
    services provided by the Grid server

21
Data Management Requirements
  • Types of data to be managed
  • Data describing Grid Jobs
  • Location of Input parameters
  • Metadata about Grid Jobs
  • Expected execution times
  • Expected Grid node capabilities
  • Data describing the tasks running on Grid nodes
  • A task is a running instance of a Grid Job
  • The Grid node that has been assigned the task
  • The start time of the task, and the status
  • Output or error information once the task is done

22
Data Management Requirements
  • Types of data to be managed (contd.)
  • Data describing Grid nodes
  • The computational capabilities (represented as a
    quality factor)
  • The average availabilities of the Grid nodes
  • The reliability information for the Grid nodes
  • The location (schools) for the Grid nodes
  • The maximum number of Grid nodes allowed per
    school
  • Data describing Grid application software
  • The location where the binaries can be found for
    different architectures

23
Database Design
Grid Server Database
24
Scheduler Requirements
  • Types of workloads
  • Transactional (e.g. Harmony)
  • The scheduler responds to client requests to
    execute transactional (business) operations
  • Response time is the most important metric
  • A push based model is more appropriate
  • Guarantees immediate response to requests
  • Computational (e.g. Melody)
  • The scheduler provides load-balancing for
    multiple computational (batch) jobs at the same
    time
  • System throughput is the most important metric
  • A pull based model is more appropriate
  • All Grid nodes are kept busy at all times if
    there are enough jobs to be scheduled

25
Scheduler Algorithm
26
Receiving Results and Errors
  • Grid Agent contacts the Results Receiver service
    with results
  • The service receives results, and stores it
    locally
  • Updates status of task as DONE in the Task
    Management Database
  • Updates the number of received tasks and
    outstanding tasks for a particular job in the Job
    Schedule database
  • Grid Agent contacts the Error Receiver service
    with errors
  • If there is a fault with the input parameters, it
    removes the Job from the Job Schedule database,
    with an error message
  • If there is a fault with the execution of the
    task on the Grid Agent side, it puts the Job back
    to the scheduled

27
Reliability Requirements
  • Scenarios affecting reliability
  • Tampering with the results
  • Results can be altered by a malicious user at the
    Grid node
  • Results may get corrupted during transit
  • It should be possible to recover from incorrect
    results sent by a Grid node
  • Unexpected delays in processing at the Grid nodes
  • The Grid nodes may experience failures, and not
    return results at an expected time
  • It should be possible to reschedule such a task
    on another Grid node, without affecting the
    correctness

28
Reliability Tampering with Results
  • Replication of tasks as a mechanism to ensure
    correctness
  • Triple Modular Redundancy (TMR), with equal
    voting
  • May be an overkill, as we assume a co-operative
    environment
  • Random replication, to be used as a sanity check
  • Build up the confidence of machines gradually,
    and keep track of it in the database
  • Weighted voting schemes, with variable number of
    redundancies
  • Use the reliabilities to associate votes for each
    Grid node, to be used to determine the
    correctness of results
  • Reduce the number of replications for highly
    reliable machines and vice-versa

29
Reliability Tampering (contd.)
  • Current Implementation provides for
  • No replication
  • A sanity-checker algorithm that verifies if the
    results received are feasible
  • If results are feasible
  • Accept the results, and store metadata in the
    database
  • If results are not feasible
  • Put the task back to be scheduled
  • Decrement the reliability information of the Grid
    node. If the Grid node sends back incorrect
    results constantly, inform the school
    administrator

30
Reliability Delays in Task Completion
  • Monitoring of tasks to circumvent unexpected
    delays
  • Associated with each scheduled task is a start
    time and an expected response time
  • A Task Status Checker service periodically goes
    through the Task Management database
  • If the Grid node associated with a task does not
    respond with results within the expected time, it
    puts the task back in the task queue to be
    scheduled
  • The expected time can be adjusted depending on
    the characteristics of the Grid nodes in the
    system
  • The Result Receiver service can still receive the
    response from the first Grid node (if it passes
    the correctness check)

31
Security
  • Requirements
  • The Grid nodes can authenticate the Grid server
    during any interaction
  • The Grid server can authenticate the Grid nodes
    during any interaction
  • Encryption of data on the wire, if need be
  • Current implementation
  • Use of HTTPS with X509 Certificates, with mutual
    authentication
  • The Grid server uses a self-signed certificate
  • Every client is given a client certificate,
    signed by the servers private key

32
Grid Server Architecture Summary
IBM HTTP Server
WebSphere Application Server
Various Web Services
JSPs for Grid user interface
IBM DB2 Server
IBM DB2 Server
Repository for input and output
Firewalll
33
Implementation Details
  • Tools used
  • Web services, deployed on the WebSphere
    Application Server 5.0
  • Written in Java, developed using WSAD5.0
  • Can be accessed via IBM HTTP Server 1.3.26
  • IBM DB2v7.2 back-end
  • C Web services client (inside the screensaver)
  • gSOAP provides the C client side stubs
  • Florida State University, Open Source project
  • JSP based interface to the Grid user

34
Related Work
  • Volunteer Computing
  • SETI_at_Home http//setiathome.ssl.berkeley.edu/
  • Screensaver based Search for Extra-Terrestrial
    Intelligence
  • Bayanihan Computing http//bayanihancomputing.net
    /
  • A Web services based approach, using applet
    clients
  • BOINC http//boinc.berkeley.edu/
  • Resource sharing among independent projects
  • Entropia DCGrid, United Devices
  • Industry solutions for a PC Grid
  • High Throughput Computing
  • Condor(G) http//www.cs.wisc.edu/condor/
  • Leverages large collections of distributed
    resources to run scientific applications

35
Conclusions
  • Melody A desktop Grid architecture for
    computational workloads
  • Leveraging idle cycles of desktop PCs in a
    co-operative environment
  • A pull-based scheduling model to support high
    throughput
  • Can potentially run different Grid applications
  • To be deployed on High School PCs in Maryland, to
    run Astronomy simulations (Hierarchical N-Body)

36
Future Work
  • Generalization of the architecture to support a
    class of similar applications
  • Involves formal definition of
  • Requirements from the Grid application and its
    interaction with the Grid agent
  • Standardization of input/output parameters
  • Description of dependencies between various jobs
  • Replication algorithms for result correctness
  • TMR, Variable replication with weighted voting
  • Analyzing the effectiveness of such approaches
  • Investigation of WS-Security for Grid Agent
    Grid Server interactions
  • Need implementations in both Java and C that
    interoperate
  • Investigation of the applicability of OGSA

37
Backup
38
Job Monitoring
  • GA instantiates and monitors the grid
    computations
  • GA can be in 4 states
  • No Job, Job-obtained but not started, Job started
    and running, Job finished-result yet to be sent
  • State information is stored in the registry
  • If screensaver is pre-empted due to user
    interaction
  • GA kills the grid job
  • Updates its state in the registry and exits
Write a Comment
User Comments (0)
About PowerShow.com