Title: Melody: A Desktop Grid Architecture for Computational Workloads
1Melody A Desktop Grid Architecture for
Computational Workloads
- Vijay K Naik
- Swaminathan Sivasubramanian
- Sriram Krishnan
2Grid with Desktop Resources
Enterprise Computation
Z.. Z..
Discovery, deployment, scheduling, rebalancing
Grid
Z.. Z..
Z.. Z..
Z.. Z..
3Outline
- Desktop Grids
- Introduction to Melody
- Overview of Melody client side architecture
- Overview of Melody server side architecture
- Related Work
- Conclusions and Future work
4Desktop Grid Environments
- Intranet
- All desktops are under a single administrative
domain - Grid nodes are highly reliable and available
- Internet
- Desktops spread over the Internet
- Grid nodes cannot be assumed to be reliable
- Co-operative environment
- Desktops span across co-operating Intranets
- E.g., Business partners, collaborating
organizations, universities, etc., each with
their own administration - Grid nodes need not be reliable, however
co-operating administrators can take action to
handle reliability issues
5Grid Workloads
- Two kinds of Grid workloads
- Transactional e.g., Web Service based
transactions - Computational workload e.g., batch workloads
- Harmony
- Dynamic scheduling based on resource availability
- Desktop owners set policies of usage for grid
workload - Targeted for an intranet scenario
- Melody
- Grid infrastructure for running loosely coupled
and asynchronous Grid applications - Co-operative environment
- Focus of this talk Melody
6Melody Introduction
- Desktop Grid architecture for running
computational workloads - Workloads are assumed to be loosely coupled
- Not suitable for jobs requiring communication
during the execution of jobs on the Grid nodes - Joint collaboration with University of Maryland
- Runs Hierarchical N-Body Simulations as the Grid
application - Will be deployed in the High Schools of Maryland
7Melody System setup
High School 1
Grid User1
Creates and manages school accounts, grid server
Submits new grid jobs, reviews results
Adds new nodes for the account and installs
client side software
Grid Administrator
High School n
Grid Server
High School 3
School Administrator
Grid User2
High School 2
8Overview of Melody client side architecture
9Components in Client Side
- Grid Agent
- Grid Administrator portals
- School Administrator portals
- Grid User portals
10Grid Agent
- Client side software that interacts with the grid
server and responsible for orchestrating the life
cycle of instances of multiple Grid applications - Functionalities
- Orchestrating life cycle of Grid applications
- Grid Application interaction
- Grid node characterization
- Grid Server communication/interaction
- Job Monitoring
- Runs as a screensaver or a stand-alone
application
11Interactions with Grid Application
- Applications are treated as black boxes
- Applications are not re-compiled
- Applications are expected to conform to a model
of execution that defines - Input and output format for the grid application
- Checkpointing format
- Error semantics
- Grid Agent (GA) uses the interfaces defined by
this model to start the application
(fresh/checkpointed state) - GA can run potentially different grid
applications
12Grid node characterization
- Quality of a grid node
- Computing power, memory capacity
- Determines the turn-around time for a grid job
- Determining quality
- Generic method
- Transmit system information (processor, memory
and disk space available) - Grid server infers the quality based on this
information - Less accurate, however independent of grid
application
- Application-oriented method
- Perform a benchmark run to determine relative
quality - More precise, but have to perform for each kind
of application
13Grid node characterization (contd..)
- GA records the availability of the grid node
- Percentage of availability for Grid computations
- Information regarding interval of availability
- Used to determine which job can be at least to
the next over the time interval available - Availability information is used by the server
for scheduling - 50 available machine half powerful
- Desktop availability - bi-modal behavior
- Run different kinds of jobs/application during
different modes
14Grid Server communication
- Grid Agent communicates with Grid Server for
- Registering a grid node
- Requesting a new job
- Updating its availability and quality
characteristics - Returning results
- Sending application-related error details
- Communication using Web services
- Each interaction a web service call
15Grid node - Typical scenario
School Administrator
Grid Administrator
Registers itself
Downloads grid application package
Updates quality values
Gets a new job, instantiates it
Monitor the job status
Return results
Grid Server
16Error Handling
- Errors
- Application related errors
- Communication errors
- Critical errors
- Application related errors
- E.g.s, Application instantiation error, result
reading error etc., - Possibly due to intentional/unintentional
- Corruption of grid application package, state and
output files - Corruption of registries
- Grid Agent sends the application related error
code to Grid Server, if possible
17Error Handling (contd..)
- Communication errors
- Possibly due to network/server failure
- Client uses an exponential back-off algorithm for
retrying its web-service call - Decreases network congestion
- Critical errors
- E.g., Registry entries are corrupted, application
package not found etc., - Grid Agent exits with a message
18Grid Agent Implementation
- Grid Agent - implemented in C
- Web Service calls using gSOAP C client stubs
- Application invocation, monitoring
- Windows process control APIs
- Stores node characteristics and state information
in the windows registry - Self Installer package
- Grid Agent package was created using
Installshield - Interactions with Grid Administrator, School
administrator and Grid user - Portals implemented using JSPs
19Overview of Melody Server Side Architecture
20The Grid Server Architecture
- Requirements Overview
- Data Management
- Management of job requirements, inputs and
outputs - Management of Grid nodes
- Scheduling and retrieval of results
- Appropriate allocation of jobs to Grid nodes
- Receiving results back from the Grid nodes
- Reliability
- Verification of correctness of results
- Ensuring that all jobs complete execution
- Security
- Ensuring that only authorized Grid nodes access
services provided by the Grid server
21Data Management Requirements
- Types of data to be managed
- Data describing Grid Jobs
- Location of Input parameters
- Metadata about Grid Jobs
- Expected execution times
- Expected Grid node capabilities
- Data describing the tasks running on Grid nodes
- A task is a running instance of a Grid Job
- The Grid node that has been assigned the task
- The start time of the task, and the status
- Output or error information once the task is done
22Data Management Requirements
- Types of data to be managed (contd.)
- Data describing Grid nodes
- The computational capabilities (represented as a
quality factor) - The average availabilities of the Grid nodes
- The reliability information for the Grid nodes
- The location (schools) for the Grid nodes
- The maximum number of Grid nodes allowed per
school - Data describing Grid application software
- The location where the binaries can be found for
different architectures
23Database Design
Grid Server Database
24Scheduler Requirements
- Types of workloads
- Transactional (e.g. Harmony)
- The scheduler responds to client requests to
execute transactional (business) operations - Response time is the most important metric
- A push based model is more appropriate
- Guarantees immediate response to requests
- Computational (e.g. Melody)
- The scheduler provides load-balancing for
multiple computational (batch) jobs at the same
time - System throughput is the most important metric
- A pull based model is more appropriate
- All Grid nodes are kept busy at all times if
there are enough jobs to be scheduled
25Scheduler Algorithm
26Receiving Results and Errors
- Grid Agent contacts the Results Receiver service
with results - The service receives results, and stores it
locally - Updates status of task as DONE in the Task
Management Database - Updates the number of received tasks and
outstanding tasks for a particular job in the Job
Schedule database - Grid Agent contacts the Error Receiver service
with errors - If there is a fault with the input parameters, it
removes the Job from the Job Schedule database,
with an error message - If there is a fault with the execution of the
task on the Grid Agent side, it puts the Job back
to the scheduled
27Reliability Requirements
- Scenarios affecting reliability
- Tampering with the results
- Results can be altered by a malicious user at the
Grid node - Results may get corrupted during transit
- It should be possible to recover from incorrect
results sent by a Grid node - Unexpected delays in processing at the Grid nodes
- The Grid nodes may experience failures, and not
return results at an expected time - It should be possible to reschedule such a task
on another Grid node, without affecting the
correctness
28Reliability Tampering with Results
- Replication of tasks as a mechanism to ensure
correctness - Triple Modular Redundancy (TMR), with equal
voting - May be an overkill, as we assume a co-operative
environment - Random replication, to be used as a sanity check
- Build up the confidence of machines gradually,
and keep track of it in the database - Weighted voting schemes, with variable number of
redundancies - Use the reliabilities to associate votes for each
Grid node, to be used to determine the
correctness of results - Reduce the number of replications for highly
reliable machines and vice-versa
29Reliability Tampering (contd.)
- Current Implementation provides for
- No replication
- A sanity-checker algorithm that verifies if the
results received are feasible - If results are feasible
- Accept the results, and store metadata in the
database - If results are not feasible
- Put the task back to be scheduled
- Decrement the reliability information of the Grid
node. If the Grid node sends back incorrect
results constantly, inform the school
administrator
30Reliability Delays in Task Completion
- Monitoring of tasks to circumvent unexpected
delays - Associated with each scheduled task is a start
time and an expected response time - A Task Status Checker service periodically goes
through the Task Management database - If the Grid node associated with a task does not
respond with results within the expected time, it
puts the task back in the task queue to be
scheduled - The expected time can be adjusted depending on
the characteristics of the Grid nodes in the
system - The Result Receiver service can still receive the
response from the first Grid node (if it passes
the correctness check)
31Security
- Requirements
- The Grid nodes can authenticate the Grid server
during any interaction - The Grid server can authenticate the Grid nodes
during any interaction - Encryption of data on the wire, if need be
- Current implementation
- Use of HTTPS with X509 Certificates, with mutual
authentication - The Grid server uses a self-signed certificate
- Every client is given a client certificate,
signed by the servers private key
32Grid Server Architecture Summary
IBM HTTP Server
WebSphere Application Server
Various Web Services
JSPs for Grid user interface
IBM DB2 Server
IBM DB2 Server
Repository for input and output
Firewalll
33Implementation Details
- Tools used
- Web services, deployed on the WebSphere
Application Server 5.0 - Written in Java, developed using WSAD5.0
- Can be accessed via IBM HTTP Server 1.3.26
- IBM DB2v7.2 back-end
- C Web services client (inside the screensaver)
- gSOAP provides the C client side stubs
- Florida State University, Open Source project
- JSP based interface to the Grid user
34Related Work
- Volunteer Computing
- SETI_at_Home http//setiathome.ssl.berkeley.edu/
- Screensaver based Search for Extra-Terrestrial
Intelligence - Bayanihan Computing http//bayanihancomputing.net
/ - A Web services based approach, using applet
clients - BOINC http//boinc.berkeley.edu/
- Resource sharing among independent projects
- Entropia DCGrid, United Devices
- Industry solutions for a PC Grid
- High Throughput Computing
- Condor(G) http//www.cs.wisc.edu/condor/
- Leverages large collections of distributed
resources to run scientific applications
35Conclusions
- Melody A desktop Grid architecture for
computational workloads - Leveraging idle cycles of desktop PCs in a
co-operative environment - A pull-based scheduling model to support high
throughput - Can potentially run different Grid applications
- To be deployed on High School PCs in Maryland, to
run Astronomy simulations (Hierarchical N-Body)
36Future Work
- Generalization of the architecture to support a
class of similar applications - Involves formal definition of
- Requirements from the Grid application and its
interaction with the Grid agent - Standardization of input/output parameters
- Description of dependencies between various jobs
- Replication algorithms for result correctness
- TMR, Variable replication with weighted voting
- Analyzing the effectiveness of such approaches
- Investigation of WS-Security for Grid Agent
Grid Server interactions - Need implementations in both Java and C that
interoperate - Investigation of the applicability of OGSA
37Backup
38Job Monitoring
- GA instantiates and monitors the grid
computations - GA can be in 4 states
- No Job, Job-obtained but not started, Job started
and running, Job finished-result yet to be sent - State information is stored in the registry
- If screensaver is pre-empted due to user
interaction - GA kills the grid job
- Updates its state in the registry and exits