A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System - PowerPoint PPT Presentation

About This Presentation
Title:

A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

Description:

... 5 Machines are not connected at all 6 Machines ... http://www.cs.virginia.edu/~ty4k/vestpage/ Tools Visited STAF(2001) Software Test Automation FrameWork ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 17
Provided by: Piy66
Learn more at: http://alumni.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System


1
A Generic Fault Tolerant System for Dynamic
Scheduling in Distributed System
  • Class project
  • by
  • Piyush Ranjan Satapathy
  • Van Lepham

2
Problem Addressed
  • What happens when scheduled jobs in a real
    distributed system fail due to fault causing
    behaviors?
  • Can I schedule my jobs in different type of
    resources?
  • Is there a very generic tool for dynamic
    distributed scheduling of jobs?
  • Generic in the sense of any kind of resource, any
    kind of job, suitable algorithm and fault
    tolerant ?
  • Jobs ranging from running regression test cases
    to complex scientific calculations anything
    which can run in parallel

3
Motivation
  • Can we have a system which can execute our
    distributed jobs dynamically or statically
    irrespective of any fault behaviors, minimizing
    the number of resources used and minimizing the
    total time of completion?
  • Can we make it up to a standard of both academics
    and industry?

4
Our Contribution
  • A system from scratch
  • Monitoring and Feedback Mechanism in real life
    parallel job execution
  • Implemented and Evaluated 5 Algorithms
  • A Java GUI for better user interaction

5
Outline
  • Introduction
  • Tools Visited (Related Work)
  • Our Central Idea (Architecture)
  • Implementation
  • Evaluation
  • Conclusion
  • Next Step

6
Introduction
  • There is no such tool academically or
    industry-wise which can fit into lot of
    environments
  • Our objective is easy to use, port and enhance
  • We monitor the running jobs and scheduled
    machines, collect info and use it for scheduling
    analysis
  • Our initial result of evaluation goes up to
    10-15 of better performance over a resource of
    40 machines out of which 11 are faulty

7
Tools Visited
  • OpenSTARS (2005)
  • A flexible Real time and optimized tool to
    schedule the jobs in distributed system
  • But Dont take care of any fault tolerant
    behaviors
  • Source http//rtdev.cs.uri.edu/svn/repos/trunk/
  • SPHINX(2005)
  • A fault tolerant system for scheduling on dynamic
    Grid environment
  • Source http//sphinx.phys.ufl.edu/
  • Cheddar(2004)
  • Free Real Time Scheduling Tool based EDF and LLF
  • Source http//beru.univ-brest.fr/singhoff/chedda
    r/
  • VEST(2003)
  • Real time Schedulability Analysis for software to
    Hardware allocations.
  • Source http//www.cs.virginia.edu/ty4k/vestpage/

8
Tools Visited
  • STAF(2001)
  • Software Test Automation FrameWork (IBM) across a
    number of machines. Good for software nightly
    build
  • Source http//staf.sourceforge.net/index.php
  • TimeWiz(2000)
  • A Comprehensive tool for real-time modeling and
    analysis.
  • Sourcehttp//www.timesys.com/products/timewiz/
  • RapidRMA(1993)
  • Based on EDF and CORBA based Real time system
  • Source http//www.tripac.com/html/downloads.html

9
Core Architecture
3 Layers 1. Adoptive Analyzer 2. Monitoring and
Feedback 3. Information Storage
History Storage
Native Machine
Supervisor Keeping History
List of Machines
Resource Monitor
Analyzer
Grid Clusters
Sets Algorithm
LSF
Job Monitor
User scheduling job
Middleware
Wide Variety of Resources
10
Implementation
  • An individual Job is presented as a script
  • Job list contains all jobs to be executed
  • Machine list contains name of machines or name of
    grid sites
  • Login Information (submit command, status
    command, kill command) for Grid/Lsf/Remote
    machine given
  • Either can be static or Dynamic
  • Monitoring and feedback can be turned on and off

11
Implementation (GUI)
12
Experimental Setup
  • 40 Machines inside the EBII Building
  • Fault Causing behaviors as below
  • 5 Machines are not connected at all
  • 6 Machines connect and hang
  • 60 Jobs
  • Execution time differs from 2 sec to 5 minutes
  • 4 Algorithms considered
  • Round Robin (Working)
  • CPU Based (Working)
  • Job Completion Based (Working)
  • EDF (..Yet to Work )
  • LLF (Yet to Work)

13
Evaluation
14
Conclusion
  • Designed a top to bottom dynamic distributed
    system
  • Implemented the fault tolerant techniques by
    monitoring and feedback
  • Stored the type of job and job history which
    executes once to make analyzers job easy
  • Got some initial interesting results over small
    experiments

15
Whats Next ?
  • We didnt get a login ID either in Planet lab or
    in grid3
  • Experiment in some grid
  • Implementing more algorithms
  • Running in real benchmarks

16
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com