Trust-Sensitive Scheduling on the Open Grid - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Trust-Sensitive Scheduling on the Open Grid

Description:

Title: Metacomputing Research in the Distributed Computing Systems Group Author: jon Last modified by: jon Created Date: 9/7/1999 7:46:02 PM Document presentation format – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 28
Provided by: jon2161
Category:

less

Transcript and Presenter's Notes

Title: Trust-Sensitive Scheduling on the Open Grid


1
Trust-Sensitive Scheduling on the Open Grid
  • Jon B. Weissmanwith help from Jason Sonnek and
    Abhishek Chandra
  • Department of Computer Science
  • University of Minnesota
  • Trends in HPDC Workshop
  • Amsterdam 2006

2
Background
  • Public donation-based infrastructures are
    attractive
  • positives cheap, scalable, fault tolerant
    (UW-Condor, _at_home, ...)
  • negatives hostile - uncertain resource
    availability/connectivity, node behavior,
    end-user demand gt best effort service

3
Background
  • Such infrastructures have been used for
    throughput-based applications
  • just make progress, all tasks equal
  • Service applications are more challenging
  • all tasks not equal
  • explicit boundaries between user requests
  • may even have SLAs, QoS, etc.

4
Service Model
  • Distributed Service
  • request -gt set of independent tasks
  • each task mapped to a donated node
  • makespan
  • E.g. BLAST service
  • user request (input sequence) chunk of DB form
    a task

5
BOINC BLAST
workunit input_sequence chunk of DB generated
when a request arrives
6
The Challenge
  • Nodes are unreliable
  • timeliness heterogeneity, bottlenecks,
  • cheating hacked, malicious (gt 1 of SETi nodes),
    misconfigured
  • failure
  • churn
  • For a service, this matters

7
Some data- timeliness
Computation Heterogeneity - both across and
within nodes
PlanetLab lower bound
Communication Heterogeneity - both across and
within nodes
8
The Problem for Today
  • Deal with node misbehavior
  • Result verification
  • application-specific verifiers not general
  • redundancy voting
  • Most approaches assume ad-hoc replication
  • under-replicate task re-execution ( latency)
  • over-replicate wasted resources (v throughput)
  • Using information about the past behavior of a
    node, we can intelligently size the amount of
    redundancy

9
System Model
10
Problems with ad-hoc replication
Unreliable node
Task x sent to group A
Reliable node
Task y sent to group B
11
Smart Replication
  • Reputation
  • ratings based on past interactions with clients
  • simple sample-based prob. (ri) over window t
  • extend to worker group (assuming no collusion) gt
    likelihood of correctness (LOC)
  • Smarter Redundancy
  • variable-sized worker groups
  • intuition higher reliability clients gt smaller
    groups

12
Terms
  • LOC (Likelihood of Correctness), lg
  • computes the actual probability of getting a
    correct answer from a group of clients (group g)
  • Target LOC (ltarget)
  • the task success-rate that the system tries to
    ensure while forming client groups
  • related to the statistics of the underlying
    distribution

13
Trust Sensitive Scheduling
  • Guiding metrics
  • throughput r is the number of successfully
    completed tasks in an interval
  • success rate s ratio of throughput to number of
    tasks attempted

14
Scheduling Algorithms
  • First-Fit
  • attempt to form the first group that satisfies
    ltarget
  • Best-Fit
  • attempt to form a group that best satisfies
    ltarget
  • Random-Fit
  • attempt to form a random group that satisfies
    ltarget
  • Fixed-size
  • randomly form fixed sized groups. Ignore client
    ratings.
  • Random and Fixed are our baselines
  • Min group size 3

15
Scheduling Algorithms
16
Scheduling Algorithms (contd)
17
Different Groupings
ltarget .5
18
Evaluation
  • Simulated a wide-variety of node reliability
    distributions
  • Set ltarget to be the success rate of Fixed
  • goal match success rate of fixed (which
    over-replicates) yet achieve higher throughput
  • if desired, can drive tput even higher (but
    success rate would suffer)

19
Comparison
gain 25-250 open question how much better
could we have done?
20
Non-stationarity
  • Nodes may suddenly shift gears
  • deliberately malicious, virus, detach/rejoin
  • underlying reliability distribution changes
  • Solution
  • window-based rating (reduce t 20 from infinite)
  • Experiment blackout at round 300 (30
    effected)

21
Role of ltarget
  • Key parameter
  • Too large
  • groups will be too large (low throughput)
  • Too small
  • groups will be too small (low success rate)
  • Adaptively learn it (parameterless)
  • maximizing r s goodput
  • or could bias toward r or s

22
Adaptive algorithm
  • Multi-objective optimization
  • choose target LOC to simultaneously maximize
    throughput r and success rate s
  • a1 r a2 s
  • use weighted combination to reduce multiple
    objectives to a single objective
  • employ hill-climbing and feedback techniques to
    control dynamic parameter adjustment

23
Adapting ltarget
  • Blackout example

24
Throughput (a11, a20)
25
Current/Future Work
  • Implementation of reputation-based scheduling
    framework (BOINC and PL)
  • Mechanisms to retain node identities (hence ri)
    under node churn
  • node signatures that capture the
    characteristics of the node

26
Current/Future Work (contd)
  • Timeliness
  • extending reliability to encompass time
  • a node whose performance is highly variable is
    less reliable
  • Client collusion
  • detection group signatures
  • prevention
  • combine quiz-based tasks with reputation systems
  • form random-groupings

27
  • Thank you.
Write a Comment
User Comments (0)
About PowerShow.com