Title: Robust Network Supercomputing with Malicious Processes (Reliably Executing Tasks Upon Estimating the Number of Malicious Processes)
1Robust Network Supercomputing with Malicious
Processes(Reliably Executing Tasks Upon
Estimating the Number of Malicious Processes)
- Kishori M. Konwar
- Sanguthevar Rajasekaran
- Alexander A. Shvartsman
- Computer Science Engineering Department
- University of Connecticut
- Storrs, CT
2Motivation
- Internet supercomputing is increasingly
becoming a powerful tool for harnessing massive
amounts of computational resources - availability of high bandwidth Internet
connections - there is an enormous number of processes around
the world - comes at a cost substantially lower than
acquiring a supercomputer or building a cluster
of powerful machines
3(No Transcript)
4TASKS
5(No Transcript)
6PrimeNet Server
- PrimeNet Server is a distributed, massively
parallel scientific computing Internet
Supercomputer - Supported by Entropia.com and ranks among the
most - powerful computers in the world
- A project comprised of about 30,000 PCs and
laptops - Currently sustains a 22,296 billion floating
point operations - per second (gigaflops) (operations that
involve fractional numbers )
7SETI_at_home
- SETI_at_home project a massive distributed
cooperative computer - Used for analysis of gigabytes of data for Search
for Extraterrestrial Intelligence (SETI) - Comprises of millions of voluntary machines
around - SETI_at_home project reported its speed to be more
than 57,290 billion floating point operations
per second
8Reliability Issues
- The master and perhaps certain workers are
reliable - they will correctly execute the tasks assigned
by the server - However, workers are commonly unreliable
- they may return to the master incorrect results
due to unintended failures caused, e.g., by
over-clocked processors - may deceivingly claim to have performed assigned
work so as to obtain incentive such as getting
higher rank
9(No Transcript)
10Some Previous Studies
- FGLS05 Assumed the worker processes might act
maliciously and hence deliberately return wrong
results. - goal is to design algorithm that enable the
master to accept correct results with high
probability at a lower cost - they provided a randomized algorithm
- unfortunately the cost complexity results depend
on several parameters and hard to interpret
11Some Previous Studies (contd)
- GM05 considered the problem of maximizing the
expected number of correct result - the tasks are dependent
- any worker computes correctly with probability p
lt 1 any incorrectly computed task corrupts all
dependent tasks - the goal is to compute a schedule that maximizes
expected number of correct results under a given
time constraint - they showed the optimization problem to be
NP-hard - provided some solutions on a restricted DAG
12Overview
- Models of Computation
- Stopping Rule Algorithm based solution
- Detection of Faulty Processors
- Performing Tasks with Faulty Workers
- Conclusions
13Overview
- Models of Computation
- Stopping Rule Algorithm based solution
- Detection of Faulty Processors
- Performing Tasks with Faulty Workers
- Conclusions
14Models of Computation
- Processes takes steps in lock steps, i.e., in
synchrony - Processes communicate by exchanging messages
- The tasks are independent and idempotent
- Processes are subject to failures and can return
incorrect results maliciously - Workers, P 1,2, . . ., n and a master M
15Work Complexities
-
- CDS01 defined as work complexity or available
processor steps - All steps taken by processes during execution
of the algorithm are counted including the steps
of the idling and waiting non-faulty processes - work
- DHW92 define work as the number of performed
tasks counting multiplicities - Approach does not charge for idling and waiting
this is called task oriented work
16Few Comments
- work ?
- We say that an even E occurs with high
probability (w.h.p.) to mean that PrE 1
O(n -?) for some constant ? gt 0.
17Modeling Failures
- Failure model Fa
- f-fraction, 0 lt f lt ½ of the n workers may fail
- Each possibly faulty worker independently
exhibits faulty behavior with probability - 0 lt p lt ½.
- The master has no a priori knowledge of f and p.
18Modeling Failures (contd)
- Failure model Fb
- There is a fixed bound on the f-fraction, 0 lt f
lt ½ of the n workers that can be faulty - Any worker from the remaining (1-f)-fraction of
the workers fails with probability 0 lt p lt1/2
independently of other workers - The master knows the values of f and p.
19Algorithmic Template
- procedure for master process M, task T
- Choose a set S ? P
- Send task T to each processor p ? S
- Wait for the results from the processes
in S - Decide on the result value v from the
responses - procedure for worker w ? P
- Wait to receive a task from master M
- Upon receiving a task from M
- Execute the task
- Send the result to M
20Overview
- Models of Computation
- Stopping Rule Algorithm based solution
- Detection of Faulty Processors
- Performing Tasks with Faulty Workers
- Conclusions
21(?, ?)-approximation algorithm
- Z is a random variable distributed in the
interval 0,1 with mean ?Z - Z1, Z2, Z3 .... are independently and identically
distributed according to the random variable Z - An (?, ?)-approximation algorithm, with 0 lt ? lt
1, - ? gt 0 for estimating ?Z satisfies
- Pr?Z (1- ? ) ? ? ?Z (1 ? )
gt 1 - ? - where is the estimated value of ?Z
22Stopping Rule Algorithm
- Dagum, Karp, Luby, and Ross 1995
- Input Parameters (?, ?) with 0 lt ? lt 1, ? gt 0
- Let ?1 1 (1 ? ) ? // ? 0.72 ?
4? log(2/ ? )/?2 - Initialize N ? 0 , S ? 0
- While S lt ?1 do N ? N1, S ? S ZN
- Output Z? ?1 /N
23Stopping Rule Theorem
- Theorem (Stopping Rule Theorem) Dagum, Karp,
Luby, and Ross - Let Z be a random variable in 0,1 with ?Z
EZ gt 0. Let - be the estimate produced and let NZ be the number
of - experiments that SRA runs with respect to Z on
input ? and ?. - Then,
- (i) Pr?Z (1- ? ) ? ? ?Z (1 ? ) gt
1 - ? - (ii) ENZ ? ?1 /?Z and
- (iii) PrNZ gt(1 ? ) ?1 /?Z ? ? /2
24Algorithm Af,p to estimate f and p
25Work Complexity of Af,p
- Theorem Algorithm Af,p is an (?,
?)-approximation algorithm, - 0 lt ? lt 1, ? gt 0, for the estimation of f and p
with work - complexity O(log2n), complexity O(n log
n), message - complexity O(log2 n) and time complexity O(log
n), with high - probability.
26Overview
- Models of Computation
- Stopping Rule Algorithm based solution
- Detection of Faulty Processors
- Performing Tasks with Faulty Workers
- Conclusions
27Detection of Faulty Processors
- Lemma It is not possible to perform all the n
- tasks correctly, in the failure model Fa with
linear - complexity (i.e., O(n)) with high
probability.
28Detection of Faulty Processors
- procedure for master process M
- Initially, F ??
- For t 0, . k log n, k gt 0
- Choose a set S ? P \ F
- Send each process p ? S test
task - Wait for the results from the
processes in S - If the response is faulty
- F? F ? p p is a faulty
process - End If
- End For
29Detection of Faulty Processors
- Lemma The algorithm detects all faulty
processes among - the n workers in O(log n) time with O(n) work
with high - probability
- TheoremKarp 04 Suppose that a(x) is a
non-decreasing, - continuous function that is strictly increasing
on x a(x) gt0, - and m(x) is a continuous function. Then for
every positive real x - and every positive integer t,
- PrT(x) gt u(x) ta(x) ?
(m(x)/x)t - where u(x) is the solution to the equation
u(x)a(x) u(m(x)) - with m0(x) 0 and mi1(x) m(mi(x)).
30Overview
- Models of Computation
- Stopping Rule Algorithm based solution
- Detection of Faulty Processors
- Performing Tasks with Faulty Workers
- Conclusions
31Performing Tasks under Fa
- procedure for master process M
- Initially, C ?? , J ? set of n tasks
- Randomly choose a set, possibly with
repetition, S?P, Skn/log n workers kgt0 is a
constant - For i 1, , k' log n, k' gt 0
- Send to each worker p?S a test task
- Collect the responses from all the
workers. - End For
- If all the responses from a worker p?S are
correct then - C ? C ? p
- End if
- For i1, , n/C
- Send C jobs from J, not sent in previous
iteration, one to each worker in C. - Collect the responses from the C workers
- End For
32Work and Time Complexities
- Theorem The algorithm performs all n tasks
correctly in - O(log n) time and has O(n) work and
complexities, - with high probability.
33Overview
- Models of Computation
- Stopping Rule Algorithm based solution
- Detection of Faulty Processors
- Performing Tasks with Faulty Workers
- Conclusions
34Performing Tasks under Fb
- procedure for master process M,
- For t 0, . k log n, k gt 0
- Choose a random permutation ??R Sn
- Foreach j ? n
- Send task to processor ?(j)
- End For
- Collect the responses from all the
workers - End For
- Foreach j ? n
- Choose the majority of the results of
computation for task - as the result
- End For
35Work and Time Complexities
- Theorem The algorithm performs all n tasks
correctly in - O(log n) time and has and work
complexities O(n log n), - for 0 lt p, f lt ½ and (1- f)(1- p) gt ½ with high
probability
36Overview
- Models of Computation
- Stopping Rule Algorithm based solution
- Detection of Faulty Processors
- Performing Tasks with Faulty Workers
- Conclusions
37Conclusions
- Perform tasks under above models where the
tasks are dependent - The dependency graph can be DAG
- Quantify work and time complexities on some
characteristics of the DAG