Title: Decentralizing Grids
1Decentralizing Grids
- Jon Weissman
- University of Minnesota
- E-Science Institute
- Nov. 8 2007
2Roadmap
- Background
- The problem space
- Some early solutions
- Research frontier/opportunities
- Wrapup
3Background
- Grids are distributed but also centralized
- Condor, Globus, BOINC, Grid Services, VOs
- Why? client-server based
- Centralization pros
- Security, policy, global resource management
- Decentralization pros
- Reliability, dynamic, flexible, scalable
- Fertile CS research frontier
4Challenges
- May have to live within the Grid ecosystem
- Condor, Globus, Grid services, VOs, etc.
- First principle approaches are risky (Legion)
- 50K foot view
- How to decentralize Grids yet retain their
existing features? - High performance, workflows, performance
prediction, etc.
5Decentralized Grid platform
- Minimal assumptions about each node
- Nodes have associated assets (A)
- basic CPU, memory, disk, etc.
- complex application services
- exposed interface to assets OS, Condor, BOINC,
Web service - Nodes may up or down
- Node trust is not a given (do X, does Y instead)
- Nodes may connect to other nodes or not
- Nodes may be aggregates
- Grid may be large gt 100K nodes, scalability is
key
6Grid Overlay
Condor network
Grid service
Raw OS services
BOINC network
7Grid Overlay - Join
Condor network
Grid service
Raw OS services
BOINC network
8Grid Overlay - Departure
Condor network
Grid service
Raw OS services
BOINC network
9Routing Discovery
discover A
Query contains sufficient information to locate a
node RSL, ClassAd, etc Exact match or semantic
match
10Routing Discovery
bingo!
11Routing Discovery
Discovered node returns a handle sufficient for
the client to interact with it - perform
service invocation, job/data transmission,
etc
12Routing Discovery
- Three parties
- initiator of discovery events for A
- client invocation, health of A
- node offering A
- Often initiator and client will be the same
- Other times client will be determined dynamically
- if W is a web service and results are returned to
a calling client, want to locate CW near W gt - discover W, then CW !
13Routing Discovery
X
discover A
14Routing Discovery
15Routing Discovery
bingo!
16Routing Discovery
17Routing Discovery
outside client
18Routing Discovery
discover As
19Routing Discovery
20Grid Overlay
- This generalizes
- Resource query (query contains job requirements)
- Looks like decentralized matchmaking
- These are the easy cases
- independent simple queries
- find a CPU with characteristics x, y, z
- find 100 CPUs each with x, y, z
- suppose queries are complex or related?
- find N CPUs with aggregate power G Gflops
- locate an asset near a prior discovered asset
21Grid Scenarios
- Grid applications are more challenging
- Application has a more complex structure
multi-task, parallel/distributed, control/data
dependencies - individual job/task needs a resource near a data
source - workflow
- queries are not independent
- Metrics are collective
- not simply raw throughput
- makespan
- response
- QoS
22Related Work
- Maryland/Purdue
- matchmaking
- Oregon-CCOF
- time-zone
CAN
23Related Work (contd)
- None of these approaches address the Grid
scenarios (in a decentralized manner) - Complex multi-task data/control dependencies
- Collective metrics
2450K Ft Research Issues
- Overlay Architecture
- structured, unstructured, hybrid
- what is the right architecture?
- Decentralized control/data dependencies
- how to do it?
- Reliability
- how to achieve it?
- Collective metrics
- how to achieve them?
25Context Application Model
answer
data source
26Context Application Models
Reliability Collective metrics Data
dependence Control dependence
27Context Environment
- RIDGE project - ridge.cs.umn.edu
- reliable infrastructure for donation grid envs
- Live deployment on PlanetLab planet-lab.org
- 700 nodes spanning 335 sites and 35 countries
- emulators and simulators
- Applications
- BLAST
- Traffic planning
- Image comparison
28Application Models
Reliability Collective metrics Data
dependence Control dependence
29Reliability Example
C
E
D
B
G
30Reliability Example
C
E
D
B
CG
G
CG responsible for Gs health
31Reliability Example
C
E
D
B
G, loc(CG )
CG
32Reliability Example
C
E
D
B
G
CG
could also discover G then CG
33Reliability Example
C
E
D
X
B
CG
34Reliability Example
C
E
D
G.
CG
35Reliability Example
C
E
D
G
CG
36Client Replication
C
E
D
B
G
37Client Replication
C
E
D
B
G
CG2
CG1
loc (G), loc (CG1), loc (CG2) propagated
38Client Replication
C
E
D
B
G
CG2
X
CG1
client hand-off depends on nature of G and
interaction
39Component Replication
C
E
D
B
G
40Component Replication
C
E
D
G2
G1
CG
41Replication Research
- Nodes are unreliable crash, hacked, churn,
malicious, slow, etc. - How many replicas?
- too many waste of resources
- too few application suffers
42System Model
- Reputation rating ri degree of node reliability
- Dynamically size the redundancy based on ri
- Nodes are not connected and check-in to a central
server - Note variable sized groups
0.9
0.8
0.8
0.7
0.7
0.4
0.3
0.4
0.8
0.8
43Reputation-based Scheduling
- Reputation rating
- Techniques for estimating reliability based on
past interactions - Reputation-based scheduling algorithms
- Using reliabilities for allocating work
- Relies on a success threshold parameter
44Algorithm Space
- How many replicas?
- first-, best-fit, random, fixed,
- algorithms compute how many replicas to meet a
success threshold - How to reach consensus?
- M-first (better for timeliness)
- Majority (better for byzantine threats)
45Experimental Results correctness
This was a simulation based on byzantine behavior
majority voting
46Experimental Results timeliness
M-first (M1), best BOINC (BOINC), conservative
(BOINC-) vs. RIDGE
47Next steps
- Nodes are decentralized, but not trust
management! - Need a peer-based trust exchange framework
- Stanford Eigentrust project local exchange
until network converges to a global state
48Application Models
Reliability Collective metrics Data
dependence Control dependence
49Collective Metrics
- Throughput not always the best metric
- Response, completion time, application-centric
- makespan - response
50Communication Makespan
- Nodes download data from replicated data nodes
- Nodes choose data servers independently
(decentralized) - Minimize the maximum download time for all worker
nodes (communication makespan)
data download dominates
51Data node selection
- Several possible factors
- Proximity (RTT)
- Network bandwidth
- Server capacity
Download Time vs. RTT - linear
Download Time vs. Bandwidth - exp
52Heuristic Ranking Function
- Query to get candidates, RTT/bw probes
- Node i, data server node j
- Cost function rtti,j exp(kj /bwi,j), kj
load/capacity - Least cost data node selected independently
- Three server selection heuristics that use kj
- BW-ONLY kj 1
- BW-LOAD kj n-minute average load (past)
- BW-CAND kj of candidate responses in last m
seconds ( future load)
53Performance Comparison
54Computational Makespan
55Computational Makespan
variable-sized
equal-sized
56Next Steps
- Other makespan scenarios
- Eliminate probes for bw and RTT -gt estimation
- Richer collective metrics
- deadlines user-in-the-loop
57Application Models
Reliability Collective metrics Data
dependence Control dependence
58Application Models
Reliability Collective metrics Data
dependence Control dependence
59Data Dependence
- Data-dependent component needs access to one or
more data sources data may be large
discover A
60Data Dependence (contd)
discover A
Where to run it?
61The Problem
- Where to run a data-dependent component?
- determine candidate set
- select a candidate
- Unlikely a candidate knows downstream bw from
particular data nodes - Idea infer bw from neighbor observations w/r to
data nodes!
62Estimation Technique
- C1 may have had little past interaction with
- but its neighbors may have
- For each neighbor generate a download estimate
- DT prior download time to from
neighbor - RTT from candidate and neighbor to
respectively - DP average weighted measure of prior download
times for any node to any data source
63Estimation Technique (contd)
- Download Power (DP) characterizes download
capability of a node - DP average (DT RTT)
- DT not enough (far-away vs. nearby data source)
- Estimation associated with each neighbor ni
- ElapsedEst ni a ß DT
- a my_RTT/neighbor_RTT (to )
- ß neighbor_DP /my_DP
- no active probes historical data, RTT inference
- Combining neighbor estimates
- mean, median, min, .
- median worked the best
- Take a min over all candidate estimates
64Comparison of Candidate Selection Heuristics
SELF uses direct observations
65Take Away
- Next steps
- routing to the best candidates
- Locality between a data source and component
- scalable, no probing needed
- many uses
66Application Models
Reliability Collective metrics Data
dependence Control dependence
67The Problem
- How to enable decentralized control?
- propagate downstream graph stages
- perform distributed synchronization
- Idea
- distributed dataflow token matching
- graph forwarding, futures (Mentat project)
68Control Example
C
E
D
B
control node token matching
69Simple Example
B
C
G
E
D
70Control Example
E, BCD
B
C
E, BCD
C, G
G
E
D, G
D
E, BCD
71Control Example
C
E
D
B
E, BCD
C
D
E, BCD
E, BCD
72Control Example
C
E
D
B
E, BCD, loc(SB)
C
D
E, BCD, loc(SC)
E, BCD, loc(SD)
output stored at loc() where component is run,
or client, or a storage node
73Control Example
C
E
D
B
B
C
D
74Control Example
C
E
D
B
B
C
D
75Control Example
C
E
D
B
B
C
D
76Control Example
C
E
D
B
B
E
C
D
77Control Example
C
E
D
B
B
E
C
D
78Control Example
C
E
D
How to color and route tokens so that they arrive
to the same control node?
B
B
E
C
D
79Open Problems
- Support for Global Operations
- troubleshooting what happened?
- monitoring application progress?
- cleanup application died, cleanup state
- Load balance across different applications
- routing to guarantee dispersion
80Summary
- Decentralizing Grids is a challenging problem
- Re-think systems, algorithms, protocols, and
middleware gt fertile research - Keep our eye on the ball
- reliability, scalability, and maintaining
performance - Some preliminary progress on point solutions
81My visit
- Looking to apply some of these ideas to existing
UK projects via collaboration - Current and potential projects
- Decentralized dataflow (Adam Barker)
- Decentralized applications Haplotype analysis
(Andrea Christoforou, Mike Baker) - Decentralized control openKnowledge (Dave
Robertson) - Goal improve reliability and scalability of
applications and/or infrastructures
82 83(No Transcript)
84 85Non-stationarity
- Nodes may suddenly shift gears
- deliberately malicious, virus, detach/rejoin
- underlying reliability distribution changes
- Solution
- window-based rating
- adapt/learn ltarget
- Experiment blackout at
- round 300 (30 effected)
86Adapting
87Adaptive Algorithm
success rate
throughput
88success rate
throughput
89Scheduling Algorithms
90Estimation Accuracy
- Objects 27 (.5 MB 2MB)
- Nodes 130 on PlanetLab
- Download 15,000 times from a randomly chosen node
- Download Elapsed Time Ratio (x-axis) is a ratio
of estimation to real measured time - 1 means perfect estimation
- Accept if the estimation is within a range
measured (measured error) - Accept with error0.33 67 of the total are
accepted - Accept with error0.50 83 of the total are
accepted
91Impact of Churn
Random mean
Global(Prox) mean
92Estimating RTT
- We use distance v(RTT1)
- Simple RTT inference technique based on triangle
inequality - Triangle Inequality Latency(a,c) Latency(a,b)
Latency(b,c) - Latency(a,b)-Latency(b,c) Latency(a,c)
Latency(a,b)Latency(b,c) - Pick the intersected area as the range, and take
the mean
Lower bound
Higher bound
Via Neighbor A
Via Neighbor B
Via Neighbor C
Inference
RTT
Final inference
Intersected range
93RTT Inference Result
- More neighbors, greater accuracy
- With 5 neighbors, 85 of the total lt 16 error
94Other Constraints
E, BCD
B
C
E, BCD
C, A, dep-CD
A
E
D, A, dep-CD
D
E, BCD
C D interact and they should be co-allocated,
nearby Tokens in bold should route to same
control point so a collective query for C D can
be issued
95Support for Global Operations
- Troubleshooting what happened?
- Monitoring application progress?
- Cleanup application died, cleanup state
- Solution mechanism propagate control node IPs
back to origin (gt origin IP piggybacked) - Control nodes and matcher nodes report progress
(or lack thereof via timeouts) to origin - Load balance across different applications
96Other Constraints
E, BCD
B
C
E, BCD
C, A
A
E
D, A
D
E, BCD
C D interact and they should be co-allocated,
nearby
97Combining Neighbors Estimation
- MEDIAN shows best results using 3 neighbors
88 of the time error is within 50 (variation in
download times is a factor of 10-20) - 3 neighbors gives greatest bang
98Effect of Candidate Size
99Performance Comparison
- Parameters
- Data size 2MB
- Replication 10
- Candidates 5
100Computation Makespan (contd)
- Now bring in reliability makespan improvement
scales well
components
101Token loss
- Between B and matcher matcher and next stage
- matcher must notify CB when token arrives (pass
loc(CB) with Bs token - destination (E) must notify CB when token arrives
(pass loc(CB) with Bs token
102RTT Inference
- gt 90-95 of Internet paths obey triangle
inequality - RTT (a, c) lt RTT (a, b) RTT (b, c)
- RTT (server, c) lt RTT (server, ni) RTT (ni, c)
- upper- bound
- lower-bound RTT (server, ni) - RTT (ni, c)
- iterate over all neighbors to get max L, min U
- return mid-point