Load Balancing, p1 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Load Balancing, p1

1
CS 294-8Distributed Load Balancing
http//www.cs.berkeley.edu/yelick/294
2
Load Balancing

Problem distribute items into buckets
Data to memory locations
Files to disks
Tasks to processors
Web pages to caches
Goal even distribution
Slides stolen from Karger at MIT
http//theory.lcs.mit.edu/karger

3
Load Balancing

Enormous and diverse literature on load balancing
Computer Science systems
operating systems
parallel computing
distributed computing
Computer Science theory
Operations research (IEOR)
Application domains

4
Agenda

Overview
Load Balancing Data
Load Balancing Computation
(if there is time)

5
The Web
MIT
CNN
UCB
CMU
Servers
USC
Browers (clients)
6
Hot Spots
IRAM
OceanStore
UCB
Servers
BANE
Telegraph
Browers (clients)
7
Temporary Loads

For permanent loads, use bigger server
Must also deal with flash crowds
IBM chess match
Florida election tally
Inefficient to design for max load
Rarely attained
Much capacity wasted
Better to offload peak load elsewhere

8
Proxy Caches Balance Load
OceanStore
Telegraph
MIT
CNN
UCB
BANE
IRAM
CMU
Servers
USC
Browers (clients)
9
Proxy Caching

Old server hit once for each browser
New serve his once for each page
Adapts to changing access patterns

10
Proxy Caching

Every server can also be a cache
Incentives
Provides a social good
Reduces load at sites you want to contact
Costs you little, if done right
Few accesses
Small amount of storage (times many servers)

11
Who Caches What?

Each cache should hold few items
Otherwise swamped by clients
Each item should be in few caches
Otherwise server swamped by caches
And cache invalidates/updates expensive
Browser must know right cache
Could ask for server to redirect
But server gets swamped by redirects

12
Hashing

Simple and powerful load balancing
Constant time to find bucket for item
Example map to n buckets. Pick a, b
yaxb (mod n)
Intuition hash maps each itme to one random
bucket
No bucket gets many items

13
Problem Adding Caches

Suppose a new cache arrives
How to work it into has function?
Natural change
y ax b (mod n1)
Problem changes bucket for every item
Every cache will be flushed
Server swamped with new requests
Goal when add bucket, few items move

14
Problem Inconsistent Views

Each client knows about a different set of
caches its view
View affects choice of cache for item
With many views, each cache will be asked for
item
Item in all caches swamps server
Goal item in few caches despite views

15
Problem Inconsistent Views
UCB
my view
caches
0
1
2
3
ax b (mod 4) 2
16
Problem Inconsistent Views
UCB
Joes view
caches
0
3
1
2
ax b (mod 4) 2
17
Problem Inconsistent Views
UCB
Sues view
caches
2
0
3
1
ax b (mod 4) 2
18
Problem Inconsistent Views
UCB
Mikes view
caches
1
2
0
3
ax b (mod 4) 2
19
Problem Inconsistent Views
UCB
caches
2
2
2
2
20
Consistent Hashing

A new kind of hash function
Maps any item to a bucket in my view
Computable in constant time, locally
1 standard hash function
Adding bucket to view takes log time
Logarithmic of standard hash functions
Handle incremental and inconsistent views

21
Single View Properties

Balance all buckets get roughly same number of
items
Smooth when a kth bucket is added, only a 1/k
fraction of items move
And only from O(log n) servers
Minimum needed to preserve balance

22
Multiple View Properties

Consider n views, each of an arbitrary constant
fraction of the buckets
Load number of items a bucket gets from all
views is O(log n) times average
Despite views, load balanced
Spread over all views, each item appears in
O(log n) buckets
Despite views, few caches for each item

23
Implementation

Use standard hash function H to map items and
caches to unit circle
If H maps to 0..M, divide by M

x
Y

Map each item to the closest cache (going
clockwise)
A holds 1,2,3
B holds 4, 5

24
Implementation

To add a new cache
Hash the cache id
Move items that should be assigned to it

Items do not move between A and B
A holds 3
B holds 4, 5
C holds 1, 2

C
25
Implementation

Cache points stored in a pre-computed binary
tree
Lookup for cached item requires
Hash of item key (e.g., URL)
BST lookup of successor
Consistent hashing with n caches requires O(log
n) time
An alternative that breaks the unit circle into
equal-length intervals can make this constant time

26
Balance

Cache points uniformly distributed by H
Each cache owns equal portion of the unit
circle
Item position random by H
So each cache gets about the same number of items

27
Smoothness

To add kth cache, hash it to circle
Captures items between it and nearest cache
1/k fraction of total items
Only from 1 other bucket
O(log n) to find is, as with lookup

28
Low Spread

Some views might not see nearest cache to an
item, hash it elsewhere
But every view will have a bucket near the item
(on the circle) by random placement
So only buckets near the item will ever have to
hold it
Only a few buckets are near the item by random
placement

29
Low Load

Cache only gets item I if no other cache is
closer to I
Under any view, some cache is close to I by
random place of caches
So a cache only gets items close to it
But an item is unlikely to be close
So cache doesnt get many items

30
Fault Tolerance

Suppose contacted cache is down
Delete from cache set view (BST) and find next
closest cache in interval
Just a small change in view
Even with many failures, uniform load and other
properties still hold

31
Experimental Setup

Cache Resolver System
Cache machines for content
Users browsers that direct requests toward
virtual caches
Resolution units (DNS) that use consistent
hashing to map virtual caches to physical
machines
Surge web load generator from BU
Two modes
Common mode (fixed cache for set of clients)
Cache resolver mode using consistent hashing

32
Performance
33
Summary of Consistent Hashing

Trivial to implement
Fast to compute
Uniformly distributes items
Can cheaply add/remove cache
Even with multiple views
No cache gets too many items
Each item in only a few caches

34
Consistent Hashing for Caching

Works well
Client maps known caches to unit circle
When item arrives, hash to cache
Server gets O(log n) requests for its own pages
Each server can also be a cache
Gets small number of requests for others pages
Robust to failures
Caches can come and go
Different browsers can know different caches

35
Refinement BW Adaptation

Browser bandwidth to machines may vary
If bandwidth to server is high, unwilling to use
lower bandwidth cache
Consistently hash item only to caches with
bandwidth as good as server
Theorem all previous properties still hold
Uniform cache loads
Low server loads (few caches per item)

36
Refinement Hot Pages

What if one page gets popular?
Cache responsible for it gets swamped
Use of tree of caches
Cache at root gets swamped
Use a different tree for each page
Build using consistent hashing
Balances load for hot pages and hot servers

37
Cache Tree Result

Using cache trees of log depth, for any set of
page accesses, can adaptively balance load such
that every server gets at most log times the
average load of the system (browser/server ratio)
Modulo some theory caveats

38
Agenda

Overview
Load Balancing Data
Load Balancing Computation
(if there is time)

39
Load Balancing Spectrum

Tasks costs
Do all tasks have equal costs?
If not, when are the costs known?
Task dependencies
Can all tasks be run in any order (including
parallel)?
If not, when are the dependencies known?
Locality
Is it important for some tasks to be scheduled on
the same processor (or nearby)?
When is the information about communication
known?
Heterogeneity
Are all the machines equally fast?
If not, when do we know their performance?

40
Task cost spectrum
41
Task Dependency Spectrum
42
Task Locality Spectrum
43
Machine Heterogeneity Spectrum

Easy All nodes (e.g., processors) are equally
powerful
Harder Nodes differ, but resources are fixed
Different physical characteristic
Hardest Nodes change dynamically
Other loads on the system (dynamic)
Data layout (inner vs. out track on disks)

44
Spectrum of Solutions

When is load balancing information known
Static scheduling. All information is available
to scheduling algorithm, which runs before any
real computation starts. (offline algorithms)
Semi-static scheduling. Information may be known
at program startup, or the beginning of each
timestep, or at other points. Offline algorithms
may be used.
Dynamic scheduling. Information is not known
until mid-execution. (online offline algorithms)

45
Approaches

Static load balancing
Semi-static load balancing
Self-scheduling
Distributed task queues
Diffusion-based load balancing
DAG scheduling
Note these are not all-inclusive, but represent
some of the problems for which good solutions
exist.

46
Static Load Balancing

Static load balancing is use when all information
is available in advance, e.g.,
dense matrix algorithms, such as LU factorization
done using blocked/cyclic layout
blocked for locality, cyclic for load balance
most computations on a regular mesh, e.g., FFT
done using cyclictransposeblocked layout for 1D
similar for higher dimensions, i.e., with
transpose
explicit methods and iterative methods on an
unstructured mesh
use graph partitioning
assumes graph does not change over time (or at
least within a timestep during iterative solve)

47
Semi-Static Load Balance

If domain changes slowly over time and locality
is important
Often used in
particle simulations, particle-in-cell (PIC)
methods
poor locality may be more of a problem than load
imbalance as particles move from one grid
partition to another
tree-structured computations (Barnes Hut, etc.)
grid computations with dynamically changing grid,
which changes slowly

48
Self-Scheduling

Self scheduling
Keep a pool of tasks that are available to run
When a processor completes its current task, look
at the pool
If the computation of one task generates more,
add them to the pool
Originally used for
Scheduling loops by compiler (really the
runtime-system)
Original paper by Tang and Yew, ICPP 1986

49
When to Use Self-Scheduling

Useful when
A batch (or set) of tasks without dependencies
can also be used with dependencies, but most
analysis has only been done for task sets without
dependencies
The cost of each task is unknown
Locality is not important
Using a shared memory multiprocessor, so a
centralized solution is fine

50
Variations on Self-Scheduling

Typically, dont want to grab smallest unit of
parallel work.
Instead, choose a chunk of tasks of size K.
If K is large, access overhead for task queue is
small
If K is small, we are likely to have even finish
times (load balance)
Variations
Use a fixed chunk size
Guided self-scheduling
Tapering
Weighted Factoring
Note there are more

51
V1 Fixed Chunk Size

Kruskal and Weiss give a technique for computing
the optimal chunk size
Requires a lot of information about the problem
characteristics
e.g., task costs, number
Results in an off-line algorithm. Not very
useful in practice.
For use in a compiler, for example, the compiler
would have to estimate the cost of each task
All tasks must be known in advance

52
V2 Guided Self-Scheduling

Idea use larger chunks at the beginning to avoid
excessive overhead and smaller chunks near the
end to even out the finish times.
The chunk size Ki at the ith access to the task
pool is given by
ceiling(Ri/p)
where Ri is the total number of tasks remaining
and
p is the number of processors
See Polychronopolous, Guided Self-Scheduling A
Practical Scheduling Scheme for Parallel
Supercomputers, IEEE Transactions on Computers,
Dec. 1987.

53
V3 Tapering

Idea the chunk size, Ki is a function of not
only the remaining work, but also the task cost
variance
variance is estimated using history information
high variance gt small chunk size should be used
low variant gt larger chunks OK

See S. Lucco, Adaptive Parallel Programs, PhD
Thesis, UCB, CSD-95-864, 1994.
Gives analysis (based on workload distribution)
Also gives experimental results -- tapering
always works at least as well as GSS, although
difference is often small

54
V4 Weighted Factoring

Idea similar to self-scheduling, but divide task
cost by computational power of requesting node
Useful for heterogeneous systems
Also useful for shared resource NOWs, e.g., built
using all the machines in a building
as with Tapering, historical information is used
to predict future speed
speed may depend on the other loads currently
on a given processor
See Hummel, Schmit, Uma, and Wein, SPAA 96
includes experimental data and analysis

55
V5 Distributed Task Queues

The obvious extension of self-scheduling to
distributed memory is
a distributed task queue (or bag)
When are these a good idea?
Distributed memory multiprocessors
Or, shared memory with significant
synchronization overhead
Locality is not (very) important
Tasks that are
known in advance, e.g., a bag of independent ones
dependencies exist, i.e., being computed on the
fly
The costs of tasks is not known in advance

56
Theory of Distributed Queues

Main result A simple randomized algorithm is
optimal with high probability
Karp and Zhang 88 show this for a tree of unit
cost (equal size) tasks
Chakrabarti et al 94 show this for a tree of
variable cost tasks
using randomized pushing of tasks
Blumofe and Leiserson 94 show this for a fixed
task tree of variable cost tasks
uses task pulling (stealing), which is better for
locality
Also have (loose) bounds on the total memory
required

57
Engineering Distributed Queues

A lot of papers on engineering these systems on
various machines, and their applications
If nothing is known about task costs when created
organize local tasks as a stack (push/pop from
top)
steal from the stack bottom (as if it were a
queue)
If something is known about tasks costs and
communication costs, can be used as hints. (See
Wen, UCB PhD, 1996.)
Goldstein, Rogers, Grunwald, and others
(independent work) have all shown
advantages of integrating into the language
framework
very lightweight thread creation

58
Diffusion-Based Load Balancing

In the randomized schemes, the machine is treated
as fully-connected.
Diffusion-based load balancing takes topology
into account
Locality properties better than prior work
Load balancing somewhat slower than randomized
Cost of tasks must be known at creation time
No dependencies between tasks

59
Diffusion-based load balancing

The machines is modeled as a graph
At each step, we compute the weight of task
remaining on each processor
This is simply the number if they are unit cost
tasks
Each processor compares its weight with its
neighbors and performs some averaging
See Ghosh et al, SPAA96 for a second order
diffusive load balancing algorithm
takes into account amount of work sent last time
avoids some oscillation of first order schemes
Note locality is not directly addressed

60
DAG Scheduling

Some problems involve a DAG of tasks
nodes represent computation (may be weighted)
edges represent orderings and usually
communication (may also be weighted)
Two application domains
Digital Signal Processing computations
Sparse direct solvers (mainly Cholesky, since it
doesnt require pivoting).
The basic strategy partition DAG to minimize
communication and keep all processors busy
NP complete
See Gerasoulis and Yang, IEEE Transaction on
PDS, Jun 93.

61
Heterogeneous Machines

Diffusion-based load balancing for heterogeneous
environment
Fizzano, Karger, Stein, Wein
Graduate declustering
Remzi Arpaci-Dusseau et al
And more

Write a Comment

User Comments (0)

About PowerShow.com

Load Balancing, p1 PowerPoint PPT Presentation