Title: Load Balancing, p1
 1CS 294-8Distributed Load Balancing 
http//www.cs.berkeley.edu/yelick/294 
 2Load Balancing
- Problem distribute items into buckets 
- Data to memory locations 
- Files to disks 
- Tasks to processors 
- Web pages to caches 
- Goal even distribution 
- Slides stolen from Karger at MIT 
 http//theory.lcs.mit.edu/karger
3Load Balancing
- Enormous and diverse literature on load balancing 
- Computer Science systems 
- operating systems 
- parallel computing 
- distributed computing 
- Computer Science theory 
- Operations research (IEOR) 
- Application domains
4Agenda
- Overview 
- Load Balancing Data 
- Load Balancing Computation 
- (if there is time) 
5The Web
MIT
CNN
UCB
CMU
Servers
USC
Browers (clients) 
 6Hot Spots
IRAM
OceanStore
UCB
Servers
BANE
Telegraph
Browers (clients) 
 7Temporary Loads
- For permanent loads, use bigger server 
- Must also deal with flash crowds 
- IBM chess match 
- Florida election tally 
- Inefficient to design for max load 
- Rarely attained 
- Much capacity wasted 
- Better to offload peak load elsewhere
8Proxy Caches Balance Load
OceanStore
Telegraph
MIT
CNN
UCB
BANE
IRAM
CMU
Servers
USC
Browers (clients) 
 9Proxy Caching
- Old server hit once for each browser 
- New serve his once for each page 
- Adapts to changing access patterns
10Proxy Caching
- Every server can also be a cache 
- Incentives 
- Provides a social good 
- Reduces load at sites you want to contact 
- Costs you little, if done right 
- Few accesses 
- Small amount of storage (times many servers)
11Who Caches What?
- Each cache should hold few items 
- Otherwise swamped by clients 
- Each item should be in few caches 
- Otherwise server swamped by caches 
- And cache invalidates/updates expensive 
- Browser must know right cache 
- Could ask for server to redirect 
- But server gets swamped by redirects
12Hashing
- Simple and powerful load balancing 
- Constant time to find bucket for item 
- Example map to n buckets. Pick a, b 
- yaxb (mod n) 
- Intuition hash maps each itme to one random 
 bucket
- No bucket gets many items 
13Problem Adding Caches
- Suppose a new cache arrives 
- How to work it into has function? 
- Natural change 
-  y  ax  b (mod n1) 
- Problem changes bucket for every item 
- Every cache will be flushed 
- Server swamped with new requests 
- Goal when add bucket, few items move
14Problem Inconsistent Views
- Each client knows about a different set of 
 caches its view
- View affects choice of cache for item 
- With many views, each cache will be asked for 
 item
- Item in all caches  swamps server 
- Goal item in few caches despite views
15Problem Inconsistent Views
UCB
my view
caches
0
1
2
3
ax  b (mod 4)  2 
 16Problem Inconsistent Views
UCB
Joes view
caches
0
3
1
2
ax  b (mod 4)  2 
 17Problem Inconsistent Views
UCB
Sues view
caches
2
0
3
1
ax  b (mod 4)  2 
 18Problem Inconsistent Views
UCB
Mikes view
caches
1
2
0
3
ax  b (mod 4)  2 
 19Problem Inconsistent Views
UCB
caches
2
2
2
2 
 20Consistent Hashing
- A new kind of hash function 
- Maps any item to a bucket in my view 
- Computable in constant time, locally 
- 1 standard hash function 
- Adding bucket to view takes log time 
- Logarithmic  of standard hash functions 
- Handle incremental and inconsistent views
21Single View Properties
- Balance all buckets get roughly same number of 
 items
- Smooth when a kth bucket is added, only a 1/k 
 fraction of items move
- And only from O(log n) servers 
- Minimum needed to preserve balance 
22Multiple View Properties
- Consider n views, each of an arbitrary constant 
 fraction of the buckets
- Load number of items a bucket gets from all 
 views is O(log n) times average
- Despite views, load balanced 
- Spread over all views, each item appears in 
 O(log n) buckets
- Despite views, few caches for each item
23Implementation
- Use standard hash function H to map items and 
 caches to unit circle
- If H maps to 0..M, divide by M 
x
Y
- Map each item to the closest cache (going 
 clockwise)
- A holds 1,2,3 
- B holds 4, 5
24Implementation
- To add a new cache 
- Hash the cache id 
- Move items that should be assigned to it
- Items do not move between A and B 
- A holds 3 
- B holds 4, 5 
- C holds 1, 2
C 
 25Implementation
- Cache points stored in a pre-computed binary 
 tree
- Lookup for cached item requires 
- Hash of item key (e.g., URL) 
- BST lookup of successor 
- Consistent hashing with n caches requires O(log 
 n) time
- An alternative that breaks the unit circle into 
 equal-length intervals can make this constant time
26Balance
- Cache points uniformly distributed by H 
- Each cache owns equal portion of the unit 
 circle
- Item position random by H 
- So each cache gets about the same number of items
27Smoothness
- To add kth cache, hash it to circle 
- Captures items between it and nearest cache 
- 1/k fraction of total items 
- Only from 1 other bucket 
- O(log n) to find is, as with lookup 
28Low Spread
- Some views might not see nearest cache to an 
 item, hash it elsewhere
- But every view will have a bucket near the item 
 (on the circle) by random placement
- So only buckets near the item will ever have to 
 hold it
- Only a few buckets are near the item by random 
 placement
29Low Load
- Cache only gets item I if no other cache is 
 closer to I
- Under any view, some cache is close to I by 
 random place of caches
- So a cache only gets items close to it 
- But an item is unlikely to be close 
- So cache doesnt get many items
30Fault Tolerance
- Suppose contacted cache is down 
- Delete from cache set view (BST) and find next 
 closest cache in interval
- Just a small change in view 
- Even with many failures, uniform load and other 
 properties still hold
31Experimental Setup
- Cache Resolver System 
- Cache machines for content 
- Users browsers that direct requests toward 
 virtual caches
- Resolution units (DNS) that use consistent 
 hashing to map virtual caches to physical
 machines
- Surge web load generator from BU 
- Two modes 
- Common mode (fixed cache for set of clients) 
- Cache resolver mode using consistent hashing
32Performance 
 33Summary of Consistent Hashing
- Trivial to implement 
- Fast to compute 
- Uniformly distributes items 
- Can cheaply add/remove cache 
- Even with multiple views 
- No cache gets too many items 
- Each item in only a few caches
34Consistent Hashing for Caching
- Works well 
- Client maps known caches to unit circle 
- When item arrives, hash to cache 
- Server gets O(log n) requests for its own pages 
- Each server can also be a cache 
- Gets small number of requests for others pages 
- Robust to failures 
- Caches can come and go 
- Different browsers can know different caches
35Refinement BW Adaptation
- Browser bandwidth to machines may vary 
- If bandwidth to server is high, unwilling to use 
 lower bandwidth cache
- Consistently hash item only to caches with 
 bandwidth as good as server
- Theorem all previous properties still hold 
- Uniform cache loads 
- Low server loads (few caches per item)
36Refinement Hot Pages
- What if one page gets popular? 
- Cache responsible for it gets swamped 
- Use of tree of caches 
- Cache at root gets swamped 
- Use a different tree for each page 
- Build using consistent hashing 
- Balances load for hot pages and hot servers
37Cache Tree Result
- Using cache trees of log depth, for any set of 
 page accesses, can adaptively balance load such
 that every server gets at most log times the
 average load of the system (browser/server ratio)
- Modulo some theory caveats 
38Agenda
- Overview 
- Load Balancing Data 
- Load Balancing Computation 
- (if there is time) 
39Load Balancing Spectrum
- Tasks costs 
- Do all tasks have equal costs? 
- If not, when are the costs known? 
- Task dependencies 
- Can all tasks be run in any order (including 
 parallel)?
- If not, when are the dependencies known? 
- Locality 
- Is it important for some tasks to be scheduled on 
 the same processor (or nearby)?
- When is the information about communication 
 known?
- Heterogeneity 
- Are all the machines equally fast? 
- If not, when do we know their performance?
40Task cost spectrum 
 41Task Dependency Spectrum 
 42Task Locality Spectrum 
 43Machine Heterogeneity Spectrum
- Easy All nodes (e.g., processors) are equally 
 powerful
- Harder Nodes differ, but resources are fixed 
- Different physical characteristic 
- Hardest Nodes change dynamically 
- Other loads on the system (dynamic) 
- Data layout (inner vs. out track on disks)
44Spectrum of Solutions
- When is load balancing information known 
- Static scheduling. All information is available 
 to scheduling algorithm, which runs before any
 real computation starts. (offline algorithms)
- Semi-static scheduling. Information may be known 
 at program startup, or the beginning of each
 timestep, or at other points. Offline algorithms
 may be used.
- Dynamic scheduling. Information is not known 
 until mid-execution. (online offline algorithms)
45Approaches
- Static load balancing 
- Semi-static load balancing 
- Self-scheduling 
- Distributed task queues 
- Diffusion-based load balancing 
- DAG scheduling 
- Note these are not all-inclusive, but represent 
 some of the problems for which good solutions
 exist.
46Static Load Balancing
- Static load balancing is use when all information 
 is available in advance, e.g.,
- dense matrix algorithms, such as LU factorization 
 
- done using blocked/cyclic layout 
- blocked for locality, cyclic for load balance 
- most computations on a regular mesh, e.g., FFT 
- done using cyclictransposeblocked layout for 1D 
- similar for higher dimensions, i.e., with 
 transpose
- explicit methods and iterative methods on an 
 unstructured mesh
- use graph partitioning 
- assumes graph does not change over time (or at 
 least within a timestep during iterative solve)
47Semi-Static Load Balance
- If domain changes slowly over time and locality 
 is important
- Often used in 
- particle simulations, particle-in-cell (PIC) 
 methods
- poor locality may be more of a problem than load 
 imbalance as particles move from one grid
 partition to another
- tree-structured computations (Barnes Hut, etc.) 
- grid computations with dynamically changing grid, 
 which changes slowly
48Self-Scheduling
- Self scheduling 
- Keep a pool of tasks that are available to run 
- When a processor completes its current task, look 
 at the pool
- If the computation of one task generates more, 
 add them to the pool
- Originally used for 
- Scheduling loops by compiler (really the 
 runtime-system)
- Original paper by Tang and Yew, ICPP 1986 
49When to Use Self-Scheduling
- Useful when 
- A batch (or set) of tasks without dependencies 
- can also be used with dependencies, but most 
 analysis has only been done for task sets without
 dependencies
- The cost of each task is unknown 
- Locality is not important 
- Using a shared memory multiprocessor, so a 
 centralized solution is fine
50Variations on Self-Scheduling
- Typically, dont want to grab smallest unit of 
 parallel work.
- Instead, choose a chunk of tasks of size K. 
- If K is large, access overhead for task queue is 
 small
- If K is small, we are likely to have even finish 
 times (load balance)
- Variations 
- Use a fixed chunk size 
- Guided self-scheduling 
- Tapering 
- Weighted Factoring 
- Note there are more
51V1 Fixed Chunk Size
- Kruskal and Weiss give a technique for computing 
 the optimal chunk size
- Requires a lot of information about the problem 
 characteristics
- e.g., task costs, number 
- Results in an off-line algorithm. Not very 
 useful in practice.
- For use in a compiler, for example, the compiler 
 would have to estimate the cost of each task
- All tasks must be known in advance
52V2 Guided Self-Scheduling
- Idea use larger chunks at the beginning to avoid 
 excessive overhead and smaller chunks near the
 end to even out the finish times.
- The chunk size Ki at the ith access to the task 
 pool is given by
-  ceiling(Ri/p) 
- where Ri is the total number of tasks remaining 
 and
- p is the number of processors 
- See Polychronopolous, Guided Self-Scheduling A 
 Practical Scheduling Scheme for Parallel
 Supercomputers, IEEE Transactions on Computers,
 Dec. 1987.
53V3 Tapering
- Idea the chunk size, Ki is a function of not 
 only the remaining work, but also the task cost
 variance
- variance is estimated using history information 
- high variance gt small chunk size should be used 
- low variant gt larger chunks OK
- See S. Lucco, Adaptive Parallel Programs, PhD 
 Thesis, UCB, CSD-95-864, 1994.
- Gives analysis (based on workload distribution) 
- Also gives experimental results -- tapering 
 always works at least as well as GSS, although
 difference is often small
54V4 Weighted Factoring
- Idea similar to self-scheduling, but divide task 
 cost by computational power of requesting node
- Useful for heterogeneous systems 
- Also useful for shared resource NOWs, e.g., built 
 using all the machines in a building
- as with Tapering, historical information is used 
 to predict future speed
- speed may depend on the other loads currently 
 on a given processor
- See Hummel, Schmit, Uma, and Wein, SPAA 96 
- includes experimental data and analysis
55V5 Distributed Task Queues
- The obvious extension of self-scheduling to 
 distributed memory is
- a distributed task queue (or bag) 
- When are these a good idea? 
- Distributed memory multiprocessors 
- Or, shared memory with significant 
 synchronization overhead
- Locality is not (very) important 
- Tasks that are 
- known in advance, e.g., a bag of independent ones 
- dependencies exist, i.e., being computed on the 
 fly
- The costs of tasks is not known in advance
56Theory of Distributed Queues 
- Main result A simple randomized algorithm is 
 optimal with high probability
- Karp and Zhang 88 show this for a tree of unit 
 cost (equal size) tasks
- Chakrabarti et al 94 show this for a tree of 
 variable cost tasks
- using randomized pushing of tasks 
- Blumofe and Leiserson 94 show this for a fixed 
 task tree of variable cost tasks
- uses task pulling (stealing), which is better for 
 locality
- Also have (loose) bounds on the total memory 
 required
57Engineering Distributed Queues
- A lot of papers on engineering these systems on 
 various machines, and their applications
- If nothing is known about task costs when created 
- organize local tasks as a stack (push/pop from 
 top)
- steal from the stack bottom (as if it were a 
 queue)
- If something is known about tasks costs and 
 communication costs, can be used as hints. (See
 Wen, UCB PhD, 1996.)
- Goldstein, Rogers, Grunwald, and others 
 (independent work) have all shown
- advantages of integrating into the language 
 framework
- very lightweight thread creation
58Diffusion-Based Load Balancing
- In the randomized schemes, the machine is treated 
 as fully-connected.
- Diffusion-based load balancing takes topology 
 into account
- Locality properties better than prior work 
- Load balancing somewhat slower than randomized 
- Cost of tasks must be known at creation time 
- No dependencies between tasks
59Diffusion-based load balancing
- The machines is modeled as a graph 
- At each step, we compute the weight of task 
 remaining on each processor
- This is simply the number if they are unit cost 
 tasks
- Each processor compares its weight with its 
 neighbors and performs some averaging
- See Ghosh et al, SPAA96 for a second order 
 diffusive load balancing algorithm
- takes into account amount of work sent last time 
- avoids some oscillation of first order schemes 
- Note locality is not directly addressed
60DAG Scheduling
- Some problems involve a DAG of tasks 
- nodes represent computation (may be weighted) 
- edges represent orderings and usually 
 communication (may also be weighted)
- Two application domains 
- Digital Signal Processing computations 
- Sparse direct solvers (mainly Cholesky, since it 
 doesnt require pivoting).
- The basic strategy partition DAG to minimize 
 communication and keep all processors busy
- NP complete 
- See Gerasoulis and Yang, IEEE Transaction on 
 PDS, Jun 93.
61Heterogeneous Machines
- Diffusion-based load balancing for heterogeneous 
 environment
- Fizzano, Karger, Stein, Wein 
- Graduate declustering 
- Remzi Arpaci-Dusseau et al 
- And more