Title: HaLoop: Efficient Iterative Data Processing On Large Scale Clusters
1HaLoop Efficient Iterative Data Processing On
Large Scale Clusters
- Yingyi Bu, UC Irvine
- Bill Howe, UW
- Magda Balazinska, UW
- Michael Ernst, UW
Horizon
http//clue.cs.washington.edu/
Award IIS 0844572 Cluster Exploratory (CluE)
http//escience.washington.edu/
VLDB 2010, Singapore
2Thesis in one slide
- Observation MapReduce has proven successful as a
common runtime for non-recursive declarative
languages - HIVE (SQL)
- Pig (RA with nested types)
- Observation Many people roll their own loops
- Graphs, clustering, mining, recursive queries
- iteration managed by external script
- Thesis With minimal extensions, we can provide
an efficient common runtime for recursive
languages - Map, Reduce, Fixpoint
3Related Work Twister Ekanayake HPDC 2010
- Redesigned evaluation engine using pub/sub
- Termination condition evaluated by main()
13. while(!complete) 14. monitor
driver.runMapReduceBCast(cData) 15.
monitor.monitorTillCompletion() 16.
DoubleVectorData newCData ((KMeansCombiner)
driver .getCurrentCombiner(
)).getResults() 17. totalError getError(cData,
newCData) 18. cData newCData 19. if
(totalError lt THRESHOLD) 20. complete
true 21. break 22. 23.
4In Detail PageRank (Twister)
while (!complete) // start the pagerank map
reduce process monitor driver.runMapReduceBCas
t(new BytesValue(tmpCompressedDvd.getBy
tes())) monitor.monitorTillCompletion() //
get the result of process newCompressedDvd
((PageRankCombiner) driver.getCurrentCo
mbiner()).getResults() // decompress the
compressed pagerank values newDvd
decompress(newCompressedDvd) tmpDvd
decompress(tmpCompressedDvd) totalError
getError(tmpDvd, newDvd) // get the
difference between new and old pagerank values
if (totalError lt tolerance) complete
true tmpCompressedDvd newCompressedDvd
run MR
term. cond.
5Related Work Spark Zaharia HotCloud 2010
- Reduction output collected at driver program
- does not currently support a grouped reduce
operation as in MapReduce
all output sent to driver.
val spark new SparkContext(ltMesos mastergt) var
count spark.accumulator(0) for (i lt-
spark.parallelize(1 to 10000, 10)) val x
Math.random 2 - 1 val y Math.random 2 -
1 if (xx yy lt 1) count 1 println("Pi
is roughly " 4 count.value / 10000.0)
6Related Work Pregel Malewicz PODC 2009
- Graphs only
- clustering k-means, canopy, DBScan
- Assumes each vertex has access to outgoing edges
- So an edge representation
- requires offline preprocessing
- perhaps using MapReduce
Edge(from, to)
7Related Work Piccolo Power OSDI 2010
- Partitioned table data model, with user-defined
partitioning - Programming model
- message-passing with global synchronization
barriers - User can give locality hints
- Worth exploring a direct comparison
GroupTables(curr, next, graph)
8Related Work BOOM c.f. Alvaro EuroSys 10
- Distributed computing based on Overlog (Datalog
temporal logic more) - Recursion supported naturally
- app API-compliant implementation of MR
- Worth exploring a direct comparison
9Details
- Architecture
- Programming Model
- Caching (and Indexing)
- Scheduling
10Example 1 PageRank
Rank Table R0
url rank
www.a.com 1.0
www.b.com 1.0
www.c.com 1.0
www.d.com 1.0
www.e.com 1.0
Linkage Table L
Ri1
url_src url_dest
www.a.com www.b.com
www.a.com www.c.com
www.c.com www.a.com
www.e.com www.c.com
www.d.com www.b.com
www.c.com www.e.com
www.e.com www.c.om
www.a.com www.d.com
p(url_dest, ?url_destSUM(rank))
Ri.rank Ri.rank/?urlCOUNT(url_dest)
Rank Table R3
url rank
www.a.com 2.13
www.b.com 3.89
www.c.com 2.60
www.d.com 2.60
www.e.com 2.13
Ri.url L.url_src
Ri
L
11A MapReduce Implementation
Join compute rank
Aggregate
fixpoint evaluation
Ri
M
M
r
M
r
r
L-split0
M
r
M
r
M
r
L-split1
M
Converged?
ii1
Client
done
12Whats the problem?
Ri
m
M
r
M
r
r
L-split0
m
r
M
r
M
r
3.
L-split1
m
2.
1.
L is loop invariant, but
- L is loaded on each iteration
- L is shuffled on each iteration
- Fixpoint evaluated as a separate MapReduce job
per iteration
plus
13Example 2 Transitive Closure
Friend
Find all transitive friends of Eric
R0
Eric, Eric
Eric, Elisa
R1
Eric, Tom Eric, Harry
R2
R3
(semi-naïve evaluation)
14Example 2 in MapReduce
(compute next generation of friends)
(remove the ones weve already seen)
Join
Dupe-elim
Si
M
M
r
r
Friend0
M
r
M
r
Friend1
M
Anything new?
Client
ii1
done
15Whats the problem?
(compute next generation of friends)
(remove the ones weve already seen)
Join
Dupe-elim
Si
M
M
r
r
Friend0
M
M
r
r
Friend1
2.
M
1.
Friend is loop invariant, but
- Friend is loaded on each iteration
- Friend is shuffled on each iteration
16Example 3 k-means
ki
k centroids at iteration i
ki
P0
M
r
ki
P1
ki1
M
r
ki
P2
M
ki - ki1 lt threshold?
Client
ii1
done
17Whats the problem?
ki
k centroids at iteration i
ki
P0
M
r
ki
P1
ki1
M
r
ki
P2
M
1.
ki - ki1 lt threshold?
Client
ii1
done
P is loop invariant, but
- P is loaded on each iteration
18Approach Inter-iteration caching
Loop body
Reducer output cache (RO)
Reducer input cache (RI)
Mapper output cache (MO)
Mapper input cache (MI)
19RI Reducer Input Cache
- Provides
- Access to loop invariant data without map/shuffle
- Used By
- Reducer function
- Assumes
- Mapper output for a given table constant across
iterations - Static partitioning (implies no new nodes)
- PageRank
- Avoid shuffling the network at every step
- Transitive Closure
- Avoid shuffling the graph at every step
- K-means
- No help
20Reducer Input Cache Benefit
Transitive Closure Billion Triples Dataset
(120GB) 90 small instances on EC2
Overall run time
21Reducer Input Cache Benefit
Transitive Closure Billion Triples Dataset
(120GB) 90 small instances on EC2
Join step only
22Reducer Input Cache Benefit
Transitive Closure Billion Triples Dataset
(120GB) 90 small instances on EC2
Reduce and Shuffle of Join Step
23Join compute rank
Aggregate
fixpoint evaluation
Ri
M
M
r
M
r
r
L-split0
M
r
M
r
M
r
L-split1
M
24RO Reducer Output Cache
- Provides
- Distributed access to output of previous
iterations - Used By
- Fixpoint evaluation
- Assumes
- Partitioning constant across iterations
- Reducer output key functionally determines
Reducer input key - PageRank
- Allows distributed fixpoint evaluation
- Obviates extra MapReduce job
- Transitive Closure
- No help
- K-means
- No help
25Reducer Output Cache Benefit
Fixpoint evaluation (s)
Iteration
Iteration
Livejournal dataset 50 EC2 small instances
Freebase dataset 90 EC2 small instances
26MI Mapper Input Cache
- Provides
- Access to non-local mapper input on later
iterations - Used
- During scheduling of map tasks
- Assumes
- Mapper input does not change
- PageRank
- Subsumed by use of Reducer Input Cache
- Transitive Closure
- Subsumed by use of Reducer Input Cache
- K-means
- Avoids non-local data reads on iterations gt 0
27Mapper Input Cache Benefit
5 non-local data reads 5 improvement
28Conclusions (last slide)
- Relatively simple changes to MapReduce/Hadoop can
support arbitrary recursive programs - TaskTracker (Cache management)
- Scheduler (Cache awareness)
- Programming model (multi-step loop bodies, cache
control) - Optimizations
- Caching loop invariant data realizes largest gain
- Good to eliminate extra MapReduce step for
termination checks - Mapper input cache benefit inconclusive need a
busier cluster - Future Work
- Analyze expressiveness of Map Reduce Fixpoint
- Consider a model of Map (Reduce) Fixpoint
29Data-Intensive Scalable Science
http//escience.washington.edu
Award IIS 0844572 Cluster Exploratory (CluE)
http//clue.cs.washington.edu
30Motivation in One Slide
- MapReduce cant express recursion/iteration
- Lots of interesting programs need loops
- graph algorithms
- clustering
- machine learning
- recursive queries (CTEs, datalog, WITH clause)
- Dominant solution Use a driver program outside
of mapreduce - Hypothesis making MapReduce loop-aware affords
optimization - and lays a foundation for scalable
implementations of recursive languages
31Experiments
- Amazon EC2
- 20, 50, 90 default small instances
- Datasets
- Billions of Triples (120GB) 1.5B nodes 1.6B
edges - Freebase (12GB) 7M ndoes 154M edges
- Livejournal social network (18GB) 4.8M nodes,
67M edges - Queries
- Transitive Closure
- PageRank
- k-means
VLDB 2010
32HaLoop Architecture
33Scheduling Algorithm
- Input Node node
- Global variable HashMapltNode, ListltParitiongtgt
last, HashMaphltNode, ListltPartitiongtgt current - 1 if (iteration 0)
- 2 Partition part StandardMapReduceSchedule(no
de) - 3 current.add(node, part)
- 4 else
- 5 if (node.hasFullLoad())
- 6 Node substitution findNearbyNode(node)
- 7 last.get(substitution).addAll(last.remove(no
de)) - 8 return
- 9
- 10 if (last.get(node).size()gt0)
- 11 Partition part last.get(node).get(0)
- 12 schedule(part, node)
- 13 current.get(node).add(part)
- 14 list.remove(part)
- 15
- 16
The same as MapReduce
Find a substitution
Iteration-local Schedule
34Programming Interface
Job job new Job() job.AddMap(Map Rank, 1)
job.AddReduce(Reduce Rank, 1) job.AddMap(Map
Aggregate, 2) job.AddReduce(Reduce Aggregate,
2) job.AddInvariantTable(1) job.SetInput(Iter
ationInput) job.SetFixedPointThreshold(0.1)
job.SetDistanceMeasure(ResultDistance)
job.SetMaxNumOfIterations(10)
job.SetReducerInputCache(true)
job.SetReducerOutputCache(true) job.Submit()
define loop body
Declare an input as invariant
Specify loop body input, parameterized by
iteration
Termination condition
Turn on caches
35Cache Infrastructure Details
- Programmer control
- Architecture for cache management
- Scheduling for inter-iteration locality
- Indexing the values in the cache
36Other Extensions and Experiments
- Distributed databases and Pig/Hadoop for
Astronomy IASDS 09 - Efficient Friends of Friends in Dryad SSDBM
2010 - SkewReduce Automated skew handling SOCC 2010
- Image Stacking and Mosaicing with Hadoop Hadoop
Summit 2010 - HaLoop Efficient iterative processing with
Hadoop VLDB2010
37MapReduce Broadly Applicable
- Biology
- Schatz 08, 09
- Astronomy
- IASDS 09, SSDBM 10, SOCC 10, PASP 10
- Oceanography
- UltraVis 09
- Visualization
- UltraVis 09, EuroVis 10
38Key idea
- When the loop output is large
- transitive closure
- connected components
- PageRank (with a convergence test as the
termination condition) - need a distributed fixpoint operator
- typically implemented as yet another MapReduce
job -- on every iteration
39Background
- Why is MapReduce popular?
- Because its fast?
- Because it scales to 1000s of commodity nodes?
- Because its fault tolerant?
- Witness
- MapReduce on GPUs
- MapReduce on MPI
- MapReduce in main memory
- MapReduce on lt10 nodes
40So why is MapReduce popular?
- The programming model
- Two serial functions, parallelism for free
- Easy and expressive
- Compare this with MPI
- 70 operations
- But it cant express recursion
- graph algorithms
- clustering
- machine learning
- recursive queries (CTEs, datalog, WITH clause)
41Fixpoint
- A fixpoint of a function f is a value x such that
f(x) x - The fixpoint queries FIX can be expressed with
the relational algebra plus a fixpoint operator - Map - Reduce - Fixpoint
- hypothesis sufficient model for all recursive
queries