Title: Source:
1MapReduce
- Source
- MapReduce Simplified Data Processing on Large
Clusters - Jeffrey Dean and Sanjay Ghemawat, Google inc.
- (wim bohm, cs.colostate.edu)
Except as otherwise noted, the content of this
presentation is licensed under the Creative
Commons Attribution 2.5 license.
2MapReduce Concept
- Simple implicitly // programming model
- Based on Lisps Map and Reduce higher order
functions - Lisp Map(fM,L) fM(first(L)) Map(fM, rest(L))
- Lisp Reduce(fR,L) fR(first(L), Reduce(fR,
rest(L))) - Lisp MapReduce(fM,fR,L) Reduce(fR,Map(fm,L))
- Lisp Lots of Irritating Superfluous Parentheses
- (left base cases out)
- Very savvy implementation
- Hi throughput, hi performance, rack aware
- Functional RTS takes care of FT, restart,
Distribution (//ism)
3Introduction
- Data center apps special type of // programs
processing large amounts of data on large
clusters Complexity - Much of this complexity is NOT in the actual
computation, but in the data distribution,
replication, access, in FT, restart etc. These
issues arise for ALL the data center apps. - This has given rise to the MapReduce abstraction
and implementation
4Map and Reduce
- Map take a set of (key,value) pairs and
generate a set of intermediate (key,value) pairs
by applying some function f to all these pairs - Reduce merge all pairs with same key applying a
reduction function R on the values - f and R are user defined
- All implemented in a non functional language such
as java, C, python
5Wordcount
- Map(String key, String value)
- // key doc name, value doc contents
- for each word w in value
- EmitIntermediate(w, 1)
-
- Reduce(String key, Iterator values)
- // key word, values list of counts
- int sum 0
- for each v in values sum ParseInt(v)
- Emit((String) sum)
6Types
- Map (keytype1,valuetype1) -gt
- list( (keytype2, valuetype2) )
- Reduce (keytype2, list(valuetype2)) -gt
- list ( valuetype2)
- Types 12 passed between user functions can be
any valid (e.g. java type) - Communication goes through files,the types are
eg. longWritable (see examples)
7Example Pi-Estimator
- Idea generate random points in a square
- Count how many are inside circle, how many in the
square (producing area estimates) - Square area As 4 r2 -gt r2 As / 4
- Circle area Ac pi r2 -gt pi Ac
/ r2 - -gt pi 4Ac / As
-
- Example of Monte Carlo method simulating a
physical phenomenon using many random samples
8Worker / Multi-threading view
- Master
- get input params (nWorkers, nPoints)
- for(i0 ilt nWorkers i) thrCreate(i,
nPoints) - for(i0 ilt nWorkers i) join
- As 0 Ac 0
- for(i0 iltnWorkers i) As nPoints
AcncPointsi - piEst 4Ac / As
- Slavei
- cPointsi0
- for(i0 iltnPointsi)
- create 2 random pts x,y in (-.5 .. .5)
- if (sqrt(xxyy)lt.5) cPointsi
-
-
9Multithreading vs Lisp functional
- Multithreading view assumes
- We can spawn threads and join them back
- We have shared memory
- If there are read/write hazards, we use
explicit mutex locks - Therefore we have parallelism
- Lisp functional uses map/reduce lists
- List of worker numbers, MAPped to list of cPoints
- List of cPoints reduced to sumCpoints
- sumcPoints used to estimate pi
- The lists make this inherently SEQUENTIAL
-
10We want MapReduce to be parallel!
- Just like in multithreading, we need some kind of
spawn(id,func,data) construct - In Lisp the spawn is taken care of by higher
order function mechanism - reduce(rFun,map(mFun,inList))
- In MapReduce we use method override to define our
specific versions of map and reduce, and we have
a Driver that creates a Job Configuration to
provide parallelism. -
11We need to communicate results
- Somehow the map processes need input
- (key1,val1) pairs and need to produce
intermediate (key2,val2) pairs, that the reduce
process can pick up. - But we are in a distributed environment
- What provides a shared name space?
- the file system!
- functional HDFS allows for parallelism
12What about parallel writes?
- HDFs no parallel writes
- GFS parallel append type writes
- MapReduce parallel processes doing potentially
parallel writes, guarantees writes to be atomic
ops If process 1 writes aaaaa, and process 2
writes bbbbb, we get aaaaabbbbb or
bbbbbaaaaa,never something like ababababab. - The data written by one process, occurs in the
order written by that process
13Parallel writes vs multithreading
- Parallel writes are like multiple threads
appending to a mutex-lock protected list. - The list is just a collection of unordered
records. - The reducer has to be aware of this
- Either it can impose an order
- Or it can make sure the reduction function is
associative and commutative - Take // grep if you want outcomes sorted by line
, make - line part of the key, and sort
14MapReduce for PiEstimator
- MapReduce is integrated into Eclipse
- We need to have the MapReduce plugins to create a
MapReduce Eclipse perspective. - MapReduce projects contain three classes
- 1. A Driver (like the master in the
multithreading case) - Creating a configuaration, defining
mappers, reducers, - starting the app, dealing with the final
result gathering. - 2. A mapper (inherited class implementing
mapper interface) - Getting data from files in a directory
specified by driver. - 3. A reducer (inherited class implementing
reducer interface) - Getting data from files in a directory
specified by driver, - produced by mappers.
15Pi
- Two versions
- 1. mypi nMaps, nSamples
- Each of the n Maps does n Samples
- More Maps more work, hopefully better
- result.
- 2. mypi2 nMaps, nSamples, nReps
- Each of the n Maps does nSamples/nMaps
- samples. Always same amount of work. Do
- nReps times for speedup experiment
-
16Mypi2 on Laptop
Multiple Sequential Mappers do not bring the
performance down
17Mypi2 on Hadoop cluster
Twenty parallel Mappers five fold
speedup Twelve seems better
18Other example grep
- Input DIRECTORY to output DIRECTORY
- Whole app written in one class
- Not 3 driver, mapper, reducer
- Uses a lot of support code sort, regular expr
scanner - Deals with regular expressions like
- (app ban coc ) .
-
19MapReduce Google implementation
- Large clusters of commodity PCs connected with
switched Ethernet. - Luiz A.Barrosso, Jeffrey Dean, and Urs
Holzle. Web search for a planet the Google
cluster architecture. IEEE Micro, 23(2)22-28,
April 2003. - Nodes dual-processor x86, Linux,2-4GB of memory
- Storage local disks on individual nodes
- GFS (Googles original file system, used by HDFS)
- Jobs (sets of tasks) submitted to scheduler,
IMPLICITLY mapped to set of available nodes
20user program
Execution overview
(1)fork
master
(1)fork
(1)fork
(2)assign map
(2)assign reduce
worker
(6)write
output file 0
split0
worker
(4)local write
split1
(5)remote read
(3)read
worker
split2
output file 1
worker
split3
worker
split4
Input Map Intermediate
Reduce Output Files phase
local files phase files
21Execution overview
- 1. Input files are split into M pieces (16 to 64
MB) - Many worker copies of the program are forked.
- 2. One special copy, the master, assigns map and
reduce tasks to idle slave workers - 3. Map workers read input splits, parse
(key,value) pairs, apply the map function, create
buffered output pairs. -
22Execution overview cont
- 4. Buffered output pairs are periodically written
to local disk, partitioned into R regions,
locations of regions are passed back to the
master. - 5. Master notifies reduce worker about locations.
Worker uses remote procedure calls to read data
from local disks of the map workers, sorts by
intermediate keys to group same key records
together.
23Execution overview cont
- 6. Reduce worker passes key plus corresponding
set of all intermediate data to reduce function.
The output of the reduce function is appended to
the final output file. - 7. When all map and reduce tasks are completed
the master wakes up the user program, which
resumes the user code.
24Fault Tolerance workers
- Master pings workers periodically. No response
worker marked as failed. Completed map tasks are
reset to idle state, so that they can be
restarted, because their results (local to failed
worker) are lost. - Completed reduce tasks do not need to be
re-started (output stored in global file system).
Reduce tasks are notified of the new map tasks,
so they can read unread data from the new
locations.
25Fault Tolerance Master
- Master writes checkpoints
- Only one master, less chance of failure
- If master failes, MapReduce task aborts.
26Backup tasks
- Common cause of slowdown one straggler a
machine that takes a lot of time because it is
very busy. - Master schedules backup executions of remaining
in-progress tasks. Task marked completed when
who-ever finishes it first. - Smart mechanism, needs tuning.
- E.g. Sort is 44 slower if backup mechanism not
used.
27File names as keys
- MapReduce programs take an input DIRECTORY and
produce an output DIRECTORY. - The files in the input directory are broken into
almost equal shards and handed to mappers. - The default key value pair is
- byte offset of first char in line, line
content - byte offset allows quick file access
- What if we want file name as key?
- we have to write our own recordreader
28Steps towards an ls in MapReduce
- Created WholeFileRecordReader.java
- implements RecordReaderltText,Textgt. Text
implements both writable and writableComparable. - The user driver (here ls_driver.java) calls the
runjob driver that, in order to put chards
together, calls recordreader. - ls_driver specifies inputFormat to be
MultiFileContentInputFormat, which specifies Text
for the input and output format and returns our
RecordReader WholeFileRecordReader - Eclipse produced the method stubs
- Most methods straight ahead
- The interesting one next (produce next record)
- Our next produces ltfileName, fileSizegt or
ltfileName,contentgt - Probably better ltfileName, pathgt so parallel
mappers read content