Title: Creating MapReduce Programs Using Hadoop
1Creating Map-Reduce Programs Using Hadoop
2Presentation Overview
- Recall Hadoop
- Overview of the map-reduce paradigm
- Elaboration on the WordCount example
- components of Hadoop that make WordCount possible
- Major new example N-Gram Generator
- step-by-step assembly of this map-reduce job
- Design questions to ask when creating your own
Hadoop jobs
3Recall why Hadoop rocks
- Hadoop is
- Free and open source
- high quality, like all Apache Foundation projects
- crossplatform (pure Java)?
- fault-tolerant
- highly scalable
- has bindings for non-Java programming languages
- applicable to many computational problems
4Map-Reduce System Overview
- JobTracker Makes scheduling decisions
- TaskTracker Manages tasks for a given node
- Task process
- Runs an individual map or reduce fragment for a
given job - Forks from the TaskTracker
- Processes communicate by custom RPC
implementation - Easy to change/extend
- Defined as Java interfaces
- Server objects implement the interface
- Client proxy objects automatically created
- All messages originate at the client (e.g., Task
to TaskTracker)? - Prevents cycles and therefore deadlocks
5Process Flow Diagram
6Application Overview
- Launching Program
- Creates a JobConf to define a job.
- Submits JobConf to JobTracker and waits for
completion. - Mapper
- Is given a stream of key1,value1 pairs
- Generates a stream of key2, value2 pairs
- Reducer
- Is given a key2 and a stream of value2s
- Generates a stream of key3, value3 pairs
7Job Launch Process Client
- Client program creates a JobConf
- Identify classes implementing Mapper and Reducer
interfaces - JobConf.setMapperClass()?
- JobConf.setReducerClass()?
- Specify input and output formats
- JobConf.setInputFormat(TextInputFormat.class)
- JobConf.setOutputFormat(TextOutputFormat.class)
- Other options too
- JobConf.setNumReduceTasks()?
- JobConf.setOutputFormat()?
- Many, many more (Facade pattern)?
8An onslaught of terminology
- We'll explain these terms, each of which plays a
role in any non-trivial map/reduce job - InputFormat, OutputFormat, FileInputFormat, ...
- JobClient and JobConf
- JobTracker and TaskTracker
- TaskRunner, MapTaskRunner, MapRunner,
- InputSplit, RecordReader, LineRecordReader, ...
- Writable, WritableComparable, WritableInt, ...
9InputFormat and OutputFormat
- The application also chooses input and output
formats, which define how the persistent data is
read and written. These are interfaces and can be
defined by the application. - InputFormat
- Splits the input to determine the input to each
map task. - Defines a RecordReader that reads key, value
pairs that are passed to the map task - OutputFormat
- Given the key, value pairs and a filename, writes
the reduce task output to persistent store.
10Example
- public static void main(String args) throws
Exception - JobConf conf new JobConf(WordCount.class)
- conf.setOutputKeyClass(Text.class)
- conf.setOutputValueClass(IntWritable.class)
- conf.setMapperClass(Map.class)
- conf.setCombinerClass(Reduce.class)
- conf.setReducerClass(Reduce.class)
-
- conf.setInputFormat(TextInputFormat.class)
- conf.setOutputFormat(TextOutputFormat.class)
- FileInputFormat.setInputPaths(conf, new
Path(args0)) - FileOutputFormat.setOutputPath(conf, new
Path(args1)) - JobClient.runJob(conf)
11Job Launch Process JobClient
- Pass JobConf to JobClient.runJob() or
JobClient.submitJob()? - runJob() blocks wait until job finishes
- submitJob() does not
- Poll for status to make running decisions
- Avoid polling with JobConf.setJobEndNotificationUR
I() - JobClient
- Determines proper division of input into
InputSplits - Sends job data to master JobTracker server
12Job Launch Process JobTracker
- JobTracker
- Inserts jar and JobConf (serialized to XML) in
shared location - Posts a JobInProgress to its run queue
13Job Launch Process TaskTracker
- TaskTrackers running on slave nodes periodically
query JobTracker for work - Retrieve job-specific jar and config
- Launch task in separate instance of Java
- main() is provided by Hadoop
14Job Launch Process Task
- TaskTracker.Child.main()
- Sets up the child TaskInProgress attempt
- Reads XML configuration
- Connects back to necessary MapReduce components
via RPC - Uses TaskRunner to launch user process
15Job Launch Process TaskRunner
- TaskRunner, MapTaskRunner, MapRunner work in a
daisy-chain to launch your Mapper - Task knows ahead of time which InputSplits it
should be mapping - Calls Mapper once for each record retrieved from
the InputSplit - Running the Reducer is much the same
16Creating the Mapper
- You provide the instance of Mapper
- Should extend MapReduceBase
- Implement interface MapperltK1,V1,K2,V2gt
- One instance of your Mapper is initialized by the
MapTaskRunner for a TaskInProgress - Exists in separate process from all other
instances of Mapper no data sharing!
17Mapper
- Override function map()?
- void map(WritableComparable key,
- Writable value,
- OutputCollector output,
- Reporter reporter)?
Emit (k2,v2) with output.collect(k2, v2)?
18Example
- public static class Map extends MapReduceBase
implements MapperltLongWritable, Text, Text,
IntWritablegt - private final static IntWritable one new
IntWritable(1) - private Text word new Text()
- public void map(LongWritable key, Text value,
OutputCollectorltText, IntWritablegt output,
Reporter reporter) throws IOException - String line value.toString()
- StringTokenizer tokenizer new
StringTokenizer(line) - while (tokenizer.hasMoreTokens())
- word.set(tokenizer.nextToken())
- output.collect(word, one)
-
-
19What is Writable?
- Hadoop defines its own box classes for strings
(Text), integers (IntWritable), etc. - All values are instances of Writable
- All keys are instances of WritableComparable
20Reading data
- Data sets are specified by InputFormats
- Defines input data (e.g., a directory)?
- Identifies partitions of the data that form an
InputSplit - Factory for RecordReader objects to extract (k,
v) records from the input source
21FileInputFormat and friends
- TextInputFormat Treats each \n-terminated
line of a file as a value - KeyValueTextInputFormat Maps \n- terminated
text lines of k SEP v - SequenceFileInputFormat Binary file of (k, v)
pairs with some addl metadata - SequenceFileAsTextInputFormat Same, but maps
(k.toString(), v.toString())?
22Filtering File Inputs
FileInputFormat will read all files out of a
specified directory and send them to the
mapper Delegates filtering this file list to a
method subclasses may override e.g., Create your
own xyzFileInputFormat to read .xyz from
directory list
23Record Readers
Without a RecordReader, Hadoop would be forced to
divide input on byte boundaries. Each InputFormat
provides its own RecordReader implementation Provi
des capability multiplexing LineRecordReader
Reads a line from a text file KeyValueRecordReader
Used by KeyValueTextInputFormat
24Input Split Size
FileInputFormat will divide large files into
chunks Exact size controlled by
mapred.min.split.size RecordReaders receive
file, offset, and length of chunk Custom
InputFormat implementations may override split
size e.g., NeverChunkFile
25Sending Data To Reducers
Map function receives OutputCollector
object OutputCollector.collect() takes (k, v)
elements Any (WritableComparable, Writable) can
be used
26WritableComparator
Compares WritableComparable data Will call
WritableComparable.compare()? Can provide fast
path for serialized data Explicitly stated in
JobConf setup JobConf.setOutputValueGroupingCompar
ator()?
27Sending Data To The Client
Reporter object sent to Mapper allows simple
asynchronous feedback incrCounter(Enum key, long
amount) setStatus(String msg)? Allows
self-identification of input InputSplit
getInputSplit()?
28Partitioner
int getPartition(key, val, numPartitions)? Outputs
the partition number for a given key One
partition values sent to one Reduce
task HashPartitioner used by default Uses
key.hashCode() to return partition num JobConf
sets Partitioner implementation
29Reducer
reduce( WritableComparable key, Iterator
values, OutputCollector output, Reporter
reporter)? Keys values sent to one partition
all go to the same reduce task Calls are sorted
by key earlier keys are reduced and output
before later keys
30Example
- public static class Reduce extends MapReduceBase
implements ReducerltText, IntWritable, Text,
IntWritablegt - public void reduce(Text key, IteratorltIntWritablegt
values, OutputCollectorltText, IntWritablegt
output, Reporter reporter) throws IOException - int sum 0
- while (values.hasNext())
- sum values.next().get()
-
- output.collect(key, new IntWritable(sum))
-
31OutputFormat
Analogous to InputFormat TextOutputFormat
Writes key val\n strings to output
file SequenceFileOutputFormat Uses a binary
format to pack (k, v) pairs NullOutputFormat
Discards output
32Presentation Overview
- Recall Hadoop
- Overview of the map-reduce paradigm
- Elaboration on the WordCount example
- components of Hadoop that make WordCount possible
- Major new example N-Gram Generator
- step-by-step assembly of this map-reduce job
- Design questions to ask when creating your own
Hadoop jobs
33Major example N-Gram Generation
- N-Gram is a common natural language processing
technique (used by Google, etc)? - N-Gram is a subsequence of N items in a given
sequence. (i.e. subsequence of words in a given
text)? - Example 3-grams (from Google) with corresponding
occurances - ceramics collectables collectibles (55)?
- ceramics collectables fine (130)?
- ceramics collected by (52)?
- ceramics collectible pottery (50)?
- ceramics collectibles cooking (45)?
- Generate N-Grams through MapReduce!
34Understanding the process
- Someone wise said, A week of writing code saves
an hour of research. - Before embarking on developing a Hadoop job, walk
through the process step by step manually and
understand the flow and manipulation of data. - Once you can comfortably (and deterministically!)
do it mentally, begin writing code.
35Requirements
- Input
- a beginning word/phrase
- n-gram size (bigram, trigram, n-gram)?
- the minimum number of occurrences (frequency)?
- whether letter case matters
- Output all possible n-grams that occur
sufficiently frequently.
36High-level view of data flow
- Given one or more files containing regular text.
- Look for the desired startword. If seen, take
the next N-1 words and add the group to the
database. - Similarly to word count, find the number of
occurrences of each N-gram. - Remove those N-grams that do not occur frequently
enough for our liking.
37Follow along
- The N-grams implementation exists and is ready
for your perusal. - Grab it
- if you use Git revision control
- git clone git//git.qnan.org/pmw/hadoop-ngram
- to get the files with your browser, go to
- http//www.qnan.org/pmw/software/hadoop-ngram
- We used Project Gutenberg ebooks as input.
38Follow along
- Start Hadoop
- bin/start-all.sh
- Grab the NGram code and build it
- Type ant and all will be built
- Look at the README to see how to run it.
- Load some text files into your HDFS
- good source http//www.gutenberg.org
- Run it yourself (or see me do it) before we
proceed.
39Can we just use WordCount?
- We have the WordCount example that does a similar
thing. But there are differences - We don't want to count the number of times our
startword appears we want to capture the
subsequent words too. - A more subtle problem is that wordcount maps one
line at a time. That's a problem if we want
3-grams with startword of pillows in the book
containing this - The guests stayed in the guest bedroom the
pillows were - delightfully soft and had a faint scent of mint.
- Still, WordCount is a good foundation for our
code.
40Steps we must perform
- Read our text in paragraphs rather than in
discrete lines - RecordReader
- InputFormat
- Develop the mapper and reducer classes
- first mapper find startword, get the next N-1
words, and return ltN-gram, 1gt - first reducer sum the number of occurrences of
each N-gram - second mapper no action
- second reducer discard N-grams that are too rare
- Driver program
41A new RecordReader
- Ours must implement RecordReaderltK, Vgt
- Contain certain functions createKey(),
createValue(), getPos(), getProgress(), next()? - Hadoop offers a LineRecordReader but no support
for Paragraphs - We'll need a ParagraphRecordReader
- Use Delegation Pattern instead of extending
LineRecordReader. We couldn't extend it because
it has private elements. - Create new next() function
42public synchronized boolean next(LongWritable
key, Text value) throws IOException Text
linevalue new Text() boolean appended,
gotsomething boolean retval byte space
' ' value.clear() gotsomething
false do appended false
retval lrr.next(key, linevalue) if
(retval) if (linevalue.toString().lengt
h() gt 0) byte rawline
linevalue.getBytes() int rawlinelen
linevalue.getLength()
value.append(rawline, 0, rawlinelen)
value.append(space, 0, 1) appended
true gotsomething true
while (appended)
//System.out.println("ParagraphRecordReadernext(
) returns "gotsomething" after setting value
to "value.toString()"") return
gotsomething
43A new InputFormat
- Given to the JobTracker during execution
- getRecordReader method
- This is the why we need InputFormat
- Must return our ParagraphRecordReader
44public class ParagraphInputFormat extends
FileInputFormatltLongWritable, Textgt
implements JobConfigurable private
CompressionCodecFactory compressionCodecs
null public void configure(JobConf conf)
compressionCodecs new CompressionCodecFac
tory(conf) protected boolean
isSplitable(FileSystem fs, Path file)
return compressionCodecs.getCodec(file) null
public RecordReaderltLongWritable, Textgt
getRecordReader(InputSplit genericSplit, JobConf
job, Reporter reporter)? throws
IOException reporter.setStatus(genericS
plit.toString()) return new
ParagraphRecordReader(job, (FileSplit)
genericSplit)
45First stage Find Mapper
- Define the startword at startup
- Each time map is called we parse an entire
paragraph and output matching N-Grams - Tell Reporter how far done we are to track
progress - Output ltN-Gram, 1gt like WordCount
- output.collect(ngram, new IntWritable(1))
- This last part is important... next slide
explains.
46Importance of output.collect()
- Remember Hadoop's data type model
- map (K1, V1) ? list(K2, V2)?
- This means that for every single (K1, V1) tuple,
the map stage can output zero, one, two, or any
other number of tuples, and they don't have to
match the input at all. - Example
- output.collect(ngram, new IntWritable(1))
- output.collect(good-ol'-ngram, new
IntWritable(0))
47Find Mapper
- Our mapper must have a configure() class
- We can pass primitives through JobConf
public void configure(JobConf conf)
desiredPhrase conf.get("mapper.desired-p
hrase") Nvalue conf.getInt("mapper.
N-value", 3) caseSensitive
conf.getBoolean("mapper.case-sensitive", false)
48Find Reducer
- Like WordCount example
- Sum all the numbers matching our N-Gram
- Output ltN-Gram, of occurencesgt
49Second stage Prune Mapper
- Parse line from previous output and divide into
Key/Value pairs
Prune Reducer
This way we can sort our elements by frequency If
this N-Gram occurs fewer times than our minimum,
trim it out
50Piping data between M/R jobs
- How does the Find map/reduce job pass its
results to the Reduce map/reduce job? - I create a temporary file within HDFS. This
temporary file is used as the output of Find and
the input of Reduce. - At the end, I delete the temporary file.
51Counters
- The N-Gram generator has one programmer-defined
counter the number of partial/incomplete
N-grams. These occur when a paragraph ends
before we can read N-1 subsequent words. - We can add as many counters as we want.
52JobConf
- We need to set everything up
- 2 Jobs executing in series Find and Prune
- User inputs parameters
- Starting N-Gram word/phrase
- N-Gram size
- Minimum frequency for pruning
JobConf ngram_find_conf new
JobConf(getConf(), NGram.class),
ngram_prune_conf new JobConf(getConf(),
NGram.class)
53Find JobConf
- Now we can plug everything in
- Also pass input parameters
- And point to our input and output files
ngram_find_conf.setJobName("ngram-find")
ngram_find_conf.setInputFormat(ParagraphIn
putFormat.class) ngram_find_conf.setOutpu
tKeyClass(Text.class) ngram_find_conf.set
OutputValueClass(IntWritable.class)
ngram_find_conf.setMapperClass(FindJob_MapClass.cl
ass) ngram_find_conf.setReducerClass(Find
Job_ReduceClass.class)
ngram_find_conf.set("mapper.desired-phrase
", args.get(2), true))
ngram_find_conf.setInt("mapper.N-value", new
Integer(other_args.get(3)).intValue())
ngram_find_conf.setBoolean("mapper.case-sensitive"
, caseSensitive)
FileInputFormat.setInputPaths(ngram_find_c
onf, other_args.get(0))
FileOutputFormat.setOutputPath(ngram_find_conf,
tempDir)
54Prune JobConf
- Perform set up as before
- We need to point our inputs to the outputs of the
previous job
ngram_prune_conf.setJobName("ngram-prune")
ngram_prune_conf.setInt("reducer.min-fre
q", min_freq) ngram_prune_conf.setOutputK
eyClass(Text.class) ngram_prune_conf.setO
utputValueClass(IntWritable.class)
ngram_prune_conf.setMapperClass(PruneJob_MapClass.
class) ngram_prune_conf.setReducerClass(P
runeJob_ReduceClass.class)
FileInputFormat.setInputPaths(ngram_prune_
conf, tempDir) FileOutputFormat.setOutput
Path(ngram_prune_conf, new Path(other_args.get(1))
)
55Execute Jobs
- Run as blocking process with runJob
- Batch processing is done in series
JobClient.runJob(ngram_find_conf)
JobClient.runJob(ngram_prune_conf)
56Design questions to ask
- From where will my input come?
- InputFileFormat
- How is my input structured?
- RecordReader
- (There are already several common IFFs and RRs.
Don't reinvent the wheel.)? - Mapper and Reducer classes
- Do Key (WritableComparator) and Value (Writable)
classes exist? - Do I need to count anything while job is in
progress? - Where is my output going?
- Executor class
- What information do my map/reduce classes need?
Must I block, waiting for job completion? Set
FileFormat?