Creating Map-Reduce Programs Using Hadoop - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Creating Map-Reduce Programs Using Hadoop

Description:

Creating Map-Reduce Programs Using Hadoop * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Major example: N ... – PowerPoint PPT presentation

Number of Views:419
Avg rating:3.0/5.0
Slides: 59
Provided by: contentSc
Category:

less

Transcript and Presenter's Notes

Title: Creating Map-Reduce Programs Using Hadoop


1
Creating Map-Reduce Programs Using Hadoop
2
Presentation Overview
  • Recall Hadoop
  • Overview of the map-reduce paradigm
  • Elaboration on the WordCount example
  • components of Hadoop that make WordCount possible
  • Major new example N-Gram Generator
  • step-by-step assembly of this map-reduce job
  • Design questions to ask when creating your own
    Hadoop jobs

3
Recall why Hadoop rocks
  • Hadoop is
  • Free and open source
  • high quality, like all Apache Foundation projects
  • crossplatform (pure Java)?
  • fault-tolerant
  • highly scalable
  • has bindings for non-Java programming languages
  • applicable to many computational problems

4
Map-Reduce System Overview
  • JobTracker Makes scheduling decisions
  • TaskTracker Manages tasks for a given node
  • Task process
  • Runs an individual map or reduce fragment for a
    given job
  • Forks from the TaskTracker

5
Map-Reduce System Overview
  • Processes communicate by custom RPC
    implementation
  • Easy to change/extend
  • Defined as Java interfaces
  • Server objects implement the interface
  • Client proxy objects automatically created
  • All messages originate at the client (e.g., Task
    to TaskTracker)?
  • Prevents cycles and therefore deadlocks

6
Process Flow Diagram
7
Application Overview
  • Launching Program
  • Creates a JobConf to define a job.
  • Submits JobConf to JobTracker and waits for
    completion.
  • Mapper
  • Is given a stream of key1,value1 pairs
  • Generates a stream of key2, value2 pairs
  • Reducer
  • Is given a key2 and a stream of value2s
  • Generates a stream of key3, value3 pairs

8
Job Launch Process Client
  • Client program creates a JobConf
  • Identify classes implementing Mapper and Reducer
    interfaces
  • JobConf.setMapperClass()? JobConf.setReducerClas
    s()?
  • Specify input and output formats
  • JobConf.setInputFormat(TextInputFormat.class)
  • JobConf.setOutputFormat(TextOutputFormat.class)
  • Other options too
  • JobConf.setNumReduceTasks()?
  • JobConf.setOutputFormat()?
  • Many, many more (Facade pattern)?

9
An onslaught of terminology
  • We'll explain these terms, each of which plays a
    role in any non-trivial map/reduce job
  • InputFormat, OutputFormat, FileInputFormat, ...
  • JobClient and JobConf
  • JobTracker and TaskTracker
  • TaskRunner, MapTaskRunner, MapRunner,
  • InputSplit, RecordReader, LineRecordReader, ...
  • Writable, WritableComparable, WritableInt, ...

10
InputFormat and OutputFormat
  • The application also chooses input and output
    formats, which define how the persistent data is
    read and written. These are interfaces and can be
    defined by the application.
  • InputFormat
  • Splits the input to determine the input to each
    map task.
  • Defines a RecordReader that reads key, value
    pairs that are passed to the map task
  • OutputFormat
  • Given the key, value pairs and a filename, writes
    the reduce task output to persistent store.

11
Example
  • public static void main(String args) throws
    Exception
  • JobConf conf new JobConf(WordCount.class)
  • conf.setOutputKeyClass(Text.class)
  • conf.setOutputValueClass(IntWritable.class)
  • conf.setMapperClass(Map.class)
  • conf.setCombinerClass(Reduce.class)
  • conf.setReducerClass(Reduce.class)
  • conf.setInputFormat(TextInputFormat.class)
  • conf.setOutputFormat(TextOutputFormat.class)
  • FileInputFormat.setInputPaths(conf, new
    Path(args0))
  • FileOutputFormat.setOutputPath(conf, new
    Path(args1))
  • JobClient.runJob(conf)

12
Job Launch Process JobClient
  • Pass JobConf to JobClient.runJob() or
    JobClient.submitJob()?
  • runJob() blocks wait until job finishes
  • submitJob() does not
  • Poll for status to make running decisions
  • Avoid polling with JobConf.setJobEndNotificationUR
    I()
  • JobClient
  • Determines proper division of input into
    InputSplits
  • Sends job data to master JobTracker server

13
Job Launch Process JobTracker
  • JobTracker
  • Inserts jar and JobConf (serialized to XML) in
    shared location
  • Posts a JobInProgress to its run queue

14
Job Launch Process TaskTracker
  • TaskTrackers running on slave nodes periodically
    query JobTracker for work
  • Retrieve job-specific jar and config
  • Launch task in separate instance of Java
  • main() is provided by Hadoop

15
Job Launch Process Task
  • TaskTracker.Child.main()
  • Sets up the child TaskInProgress attempt
  • Reads XML configuration
  • Connects back to necessary MapReduce components
    via RPC
  • Uses TaskRunner to launch user process

16
Job Launch Process TaskRunner
  • TaskRunner, MapTaskRunner, MapRunner work in a
    daisy-chain to launch your Mapper
  • Task knows ahead of time which InputSplits it
    should be mapping
  • Calls Mapper once for each record retrieved from
    the InputSplit
  • Running the Reducer is much the same

17
Creating the Mapper
  • You provide the instance of Mapper
  • Should extend MapReduceBase
  • Implement interface MapperltK1,V1,K2,V2gt
  • One instance of your Mapper is initialized by the
    MapTaskRunner for a TaskInProgress
  • Exists in separate process from all other
    instances of Mapper no data sharing!

18
Mapper
  • Override function map()?
  • void map(WritableComparable key,
  • Writable value,
  • OutputCollector output,
  • Reporter reporter)?

Emit (k2,v2) with output.collect(k2, v2)?
19
Example
  • public static class Map extends MapReduceBase
    implements MapperltLongWritable, Text, Text,
    IntWritablegt
  • private final static IntWritable one new
    IntWritable(1)
  • private Text word new Text()
  • public void map(LongWritable key, Text value,
    OutputCollectorltText, IntWritablegt output,
    Reporter reporter) throws IOException
  • String line value.toString()
  • StringTokenizer tokenizer new
    StringTokenizer(line)
  • while (tokenizer.hasMoreTokens())
  • word.set(tokenizer.nextToken())
  • output.collect(word, one)

20
What is Writable?
  • Hadoop defines its own box classes for strings
    (Text), integers (IntWritable), etc.
  • All values are instances of Writable
  • All keys are instances of WritableComparable

21
Reading data
  • Data sets are specified by InputFormats
  • Defines input data (e.g., a directory)?
  • Identifies partitions of the data that form an
    InputSplit
  • Factory for RecordReader objects to extract (k,
    v) records from the input source

22
FileInputFormat and friends
  • TextInputFormat Treats each \n-terminated
    line of a file as a value
  • KeyValueTextInputFormat Maps \n- terminated
    text lines of k SEP v
  • SequenceFileInputFormat Binary file of (k, v)
    pairs with some addl metadata
  • SequenceFileAsTextInputFormat Same, but maps
    (k.toString(), v.toString())?

23
Filtering File Inputs
FileInputFormat will read all files out of a
specified directory and send them to the
mapper Delegates filtering this file list to a
method subclasses may override e.g., Create your
own xyzFileInputFormat to read .xyz from
directory list
24
Record Readers
Without a RecordReader, Hadoop would be forced to
divide input on byte boundaries. Each InputFormat
provides its own RecordReader implementation Provi
des capability multiplexing LineRecordReader
Reads a line from a text file KeyValueRecordReader
Used by KeyValueTextInputFormat
25
Input Split Size
FileInputFormat will divide large files into
chunks Exact size controlled by
mapred.min.split.size RecordReaders receive
file, offset, and length of chunk Custom
InputFormat implementations may override split
size e.g., NeverChunkFile
26
Sending Data To Reducers
Map function receives OutputCollector
object OutputCollector.collect() takes (k, v)
elements Any (WritableComparable, Writable) can
be used
27
WritableComparator
Compares WritableComparable data Will call
WritableComparable.compare()? Can provide fast
path for serialized data Explicitly stated in
JobConf setup JobConf.setOutputValueGroupingCompar
ator()?
28
Sending Data To The Client
Reporter object sent to Mapper allows simple
asynchronous feedback incrCounter(Enum key, long
amount) setStatus(String msg)? Allows
self-identification of input InputSplit
getInputSplit()?
29
Partitioner
int getPartition(key, val, numPartitions)? Outputs
the partition number for a given key One
partition values sent to one Reduce
task HashPartitioner used by default Uses
key.hashCode() to return partition num JobConf
sets Partitioner implementation
30
Reducer
reduce( WritableComparable key, Iterator
values, OutputCollector output, Reporter
reporter)? Keys values sent to one partition
all go to the same reduce task Calls are sorted
by key earlier keys are reduced and output
before later keys
31
Example
  • public static class Reduce extends MapReduceBase
    implements ReducerltText, IntWritable, Text,
    IntWritablegt
  • public void reduce(Text key, IteratorltIntWritablegt
    values, OutputCollectorltText, IntWritablegt
    output, Reporter reporter) throws IOException
  • int sum 0
  • while (values.hasNext())
  • sum values.next().get()
  • output.collect(key, new IntWritable(sum))

32
OutputFormat
Analogous to InputFormat TextOutputFormat
Writes key val\n strings to output
file SequenceFileOutputFormat Uses a binary
format to pack (k, v) pairs NullOutputFormat
Discards output
33
Presentation Overview
  • Recall Hadoop
  • Overview of the map-reduce paradigm
  • Elaboration on the WordCount example
  • components of Hadoop that make WordCount possible
  • Major new example N-Gram Generator
  • step-by-step assembly of this map-reduce job
  • Design questions to ask when creating your own
    Hadoop jobs

34
Major example N-Gram Generation
  • N-Gram is a common natural language processing
    technique (used by Google, etc)?
  • N-Gram is a subsequence of N items in a given
    sequence. (i.e. subsequence of words in a given
    text)?
  • Example 3-grams (from Google) with corresponding
    occurrences
  • ceramics collectables collectibles (55)?
  • ceramics collected by (52)?
  • ceramics collectibles cooking (45)?

35
Understanding the process
  • Someone wise said, A week of writing code saves
    an hour of research.
  • Before embarking on developing a Hadoop job, walk
    through the process step by step manually and
    understand the flow and manipulation of data.
  • Once you can comfortably (and deterministically!)
    do it mentally, begin writing code.

36
Requirements
  • Input
  • a beginning word/phrase
  • n-gram size (bigram, trigram, n-gram)?
  • the minimum number of occurrences (frequency)?
  • whether letter case matters
  • Output all possible n-grams that occur
    sufficiently frequently.

37
High-level view of data flow
  • Given one or more files containing regular text.
  • Look for the desired startword. If seen, take
    the next N-1 words and add the group to the
    database.
  • Similarly to word count, find the number of
    occurrences of each N-gram.
  • Remove those N-grams that do not occur frequently
    enough for our liking.

38
Follow along
  • The N-grams implementation exists and is ready
    for your perusal.
  • Grab it
  • if you use Git revision control
  • git clone git//git.qnan.org/pmw/hadoop-ngram
  • to get the files with your browser, go to
  • http//www.qnan.org/pmw/software/hadoop-ngram
  • We used Project Gutenberg ebooks as input.

39
Follow along
  • Start Hadoop
  • bin/start-all.sh
  • Grab the NGram code and build it
  • Type ant and all will be built
  • Look at the README to see how to run it.
  • Load some text files into your HDFS
  • good source http//www.gutenberg.org
  • Run it yourself (or see me do it) before we
    proceed.

40
Can we just use WordCount?
  • We have the WordCount example that does a similar
    thing. But there are differences
  • We don't want to count the number of times our
    startword appears we want to capture the
    subsequent words too.
  • A more subtle problem is that wordcount maps one
    line at a time. That's a problem if we want
    3-grams with startword of pillows in the book
    containing this
  • The guests stayed in the guest bedroom the
    pillows were
  • delightfully soft and had a faint scent of mint.
  • Still, WordCount is a good foundation for our
    code.

41
Steps we must perform
  • Read our text in paragraphs rather than in
    discrete lines
  • RecordReader
  • InputFormat
  • Develop the mapper and reducer classes
  • first mapper find startword, get the next N-1
    words, and return ltN-gram, 1gt
  • first reducer sum the number of occurrences of
    each N-gram
  • second mapper no action
  • second reducer discard N-grams that are too rare
  • Driver program

42
A new RecordReader
  • Ours must implement RecordReaderltK, Vgt
  • Contain certain functions createKey(),
    createValue(), getPos(), getProgress(), next()?
  • Hadoop offers a LineRecordReader but no support
    for Paragraphs
  • We'll need a ParagraphRecordReader
  • Use Delegation Pattern instead of extending
    LineRecordReader. We couldn't extend it because
    it has private elements.
  • Create new next() function

43
public synchronized boolean next(LongWritable
key, Text value) throws IOException Text
linevalue new Text() boolean appended,
gotsomething boolean retval byte space
' ' value.clear() gotsomething
false do appended false
retval lrr.next(key, linevalue) if
(retval) if (linevalue.toString().lengt
h() gt 0) byte rawline
linevalue.getBytes() int rawlinelen
linevalue.getLength()
value.append(rawline, 0, rawlinelen)
value.append(space, 0, 1) appended
true gotsomething true
while (appended)
//System.out.println("ParagraphRecordReadernext(
) returns "gotsomething" after setting value
to "value.toString()"") return
gotsomething
44
A new InputFormat
  • Given to the JobTracker during execution
  • getRecordReader method
  • This is the why we need InputFormat
  • Must return our ParagraphRecordReader

45
public class ParagraphInputFormat extends
FileInputFormatltLongWritable, Textgt
implements JobConfigurable private
CompressionCodecFactory compressionCodecs
null public void configure(JobConf conf)
compressionCodecs new CompressionCodecFac
tory(conf) protected boolean
isSplitable(FileSystem fs, Path file)
return compressionCodecs.getCodec(file) null
public RecordReaderltLongWritable, Textgt
getRecordReader(InputSplit genericSplit, JobConf
job, Reporter reporter)? throws
IOException reporter.setStatus(genericS
plit.toString()) return new
ParagraphRecordReader(job, (FileSplit)
genericSplit)
46
First stage Find Mapper
  • Define the startword at startup
  • Each time map is called we parse an entire
    paragraph and output matching N-Grams
  • Tell Reporter how far done we are to track
    progress
  • Output ltN-Gram, 1gt like WordCount
  • output.collect(ngram, new IntWritable(1))
  • This last part is important... next slide
    explains.

47
Importance of output.collect()
  • Remember Hadoop's data type model
  • map (K1, V1) ? list(K2, V2)?
  • This means that for every single (K1, V1) tuple,
    the map stage can output zero, one, two, or any
    other number of tuples, and they don't have to
    match the input at all.
  • Example
  • output.collect(ngram, new IntWritable(1))
  • output.collect(good-ol'-ngram, new
    IntWritable(0))

48
Find Mapper
  • Our mapper must have a configure() class
  • We can pass primitives through JobConf

public void configure(JobConf conf)
desiredPhrase conf.get("mapper.desired-p
hrase") Nvalue conf.getInt("mapper.
N-value", 3) caseSensitive
conf.getBoolean("mapper.case-sensitive", false)

49
Find Reducer
  • Like WordCount example
  • Sum all the numbers matching our N-Gram
  • Output ltN-Gram, of occurencesgt

50
Second stage Prune Mapper
  • Parse line from previous output and divide into
    Key/Value pairs

Prune Reducer
This way we can sort our elements by frequency If
this N-Gram occurs fewer times than our minimum,
trim it out
51
Piping data between M/R jobs
  • How does the Find map/reduce job pass its
    results to the Reduce map/reduce job?
  • I create a temporary file within HDFS. This
    temporary file is used as the output of Find and
    the input of Reduce.
  • At the end, I delete the temporary file.

52
Counters
  • The N-Gram generator has one programmer-defined
    counter the number of partial/incomplete
    N-grams. These occur when a paragraph ends
    before we can read N-1 subsequent words.
  • We can add as many counters as we want.

53
JobConf
  • We need to set everything up
  • 2 Jobs executing in series Find and Prune
  • User inputs parameters
  • Starting N-Gram word/phrase
  • N-Gram size
  • Minimum frequency for pruning

JobConf ngram_find_conf new
JobConf(getConf(), NGram.class),
ngram_prune_conf new JobConf(getConf(),
NGram.class)
54
Find JobConf
  • Now we can plug everything in
  • Also pass input parameters
  • And point to our input and output files

ngram_find_conf.setJobName("ngram-find")
ngram_find_conf.setInputFormat(ParagraphIn
putFormat.class) ngram_find_conf.setOutpu
tKeyClass(Text.class) ngram_find_conf.set
OutputValueClass(IntWritable.class)
ngram_find_conf.setMapperClass(FindJob_MapClass.cl
ass) ngram_find_conf.setReducerClass(Find
Job_ReduceClass.class)
ngram_find_conf.set("mapper.desired-phrase
", args.get(2), true))
ngram_find_conf.setInt("mapper.N-value", new
Integer(other_args.get(3)).intValue())
ngram_find_conf.setBoolean("mapper.case-sensitive"
, caseSensitive)
FileInputFormat.setInputPaths(ngram_find_c
onf, other_args.get(0))
FileOutputFormat.setOutputPath(ngram_find_conf,
tempDir)
55
Prune JobConf
  • Perform set up as before
  • We need to point our inputs to the outputs of the
    previous job

ngram_prune_conf.setJobName("ngram-prune")
ngram_prune_conf.setInt("reducer.min-fre
q", min_freq) ngram_prune_conf.setOutputK
eyClass(Text.class) ngram_prune_conf.setO
utputValueClass(IntWritable.class)
ngram_prune_conf.setMapperClass(PruneJob_MapClass.
class) ngram_prune_conf.setReducerClass(P
runeJob_ReduceClass.class)
FileInputFormat.setInputPaths(ngram_prune_
conf, tempDir) FileOutputFormat.setOutput
Path(ngram_prune_conf, new Path(other_args.get(1))
)
56
Execute Jobs
  • Run as blocking process with runJob
  • Batch processing is done in series

JobClient.runJob(ngram_find_conf)
JobClient.runJob(ngram_prune_conf)
57
Design questions to ask
  • From where will my input come?
  • InputFileFormat
  • How is my input structured?
  • RecordReader
  • (There are already several common IFFs and RRs.
    Don't reinvent the wheel.)?
  • Mapper and Reducer classes
  • Do Key (WritableComparator) and Value (Writable)
    classes exist?

58
Design questions to ask
  • Do I need to count anything while job is in
    progress?
  • Where is my output going?
  • Executor class
  • What information do my map/reduce classes need?
    Must I block, waiting for job completion? Set
    FileFormat?
Write a Comment
User Comments (0)
About PowerShow.com