MapReduce: Simplified Data Processing on Large Clusters - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

MapReduce: Simplified Data Processing on Large Clusters

Description:

Task that takes unusual amount of time to complete ... Used by last.fm on 25 nodes. Chart calculation & web log analysis. Yahoo! on 900 nodes ... – PowerPoint PPT presentation

Number of Views:392
Avg rating:3.0/5.0
Slides: 14
Provided by: Bren204
Category:

less

Transcript and Presenter's Notes

Title: MapReduce: Simplified Data Processing on Large Clusters


1
MapReduce Simplified Data Processing on Large
Clusters
  • Jeffrey Dean
  • Sanjay Ghemawat
  • Presented By Brendan Melville

2
Why Am I Doing This?
3
Model
  • Two user-defined functions
  • Map
  • Input key/value pairs ? intermediate output
    key/value pairs
  • Reduce
  • Intermediate key K and all intermediate values
    associated with K ? output value(s)

4
Examples
  • Grep
  • Map emits a line if pattern is matched
  • Reduce Copies data to output
  • Inverted Index
  • Map parses a document and emits ltword, docIdgt
    pairs
  • Reduce takes all pairs for a given word, sorts
    the docId values, and emits a ltword, list(docId)gt
    pair

5
Execution
  • Input files split (M splits)
  • Assign Master Workers
  • Map tasks
  • Writing intermediate data to disk (R regions)
  • Intermediate data read sort
  • Reduce tasks
  • Return

6
Execution
7
Fault Recovery
  • Workers are pinged by master periodically
  • Non-responsive workers are marked as failed
  • All tasks in-progress or completed by failed
    worker become eligible for rescheduling
  • Master could periodically checkpoint
  • Current implementations abort on master failure

8
Execution Issues
  • Bandwidth
  • Scarce resource, must be conserved
  • GFS is used for global file-system
  • Master schedules tasks based on location of input
    information
  • Straggler
  • Task that takes unusual amount of time to
    complete
  • Schedule backup tasks for remaining tasks
    toward the end of computation

9
Benefits
  • User code is smaller and easier to understand
  • Cluster issues (failures, network problems, slow
    machines) handled by library
  • Enhanced performance allows separate computations
    to be run separately

10
Hadoop
  • MapReduce and Distributed File-system framework
    for large commodity clusters
  • Master/Slave relationship (JobTracker /
    TaskTracker)
  • JobTracker handles all scheduling data flow
    between TaskTrackers
  • TaskTracker handles all worker tasks on a node
  • Individual Task runs map or reduce operation
  • Integrates with HDFS for data locality

11
Hadoop
  • HDFS also Master/Slave (NameNode / DataNode)
  • Client in communication with both
  • Master handles replication, deletion, creation
  • Slave handles data retrieval
  • Files stored in many blocks
  • Each block associated to block Id
  • Block Id associated with several nodes
    hostnameport (depending on level of replication)

12
Hadoop
  • Used by last.fm on 25 nodes
  • Chart calculation web log analysis
  • Yahoo! on 900 nodes
  • 3M files (150TB of data)
  • Sort benchmark on 200 node cluster using 700
    reduce operations 2010GB in 6.6 hours (47 hours
    using non-MapReduce)

13
The End
  • Visit Hadoop at http//lucene.apache.org/hadoop/
Write a Comment
User Comments (0)
About PowerShow.com