Title: ISQS 6339, Business Intelligence Hadoop
1ISQS 6339, Business IntelligenceHadoop
MapReduce
- Zhangxi Lin
- Texas Tech University
1
2Outlines
- Big data ecology
- Review of Hadoop
- MapReduce Algorithm
- The Hadoop Ecological System
- Appendix
- Examples of MapReduce
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9Review of Hadoop
10Questions before viewing the videos
- What is Hadoop
- What is MapReduce
- Why did they become a major solution to cope with
big data problem
11Videos of Hadoop
- Challenges Created by Big Data. 851
- Published on Apr 10, 2013. This video explains
the challenges created by big data that Hadoop
addresses efficiently. You will learn why
traditional enterprise model fails to address the
Variety, Volume, and Velocity challenges created
by Big Data and why creation of Hadoop was
required. - http//www.youtube.com/watch?vcA2btTHKPMY
- Hadoop Architecture. 1427
- Published on Mar 24, 2013
- http//www.youtube.com/watch?vYewlBXJ3rv8
- History Behind Creation of Hadoop. 629
- Published on Apr 5, 2013. This video talk about
the brief history behind creation of Hadoop. How
Google invented the technology, how it went into
Yahoo, how Doug Cutting and Michael Cafarella
created Hadoop, and how it went to Apache. - http//www.youtube.com/watch?vjA7kYyHKeX8
12Hadoop for BI in the Cloudera
- Hadoop is a free, Java-based programming
framework that supports the processing of large
data sets in a distributed computing environment. - Hadoop makes it possible to run applications on
systems with thousands of nodes involving
thousands of terabytes. - Hadoop was inspired by Google's MapReduce, a
software framework in which anapplication is
broken down into numerous small parts. Doug
Cutting, Hadoop's creator, named the framework
after his child's stuffed toy elephant.
13Apache Hadoop
- The Apache Hadoop framework is composed of the
following modules - Hadoop Common - contains libraries and utilities
needed by other Hadoop modules - Hadoop Distributed File System (HDFS).
- Hadoop YARN - a resource-management platform
responsible for managing compute resources in
clusters and using them for scheduling of users'
applications. - Hadoop MapReduce - a programming model for large
scale data processing.
14MapReduce
MapReduce is a framework for processing paralleliz
able problems across huge datasets using a large
number of computers (nodes), collectively
referred to as a cluster or a grid.
15How Hadoop Operates
16Hadoop 2 Big data's big leap forward
- The new Hadoop is the Apache Foundation's attempt
to create a whole new general framework for the
way big data can be stored, mined, and processed. - The biggest constraint on scale has been Hadoops
job handling. All jobs in Hadoop are run as batch
processes through a single daemon called
JobTracker, which creates a scalability and
processing-speed bottleneck. - Hadoop 2 uses an entirely new job-processing
framework built using two daemons
ResourceManager, which governs all jobs in the
system, and NodeManager, which runs on each
Hadoop node and keeps the ResourceManager
informed about what's happening on that node.
17MapReduce 2.0 YARN(Yet Another Resource
Negotiator)
18Apache Spark
- An open-source cluster computing framework
originally developed in the AMPLab at UC
Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory
primitives provide performance up to 100 times
faster for certain applications. - Spark requires a cluster manager and a
distributed storage system. For cluster manager,
Spark supports standalone (native Spark
cluster), Hadoop YARN, or Apache Mesos. For
distributed storage, Spark can interface with a
wide variety, including Hadoop Distributed File
System (HDFS), Cassandra, OpenStack Swift,
and Amazon S3. - In February 2014, Spark became an Apache
Top-Level Project. Spark has over 465
contributors in 2014. - - Source http//en.wikipedia.org/wiki/Apache_Spa
rk
19MAP/REDUCE ALGORITHM
20Videos - MapReduce
- Intro To MapReduce. 908
- Published on Mar 1, 2013. Intro to MapReduce
concepts. Explores the flow of a MapReduce
program. - http//www.youtube.com/watch?vHFplUBeBhcM
- Hadoop Map Reduce Part1. 421
- Published on Mar 20, 2012
- http//www.youtube.com/watch?vdVqaz2j2kII
21Distributed File Systems (DFS) Implementations
- Files are divided into chunks, typically 64
megabytes in size. Chunks are replicated three
times, at three different compute nodes located
on different racks. - To find the chunks of a file, the master node or
name node is used. The master node is itself
replicated. - Three Standards
- The Google File System (GFS), the original of the
class. - Hadoop Distributed File System (HDFS), an
open-source DFS used with Hadoop, an
implementation of map-reduce and distributed by
the Apache Software Foundation. - CloudStore, an open-source DFS originally
developed by Kosmix
22Map/Reduce Execution
23(No Transcript)
24Example 1 counting the number of occurrences for
each word in a collection of documents
- The input file is a repository of documents, and
each document is an element. The Map function for
this example uses keys that are of type String
(the words) and values that are integers. The Map
task reads a document and breaks it into its
sequence of words w1,w2, . . . ,wn. It then emits
a sequence of key-value pairs where the value is
always 1. That is, the output of the Map task for
this document is the sequence of key-value pairs
- (w1, 1), (w2, 1), . . . , (wn, 1)
25Map Task
- A single Map task will typically process many
documents. Thus, its output will be more than the
sequence for the one document suggested above. If
a word w appears m times among all the documents
assigned to that process, then there will be m
key-value pairs (w, 1) among its output. - After all the Map tasks have completed
successfully, the master controller merges the
files from each Map task that are destined for a
particular Reduce task and feeds the merged file
to that process as a sequence of
key-list-of-value pairs. That is, for each key k,
the input to the Reduce task that handles key k
is a pair of the form (k, v1, v2, . . . , vn),
where (k, v1), (k, v2), . . . , (k, vn) are all
the key-value pairs with key k coming from all
the Map tasks.
26Grouping and Aggregation
27Reduce Task
- The output of the Reduce function is a sequence
of zero or more key-value pairs. - The Reduce function simply adds up all the
values. The output of a reducer consists of the
word and the sum. Thus, the output of all the
Reduce tasks is a sequence of (w,m) pairs, where
w is a word that appears at least once among all
the input documents and m is the total number of
occurrences of w among all those documents. - The application of the Reduce function to a
single key and its associated list of values is
referred to as a reducer.
28Combiner
29Reducers, Reduce Tasks, Compute Nodes, and Skew
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Hadoop Ecological System
37Choosing a right Hadoop architecture
- Application dependent
- Too many solution providers
- Too many choices
38Videos
- The Evolution of the Apache Hadoop Ecosystem
Cloudera. 811 - Published on Sep 6, 2013. Hadoop Co-founder Doug
Cutting explains how the Hadoop ecosystem has
expanded and evolved into a much larger Big Data
platform with Hadoop at its center. - http//www.youtube.com/watch?veo1PwSfCXTI
- A Hadoop Ecosystem Overview. 2154
- Published on Jan 10, 2014. This is a technical
overview, explaining the Hadoop Ecosystem. As a
part of this presentation, we chose to focus on
the HDFS, MapReduce, Yarn, Hive, Pig and HBase
software components. - http//www.youtube.com/watch?vkRnh3WpcKXo
- Working in the Hadoop Ecosystem. 1040
- Published on Sep 5, 2013. Mark Grover, a Software
Engineer at Cloudera, talks about working in the
Hadoop ecosystem. - http//www.youtube.com/watch?vnbUsY9tj-pM
39Clouderas Hadoop System
40(No Transcript)
41Comparison of Two Generations of Hadoop
42(No Transcript)
43Different Components of Hadoop
44Pivotal Big Data Product - OSS
- Greenplum was a big data analytics company
headquartered in San Mateo, California. Its
products include Unified Analytics Platform, Data
Computing Appliance, Analytics Lab, Database, HD
and Chorus. Acquired by EMC Corporation in July
2010, and then became part of Pivotal Software in
2012. - Pivotal GemFire is a distributed data management
platform designed for many diverse data
management situations, but is especially useful
for high-volume, latency-sensitive,
mission-critical, transactional systems.
- Pivotal Software, Inc. (Pivotal) is a software
company based in Palo Alto, California that
provides software and services for the
development of custom applications for data and
analytics based on cloud computing technology.
Pivotal Software is a spin-out and joint
venture of EMC Corporation and its
subsidiary VMware that combined software
products, employees, and lines of businesses from
the two parent companies including Greenplum,Cloud
Foundry, Spring, Pivotal Labs, GemFire and other
products from the VMware vFabric Suite.
452015 Team-Topic
No Topic Components Team Schedule
1 Data warehousing Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr DW1 4/7
2 Publicly available big data services Tools and free resources Hortonworks, CloudEra, HaaS, EC2 DW2 4/9
3 MapReduce Data mining Efficiency of distributed data/text mining Mahout, H2O, R, Python DW3 4/14
4 Big data ETL-1 1) Heterogeneous data processing across platforms 2) System management 1) Kettle, Flume, Sqoop, Impala 2) Oozie, ZooKeeper, Ambari, Loom, Ganglia DW4 4/16
5 Big data ETL-2 1) Heterogeneous data processing across platforms 2) System management 1) Kettle, Flume, Sqoop, Impala 2) Oozie, ZooKeeper, Ambari, Loom, Ganglia DW5 4/21
6 Application development platform 1) Algorithms and innovative development environments 2) Load balancing Tomcat, Neo4J, Pig, Hue DW6 4/23
7 Tools Visualizations Features for big data visualization and data utilization. Pentaho, Tableau Saiku, Mondrian, Gephi, DW7 4/28
8 Streaming data processing Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro 5/5
462014 Topics
Topic Components Team Presentation
Data warehousing HDFS, HBase, HIVE, NoSQL 5 4/8
Data mining Mahout, R 1 4/10
System management Oozie, ZooKeeper 3 4/15
ETL Kettle, Flume, Sqoop 8 4/17
Programming platform Pig, Hue, Python, Tomcat, Jetty, Neo4J 2 4/22
Streaming data processing Storm, Kafka, Avro 4 4/24
Information retrieval Impala, Pentaho, Solr 6 4/29
Visualization Saiku, Mondrian, Gephi, Ganglia 7 5/1
47Project teams tasks
- 1) collect the information
- 2) study the product
- 3) present the topic for 40-50 minutes with 20-25
slides. - Explain the position of your topic in Hadoop
ecological system - Main challenges and solutions
- Products and their features
- Application demonstrations
- Your comments
- 3 questions for your classmates
- 4) provide the references, including tutorial
materials, videos, web links - 5) report outcomes with documents in 5-10 pages.
- Optional deliverables with extra credit
- 1) implementation demonstration
- 2) research papers
- 3) proposal for research/implementation ideas
that demonstrates the creativeness.
48More Hadoop Charts
49Vision of Data Flow
50(No Transcript)
51Real-time Data Processing
52Application Perspective of Hadoop
53Matrix Calculation
54Map/Reduce Matrix Multiplication
55Map/Reduce Scheme 1, Step 1
56Map/Reduce Scheme 1, Step 2
57Map/Reduce Scheme 2, Oneshot