ISQS 6339, Business Intelligence Hadoop - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

ISQS 6339, Business Intelligence Hadoop

Description:

Acquired by EMC Corporation in ... and Velocity challenges created by Big Data and ... Explores the flow of a MapReduce program. http://www.youtube.com ... – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 58

Provided by: Judy1183

Category:

more less

Transcript and Presenter's Notes

Title: ISQS 6339, Business Intelligence Hadoop

1
ISQS 6339, Business IntelligenceHadoop
MapReduce

Zhangxi Lin
Texas Tech University

1
2
Outlines

Big data ecology
Review of Hadoop
MapReduce Algorithm
The Hadoop Ecological System
Appendix
Examples of MapReduce

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Review of Hadoop
10
Questions before viewing the videos

What is Hadoop
What is MapReduce
Why did they become a major solution to cope with
big data problem

11
Videos of Hadoop

Challenges Created by Big Data. 851
Published on Apr 10, 2013. This video explains
the challenges created by big data that Hadoop
addresses efficiently. You will learn why
traditional enterprise model fails to address the
Variety, Volume, and Velocity challenges created
by Big Data and why creation of Hadoop was
required.
http//www.youtube.com/watch?vcA2btTHKPMY
Hadoop Architecture. 1427
Published on Mar 24, 2013
http//www.youtube.com/watch?vYewlBXJ3rv8
History Behind Creation of Hadoop. 629
Published on Apr 5, 2013. This video talk about
the brief history behind creation of Hadoop. How
Google invented the technology, how it went into
Yahoo, how Doug Cutting and Michael Cafarella
created Hadoop, and how it went to Apache.
http//www.youtube.com/watch?vjA7kYyHKeX8

12
Hadoop for BI in the Cloudera

Hadoop is a free, Java-based programming
framework that supports the processing of large
data sets in a distributed computing environment.
Hadoop makes it possible to run applications on
systems with thousands of nodes involving
thousands of terabytes.
Hadoop was inspired by Google's MapReduce, a
software framework in which anapplication is
broken down into numerous small parts. Doug
Cutting, Hadoop's creator, named the framework
after his child's stuffed toy elephant.

13
Apache Hadoop

The Apache Hadoop framework is composed of the
following modules
Hadoop Common - contains libraries and utilities
needed by other Hadoop modules
Hadoop Distributed File System (HDFS).
Hadoop YARN - a resource-management platform
responsible for managing compute resources in
clusters and using them for scheduling of users'
applications.
Hadoop MapReduce - a programming model for large
scale data processing.

14
MapReduce
MapReduce is a framework for processing paralleliz
able problems across huge datasets using a large
number of computers (nodes), collectively
referred to as a cluster or a grid.
15
How Hadoop Operates
16
Hadoop 2 Big data's big leap forward

The new Hadoop is the Apache Foundation's attempt
to create a whole new general framework for the
way big data can be stored, mined, and processed.
The biggest constraint on scale has been Hadoops
job handling. All jobs in Hadoop are run as batch
processes through a single daemon called
JobTracker, which creates a scalability and
processing-speed bottleneck.
Hadoop 2 uses an entirely new job-processing
framework built using two daemons
ResourceManager, which governs all jobs in the
system, and NodeManager, which runs on each
Hadoop node and keeps the ResourceManager
informed about what's happening on that node.

17
MapReduce 2.0 YARN(Yet Another Resource
Negotiator)
18
Apache Spark

An open-source cluster computing framework
originally developed in the AMPLab at UC
Berkeley. In contrast to Hadoop's two-stage
disk-based MapReduce paradigm, Spark's in-memory
primitives provide performance up to 100 times
faster for certain applications.
Spark requires a cluster manager and a
distributed storage system. For cluster manager,
Spark supports standalone (native Spark
cluster), Hadoop YARN, or Apache Mesos. For
distributed storage, Spark can interface with a
wide variety, including Hadoop Distributed File
System (HDFS), Cassandra, OpenStack Swift,
and Amazon S3.
In February 2014, Spark became an Apache
Top-Level Project. Spark has over 465
contributors in 2014.
- Source http//en.wikipedia.org/wiki/Apache_Spa
rk

19
MAP/REDUCE ALGORITHM
20
Videos - MapReduce

Intro To MapReduce. 908
Published on Mar 1, 2013. Intro to MapReduce
concepts. Explores the flow of a MapReduce
program.
http//www.youtube.com/watch?vHFplUBeBhcM
Hadoop Map Reduce Part1. 421
Published on Mar 20, 2012
http//www.youtube.com/watch?vdVqaz2j2kII

21
Distributed File Systems (DFS) Implementations

Files are divided into chunks, typically 64
megabytes in size. Chunks are replicated three
times, at three different compute nodes located
on different racks.
To find the chunks of a file, the master node or
name node is used. The master node is itself
replicated.
Three Standards
The Google File System (GFS), the original of the
class.
Hadoop Distributed File System (HDFS), an
open-source DFS used with Hadoop, an
implementation of map-reduce and distributed by
the Apache Software Foundation.
CloudStore, an open-source DFS originally
developed by Kosmix

22
Map/Reduce Execution
23
(No Transcript)
24
Example 1 counting the number of occurrences for
each word in a collection of documents

The input file is a repository of documents, and
each document is an element. The Map function for
this example uses keys that are of type String
(the words) and values that are integers. The Map
task reads a document and breaks it into its
sequence of words w1,w2, . . . ,wn. It then emits
a sequence of key-value pairs where the value is
always 1. That is, the output of the Map task for
this document is the sequence of key-value pairs
(w1, 1), (w2, 1), . . . , (wn, 1)

25
Map Task

A single Map task will typically process many
documents. Thus, its output will be more than the
sequence for the one document suggested above. If
a word w appears m times among all the documents
assigned to that process, then there will be m
key-value pairs (w, 1) among its output.
After all the Map tasks have completed
successfully, the master controller merges the
files from each Map task that are destined for a
particular Reduce task and feeds the merged file
to that process as a sequence of
key-list-of-value pairs. That is, for each key k,
the input to the Reduce task that handles key k
is a pair of the form (k, v1, v2, . . . , vn),
where (k, v1), (k, v2), . . . , (k, vn) are all
the key-value pairs with key k coming from all
the Map tasks.

26
Grouping and Aggregation
27
Reduce Task

The output of the Reduce function is a sequence
of zero or more key-value pairs.
The Reduce function simply adds up all the
values. The output of a reducer consists of the
word and the sum. Thus, the output of all the
Reduce tasks is a sequence of (w,m) pairs, where
w is a word that appears at least once among all
the input documents and m is the total number of
occurrences of w among all those documents.
The application of the Reduce function to a
single key and its associated list of values is
referred to as a reducer.

28
Combiner
29
Reducers, Reduce Tasks, Compute Nodes, and Skew
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Hadoop Ecological System
37
Choosing a right Hadoop architecture

Application dependent
Too many solution providers
Too many choices

38
Videos

The Evolution of the Apache Hadoop Ecosystem
Cloudera. 811
Published on Sep 6, 2013. Hadoop Co-founder Doug
Cutting explains how the Hadoop ecosystem has
expanded and evolved into a much larger Big Data
platform with Hadoop at its center.
http//www.youtube.com/watch?veo1PwSfCXTI
A Hadoop Ecosystem Overview. 2154
Published on Jan 10, 2014. This is a technical
overview, explaining the Hadoop Ecosystem. As a
part of this presentation, we chose to focus on
the HDFS, MapReduce, Yarn, Hive, Pig and HBase
software components.
http//www.youtube.com/watch?vkRnh3WpcKXo
Working in the Hadoop Ecosystem. 1040
Published on Sep 5, 2013. Mark Grover, a Software
Engineer at Cloudera, talks about working in the
Hadoop ecosystem.
http//www.youtube.com/watch?vnbUsY9tj-pM

39
Clouderas Hadoop System
40
(No Transcript)
41
Comparison of Two Generations of Hadoop
42
(No Transcript)
43
Different Components of Hadoop
44
Pivotal Big Data Product - OSS

Greenplum was a big data analytics company
headquartered in San Mateo, California. Its
products include Unified Analytics Platform, Data
Computing Appliance, Analytics Lab, Database, HD
and Chorus. Acquired by EMC Corporation in July
2010, and then became part of Pivotal Software in
2012.
Pivotal GemFire is a distributed data management
platform designed for many diverse data
management situations, but is especially useful
for high-volume, latency-sensitive,
mission-critical, transactional systems.

Pivotal Software, Inc. (Pivotal) is a software
company based in Palo Alto, California that
provides software and services for the
development of custom applications for data and
analytics based on cloud computing technology.
Pivotal Software is a spin-out and joint
venture of EMC Corporation and its
subsidiary VMware that combined software
products, employees, and lines of businesses from
the two parent companies including Greenplum,Cloud
Foundry, Spring, Pivotal Labs, GemFire and other
products from the VMware vFabric Suite.

45
2015 Team-Topic
No Topic Components Team Schedule
1 Data warehousing Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr DW1 4/7
2 Publicly available big data services Tools and free resources Hortonworks, CloudEra, HaaS, EC2 DW2 4/9
3 MapReduce Data mining Efficiency of distributed data/text mining Mahout, H2O, R, Python DW3 4/14
4 Big data ETL-1 1) Heterogeneous data processing across platforms 2) System management 1) Kettle, Flume, Sqoop, Impala 2) Oozie, ZooKeeper, Ambari, Loom, Ganglia DW4 4/16
5 Big data ETL-2 1) Heterogeneous data processing across platforms 2) System management 1) Kettle, Flume, Sqoop, Impala 2) Oozie, ZooKeeper, Ambari, Loom, Ganglia DW5 4/21
6 Application development platform 1) Algorithms and innovative development environments 2) Load balancing Tomcat, Neo4J, Pig, Hue DW6 4/23
7 Tools Visualizations Features for big data visualization and data utilization. Pentaho, Tableau Saiku, Mondrian, Gephi, DW7 4/28
8 Streaming data processing Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro 5/5
46
2014 Topics
Topic Components Team Presentation
Data warehousing HDFS, HBase, HIVE, NoSQL 5 4/8
Data mining Mahout, R 1 4/10
System management Oozie, ZooKeeper 3 4/15
ETL Kettle, Flume, Sqoop 8 4/17
Programming platform Pig, Hue, Python, Tomcat, Jetty, Neo4J 2 4/22
Streaming data processing Storm, Kafka, Avro 4 4/24
Information retrieval Impala, Pentaho, Solr 6 4/29
Visualization Saiku, Mondrian, Gephi, Ganglia 7 5/1
47
Project teams tasks

1) collect the information
2) study the product
3) present the topic for 40-50 minutes with 20-25
slides.
Explain the position of your topic in Hadoop
ecological system
Main challenges and solutions
Products and their features
Application demonstrations
Your comments
3 questions for your classmates
4) provide the references, including tutorial
materials, videos, web links
5) report outcomes with documents in 5-10 pages.

Optional deliverables with extra credit
1) implementation demonstration
2) research papers
3) proposal for research/implementation ideas
that demonstrates the creativeness.

48
More Hadoop Charts