ISQS 6339, Business Intelligence Hadoop - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

ISQS 6339, Business Intelligence Hadoop

Description:

Acquired by EMC Corporation in ... and Velocity challenges created by Big Data and ... Explores the flow of a MapReduce program. http://www.youtube.com ... – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 58
Provided by: Judy1183
Category:

less

Transcript and Presenter's Notes

Title: ISQS 6339, Business Intelligence Hadoop


1
ISQS 6339, Business IntelligenceHadoop
MapReduce
  • Zhangxi Lin
  • Texas Tech University

1
2
Outlines
  • Big data ecology
  • Review of Hadoop
  • MapReduce Algorithm
  • The Hadoop Ecological System
  • Appendix
  • Examples of MapReduce

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Review of Hadoop
10
Questions before viewing the videos
  • What is Hadoop
  • What is MapReduce
  • Why did they become a major solution to cope with
    big data problem

11
Videos of Hadoop
  • Challenges Created by Big Data. 851
  • Published on Apr 10, 2013. This video explains
    the challenges created by big data that Hadoop
    addresses efficiently. You will learn why
    traditional enterprise model fails to address the
    Variety, Volume, and Velocity challenges created
    by Big Data and why creation of Hadoop was
    required.
  • http//www.youtube.com/watch?vcA2btTHKPMY
  • Hadoop Architecture. 1427
  • Published on Mar 24, 2013
  • http//www.youtube.com/watch?vYewlBXJ3rv8
  • History Behind Creation of Hadoop. 629
  • Published on Apr 5, 2013. This video talk about
    the brief history behind creation of Hadoop. How
    Google invented the technology, how it went into
    Yahoo, how Doug Cutting and Michael Cafarella
    created Hadoop, and how it went to Apache.
  • http//www.youtube.com/watch?vjA7kYyHKeX8

12
Hadoop for BI in the Cloudera
  • Hadoop is a free, Java-based programming
    framework that supports the processing of large
    data sets in a distributed computing environment.
  • Hadoop makes it possible to run applications on
    systems with thousands of nodes involving
    thousands of terabytes.
  • Hadoop was inspired by Google's MapReduce, a
    software framework in which anapplication is
    broken down into numerous small parts. Doug
    Cutting, Hadoop's creator, named the framework
    after his child's stuffed toy elephant.

13
Apache Hadoop
  •  The Apache Hadoop framework is composed of the
    following modules 
  • Hadoop Common - contains libraries and utilities
    needed by other Hadoop modules
  • Hadoop Distributed File System (HDFS).
  • Hadoop YARN - a resource-management platform
    responsible for managing compute resources in
    clusters and using them for scheduling of users'
    applications.
  • Hadoop MapReduce - a programming model for large
    scale data processing.

14
MapReduce
MapReduce is a framework for processing paralleliz
able  problems across huge datasets using a large
number of computers (nodes), collectively
referred to as a cluster  or a grid. 
15
How Hadoop Operates
16
Hadoop 2 Big data's big leap forward
  • The new Hadoop is the Apache Foundation's attempt
    to create a whole new general framework for the
    way big data can be stored, mined, and processed.
  • The biggest constraint on scale has been Hadoops
    job handling. All jobs in Hadoop are run as batch
    processes through a single daemon called
    JobTracker, which creates a scalability and
    processing-speed bottleneck.
  • Hadoop 2 uses an entirely new job-processing
    framework built using two daemons
    ResourceManager, which governs all jobs in the
    system, and NodeManager, which runs on each
    Hadoop node and keeps the ResourceManager
    informed about what's happening on that node.

17
MapReduce 2.0 YARN(Yet Another Resource
Negotiator)
18
Apache Spark
  • An open-source cluster computing framework
    originally developed in the AMPLab at UC
    Berkeley. In contrast to Hadoop's two-stage
    disk-based MapReduce paradigm, Spark's in-memory
    primitives provide performance up to 100 times
    faster for certain applications. 
  • Spark requires a cluster manager and a
    distributed storage system. For cluster manager,
    Spark supports standalone (native Spark
    cluster), Hadoop YARN, or Apache Mesos. For
    distributed storage, Spark can interface with a
    wide variety, including Hadoop Distributed File
    System (HDFS), Cassandra, OpenStack Swift,
    and Amazon S3.
  • In February 2014, Spark became an Apache
    Top-Level Project. Spark has over 465
    contributors in 2014.
  • - Source http//en.wikipedia.org/wiki/Apache_Spa
    rk

19
MAP/REDUCE ALGORITHM
20
Videos - MapReduce
  • Intro To MapReduce. 908
  • Published on Mar 1, 2013. Intro to MapReduce
    concepts. Explores the flow of a MapReduce
    program.
  • http//www.youtube.com/watch?vHFplUBeBhcM
  • Hadoop Map Reduce Part1. 421
  • Published on Mar 20, 2012
  • http//www.youtube.com/watch?vdVqaz2j2kII

21
Distributed File Systems (DFS) Implementations
  • Files are divided into chunks, typically 64
    megabytes in size. Chunks are replicated three
    times, at three different compute nodes located
    on different racks.
  • To find the chunks of a file, the master node or
    name node is used. The master node is itself
    replicated.
  • Three Standards
  • The Google File System (GFS), the original of the
    class.
  • Hadoop Distributed File System (HDFS), an
    open-source DFS used with Hadoop, an
    implementation of map-reduce and distributed by
    the Apache Software Foundation.
  • CloudStore, an open-source DFS originally
    developed by Kosmix

22
Map/Reduce Execution
23
(No Transcript)
24
Example 1 counting the number of occurrences for
each word in a collection of documents
  • The input file is a repository of documents, and
    each document is an element. The Map function for
    this example uses keys that are of type String
    (the words) and values that are integers. The Map
    task reads a document and breaks it into its
    sequence of words w1,w2, . . . ,wn. It then emits
    a sequence of key-value pairs where the value is
    always 1. That is, the output of the Map task for
    this document is the sequence of key-value pairs
  • (w1, 1), (w2, 1), . . . , (wn, 1)

25
Map Task
  • A single Map task will typically process many
    documents. Thus, its output will be more than the
    sequence for the one document suggested above. If
    a word w appears m times among all the documents
    assigned to that process, then there will be m
    key-value pairs (w, 1) among its output.
  • After all the Map tasks have completed
    successfully, the master controller merges the
    files from each Map task that are destined for a
    particular Reduce task and feeds the merged file
    to that process as a sequence of
    key-list-of-value pairs. That is, for each key k,
    the input to the Reduce task that handles key k
    is a pair of the form (k, v1, v2, . . . , vn),
    where (k, v1), (k, v2), . . . , (k, vn) are all
    the key-value pairs with key k coming from all
    the Map tasks.

26
Grouping and Aggregation
27
Reduce Task
  • The output of the Reduce function is a sequence
    of zero or more key-value pairs.
  • The Reduce function simply adds up all the
    values. The output of a reducer consists of the
    word and the sum. Thus, the output of all the
    Reduce tasks is a sequence of (w,m) pairs, where
    w is a word that appears at least once among all
    the input documents and m is the total number of
    occurrences of w among all those documents.
  • The application of the Reduce function to a
    single key and its associated list of values is
    referred to as a reducer.

28
Combiner
29
Reducers, Reduce Tasks, Compute Nodes, and Skew
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Hadoop Ecological System
37
Choosing a right Hadoop architecture
  • Application dependent
  • Too many solution providers
  • Too many choices

38
Videos
  • The Evolution of the Apache Hadoop Ecosystem
    Cloudera. 811
  • Published on Sep 6, 2013. Hadoop Co-founder Doug
    Cutting explains how the Hadoop ecosystem has
    expanded and evolved into a much larger Big Data
    platform with Hadoop at its center.
  • http//www.youtube.com/watch?veo1PwSfCXTI
  • A Hadoop Ecosystem Overview. 2154
  • Published on Jan 10, 2014. This is a technical
    overview, explaining the Hadoop Ecosystem. As a
    part of this presentation, we chose to focus on
    the HDFS, MapReduce, Yarn, Hive, Pig and HBase
    software components.
  • http//www.youtube.com/watch?vkRnh3WpcKXo
  • Working in the Hadoop Ecosystem. 1040
  • Published on Sep 5, 2013. Mark Grover, a Software
    Engineer at Cloudera, talks about working in the
    Hadoop ecosystem.
  • http//www.youtube.com/watch?vnbUsY9tj-pM

39
Clouderas Hadoop System
40
(No Transcript)
41
Comparison of Two Generations of Hadoop
42
(No Transcript)
43
Different Components of Hadoop
44
Pivotal Big Data Product - OSS
  • Greenplum was a big data analytics company
    headquartered in San Mateo, California. Its
    products include Unified Analytics Platform, Data
    Computing Appliance, Analytics Lab, Database, HD
    and Chorus. Acquired by EMC Corporation in July
    2010, and then became part of Pivotal Software in
    2012.
  • Pivotal GemFire is a distributed data management
    platform designed for many diverse data
    management situations, but is especially useful
    for high-volume, latency-sensitive,
    mission-critical, transactional systems.
  • Pivotal Software, Inc. (Pivotal) is a software
    company based in Palo Alto, California that
    provides software and services for the
    development of custom applications for data and
    analytics based on cloud computing technology.
    Pivotal Software is a spin-out and joint
    venture of EMC Corporation and its
    subsidiary VMware that combined software
    products, employees, and lines of businesses from
    the two parent companies including Greenplum,Cloud
    Foundry, Spring, Pivotal Labs, GemFire and other
    products from the VMware vFabric Suite.

45
2015 Team-Topic
No Topic Components Team Schedule
1 Data warehousing Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr DW1 4/7
2 Publicly available big data services Tools and free resources Hortonworks, CloudEra, HaaS, EC2 DW2 4/9
3 MapReduce  Data mining Efficiency of distributed data/text mining Mahout, H2O, R, Python DW3 4/14
4 Big data ETL-1 1) Heterogeneous data processing across platforms 2) System management  1) Kettle, Flume, Sqoop, Impala  2) Oozie, ZooKeeper, Ambari, Loom, Ganglia DW4 4/16
5 Big data ETL-2 1) Heterogeneous data processing across platforms 2) System management  1) Kettle, Flume, Sqoop, Impala  2) Oozie, ZooKeeper, Ambari, Loom, Ganglia DW5 4/21
6 Application development platform 1) Algorithms and innovative development environments 2) Load balancing Tomcat, Neo4J, Pig, Hue DW6 4/23
7 Tools Visualizations Features for big data visualization and data utilization. Pentaho, Tableau Saiku, Mondrian, Gephi, DW7 4/28
8 Streaming data processing Efficiency and effectiveness of real-time data processing  Spark, Storm, Kafka, Avro   5/5
46
2014 Topics
Topic Components Team Presentation
Data warehousing HDFS, HBase, HIVE, NoSQL 5 4/8
Data mining Mahout, R 1 4/10
System management Oozie, ZooKeeper 3 4/15
ETL Kettle, Flume, Sqoop 8 4/17
Programming platform Pig, Hue, Python, Tomcat, Jetty, Neo4J 2 4/22
Streaming data processing Storm, Kafka, Avro 4 4/24
Information retrieval Impala, Pentaho, Solr 6 4/29
Visualization Saiku, Mondrian, Gephi, Ganglia 7 5/1
47
Project teams tasks
  • 1) collect the information
  • 2) study the product
  • 3) present the topic for 40-50 minutes with 20-25
    slides.
  • Explain the position of your topic in Hadoop
    ecological system
  • Main challenges and solutions
  • Products and their features
  • Application demonstrations
  • Your comments
  • 3 questions for your classmates
  • 4) provide the references, including tutorial
    materials, videos, web links
  • 5) report outcomes with documents in 5-10 pages.
  • Optional deliverables with extra credit
  • 1) implementation demonstration
  • 2) research papers
  • 3) proposal for research/implementation ideas
    that demonstrates the creativeness.

48
More Hadoop Charts
  • Appendix

49
Vision of Data Flow
50
(No Transcript)
51
Real-time Data Processing
52
Application Perspective of Hadoop
53
Matrix Calculation
  • Appendix

54
Map/Reduce Matrix Multiplication
55
Map/Reduce Scheme 1, Step 1
56
Map/Reduce Scheme 1, Step 2
57
Map/Reduce Scheme 2, Oneshot
Write a Comment
User Comments (0)
About PowerShow.com