BRIEF OVERVIEW OF HIVE - PowerPoint PPT Presentation

About This Presentation
Title:

BRIEF OVERVIEW OF HIVE

Description:

Title: A0 PNW Power Gating on IREM Author: Dan Bockelman Last modified by: Jon Brauer Created Date: 8/17/2000 2:34:20 PM Document presentation format – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 8
Provided by: DanB197
Category:

less

Transcript and Presenter's Notes

Title: BRIEF OVERVIEW OF HIVE


1
BRIEF OVERVIEW OF HIVE
  • Jonathan Brauer
  • ESE 380L
  • Feb 2014

2
Overview
  • Hive is a Massively Parallel Data Warehousing
    environment
  • Hive provides SQL like programming environment
    for Hadoop
  • Hadoop becoming common in Big Data houses
  • Hadoop makes it relatively easy to quickly
    implement MapReduce jobs, but often requires
    plug-ins or APIs be used to write jobs
  • Engineers though familiar with SQL and not
    MapReduce may be more productive with SQL.
  • Hive queries are MapReduce operations

3
Background on Hadoop
  • What is Hadoop?
  • Open source implementation of a a MapReduce
    environment
  • A distributed filesystem for storing data
    Hadoop Distributed File System (HDFS)
  • Multiple copies of data
  • Very large files can be handled
  • Files are broken up into buckets commonly 128MB
  • MapReduce consists of a Map function and Reduce
    function
  • Map functions are applied to all data
  • Reduce functions collate map output
  • Example in SQL is Map does SELECT on rows and
    the Reducer could SORT the output

4
Advantages
  • Hive allows developers to with SQL background to
    ramp rapidly and perform Hive queries
  • Open Source Apache project
  • Hive is compatible with other MapReduce
    operations in an infrastructure some groups can
    use Hive and others native MapReduce
  • Can share tables with Hbase
  • Hive has built in functions for reducing data
    such as sampling
  • Block Sampling
  • Bucket Sampling
  • Deterministic Sampling
  • Non-Deterministic Sampling

5
Disadvantages
  • Not for real time unless very small data (why are
    you using Hadoop?)
  • Row updates are not generally allowed
  • Hive queries can be very time consuming
  • Similar to RDBMS some experience and knowledge of
    writing efficient queries is necessary in Hive
  • Hive features require extending and modifying SQL
    operations and some SQL operations behave
    differently
  • SORT BY vs. ORDER BY (Local vs. Global reducer
    behavior)
  • Large data sizes make some queries impossible to
    finish due to individual system resources in a
    meaningful time (doing an ORDER by on all columns
    in a PetaByte search is a bad idea).
  • Queries are still IO bound
  • Hive optimizations still on-going
  • Consider using Hadoop natively, Hbase (Fast, row
    edit), or Pig (transforms)

6
Example
  • SELECT a.userid, b.text
  • FROM users a
  • TABLESAMPLE(1 PERCENT)
  • JOIN data b
  • ON a.dat 2012-03-15
  • AND b.dat 2012-03-15
  • AND a.userid b.id

7
Questions?
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com