An introduction to Apache HCatalog - PowerPoint PPT Presentation

About This Presentation
Title:

An introduction to Apache HCatalog

Description:

An introduction to Apache HCatalog, what is it ? Why is it useful and how can it help Pig, Hive and MapReduce users on Hadoop share data ? – PowerPoint PPT presentation

Number of Views:544
Slides: 9
Provided by: semtechs

less

Transcript and Presenter's Notes

Title: An introduction to Apache HCatalog


1
Apache HCatalog
  • What is it ?
  • How does it work ?
  • Interfaces
  • Architecture
  • Example

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
2
HCatalog What is it ?
  • A Hive metastore interface set
  • Shared schema and data types for Hadoop tools
  • Rest interface for external data access
  • Assists inter operability between
  • Pig, Hive and Map Reduce
  • Table abstraction of data storage
  • Will provide data availability notifications

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
3
HCatalog How does it work ?
  • Pig
  • HCatLoader HCatStorer interface
  • Map Reduce
  • HCatInputFormat HCatOutputFormat interface
  • Hive
  • No interface necessary
  • Direct access to meta data
  • Notifications when data available

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
4
HCatalog Interfaces
  • Interface via
  • Pig
  • Map Reduce
  • Hive
  • Streaming
  • Access data via
  • Orc file
  • RC file
  • Text file
  • Sequence file
  • Custom format

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
5
HCatalog Interfaces
www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
6
HCatalog Architecture
www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
7
HCatalog Example
  • A data flow example from hive.apache.org
  • First Joe in data acquisition uses distcp to get
    data onto the grid.
  • hadoop distcp file///file.dat hdfs//data/raweven
    ts/20100819/data
  • hcat "alter table rawevents add partition
    (ds'20100819') location 'hdfs//data/rawevents/20
    100819/data'"
  • Second Sally in data processing uses Pig to
    cleanse and prepare the data.
  • Without HCatalog, Sally must be manually informed
    by Joe when data is available, or poll on HDFS.
  • A load '/data/rawevents/20100819/data' as
    (alphaint, betachararray, )
  • B filter A by bot_finder(zeta) 0
  • store Z into 'data/processedevents/20100819/data'
  • With HCatalog, HCatalog will send a JMS message
    that data is available. The Pig job can then be
    started.
  • A load 'rawevents' using HCatLoader()
  • B filter A by date '20100819' and by
    bot_finder(zeta) 0
  • store Z into 'processedevents' using
    HcatStorer("date20100819")
  • Note that the pig job refers to the data by name
    rawevents rather than a location

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
8
Contact Us
  • Feel free to contact us at
  • www.semtech-solutions.co.nz
  • info_at_semtech-solutions.co.nz
  • We offer IT project consultancy
  • We are happy to hear about your problems
  • You can just pay for those hours that you need
  • To solve your problems
Write a Comment
User Comments (0)
About PowerShow.com