Apache Gobblin - PowerPoint PPT Presentation

About This Presentation

Title:

Apache Gobblin

Description:

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing. Links for further information and connecting – PowerPoint PPT presentation

Number of Views:97

Slides: 15

Provided by: semtechs

Category: Medicine, Science & Technology

Tags: apache | data | framework | gobblin | ingestion

Transcript and Presenter's Notes

Title: Apache Gobblin

1
What Is Apache Gobblin ?

A big data integration framework
To simplify integration issues like
Data ingestion
Replication
Organization
Lifecycle management
For streaming and batch
An Apache incubator project

2
Gobblin Execution Modes

Gobblin has a number of execution modes
Standalone
Run on a single box / JVM / embedded mode
Map Reduce
Run as a map reduce application
Yarn / Mesos ( proposed ? )
Run on a cluster via a scheduler, supports HA
Cloud
Run on AWS / Azure, supports HA

3
Gobblin Sinks/Writers

Gobblin supports the following sinks
Avro HDFS
Parquet HDFS
HDFS byte array
Console (StdOut)
Couchbase
HTTP
JDBC
Kafka

4
Gobblin Sources

Gobblin supports the following sources

Avro files
File copy
Query based
Rest API
Google Analytics
Google drive
Google webmaster
Hadoop text input
Hive Avro to ORC
Hive compliance purging

JSON
Kafka
MySQL
Oracle
Salesforce
FTP / SFTP
SQL Server
Teradata
Wikipedia

5
Gobblin Architecture
6
Gobblin Architecture

A Gobblin job is built on a set of plugable
constructs
Which are extensible
A job is a set of tasks created from a workunit
The workunit serves as a container at runtime
Tasks are executed by the Gobblin runtime
On the chosen deployment i.e. MapReduce
Run time handles scheduling, error handling etc
Utilities handle meta data, state, metrics etc

7
Gobblin Job
8
Gobblin Job

Optional aquire lock (to stop next job instance)
Create source instance
From source work units create tasks
Launch and run tasks
Publish data if OK to do so
Persist the job/task states into the state store
Clean up temporary work data
Release the job lock ( optional )

9
Gobblin Constructs
10
Gobblin Constructs

Source partitions data into work units
Source creates work unit data extractors
Converter converts schema and data records
Quality checker checks row and task level data
Fork operator allows control to flow into
multiple streams
Writers sends data records to sink
Publisher publishes job records

11
Gobblin Job Configuration

Goblin jobs are configured via configuration
files
May be named .pull / .job plus .properties
Source properties file defines
Connection / converter / quality / publisher
Job file defines
Name / group / description / schedule
Extraction properties
Source properties

12
Gobblin Users
13
Available Books

See Big Data Made Easy
Apress Jan 2015
See Mastering Apache Spark
Packt Oct 2015
See Complete Guide to Open Source Big Data
Stack
Apress Jan 2018
Find the author on Amazon
www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
Connect on LinkedIn
www.linkedin.com/in/mike-frampton-38563020

14
Connect

Feel free to connect on LinkedIn
www.linkedin.com/in/mike-frampton-38563020
See my open source blog at
open-source-systems.blogspot.com/
I am always interested in
New technology
Opportunities
Technology based issues
Big data integration

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

Featured Presentations

Related Books