Title: Big Data Frameworks
1BIG DATA
FRAMEWORKS
Presented by Cuelogic Technologies
2Introduction
There are 3Vs that are vital for classifying
data as Big Data. These include Volume, Velocity
and Veracity. Volume Data volumes it is in terms
of terabytes, petabytes and so on. Velocity Veloc
ity is to do with the high speed of data movement
like real-time data streaming at a rapid rate in
microseconds. Veracity Veracity involves the
handling approach for both structured and
unstructured data.
3Implementation of Big Data infrastructure and
technology can be seen in various industries
like banking, retail, insurance, healthcare,
media, etc. Big Data management functions like
storage, sorting, processing and analysis for
such colossal volumes cannot be handled by the
existing database systems or technologies.
IT
ABOUT
THINK
4There are many frameworks presently existing in
this space. Some of the popular ones are Spark,
Hadoop, Hive and Storm. Some score high on
utility index like Presto while frameworks like
Flink have great potential. There are still
others which need some mention like the Samza,
Impala, Apache Pig, etc. Some of these
frameworks have been briefly discussed below.
5Apache Hadoop
Hadoop is a Java-based platform founded by Mike
Cafarella and Doug Cutting. This open-source
framework provides batch data processing as well
as data storage services across a group of
hardware machines arranged in clusters. Hadoop
consists of multiple layers like HDFS and YARN
that work together to carry out data processing.
6HDFS (Hadoop Distributed File System) is the
hardware layer that ensures coordination of data
replication and storage activities across
various data clusters. In the event of a cluster
node failure, real-time can still be made
available for processing. YARN (Yet Another
Resource Negotiator) is the layer responsible
for resource management and job
scheduling. MapReduce is the software layer that
functions as the batch processing engine.
7Cons
Pros
- Include vulnerability to security breaches, does
not perform in- memory computation hence
suffers processing overheads, not suited for
stream processing and real-time processing,
issues in processing small files in large
numbers.
Include cost-effective solution, high
throughput, multi-language support,
compatibility with most emerging technologies in
Big Data services, high scalability, fault
tolerance, better suited for RD, high
availability through excellent failure handling
mechanism.
8Apache Spark
It is a batch processing framework with enhanced
data streaming processing. With full in-memory
computation and processing optimisation, it
promises a lightning fast cluster computing
system.
9Spark framework is composed of five layers. HDFS
and HBASE They form the first layer of data
storage systems. YARN and Mesos They form the
resource management layer. Core engine This
forms the third layer. Library This forms the
fourth layer containing Spark SQL for SQL
queries while stream processing, GraphX and Spark
R utilities for processing graph data and MLlib
for machine learning algorithms. The fifth layer
contains an application program interface such as
Java or Scala.
10Cons
Pros
Include scalability, lightning processing speeds
through reduced number of I/O operations to
disk, fault tolerance, supports advanced
analytics applications with superior AI
implementation and seamless integration
with Hadoop
Include complexity of setup and implementation,
language support limitation, not a genuine
streaming engine.
11Storm
It is an application development
platform-independent, can be used with any
programming language and guarantees delivery of
data with the least latency. In Storm
architecture, there are 2 nodes Master Node and
Worker/ Supervisor Node. The master node
monitors the failures of machines and is
responsible for task allocation. In case of a
cluster failure, the task is reassigned to
another one.
12Cons
Pros
Include ease in setup and operation, high
scalability, good speed, fault tolerance,
support for a wide range of languages
Include complex implementation, debugging issues
and not very learner-friendly
13Apache Flink
Apache Flink, an open-source framework is equally
good for both batch as well as stream data
processing. It is suited for cluster
environments. It is based on transformations -
streams concept. It is also the 4G of Big Data.
It is the 100 times faster than Hadoop - Map
Reduce.
14Flink system contains multiple layers
Deploy Layer Runtime Layer Library Layer
15Cons
Pros
Include low latency, high throughput, fault
tolerance, entry by entry processing, ease of
batch and stream data processing, compatibility
with Hadoop.
Include few scalability issues.
16Hive
Apache Hive, designed by Facebook, is an ETL
(Extract / Transform/ Load) and data warehousing
system. It is built on top of the Hadoop HDFS
platform. The key components of the Hive
Architecture include Deploy Layer Runtime Layer
17The key components of the Hive Architecture
include Hive Clients Hive Services Hive Storage
and Computing The Hive engine converts SQL-
queries or requests to MapReduce task chains.
The engine comprises of, Parser It goes through
the incoming SQL-requests and sorts
ThemOptimizer It goes through the sorted
requests and optimises ThemExecutor It sends
tasks to the Map Reduce framework
18Cons
Pros
Include low latency, high throughput, fault
tolerance, entry by entry processing, ease of
batch and stream data processing, compatibility
with Hadoop.
Include few scalability issues.
19Presto
Presto is the open-source distributed SQL tool
most suited for smaller datasets up to
3Tb.Presto engine includes a coordinator and
multiple workers. When client submits queries,
these are parsed, analysed, their execution
planned and distributed for processing among the
workers by the coordinator.
20Cons
Pros
Include least query degradation even in the
event of increased concurrent query workload.
It has a query execution rate that is three
times faster than Hive. Ease in adding images
and embedding links. Highly user- friendly.
Include reliability issues
21Impala
Impala is an open-source MPP (Massive Parallel
Processing) query engine that runs on multiple
systems under a Hadoop cluster. It has been
written in C and Java.
22It is not coupled with its storage engine. It incl
udes 3 main components
- Impala Daemon (Impalad) It is executed on every
node where Impala is installed. - Impala StateStore Impala MetaStore
- Impala has its query language like SQL.
23Cons
Pros
- Include supports in-memory computation hence
accesses data without movement directly from
Hadoop nodes, smooth integration with BI tools
like Tableau, ZoomData, etc., supports a wide
range of - file formats.
Include no support for serialisation and
deserialization of data, inability to read
custom binary files, table refresh needed for
every record addition.
24Contact Us
1 347 374 8437 info_at_cuelogic.com
https//www.cuelogic.com/ Unit 610, 134 W 29th
St, New York, NY 10001 Content Source Cuelogic
Blog
25(No Transcript)