Title: Big Data
1Big Data
2(No Transcript)
3Big Data
- What is Big Data?
- Analog starage vs digital.
- The FOUR Vs of Big Data.
- Whos Generating Big Data
- The importance of Big Data.
- Optimalization
- HDFC
4Definition
Big data is the term for a collection of data
sets so large and complex that it becomes
difficult to process using on-hand database
management tools or traditional data processing
applications. The challenges include capture,
curation, storage, search, sharing, transfer,
analysis, and visualization.
5(No Transcript)
6The FOUR Vs of Big Data
From traffic patterns and music downloads to web
history and medical records, data is recorded,
stored, and analyzed to enable that technology
and services that the world relies on every day.
But what exactly is big data be used? According
to IBM scientists big data can be break into four
dimensions Volume, Velocity, Variety and
Veracity.
7The FOUR Vs of Big Data
8The FOUR Vs of Big Data
Volume. Many factors contribute to the increase
in data volume. Transaction-based data stored
through the years. Unstructured data streaming in
from social media. Increasing amounts of sensor
and machine-to-machine data being collected. In
the past, excessive data volume was a storage
issue. But with decreasing storage costs, other
issues emerge, including how to determine
relevance within large data volumes and how to
use analytics to create value from relevant data.
9The FOUR Vs of Big Data
10The FOUR Vs of Big Data
Variety. Data today comes in all types of
formats. Structured, numeric data in traditional
databases. Information created from
line-of-business applications. Unstructured text
documents, email, video, audio, stock ticker data
and financial transactions. Managing, merging and
governing different varieties of data is
something many organizations still grapple with.
11The FOUR Vs of Big Data
12The FOUR Vs of Big Data
Velocity. Data is streaming in at unprecedented
speed and must be dealt with in a timely manner.
RFID tags, sensors and smart metering are driving
the need to deal with torrents of data in
near-real time. Reacting quickly enough to deal
with data velocity is a challenge for most
organizations.
13The FOUR Vs of Big Data
14The FOUR Vs of Big Data
Veracity - Big Data Veracity refers to the
biases, noise and abnormality in data. Is the
data that is being stored, and mined meaningful
to the problem being analyzed. Inderpal feel
veracity in data analysis is the biggest
challenge when compares to things like volume and
velocity. In scoping out your big data strategy
you need to have your team and partners work to
help keep your data clean and processes to keep
dirty data from accumulating in your systems.
15Whos Generating Big Data
- The progress and innovation is no longer hindered
by the ability to collect data - But, by the ability to manage, analyze,
summarize, visualize, and discover knowledge from
the collected data in a timely manner and in a
scalable fashion
15
16The importance of Big Data
- The real issue is not that you are acquiring
large amounts of data. It's what you do with the
data that counts. The hopeful vision is that
organizations will be able to take data from any
source, harness relevant data and analyze it to
find answers that enable - Cost reductions
- Time reductions
- New product development and optimized offerings
- Smarter business decision making
17(No Transcript)
18The importance of Big Data
- For instance, by combining big data and
high-powered analytics, it is possible to - Determine root causes of failures, issues and
defects in near-real time, potentially saving
billions of dollars annually. - Optimize routes for many thousands of package
delivery vehicles while they are on the road. - Analyze millions of SKUs to determine prices that
maximize profit and clear inventory. - Generate retail coupons at the point of sale
based on the customer's current and past
purchases. - Send tailored recommendations to mobile devices
while customers are in the right area to take
advantage of offers. - Recalculate entire risk portfolios in minutes.
- Quickly identify customers who matter the most.
- Use clickstream analysis and data mining to
detect fraudulent behavior
19HDFS / Hadoop
- Data in a HDFS cluster is broken down into
smaller pieces (called blocks) and distributed
throughout the cluster. In this way, the map and
reduce functions can be executed on smaller
subsets of your larger data sets, and this
provides the scalability that is needed for big
data processing. The goal of Hadoop is to use
commonly available servers in a very large
cluster, where each server has a set of
inexpensive internal disk drives.
20PROS OF HDFS
- Scalable New nodes can be added as needed, and
added without needing to change data formats, how
data is loaded, how jobs are written, or the
applications on top. - Cost effective Hadoop brings massively parallel
computing to commodity servers. The result is a
sizeable decrease in the cost per terabyte of
storage, which in turn makes it affordable to
model all your data. - Flexible Hadoop is schema-less, and can absorb
any type of data, structured or not, from any
number of sources. Data from multiple sources can
be joined and aggregated in arbitrary ways
enabling deeper analyses than any one system can
provide. - Fault tolerant When you lose a node, the system
redirects work to another location of the data
and continues processing without missing a beat.
21(No Transcript)
22(No Transcript)
23(No Transcript)
24Sources
- McKinsey Global Institute
- Cisco
- Gartner
- EMC, SAS
- IBM
- MEPTEC
25 Thank you for your attention. Authors Tomasz
Wis Krzysztof Rudnicki