PPT – Big Data - What Is It? PowerPoint presentation | free to download

About This Presentation

Title:

Big Data - What Is It?

Description:

Big Data - What Is It? Demetris Zeinalipour Assistant Professor Data Management Systems Laboratory Department of Computer Science University of Cyprus – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 58

Provided by: Demetr

Category:

more less

Transcript and Presenter's Notes

Title: Big Data - What Is It?

1
Big Data - What Is It?
Demetris Zeinalipour Assistant Professor Data
Management Systems Laboratory Department of
Computer Science University of Cyprus http//dmsl
.cs.ucy.ac.cy/
EPL671 Research Methodologies in Computer
Science, Graduate Course, Tuesday, Mar 19th,
2013.
2
Objectives

To provide an overview of the emerging field of
Big Data Management from a wide range of
perspectives
Fundamentals / Trends, Industrial / Academic,
Commercial / Open, Reality / Visionary, etc.
I assume that the audience has a technical
background (e.g., DBAs)
Lots of examples and illustrations to keep this
presentation entertaining and educating.

3
Talk Outline

Big Data Definitions and Background
Big Data Definition by 3V Examples
Velocity
Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services
Volume
TextltMultimedialtSciences, Web Data, Filesystems
Variety
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency,
Map-Reduce, Column Stores)
NewSQL Trends
Big Data Education and Research
Courses _at_ UCY
Research Prototypes _at_ UCY

4
Big Data Definitions

"Refers to data sets whose size and structure
strains (stretches) the ability of commonly used
relational DBMSs to capture, manage, and process
the data within a tolerable elapsed time."
Hoffer, Ramesh, Topi Modern Database Management,
11E, 2013.
Similar from Wikipedia, Feb. 2013
"big data is a collection of data sets so large
and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications."

5
Big Data Characteristics

Size from a few dozen terabytes to many
petabytes in a single database.
Data model anything from structured (relational
or tabular) to semi-structured (XML or JSON) or
even unstructured (Web text and log files).
Architectures highly parallel and distributed in
order to cope with the inherent I/O and CPU
limitations.
Hardware mid-scale private clouds (datacenters),
offering higher privacy, to large-scale public
clouds.
Functionality operational (OLTP) and analytic
(OLAP) functionality stand-alone or as-a-Service.

6
Big Data Characteristics
2013 IEEE International Conference on Big Data
(IEEE BigData 2013), October 6-9, 2013, Silicon
Valley, CA, USA
Wordle.net
7
Background Public Clouds
Google's Datacenter in Oregon
8
Background Public Clouds
Microsoft Azure in Chicago
112 containers x 2000 servers 224,000 servers
9
Background -as-a-Service
To Amazon RDS (Relational Database Service)
963 / year
27,165 / year
10
Background Private Clouds
Our Laboratory Private IaaS
11
Cloud Computing gt Big Data
12
Big Data Velocity-Volume-Variety

Velocity
how fast data is being produced and how fast the
data must be processed to meet demand.
How to deal with torrents of data, in near-real
time, streaming from RFID tags and smart metering
systems?
How to identify fraud in 5 million trade events
created each day?
Reacting quickly enough to deal with velocity is
a challenge to most organizations.

Source IDC. "Big Data Analytics Future
Architectures, Skills and Roadmaps for the CIO,"
September 2011.
13
Big Data Velocity-Volume-Variety

Volume
Past Challenge Store data.
transaction-based data stored through the years.
sensor data being collected
Integration with web applications social media
New Challenge Create value from data
Turn 12 TB of Tweets each day into a sentiment
analysis (opinion mining) product.
e.g., People feel positive/negative/neutral about
brand X.
Turn 350 billion annual smart meter readings to
knowledge that helps predicting power
consumption.

14
Big Data Velocity-Volume-Variety

Variety
By some estimates, 80 percent of an
organization's data is not numeric!
Different data format unstructured, structured,
semi-structured
text, sensor data, audio, video, click streams,
log files, etc.

15
Talk Outline

Big Data Definitions and Background
Big Data Definition by 3V Examples
Velocity
Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services)
Volume
TextltMultimedialtSciences, Web Data, Filesystems
Variety
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores)
NewSQL Trends
Big Data Education and Research
Courses _at_ UCY
Research Prototypes _at_ UCY

16
Velocity 1 Smart Meters

Smart meter records consumption of electric
energy in intervals and communicates that
information to the utility for monitoring and
billing purposes.

Every 15m
17
Velocity 1 Smart Meters

Ontario's Meter Data Management and Repository
(MDM/R) storing, processing and managing all
smart meter data in Ontario, Canada
Characteristics
Provides hourly billing quantity and extensive
reports.
4.6 million smart meters.
Storage/Bandwidth 4.6M meters x 0.5K message
(typical HTTP) 2.3 GB / round
110 million meter reads per day
on an annual basis, exceeds the number of debit
card transactions processed in the country
(Canada!)

Source Smart Metering Entity http//www.smi-ieso
.ca/mdmr
18
Velocity 2 Network Monitoring

Akamai
CDN serving 15-30 of all Web traffic (10TB/sec)
One out of every three Global 500 companies
All of the top Internet portals
Has a picture of the global traffic every 6
seconds
How?
119,000 servers in 80 countries within over 1,100
networks.
Servers report to a proprietary database network
health information (latency/loss) every 6 seconds.

Proprietary DBMS
Every 6 seconds
ping/traceroute
19
Velocity 2 Network Monitoring
Companies started seeking Big data engineers.
20
Velocity 3 Web2.0 Media

Analyze online conversations in Social Nets.
Accelerated responses to marketplace shifts.

Continously Over Web2.0 protocols
21
Velocity 3 Web2.0 Media
Web1.0 The Unstructured Web http//books.google.c
om/
(content in HTML only apprehensible to User)
22
Velocity 3 Web2.0 Media
Web2.0 The Semi-structured Web! https//www.googl
eapis.com/books/v1/volumes?qdatabases
content in XML/JSON apprehensible to Computer
23
Velocity 3 Web2.0 Media
Twitter API
https//twitter.com/users/dmslucy.json
24
Velocity 3 Web2.0 Media
In fact, Web2.0 Services are omnipresent! (Google,
Twitter, Facebook, Youtube, Linkedin, )
http//www.programmableweb.com/ - 7800 APIs!!!
6800 Mashups!
https//code.google.com/apis
25
Velocity 4 Smartphone Services
Response Format The response format is also JSON.
"location" "latitude" 51.0,
"longitude" -0.1, , "accuracy" 1200.4,

Request Format (request.json)
"homeMobileCountryCode" 310,
"homeMobileNetworkCode" 260,
"radioType" "gsm",
"carrier" "T-Mobile",
"cellTowers"
"cellId" 39627456,
"locationAreaCode" 40495,
"mobileCountryCode" 310,
"mobileNetworkCode" 260,
"age" 0,
"signalStrength" -95
,

"wifiAccessPoints" "macAddress"
"0123456789AB", "signalStrength" 8,
"age" 0, "signalToNoiseRatio" -65,
"channel" 8 , "macAddress"
"0123456789AC", "signalStrength" 4,
"age" 0
Will be discussing some furtherin-house
applications in a while
26
Velocity 4 Smartphone Services
Wireless Data Transfer Rates

4G ITU peak rates
100 Mbps (high mobility, such as trains and cars)
1Gbps (low mobility, such as pedestrians and
stationary users)

Plot Courtesy of H. Kim, N. Agrawal, and C.
Ungureanu, "Revisiting Storage for Smartphones",
The 10th USENIX Conference on File and Storage
Technologies (FAST'12), San Jose, CA, February
2012. Best Paper Award
27
Velocity 4 Smartphone Services
Mapping the Road traffic by collecting WiFi
signals.
Every 1 second
Received Signal Strength (RSS) power present in
WiFi radio signal
Graphics courtesy of A .Thiagarajan et. al.
Vtrack Accurate, Energy-Aware Road Traffic
Delay Estimation using Mobile Phones, In
Sensys09, pages 85-98. ACM, (Best Paper) MITs
CarTel Group
28
Talk Outline

Big Data Definitions and Background
Big Data Definition by 3V Examples
Velocity
Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services)
Volume
TextltMultimedialtSciences, Web Data, Filesystems
Variety
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores)
NewSQL Trends
Big Data Education and Research
Courses _at_ UCY
Research Prototypes _at_ UCY

29
Volume 1 TextltMultimedialtSciences

From the TB-era to the PB-era.
The U.S. Library of Congress (April 2011) 235 TB
Anchestry.com Genealogical data 600 TB
Games World of Warcraft uses 1.3 PB of storage
to maintain its game.
Internet Video will account for 61 of total
Internet Data by 2015 (966 Exabytes or nearly 1
Zettabyte!)
Climate science The German Climate Computing
Centre (DKRZ) has a storage capacity of 60 PB of
climate data.
Physics The experiments in the Large Hadron
Collider produce about 15 PB of data per year,
which is distributed over the LHC Computing Grid
(Our department is part of the EGEE Enabling
Grids for E-sciencE, now EGI - European Grid
Infrastructure).

Human Generated
Multimedia/ Streaming
Sciences/Sensors
Source Petabyte, from Wikipedia
http//en.wikipedia.org/wiki/Petabyte
30
Volume 2 Web Data
Google Volume (in 2006)IDC The total amount of
global data is expected to grow to 2.7 zettabytes
during 2012. This is 48 up from 2011.
http//en.wikipedia.org/wiki/Zettabyte
Bigtable A Distributed Storage System for
Structured Data, OSDI'06 Seventh Symposium on
Operating System Design and Implementation,
Seattle, WA, November, 2006.
31
Volume 3 Big Data File Systems

Big Data Filesystems HDFS

Namespace lookup are fast (1 Master enough!)
1GB Metadata 1PB Data In NFS Metadata
Transfers going through same server gt Not
Scalable HDFS designed for unreliable hardware
(2-3 failures / 1000 nodes / day)
32
Volume 3 Big Data File Systems

Big Data Filesystems How Big?
Results from 2010

HDFS scalability the limits to growth
http//static.usenix.org/publications/login/2010-
04/openpdfs/shvachko.pdf
33
Variety 2 File Systems
NFS uses a Client/Server Architecture that is a
single point of failure by default.
34
Talk Outline

Big Data Definitions and Background
Big Data Definition by 3V Examples
Velocity
Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services)
Volume
TextltMultimedialtSciences, Web Data, Filesystems
Variety
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores)
NewSQL Overview (ACID-compliant NoSQL stores)
Big Data Teaching and Research
Courses _at_ UCY
Research Prototypes _at_ UCY

35
Variety Overview
451 Research, Matthew Aslett, http//goo.gl/GYcEx
36
Variety 1 NoSQL

NoSQL ("not only SQL") is a broad class of
database management systems identified by
non-adherence to the widely used relational
database management system model.
NoSQL databases are NOT built primarily on
tables, and generally DO NOT use SQL for data.
NoSQL gt Not Relational!
Key Value (e.g., BerkeleyDB emb, Oracle NoSQL -
Distributed)
Document Stores (e.g., JSON stores)
BigTables (i.e., Column-stores)
Graph Databases (e.g., FlockDB)
potentially much longer list but I
will only focus on a few trends

37
Variety 1 NoSQL / Document Stores
Document in CouchDB
Map Function
function(doc) for (i in doc.authors)
author doc.authorsi emit(doc._id,
author)
Results (through REST/HTTP or Futon)
38
Variety 1 NoSQL / Document Stores
For a real app we could envision much more
complex queries.

http//rickosborne.org/download/SQL-to-MongoDB.pdf
39
Variety 1 NoSQL / Replication
Asynchronous Replication means Eventually
Consistent
Asynchronous
Asynchronous
40
Variety 1 NoSQL / Consistency ?
SQL RDBMSs
(Most) NoSQL DBMSs
Eventually consistent!
Strongly consistent!
41
Variety 2 NoSQL / Map Reduce Analytics

Map-Reduce a programming model for processing
large data sets (Not online like Warehouses ?).
Invented by Google! "MapReduce Simplified Data
Processing on Large Clusters, Jeffrey Dean and
Sanjay Ghemawat, OSDI'04 Sixth Symposium on
Operating System Design and Implementation,San
Francisco, CA, December, 2004."
Can be implemented in any language (Java, example
nex)
Hadoop Apache's open-source software framework
that supports data-intensive distributed
applications
Derived from Google's MapReduce Google File
System (GFS) papers.
Enables applications to work with thousands of
computation-independent computers and petabytes
of data.
Download http//hadoop.apache.org/

42
Variety 2 NoSQL / Map Reduce Analytics
Count the distinct words in all documents cat
.txt sort uniq -c
1 TB on 1 PC 2 hours!!! 1TB on 100 PCs 1min!!!
43
Variety 2 NoSQL / Map Reduce Analytics
Example uses 1 mapper / 1 reduce only!
Shuffle
Reduce
Map
44
Variety 2 NoSQL / Map Reduce Analytics
Standard Output (e.g., socket)
HFDS blocks (64MB containing documents)
Hashing
Remote Write (e.g., Socket)
HDFS Writing
Local Shuffling (of terms)
HDFS Reading
45
Variety 3 NoSQL / Column Stores

A column-oriented DBMS is a database management
system (DBMS) that stores data tables as sections
of columns rather than as rows, like most
relational DBMSs
Suggested for data warehouses, customer
relationship management (CRM) systems and other
ad-hoc inquiry systems where aggregates or scans
are carried out over large numbers of similar
data items

Column-Store 1,2,3
Smith,Jones,Johnson Joe,Mary,Cathy
40000,50000,44000
OLTP-workloads!
OLAP-workloads!
Row-Store 1,Smith,Joe,40000
2,Jones,Mary,50000 3,Johnson,Cathy,44000
46
Variety 3 NoSQL / Column Stores
All column family members are stored together on
the big data filesystem.
47
Variety 3 NoSQL Column Stores

Not suitable for every problem.
You need enough data a few thousand/million
rows.
Make sure you can live without all the extra
features that an RDBMS provides (e.g., typed
columns, secondary indexes, transactions,
advanced query languages, etc.)
An application built against an RDBMS cannot be
"ported" to HBase by simply changing a JDBC
driver, for example.
Consider moving from an RDBMS to HBase as a
complete redesign as opposed to a port.
Have enough hardware Even HDFS doesn't do well
with anything less than 5 DataNodes (due to
things such as HDFS block replication which has a
default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop
- but this should be considered a development
configuration only.

48
Variety 4 NewSQL

"NewSQL" is a class of modern relational database
management systems that seek to provide the same
scalable performance of NoSQL systems for OLTP
workloads while still maintaining the ACID
guarantees (i.e., offering transactions) of a
traditional DBMS.

NewSQL NoSQLTransactions
49
Variety 4 NewSQL
Google's Trajectory

(2003) Google GFS Paper (SOSP'03)
Objective Create a Google-scale Filesystem
Apache HDFS is GFS open-source implementation.
(2004) Google's Map-Reduce Paper (OSDI'04)
Objective Enable big-data analytics over
non-tabular data (e.g., XML or text) with the
assistance of GFS.
Apache's MapReduce An open source implementation
of the paper
(2006) Google BigTable Paper (OSDI'06)
Objective Enable big-data analytics over tabular
data (i.e., tables)
(2008) Apache's Hbase An open-source
implementation of the paper
(2010) Facebook Messaging moves from Cassandra
to HBase
(2012) Google's F1 RDBMS (SIGMOD'12) Spanner
Storage Papers (OSDI'12)

50
Talk Outline

Big Data Definitions and Background
Big Data Definition by 3V Examples
Velocity
Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services)
Volume
TextltMultimedialtSciences, Web Data, Filesystems
Variety
The New Database Landscape
NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores)
NewSQL Overview (ACID-compliant NoSQL stores)
Big Data Education and Research
Courses _at_ UCY
Research Prototypes _at_ UCY

51
Big Data Courses _at_ UCY

NoSQL and NewSQL
Intro to Web2.0 the JSON data interchange
format,
Key-Value data model CouchDB.
Introduction Fundamentals I/O Performance,
Replication Strategies, etc.
Big-data Filesystems HDFS
"Big-Data" Analytics Map-Reduce, Hadoop, PIG
Column Stores BigTable, HBase and Intro to
NewSQL (Spanner and F1)

Advanced Topics in Databases http//www.cs.ucy.ac.
cy/dzeina/courses/epl646
52
Big Data Courses Elsewhere

Data science incorporates varying elements and
builds on techniques and theories from many
fields, including with the goal of extracting
meaning from data and creating data products.
Data Science Combines the Following Fields
Math
Statistics,
Data engineering,
Pattern recognition and learning,
Advanced computing,
Visualization,
Uncertainty modeling,
Data warehousing, and
High performance computing

53
Big Data Courses Elsewhere

Course Syllabus Example (Univ. of Washington)
Data modeling relations, key-value, trees,
graphs, images, text
Relational algebra and parallel query processing
NoSQL systems, key-value stores
Tradeoffs of SQL, NoSQL, and NewSQL systems
Algorithm design in Hadoop (and MapReduce in
general)
Basic statistical analysis at scale sampling,
regression
Introduction to data mining clustering,
association rules, decision trees
Case studies in analytics social networking,
bioinformatics, text processing
Free 10 week course

https//www.coursera.org/course/datasci/
54
Big Data Research _at_ UCY

Crowdbeam Build an innovative Windows Phone
messaging platform for a Finnish alliance, backed
by Microsoft Nokia.
Problem Millions of users querying their K
closest smartphones continuously.
Query executed every few seconds.
Currently state-less service
Setup A 14-node Couchbase cluster (i.e.,
distributed - shared-nothing architecture - NoSQL
document-oriented database that is optimized for
interactive applications

55
Big Data Research _at_ UCY
Native JSON Store JSON RESTful API
56
Big Data Research _at_ UCY

Airplace Build an innovative indoor localization
navigation platform for Taiwanese company.
Problem Radiomaps of indoor environments are
fairly large structures considering that those
become massively available.
Setup A 4-node Apache Hbase cluster (i.e.,
distributed, non-relational, shared-nothing
architecture modeled after Google's BigTable and
is written in Java.
Best Demo Award at IEEE MDM'12, covered on
Euronews and local media.

57
Big Data Research _at_ UCY
SmartLab Massive smartphone simulations with our
first global open smartphone IaaS cloud
http//smartlab.cs.ucy.ac.cy/
58
Big Data Research _at_ UCY
http//smartlab.cs.ucy.ac.cy/
59
Big Data - What Is It?
Demetris Zeinalipour Assistant Professor Data
Management Systems Laboratory Department of
Computer Science University of Cyprus http//dmsl.
cs.ucy.ac.cy/
Thanks! Questions?

Write a Comment

User Comments (0)