Title: Big Data - What Is It?
1Big Data - What Is It?
Demetris Zeinalipour Assistant Professor Data
Management Systems Laboratory Department of
Computer Science University of Cyprus http//dmsl
.cs.ucy.ac.cy/
EPL671 Research Methodologies in Computer
Science, Graduate Course, Tuesday, Mar 19th,
2013.
2Objectives
- To provide an overview of the emerging field of
Big Data Management from a wide range of
perspectives - Fundamentals / Trends, Industrial / Academic,
Commercial / Open, Reality / Visionary, etc. - I assume that the audience has a technical
background (e.g., DBAs) - Lots of examples and illustrations to keep this
presentation entertaining and educating.
3Talk Outline
- Big Data Definitions and Background
- Big Data Definition by 3V Examples
- Velocity
- Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services - Volume
- TextltMultimedialtSciences, Web Data, Filesystems
- Variety
- The New Database Landscape
- NoSQL (Document Stores, Replication, Consistency,
Map-Reduce, Column Stores) - NewSQL Trends
- Big Data Education and Research
- Courses _at_ UCY
- Research Prototypes _at_ UCY
4Big Data Definitions
- "Refers to data sets whose size and structure
strains (stretches) the ability of commonly used
relational DBMSs to capture, manage, and process
the data within a tolerable elapsed time." - Hoffer, Ramesh, Topi Modern Database Management,
11E, 2013. - Similar from Wikipedia, Feb. 2013
- "big data is a collection of data sets so large
and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications."
5Big Data Characteristics
- Size from a few dozen terabytes to many
petabytes in a single database. - Data model anything from structured (relational
or tabular) to semi-structured (XML or JSON) or
even unstructured (Web text and log files). - Architectures highly parallel and distributed in
order to cope with the inherent I/O and CPU
limitations. - Hardware mid-scale private clouds (datacenters),
offering higher privacy, to large-scale public
clouds. - Functionality operational (OLTP) and analytic
(OLAP) functionality stand-alone or as-a-Service.
6Big Data Characteristics
2013 IEEE International Conference on Big Data
(IEEE BigData 2013), October 6-9, 2013, Silicon
Valley, CA, USA
Wordle.net
7Background Public Clouds
Google's Datacenter in Oregon
8Background Public Clouds
Microsoft Azure in Chicago
112 containers x 2000 servers 224,000 servers
9Background -as-a-Service
To Amazon RDS (Relational Database Service)
963 / year
27,165 / year
10Background Private Clouds
Our Laboratory Private IaaS
11Cloud Computing gt Big Data
12Big Data Velocity-Volume-Variety
- Velocity
- how fast data is being produced and how fast the
data must be processed to meet demand. - How to deal with torrents of data, in near-real
time, streaming from RFID tags and smart metering
systems? - How to identify fraud in 5 million trade events
created each day? - Reacting quickly enough to deal with velocity is
a challenge to most organizations.
Source IDC. "Big Data Analytics Future
Architectures, Skills and Roadmaps for the CIO,"
September 2011.
13Big Data Velocity-Volume-Variety
- Volume
- Past Challenge Store data.
- transaction-based data stored through the years.
- sensor data being collected
- Integration with web applications social media
- New Challenge Create value from data
- Turn 12 TB of Tweets each day into a sentiment
analysis (opinion mining) product. - e.g., People feel positive/negative/neutral about
brand X. - Turn 350 billion annual smart meter readings to
knowledge that helps predicting power
consumption.
14Big Data Velocity-Volume-Variety
- Variety
- By some estimates, 80 percent of an
organization's data is not numeric! - Different data format unstructured, structured,
semi-structured - text, sensor data, audio, video, click streams,
log files, etc.
15Talk Outline
- Big Data Definitions and Background
- Big Data Definition by 3V Examples
- Velocity
- Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services) - Volume
- TextltMultimedialtSciences, Web Data, Filesystems
- Variety
- The New Database Landscape
- NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores) - NewSQL Trends
- Big Data Education and Research
- Courses _at_ UCY
- Research Prototypes _at_ UCY
16Velocity 1 Smart Meters
- Smart meter records consumption of electric
energy in intervals and communicates that
information to the utility for monitoring and
billing purposes.
Every 15m
17Velocity 1 Smart Meters
- Ontario's Meter Data Management and Repository
(MDM/R) storing, processing and managing all
smart meter data in Ontario, Canada - Characteristics
- Provides hourly billing quantity and extensive
reports. - 4.6 million smart meters.
- Storage/Bandwidth 4.6M meters x 0.5K message
(typical HTTP) 2.3 GB / round - 110 million meter reads per day
- on an annual basis, exceeds the number of debit
card transactions processed in the country
(Canada!)
Source Smart Metering Entity http//www.smi-ieso
.ca/mdmr
18Velocity 2 Network Monitoring
- Akamai
- CDN serving 15-30 of all Web traffic (10TB/sec)
- One out of every three Global 500 companies
- All of the top Internet portals
- Has a picture of the global traffic every 6
seconds - How?
- 119,000 servers in 80 countries within over 1,100
networks. - Servers report to a proprietary database network
health information (latency/loss) every 6 seconds.
Proprietary DBMS
Every 6 seconds
ping/traceroute
19Velocity 2 Network Monitoring
Companies started seeking Big data engineers.
20Velocity 3 Web2.0 Media
- Analyze online conversations in Social Nets.
- Accelerated responses to marketplace shifts.
Continously Over Web2.0 protocols
21Velocity 3 Web2.0 Media
Web1.0 The Unstructured Web http//books.google.c
om/
(content in HTML only apprehensible to User)
22Velocity 3 Web2.0 Media
Web2.0 The Semi-structured Web! https//www.googl
eapis.com/books/v1/volumes?qdatabases
content in XML/JSON apprehensible to Computer
23Velocity 3 Web2.0 Media
Twitter API
https//twitter.com/users/dmslucy.json
24Velocity 3 Web2.0 Media
In fact, Web2.0 Services are omnipresent! (Google,
Twitter, Facebook, Youtube, Linkedin, )
http//www.programmableweb.com/ - 7800 APIs!!!
6800 Mashups!
https//code.google.com/apis
25Velocity 4 Smartphone Services
Response Format The response format is also JSON.
"location" "latitude" 51.0,
"longitude" -0.1, , "accuracy" 1200.4,
- Request Format (request.json)
-
- "homeMobileCountryCode" 310,
- "homeMobileNetworkCode" 260,
- "radioType" "gsm",
- "carrier" "T-Mobile",
- "cellTowers"
-
- "cellId" 39627456,
- "locationAreaCode" 40495,
- "mobileCountryCode" 310,
- "mobileNetworkCode" 260,
- "age" 0,
- "signalStrength" -95
-
- ,
"wifiAccessPoints" "macAddress"
"0123456789AB", "signalStrength" 8,
"age" 0, "signalToNoiseRatio" -65,
"channel" 8 , "macAddress"
"0123456789AC", "signalStrength" 4,
"age" 0
Will be discussing some furtherin-house
applications in a while
26Velocity 4 Smartphone Services
Wireless Data Transfer Rates
- 4G ITU peak rates
- 100 Mbps (high mobility, such as trains and cars)
- 1Gbps (low mobility, such as pedestrians and
stationary users)
Plot Courtesy of H. Kim, N. Agrawal, and C.
Ungureanu, "Revisiting Storage for Smartphones",
The 10th USENIX Conference on File and Storage
Technologies (FAST'12), San Jose, CA, February
2012. Best Paper Award
27Velocity 4 Smartphone Services
Mapping the Road traffic by collecting WiFi
signals.
Every 1 second
Received Signal Strength (RSS) power present in
WiFi radio signal
Graphics courtesy of A .Thiagarajan et. al.
Vtrack Accurate, Energy-Aware Road Traffic
Delay Estimation using Mobile Phones, In
Sensys09, pages 85-98. ACM, (Best Paper) MITs
CarTel Group
28Talk Outline
- Big Data Definitions and Background
- Big Data Definition by 3V Examples
- Velocity
- Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services) - Volume
- TextltMultimedialtSciences, Web Data, Filesystems
- Variety
- The New Database Landscape
- NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores) - NewSQL Trends
- Big Data Education and Research
- Courses _at_ UCY
- Research Prototypes _at_ UCY
29Volume 1 TextltMultimedialtSciences
- From the TB-era to the PB-era.
- The U.S. Library of Congress (April 2011) 235 TB
- Anchestry.com Genealogical data 600 TB
- Games World of Warcraft uses 1.3 PB of storage
to maintain its game. - Internet Video will account for 61 of total
Internet Data by 2015 (966 Exabytes or nearly 1
Zettabyte!) - Climate science The German Climate Computing
Centre (DKRZ) has a storage capacity of 60 PB of
climate data. - Physics The experiments in the Large Hadron
Collider produce about 15 PB of data per year,
which is distributed over the LHC Computing Grid
(Our department is part of the EGEE Enabling
Grids for E-sciencE, now EGI - European Grid
Infrastructure).
Human Generated
Multimedia/ Streaming
Sciences/Sensors
Source Petabyte, from Wikipedia
http//en.wikipedia.org/wiki/Petabyte
30Volume 2 Web Data
Google Volume (in 2006)IDC The total amount of
global data is expected to grow to 2.7 zettabytes
during 2012. This is 48 up from 2011.
http//en.wikipedia.org/wiki/Zettabyte
Bigtable A Distributed Storage System for
Structured Data, OSDI'06 Seventh Symposium on
Operating System Design and Implementation,
Seattle, WA, November, 2006.
31Volume 3 Big Data File Systems
- Big Data Filesystems HDFS
Namespace lookup are fast (1 Master enough!)
1GB Metadata 1PB Data In NFS Metadata
Transfers going through same server gt Not
Scalable HDFS designed for unreliable hardware
(2-3 failures / 1000 nodes / day)
32Volume 3 Big Data File Systems
- Big Data Filesystems How Big?
- Results from 2010
HDFS scalability the limits to growth
http//static.usenix.org/publications/login/2010-
04/openpdfs/shvachko.pdf
33Variety 2 File Systems
NFS uses a Client/Server Architecture that is a
single point of failure by default.
34Talk Outline
- Big Data Definitions and Background
- Big Data Definition by 3V Examples
- Velocity
- Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services) - Volume
- TextltMultimedialtSciences, Web Data, Filesystems
- Variety
- The New Database Landscape
- NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores) - NewSQL Overview (ACID-compliant NoSQL stores)
- Big Data Teaching and Research
- Courses _at_ UCY
- Research Prototypes _at_ UCY
35Variety Overview
451 Research, Matthew Aslett, http//goo.gl/GYcEx
36Variety 1 NoSQL
- NoSQL ("not only SQL") is a broad class of
database management systems identified by
non-adherence to the widely used relational
database management system model. - NoSQL databases are NOT built primarily on
tables, and generally DO NOT use SQL for data. - NoSQL gt Not Relational!
- Key Value (e.g., BerkeleyDB emb, Oracle NoSQL -
Distributed) - Document Stores (e.g., JSON stores)
- BigTables (i.e., Column-stores)
- Graph Databases (e.g., FlockDB)
- potentially much longer list but I
- will only focus on a few trends
37Variety 1 NoSQL / Document Stores
Document in CouchDB
Map Function
function(doc) for (i in doc.authors)
author doc.authorsi emit(doc._id,
author)
Results (through REST/HTTP or Futon)
38Variety 1 NoSQL / Document Stores
For a real app we could envision much more
complex queries.
http//rickosborne.org/download/SQL-to-MongoDB.pdf
39Variety 1 NoSQL / Replication
Asynchronous Replication means Eventually
Consistent
Asynchronous
Asynchronous
40Variety 1 NoSQL / Consistency ?
SQL RDBMSs
(Most) NoSQL DBMSs
Eventually consistent!
Strongly consistent!
41Variety 2 NoSQL / Map Reduce Analytics
- Map-Reduce a programming model for processing
large data sets (Not online like Warehouses ?). - Invented by Google! "MapReduce Simplified Data
Processing on Large Clusters, Jeffrey Dean and
Sanjay Ghemawat, OSDI'04 Sixth Symposium on
Operating System Design and Implementation,San
Francisco, CA, December, 2004." - Can be implemented in any language (Java, example
nex) - Hadoop Apache's open-source software framework
that supports data-intensive distributed
applications - Derived from Google's MapReduce Google File
System (GFS) papers. - Enables applications to work with thousands of
computation-independent computers and petabytes
of data. - Download http//hadoop.apache.org/
42Variety 2 NoSQL / Map Reduce Analytics
Count the distinct words in all documents cat
.txt sort uniq -c
1 TB on 1 PC 2 hours!!! 1TB on 100 PCs 1min!!!
43Variety 2 NoSQL / Map Reduce Analytics
Example uses 1 mapper / 1 reduce only!
Shuffle
Reduce
Map
44Variety 2 NoSQL / Map Reduce Analytics
Standard Output (e.g., socket)
HFDS blocks (64MB containing documents)
Hashing
Remote Write (e.g., Socket)
HDFS Writing
Local Shuffling (of terms)
HDFS Reading
45Variety 3 NoSQL / Column Stores
- A column-oriented DBMS is a database management
system (DBMS) that stores data tables as sections
of columns rather than as rows, like most
relational DBMSs - Suggested for data warehouses, customer
relationship management (CRM) systems and other
ad-hoc inquiry systems where aggregates or scans
are carried out over large numbers of similar
data items
Column-Store 1,2,3
Smith,Jones,Johnson Joe,Mary,Cathy
40000,50000,44000
OLTP-workloads!
OLAP-workloads!
Row-Store 1,Smith,Joe,40000
2,Jones,Mary,50000 3,Johnson,Cathy,44000
46Variety 3 NoSQL / Column Stores
All column family members are stored together on
the big data filesystem.
47Variety 3 NoSQL Column Stores
- Not suitable for every problem.
- You need enough data a few thousand/million
rows. - Make sure you can live without all the extra
features that an RDBMS provides (e.g., typed
columns, secondary indexes, transactions,
advanced query languages, etc.) - An application built against an RDBMS cannot be
"ported" to HBase by simply changing a JDBC
driver, for example. - Consider moving from an RDBMS to HBase as a
complete redesign as opposed to a port. - Have enough hardware Even HDFS doesn't do well
with anything less than 5 DataNodes (due to
things such as HDFS block replication which has a
default of 3), plus a NameNode. - HBase can run quite well stand-alone on a laptop
- but this should be considered a development
configuration only.
48Variety 4 NewSQL
- "NewSQL" is a class of modern relational database
management systems that seek to provide the same
scalable performance of NoSQL systems for OLTP
workloads while still maintaining the ACID
guarantees (i.e., offering transactions) of a
traditional DBMS.
NewSQL NoSQLTransactions
49Variety 4 NewSQL
Google's Trajectory
- (2003) Google GFS Paper (SOSP'03)
- Objective Create a Google-scale Filesystem
- Apache HDFS is GFS open-source implementation.
- (2004) Google's Map-Reduce Paper (OSDI'04)
- Objective Enable big-data analytics over
non-tabular data (e.g., XML or text) with the
assistance of GFS. - Apache's MapReduce An open source implementation
of the paper - (2006) Google BigTable Paper (OSDI'06)
- Objective Enable big-data analytics over tabular
data (i.e., tables) - (2008) Apache's Hbase An open-source
implementation of the paper - (2010) Facebook Messaging moves from Cassandra
to HBase - (2012) Google's F1 RDBMS (SIGMOD'12) Spanner
Storage Papers (OSDI'12)
50Talk Outline
- Big Data Definitions and Background
- Big Data Definition by 3V Examples
- Velocity
- Sensor Monitoring, Network Monitoring, Web2.0
Media, Smartphone Services) - Volume
- TextltMultimedialtSciences, Web Data, Filesystems
- Variety
- The New Database Landscape
- NoSQL (Document Stores, Replication, Consistency,
File Systems, Map-Reduce, Column Stores) - NewSQL Overview (ACID-compliant NoSQL stores)
- Big Data Education and Research
- Courses _at_ UCY
- Research Prototypes _at_ UCY
51Big Data Courses _at_ UCY
- NoSQL and NewSQL
- Intro to Web2.0 the JSON data interchange
format, - Key-Value data model CouchDB.
- Introduction Fundamentals I/O Performance,
Replication Strategies, etc. - Big-data Filesystems HDFS
- "Big-Data" Analytics Map-Reduce, Hadoop, PIG
- Column Stores BigTable, HBase and Intro to
NewSQL (Spanner and F1)
Advanced Topics in Databases http//www.cs.ucy.ac.
cy/dzeina/courses/epl646
52Big Data Courses Elsewhere
- Data science incorporates varying elements and
builds on techniques and theories from many
fields, including with the goal of extracting
meaning from data and creating data products. - Data Science Combines the Following Fields
- Math
- Statistics,
- Data engineering,
- Pattern recognition and learning,
- Advanced computing,
- Visualization,
- Uncertainty modeling,
- Data warehousing, and
- High performance computing
53Big Data Courses Elsewhere
- Course Syllabus Example (Univ. of Washington)
- Data modeling relations, key-value, trees,
graphs, images, text - Relational algebra and parallel query processing
- NoSQL systems, key-value stores
- Tradeoffs of SQL, NoSQL, and NewSQL systems
- Algorithm design in Hadoop (and MapReduce in
general) - Basic statistical analysis at scale sampling,
regression - Introduction to data mining clustering,
association rules, decision trees - Case studies in analytics social networking,
bioinformatics, text processing - Free 10 week course
https//www.coursera.org/course/datasci/
54Big Data Research _at_ UCY
- Crowdbeam Build an innovative Windows Phone
messaging platform for a Finnish alliance, backed
by Microsoft Nokia. - Problem Millions of users querying their K
closest smartphones continuously. - Query executed every few seconds.
- Currently state-less service
- Setup A 14-node Couchbase cluster (i.e.,
distributed - shared-nothing architecture - NoSQL
document-oriented database that is optimized for
interactive applications
55Big Data Research _at_ UCY
Native JSON Store JSON RESTful API
56Big Data Research _at_ UCY
- Airplace Build an innovative indoor localization
navigation platform for Taiwanese company. - Problem Radiomaps of indoor environments are
fairly large structures considering that those
become massively available. - Setup A 4-node Apache Hbase cluster (i.e.,
distributed, non-relational, shared-nothing
architecture modeled after Google's BigTable and
is written in Java. - Best Demo Award at IEEE MDM'12, covered on
Euronews and local media.
57Big Data Research _at_ UCY
SmartLab Massive smartphone simulations with our
first global open smartphone IaaS cloud
http//smartlab.cs.ucy.ac.cy/
58Big Data Research _at_ UCY
http//smartlab.cs.ucy.ac.cy/
59Big Data - What Is It?
Demetris Zeinalipour Assistant Professor Data
Management Systems Laboratory Department of
Computer Science University of Cyprus http//dmsl.
cs.ucy.ac.cy/
Thanks! Questions?