Big Data - What Is It? - PowerPoint PPT Presentation

About This Presentation
Title:

Big Data - What Is It?

Description:

Big Data - What Is It? Demetris Zeinalipour Assistant Professor Data Management Systems Laboratory Department of Computer Science University of Cyprus – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 58
Provided by: Demetr
Category:
Tags: data | grid | metering | smart

less

Transcript and Presenter's Notes

Title: Big Data - What Is It?


1
Big Data - What Is It?
Demetris Zeinalipour Assistant Professor Data
Management Systems Laboratory Department of
Computer Science University of Cyprus http//dmsl
.cs.ucy.ac.cy/
EPL671 Research Methodologies in Computer
Science, Graduate Course, Tuesday, Mar 19th,
2013.
2
Objectives
  • To provide an overview of the emerging field of
    Big Data Management from a wide range of
    perspectives
  • Fundamentals / Trends, Industrial / Academic,
    Commercial / Open, Reality / Visionary, etc.
  • I assume that the audience has a technical
    background (e.g., DBAs)
  • Lots of examples and illustrations to keep this
    presentation entertaining and educating.

3
Talk Outline
  • Big Data Definitions and Background
  • Big Data Definition by 3V Examples
  • Velocity
  • Sensor Monitoring, Network Monitoring, Web2.0
    Media, Smartphone Services
  • Volume
  • TextltMultimedialtSciences, Web Data, Filesystems
  • Variety
  • The New Database Landscape
  • NoSQL (Document Stores, Replication, Consistency,
    Map-Reduce, Column Stores)
  • NewSQL Trends
  • Big Data Education and Research
  • Courses _at_ UCY
  • Research Prototypes _at_ UCY

4
Big Data Definitions
  • "Refers to data sets whose size and structure
    strains (stretches) the ability of commonly used
    relational DBMSs to capture, manage, and process
    the data within a tolerable elapsed time."
  • Hoffer, Ramesh, Topi Modern Database Management,
    11E, 2013.
  • Similar from Wikipedia, Feb. 2013
  • "big data is a collection of data sets so large
    and complex that it becomes difficult to process
    using on-hand database management tools or
    traditional data processing applications."

5
Big Data Characteristics
  • Size from a few dozen terabytes to many
    petabytes in a single database.
  • Data model anything from structured (relational
    or tabular) to semi-structured (XML or JSON) or
    even unstructured (Web text and log files).
  • Architectures highly parallel and distributed in
    order to cope with the inherent I/O and CPU
    limitations.
  • Hardware mid-scale private clouds (datacenters),
    offering higher privacy, to large-scale public
    clouds.
  • Functionality operational (OLTP) and analytic
    (OLAP) functionality stand-alone or as-a-Service.

6
Big Data Characteristics
2013 IEEE International Conference on Big Data
(IEEE BigData 2013), October 6-9, 2013, Silicon
Valley, CA, USA
Wordle.net
7
Background Public Clouds
Google's Datacenter in Oregon
8
Background Public Clouds
Microsoft Azure in Chicago
112 containers x 2000 servers 224,000 servers
9
Background -as-a-Service
To Amazon RDS (Relational Database Service)
963 / year
27,165 / year
10
Background Private Clouds
Our Laboratory Private IaaS
11
Cloud Computing gt Big Data
12
Big Data Velocity-Volume-Variety
  • Velocity
  • how fast data is being produced and how fast the
    data must be processed to meet demand.
  • How to deal with torrents of data, in near-real
    time, streaming from RFID tags and smart metering
    systems?
  • How to identify fraud in 5 million trade events
    created each day?
  • Reacting quickly enough to deal with velocity is
    a challenge to most organizations.

Source IDC. "Big Data Analytics Future
Architectures, Skills and Roadmaps for the CIO,"
September 2011.
13
Big Data Velocity-Volume-Variety
  • Volume
  • Past Challenge Store data.
  • transaction-based data stored through the years.
  • sensor data being collected
  • Integration with web applications social media
  • New Challenge Create value from data
  • Turn 12 TB of Tweets each day into a sentiment
    analysis (opinion mining) product.
  • e.g., People feel positive/negative/neutral about
    brand X.
  • Turn 350 billion annual smart meter readings to
    knowledge that helps predicting power
    consumption.

14
Big Data Velocity-Volume-Variety
  • Variety
  • By some estimates, 80 percent of an
    organization's data is not numeric!
  • Different data format unstructured, structured,
    semi-structured
  • text, sensor data, audio, video, click streams,
    log files, etc.

15
Talk Outline
  • Big Data Definitions and Background
  • Big Data Definition by 3V Examples
  • Velocity
  • Sensor Monitoring, Network Monitoring, Web2.0
    Media, Smartphone Services)
  • Volume
  • TextltMultimedialtSciences, Web Data, Filesystems
  • Variety
  • The New Database Landscape
  • NoSQL (Document Stores, Replication, Consistency,
    File Systems, Map-Reduce, Column Stores)
  • NewSQL Trends
  • Big Data Education and Research
  • Courses _at_ UCY
  • Research Prototypes _at_ UCY

16
Velocity 1 Smart Meters
  • Smart meter records consumption of electric
    energy in intervals and communicates that
    information to the utility for monitoring and
    billing purposes.

Every 15m
17
Velocity 1 Smart Meters
  • Ontario's Meter Data Management and Repository
    (MDM/R) storing, processing and managing all
    smart meter data in Ontario, Canada
  • Characteristics
  • Provides hourly billing quantity and extensive
    reports.
  • 4.6 million smart meters.
  • Storage/Bandwidth 4.6M meters x 0.5K message
    (typical HTTP) 2.3 GB / round
  • 110 million meter reads per day
  • on an annual basis, exceeds the number of debit
    card transactions processed in the country
    (Canada!)

Source Smart Metering Entity http//www.smi-ieso
.ca/mdmr
18
Velocity 2 Network Monitoring
  • Akamai
  • CDN serving 15-30 of all Web traffic (10TB/sec)
  • One out of every three Global 500 companies
  • All of the top Internet portals
  • Has a picture of the global traffic every 6
    seconds
  • How?
  • 119,000 servers in 80 countries within over 1,100
    networks.
  • Servers report to a proprietary database network
    health information (latency/loss) every 6 seconds.

Proprietary DBMS
Every 6 seconds
ping/traceroute
19
Velocity 2 Network Monitoring
Companies started seeking Big data engineers.
20
Velocity 3 Web2.0 Media
  • Analyze online conversations in Social Nets.
  • Accelerated responses to marketplace shifts.

Continously Over Web2.0 protocols
21
Velocity 3 Web2.0 Media
Web1.0 The Unstructured Web http//books.google.c
om/
(content in HTML only apprehensible to User)
22
Velocity 3 Web2.0 Media
Web2.0 The Semi-structured Web! https//www.googl
eapis.com/books/v1/volumes?qdatabases
content in XML/JSON apprehensible to Computer
23
Velocity 3 Web2.0 Media
Twitter API
https//twitter.com/users/dmslucy.json
24
Velocity 3 Web2.0 Media
In fact, Web2.0 Services are omnipresent! (Google,
Twitter, Facebook, Youtube, Linkedin, )
http//www.programmableweb.com/ - 7800 APIs!!!
6800 Mashups!
https//code.google.com/apis
25
Velocity 4 Smartphone Services
Response Format The response format is also JSON.
"location" "latitude" 51.0,
"longitude" -0.1, , "accuracy" 1200.4,
  • Request Format (request.json)
  • "homeMobileCountryCode" 310,
  • "homeMobileNetworkCode" 260,
  • "radioType" "gsm",
  • "carrier" "T-Mobile",
  • "cellTowers"
  • "cellId" 39627456,
  • "locationAreaCode" 40495,
  • "mobileCountryCode" 310,
  • "mobileNetworkCode" 260,
  • "age" 0,
  • "signalStrength" -95
  • ,

"wifiAccessPoints" "macAddress"
"0123456789AB", "signalStrength" 8,
"age" 0, "signalToNoiseRatio" -65,
"channel" 8 , "macAddress"
"0123456789AC", "signalStrength" 4,
"age" 0
Will be discussing some furtherin-house
applications in a while
26
Velocity 4 Smartphone Services
Wireless Data Transfer Rates
  • 4G ITU peak rates
  • 100 Mbps (high mobility, such as trains and cars)
  • 1Gbps (low mobility, such as pedestrians and
    stationary users)

Plot Courtesy of H. Kim, N. Agrawal, and C.
Ungureanu, "Revisiting Storage for Smartphones",
The 10th USENIX Conference on File and Storage
Technologies (FAST'12), San Jose, CA, February
2012. Best Paper Award
27
Velocity 4 Smartphone Services
Mapping the Road traffic by collecting WiFi
signals.
Every 1 second
Received Signal Strength (RSS) power present in
WiFi radio signal
Graphics courtesy of A .Thiagarajan et. al.
Vtrack Accurate, Energy-Aware Road Traffic
Delay Estimation using Mobile Phones, In
Sensys09, pages 85-98. ACM, (Best Paper) MITs
CarTel Group
28
Talk Outline
  • Big Data Definitions and Background
  • Big Data Definition by 3V Examples
  • Velocity
  • Sensor Monitoring, Network Monitoring, Web2.0
    Media, Smartphone Services)
  • Volume
  • TextltMultimedialtSciences, Web Data, Filesystems
  • Variety
  • The New Database Landscape
  • NoSQL (Document Stores, Replication, Consistency,
    File Systems, Map-Reduce, Column Stores)
  • NewSQL Trends
  • Big Data Education and Research
  • Courses _at_ UCY
  • Research Prototypes _at_ UCY

29
Volume 1 TextltMultimedialtSciences
  • From the TB-era to the PB-era.
  • The U.S. Library of Congress (April 2011) 235 TB
  • Anchestry.com Genealogical data 600 TB
  • Games World of Warcraft uses 1.3 PB of storage
    to maintain its game.
  • Internet Video will account for 61 of total
    Internet Data by 2015 (966 Exabytes or nearly 1
    Zettabyte!)
  • Climate science The German Climate Computing
    Centre (DKRZ) has a storage capacity of 60 PB of
    climate data.
  • Physics The experiments in the Large Hadron
    Collider produce about 15 PB of data per year,
    which is distributed over the LHC Computing Grid
    (Our department is part of the EGEE Enabling
    Grids for E-sciencE, now EGI - European Grid
    Infrastructure).

Human Generated
Multimedia/ Streaming
Sciences/Sensors
Source Petabyte, from Wikipedia
http//en.wikipedia.org/wiki/Petabyte
30
Volume 2 Web Data
Google Volume (in 2006)IDC The total amount of
global data is expected to grow to 2.7 zettabytes
during 2012. This is 48 up from 2011.
http//en.wikipedia.org/wiki/Zettabyte
Bigtable A Distributed Storage System for
Structured Data, OSDI'06 Seventh Symposium on
Operating System Design and Implementation,
Seattle, WA, November, 2006.
31
Volume 3 Big Data File Systems
  • Big Data Filesystems HDFS

Namespace lookup are fast (1 Master enough!)
1GB Metadata 1PB Data In NFS Metadata
Transfers going through same server gt Not
Scalable HDFS designed for unreliable hardware
(2-3 failures / 1000 nodes / day)
32
Volume 3 Big Data File Systems
  • Big Data Filesystems How Big?
  • Results from 2010

HDFS scalability the limits to growth
http//static.usenix.org/publications/login/2010-
04/openpdfs/shvachko.pdf
33
Variety 2 File Systems
NFS uses a Client/Server Architecture that is a
single point of failure by default.
34
Talk Outline
  • Big Data Definitions and Background
  • Big Data Definition by 3V Examples
  • Velocity
  • Sensor Monitoring, Network Monitoring, Web2.0
    Media, Smartphone Services)
  • Volume
  • TextltMultimedialtSciences, Web Data, Filesystems
  • Variety
  • The New Database Landscape
  • NoSQL (Document Stores, Replication, Consistency,
    File Systems, Map-Reduce, Column Stores)
  • NewSQL Overview (ACID-compliant NoSQL stores)
  • Big Data Teaching and Research
  • Courses _at_ UCY
  • Research Prototypes _at_ UCY

35
Variety Overview
451 Research, Matthew Aslett, http//goo.gl/GYcEx
36
Variety 1 NoSQL
  • NoSQL ("not only SQL") is a broad class of
    database management systems identified by
    non-adherence to the widely used relational
    database management system model.
  • NoSQL databases are NOT built primarily on
    tables, and generally DO NOT use SQL for data.
  • NoSQL gt Not Relational!
  • Key Value (e.g., BerkeleyDB emb, Oracle NoSQL -
    Distributed)
  • Document Stores (e.g., JSON stores)
  • BigTables (i.e., Column-stores)
  • Graph Databases (e.g., FlockDB)
  • potentially much longer list but I
  • will only focus on a few trends

37
Variety 1 NoSQL / Document Stores
Document in CouchDB
Map Function
function(doc) for (i in doc.authors)
author doc.authorsi emit(doc._id,
author)
Results (through REST/HTTP or Futon)
38
Variety 1 NoSQL / Document Stores
For a real app we could envision much more
complex queries.

http//rickosborne.org/download/SQL-to-MongoDB.pdf
39
Variety 1 NoSQL / Replication
Asynchronous Replication means Eventually
Consistent
Asynchronous
Asynchronous
40
Variety 1 NoSQL / Consistency ?
SQL RDBMSs
(Most) NoSQL DBMSs
Eventually consistent!
Strongly consistent!
41
Variety 2 NoSQL / Map Reduce Analytics
  • Map-Reduce a programming model for processing
    large data sets (Not online like Warehouses ?).
  • Invented by Google! "MapReduce Simplified Data
    Processing on Large Clusters, Jeffrey Dean and
    Sanjay Ghemawat, OSDI'04 Sixth Symposium on
    Operating System Design and Implementation,San
    Francisco, CA, December, 2004."
  • Can be implemented in any language (Java, example
    nex)
  • Hadoop Apache's open-source software framework
    that supports data-intensive distributed
    applications
  • Derived from Google's MapReduce Google File
    System (GFS) papers.
  • Enables applications to work with thousands of
    computation-independent computers and petabytes
    of data.
  • Download http//hadoop.apache.org/

42
Variety 2 NoSQL / Map Reduce Analytics
Count the distinct words in all documents cat
.txt sort uniq -c
1 TB on 1 PC 2 hours!!! 1TB on 100 PCs 1min!!!
43
Variety 2 NoSQL / Map Reduce Analytics
Example uses 1 mapper / 1 reduce only!
Shuffle
Reduce
Map
44
Variety 2 NoSQL / Map Reduce Analytics
Standard Output (e.g., socket)
HFDS blocks (64MB containing documents)
Hashing
Remote Write (e.g., Socket)
HDFS Writing
Local Shuffling (of terms)
HDFS Reading
45
Variety 3 NoSQL / Column Stores
  • A column-oriented DBMS is a database management
    system (DBMS) that stores data tables as sections
    of columns rather than as rows, like most
    relational DBMSs
  • Suggested for data warehouses, customer
    relationship management (CRM) systems and other
    ad-hoc inquiry systems where aggregates or scans
    are carried out over large numbers of similar
    data items

Column-Store 1,2,3
Smith,Jones,Johnson Joe,Mary,Cathy
40000,50000,44000
OLTP-workloads!
OLAP-workloads!
Row-Store 1,Smith,Joe,40000
2,Jones,Mary,50000 3,Johnson,Cathy,44000
46
Variety 3 NoSQL / Column Stores
All column family members are stored together on
the big data filesystem.
47
Variety 3 NoSQL Column Stores
  • Not suitable for every problem.
  • You need enough data a few thousand/million
    rows.
  • Make sure you can live without all the extra
    features that an RDBMS provides (e.g., typed
    columns, secondary indexes, transactions,
    advanced query languages, etc.)
  • An application built against an RDBMS cannot be
    "ported" to HBase by simply changing a JDBC
    driver, for example.
  • Consider moving from an RDBMS to HBase as a
    complete redesign as opposed to a port.
  • Have enough hardware Even HDFS doesn't do well
    with anything less than 5 DataNodes (due to
    things such as HDFS block replication which has a
    default of 3), plus a NameNode.
  • HBase can run quite well stand-alone on a laptop
    - but this should be considered a development
    configuration only.

48
Variety 4 NewSQL
  • "NewSQL" is a class of modern relational database
    management systems that seek to provide the same
    scalable performance of NoSQL systems for OLTP
    workloads while still maintaining the ACID
    guarantees (i.e., offering transactions) of a
    traditional DBMS.

NewSQL NoSQLTransactions
49
Variety 4 NewSQL
Google's Trajectory
  • (2003) Google GFS Paper (SOSP'03)
  • Objective Create a Google-scale Filesystem
  • Apache HDFS is GFS open-source implementation.
  • (2004) Google's Map-Reduce Paper (OSDI'04)
  • Objective Enable big-data analytics over
    non-tabular data (e.g., XML or text) with the
    assistance of GFS.
  • Apache's MapReduce An open source implementation
    of the paper
  • (2006) Google BigTable Paper (OSDI'06)
  • Objective Enable big-data analytics over tabular
    data (i.e., tables)
  • (2008) Apache's Hbase An open-source
    implementation of the paper
  • (2010) Facebook Messaging moves from Cassandra
    to HBase
  • (2012) Google's F1 RDBMS (SIGMOD'12) Spanner
    Storage Papers (OSDI'12)

50
Talk Outline
  • Big Data Definitions and Background
  • Big Data Definition by 3V Examples
  • Velocity
  • Sensor Monitoring, Network Monitoring, Web2.0
    Media, Smartphone Services)
  • Volume
  • TextltMultimedialtSciences, Web Data, Filesystems
  • Variety
  • The New Database Landscape
  • NoSQL (Document Stores, Replication, Consistency,
    File Systems, Map-Reduce, Column Stores)
  • NewSQL Overview (ACID-compliant NoSQL stores)
  • Big Data Education and Research
  • Courses _at_ UCY
  • Research Prototypes _at_ UCY

51
Big Data Courses _at_ UCY
  • NoSQL and NewSQL
  • Intro to Web2.0 the JSON data interchange
    format,
  • Key-Value data model CouchDB.
  • Introduction Fundamentals I/O Performance,
    Replication Strategies, etc.
  • Big-data Filesystems HDFS
  • "Big-Data" Analytics Map-Reduce, Hadoop, PIG
  • Column Stores BigTable, HBase and Intro to
    NewSQL (Spanner and F1)

Advanced Topics in Databases http//www.cs.ucy.ac.
cy/dzeina/courses/epl646
52
Big Data Courses Elsewhere
  • Data science incorporates varying elements and
    builds on techniques and theories from many
    fields, including with the goal of extracting
    meaning from data and creating data products.
  • Data Science Combines the Following Fields
  • Math
  • Statistics,
  • Data engineering,
  • Pattern recognition and learning,
  • Advanced computing,
  • Visualization,
  • Uncertainty modeling,
  • Data warehousing, and
  • High performance computing

53
Big Data Courses Elsewhere
  • Course Syllabus Example (Univ. of Washington)
  • Data modeling relations, key-value, trees,
    graphs, images, text
  • Relational algebra and parallel query processing
  • NoSQL systems, key-value stores
  • Tradeoffs of SQL, NoSQL, and NewSQL systems
  • Algorithm design in Hadoop (and MapReduce in
    general)
  • Basic statistical analysis at scale sampling,
    regression
  • Introduction to data mining clustering,
    association rules, decision trees
  • Case studies in analytics social networking,
    bioinformatics, text processing
  • Free 10 week course

https//www.coursera.org/course/datasci/
54
Big Data Research _at_ UCY
  • Crowdbeam Build an innovative Windows Phone
    messaging platform for a Finnish alliance, backed
    by Microsoft Nokia.
  • Problem Millions of users querying their K
    closest smartphones continuously.
  • Query executed every few seconds.
  • Currently state-less service
  • Setup A 14-node Couchbase cluster (i.e.,
    distributed - shared-nothing architecture - NoSQL
    document-oriented database that is optimized for
    interactive applications

55
Big Data Research _at_ UCY
Native JSON Store JSON RESTful API
56
Big Data Research _at_ UCY
  • Airplace Build an innovative indoor localization
    navigation platform for Taiwanese company.
  • Problem Radiomaps of indoor environments are
    fairly large structures considering that those
    become massively available.
  • Setup A 4-node Apache Hbase cluster (i.e.,
    distributed, non-relational, shared-nothing
    architecture modeled after Google's BigTable and
    is written in Java.
  • Best Demo Award at IEEE MDM'12, covered on
    Euronews and local media.

57
Big Data Research _at_ UCY
SmartLab Massive smartphone simulations with our
first global open smartphone IaaS cloud
http//smartlab.cs.ucy.ac.cy/
58
Big Data Research _at_ UCY
http//smartlab.cs.ucy.ac.cy/
59
Big Data - What Is It?
Demetris Zeinalipour Assistant Professor Data
Management Systems Laboratory Department of
Computer Science University of Cyprus http//dmsl.
cs.ucy.ac.cy/
Thanks! Questions?
Write a Comment
User Comments (0)
About PowerShow.com