High Performance Computing Solutions for Data Mining - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

High Performance Computing Solutions for Data Mining

Description:

High Performance Computing Solutions for Data Mining Prof. Navneet Goyal – PowerPoint PPT presentation

Number of Views:469

Avg rating:3.0/5.0

Slides: 39

Provided by: Admi335

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Computing Solutions for Data Mining

1
High Performance Computing Solutions for Data
Mining Prof. Navneet Goyal
2

Topics

Big Data
Sources
Characteristics
Management
Analytics
The Road Ahead
The need for HPC for taming BIG DATA
The holy grail of Programming Performance
Abstraction vs. Performance
Parallel Domain Specific Languages (PDSL)
Distributed Computing Clusters
Parallel Computing New Avtaar
MapReduce Hadoop
What we are doing _at_ BITS-Pilani
Motivation and Future Directions

3
BIG DATA

Just a Hype?
Or a real Challenge?
Or a great Opportunity?

4
BIG DATA

Just a Hype?
Or a real Challenge?
Or a great Opportunity?
Challenge in terms of how to manage this data
Opportunity in terms of what we can do with this
data to enrich the lives of everybody around us
and to make our mother Earth a better place to
live
IBMs Smarter Planet Initiative!

5
Best Quote so far

Dhiraj Rajaram, Founder CEO of Mu Sigma,
leading Data Analytics co.
"Data is the new 'oil' and there is a growing
need for the ability to refine it,"

6
Another Quote

We dont have better Algorithms, We just have
more data
- Peter Norvig, Director of Research, Google

7
What Is Big Data?

There is not a consensus as to how to define big
data

Big data exceeds the reach of commonly used
hardware environments and software tools to
capture, manage, and process it with in a
tolerable elapsed time for its user population.
- Teradata Magazine article, 2011
Big data refers to data sets whose size is
beyond the ability of typical database software
tools to capture, store, manage and analyze. -
The McKinsey Global Institute, 2011
8
Big Data
by the end of 2011, this was about 30 billion
and growing even faster
In 2005 there were 1.3 billion RFID tags in
circulation
Source Slides of Dean Compher, IBM
9
RFID

Radio Frequency ID tags (RFID).
Wal-Mart redesigned their supply chain with RFID
Cost of RFID tags have come down so much, theyve
just proliferated all over the world.
A good place to start with Big Data, because they
are now ubiquitous as is the opportunity for Big
Data.
track cars on a toll route,
food supplies for temperature transport,
livestock,
inventories,
luggage,
retail,
tickets used for transportation

Source Slides of Dean Compher, IBM
10
An increasingly sensor-enabled and
instrumented business environment
generates HUGE volumes of data with
MACHINE SPEED characteristics
1 BILLION lines of codeEACH engine generating 10
TB every 30 minutes!
11
Aircrafts

Aircrafts are hugely sensor enabled devices that
are instrumented to collect data as they operate.
They also generate huge volumes of data.
For this particular Airbus, over a billion lines
of a code and a single engine generates 10
terabytes of data every 30 minutes.
And so theres four engines there, right?
UK to New York would generate 640 TB of data.

12
350B Transactions/YearMeter Reads every 15 min.

120M meter reads/month
3.65B3.65B meter reads/day
120M meter reads/month
13
In August of 2010, Adam Savage, of Myth
Busters, took a photo of his vehicle using his
smartphone. He then posted the photo to his
Twitter account including the phrase Off to
work. Since the photo was taken by his
smartphone, the image contained metadata
revealing the exact geographical location the
photo was taken By simply taking and posting a
photo, Savage revealed the exact location of his
home, the vehicle he drives, and the time he
leaves for work
14
The Social Layer in an Instrumented
Interconnected World
12 TBs of tweet data every day
? TBs ofdata every day
25 TBs oflog data every day
15
Big Data Includes Any of the following
Characteristics
Extracting insight from an immense volume,
variety and velocity of data, in context, beyond
what was previously possible
Manage the complexity of data in many different
structures, ranging from relational, to logs, to
raw text Streaming data and large volume data
movement Scale from Terabytes to Petabytes (1K
TBs) to Zetabytes (1B TBs)
Variety Velocity Volume
16
Big Data

Three Key things with Big Data
Volume (2.5 Exabytes of data created every day
and doubles every 40 months)
Velocity (Real-time or nearly real-time info
needed)
Variety (New sources of data unstructured)
To turn all this information into Competitive
Gold, need Innovative thinking and HPC tools
(Choose the right data build models that predict
and optimize)
Using Big Data enables organizations to decide on
the basis of Evidence rather than Intuition
Data Scientists who manage Big Data and find
insights in to it

17
3 Vs of Big Data

The BIG in big data isnt just about volume

Introduction

We are generating more Data that we can handle!!!
Thats why we are here!!!
BIG DATA
Using Data to our benefit is a far cry!!!
In future, everything will be Data driven
High time we figured out how to tame this
monster and use it for the benefit of the
society

Introduction

BIG DATA poses a big challenge to our
capabilities
Data scaling outdoing scaling of compute
resources
CPU speed not increasing either
At the same time, BIG DATA offers a BIGGER
opportunity
Opportunity to
Understand nature
Understand evolution
Understand human behavior/psychology/physiology
Understand stock markets
Understand road and network traffic
Opportunity for us to be more and more
innovative!!

Taming BIG DATA

Divide Conquer
Partition large problem into smaller
independent sub-problems
Can be handled by different workers
Threads in a processor core
Cores in a multi/many core processor
Multiple processors in a machine
Multiple machines in a cluster
Multiple clusters in a cloud

Abstraction
21

Analyzing BIG DATA

Data analysis, organization, retrieval, and
modeling are other foundational challenges. Data
analysis is a clear bottleneck in many
applications, both due to lack of scalability of
the underlying algorithms and due to the
complexity of the data that needs to be analyzed
Challenges and Opportunities with Big Data
A community white paper developed by leading
researchers across the United States

22
SourceChallenges and Opportunities with Big
Data A community white paper developed by
leading researchers across the United States
23

Programming Models

Abstraction vs. Performance
Look at the Generation of Languages
Evolution from Machine Language to NLP
Abstraction increasing
What about Performance?
Are we paying too high a price for high levels of
abstraction??
A delicate trade-off!!
Holy Grail
Desired level of (abstraction performance)

Domain Specific Languages (DSL)

Lot of interest in DSLs recently
High-level languages
Optimized for a particular domain
Two types
Internal (embedded in a host language)
External (new language, new compiler more
tedious)
OptiML is an internal DSL embedded in SCALA
BIG DATA has necessitated the need for developing
Parallel DSLs (PDSL)
Kevin J. Brown et al. A Heterogeneous Parallel
Framework for Domain-Specific Languages. PACT
2011 89-100
A. K. Sujeeth ey al. OptiML an implicitly
parallel domain specific language for machine
learningICML, 2011

Programming Issues

Need to go wide deep!
WIDE more nodes in a cluster
DEEP more cores in a node
Active Research in both WIDE DEEP models
Nodes are typically multicore
WIDE models at the mercy of OS for leveraging
multicores
WIDE DEEP models not necessarily orthogonal
Combining WIDE and DEEP models Nontrivial
Hybrid Programming Models MPI OPEN MP

Distributed Computing

Existing distributed computing options like
Cluster Grid computing provide low levels of
abstraction
Programmer has to deal with
synchronization, deadlocks, data dependency,
mutual exclusion, replication, reliability,
platform scalability and provisioning
Too much to ask from a data mining researcher!!
What are the solutions available?

MapReduce/Hadoop

MapReduce - A programming model its associated
implementation
provides a high level of abstraction
but has limitations
Only data parallel tasks stand to benefit!
MapReduce hides parallel/distributed computing
concepts from users/programmers
Even novice users/programmers can leverage
cluster computing for data-intensive problems
Cluster, Grid, MapReduce are intended platforms
for general purpose computing
Hadoop/PIG combo is very effective!

MapReduce Simplified Data Processing on Large
Clusters Jeffrey Dean and Sanjay Ghemawat, OSDI,
2004
28

MapReduce/Hadoop

Input
Map
Shuffle
Reduce
Output

http//www.slideshare.net/hadoop/practical-problem
-solving-with-apache-hadoop-pig By Milind
Bhandarkar
29

Research Gaps

Distributed computing is the only option for
large scale data analytics. But
current distributed computing systems are for
general purpose computing and
they support low levels of abstraction
(parallelization is non trivial)
MapReduce does provide higher level of
abstraction,
but is not suitable for Data Mining
(evident from literature survey)
in particular, it scales only for data parallel
problems
Need for a scalable distributed computing
framework that provides both abstraction and
performance

Research Gaps

K-means on MapReduce/Hadoop
A typical data-parallel problem
Suited for MapReduce
DBSCAN or OPTICS
Not a data-parallel problem
Not suitable for MapReduce
Throws up new data distribution and algorithmic
challenges
Need for a scalable distributed computing
framework that allows us to efficiently exploit
all kinds of parallelisms that exist in an
algorithm

_at_ CS Department BITS-Pilani

Programming Model
Researchers (DM Algorithms)
Distributed File System 20 TB NAS for Central
Storage
End users (DM Applications)
INFINIBAND CLUSTER 48 Multicore NODES
32

_at_ CS Department BITS-Pilani

SCALABLE SERVICE
End users (DM Applications)
Researchers (DM Algorithms)
CLOUD
33

_at_ CS Department BITS-Pilani

Advanced Data Analytics Parallel Technologies
Lab. (ADAPT LAB.) ADAPTing to the future
A new distributed computing framework for data
mining
HPC Division
Department of Electronics Information
Technology (DEITY)
INR 1.20 Crores (3 Years)
Navneet Goyal, Poonam Goyal, Sundar B
Collaborators IASRI, New Delhi
Full-time PhD Students Sonal Kumari (TCS
PhD Fellow)
Mohit Sati
Jagat Sesh
Challa
Saiyedul
Islam (TCS PhD Fellow)

_at_ CS Department BITS-Pilani

Advanced Data Analytics Parallel Technologies
Lab. (ADAPT LAB.) ADAPTing to the future
48 node Bewoulf Cluster
20 TB NAS
Intel Cluster Studio
Vampir Standard
Vmware Vcenter
IBM SPSS Modeler