High Performance Computing Solutions for Data Mining - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

High Performance Computing Solutions for Data Mining

Description:

High Performance Computing Solutions for Data Mining Prof. Navneet Goyal – PowerPoint PPT presentation

Number of Views:455
Avg rating:3.0/5.0
Slides: 39
Provided by: Admi335
Category:

less

Transcript and Presenter's Notes

Title: High Performance Computing Solutions for Data Mining


1
High Performance Computing Solutions for Data
Mining Prof. Navneet Goyal
2
  • Topics
  • Big Data
  • Sources
  • Characteristics
  • Management
  • Analytics
  • The Road Ahead
  • The need for HPC for taming BIG DATA
  • The holy grail of Programming Performance
  • Abstraction vs. Performance
  • Parallel Domain Specific Languages (PDSL)
  • Distributed Computing Clusters
  • Parallel Computing New Avtaar
  • MapReduce Hadoop
  • What we are doing _at_ BITS-Pilani
  • Motivation and Future Directions

3
BIG DATA
  • Just a Hype?
  • Or a real Challenge?
  • Or a great Opportunity?

4
BIG DATA
  • Just a Hype?
  • Or a real Challenge?
  • Or a great Opportunity?
  • Challenge in terms of how to manage this data
  • Opportunity in terms of what we can do with this
    data to enrich the lives of everybody around us
    and to make our mother Earth a better place to
    live
  • IBMs Smarter Planet Initiative!

5
Best Quote so far
  • Dhiraj Rajaram, Founder CEO of Mu Sigma,
    leading Data Analytics co.
  • "Data is the new 'oil' and there is a growing
    need for the ability to refine it,"

6
Another Quote
  • We dont have better Algorithms, We just have
    more data
  • - Peter Norvig, Director of Research, Google

7
What Is Big Data?
  • There is not a consensus as to how to define big
    data

Big data exceeds the reach of commonly used
hardware environments and software tools to
capture, manage, and process it with in a
tolerable elapsed time for its user population.
- Teradata Magazine article, 2011
Big data refers to data sets whose size is
beyond the ability of typical database software
tools to capture, store, manage and analyze. -
The McKinsey Global Institute, 2011
8
Big Data
by the end of 2011, this was about 30 billion
and growing even faster
In 2005 there were 1.3 billion RFID tags in
circulation
Source Slides of Dean Compher, IBM
9
RFID
  • Radio Frequency ID tags (RFID).
  • Wal-Mart redesigned their supply chain with RFID
  • Cost of RFID tags have come down so much, theyve
    just proliferated all over the world.
  • A good place to start with Big Data, because they
    are now ubiquitous as is the opportunity for Big
    Data.
  • track cars on a toll route,
  • food supplies for temperature transport,
  • livestock,
  • inventories,
  • luggage,
  • retail,
  • tickets used for transportation

Source Slides of Dean Compher, IBM
10
An increasingly sensor-enabled and
instrumented business environment
generates HUGE volumes of data with
MACHINE SPEED characteristics
1 BILLION lines of codeEACH engine generating 10
TB every 30 minutes!
11
Aircrafts
  • Aircrafts are hugely sensor enabled devices that
    are instrumented to collect data as they operate.
    They also generate huge volumes of data.
  • For this particular Airbus, over a billion lines
    of a code and a single engine generates 10
    terabytes of data every 30 minutes.
  • And so theres four engines there, right?
  • UK to New York would generate 640 TB of data.

12
350B Transactions/YearMeter Reads every 15 min.

120M meter reads/month
3.65B3.65B meter reads/day
120M meter reads/month
13
In August of 2010, Adam Savage, of Myth
Busters, took a photo of his vehicle using his
smartphone. He then posted the photo to his
Twitter account including the phrase Off to
work. Since the photo was taken by his
smartphone, the image contained metadata
revealing the exact geographical location the
photo was taken By simply taking and posting a
photo, Savage revealed the exact location of his
home, the vehicle he drives, and the time he
leaves for work
14
The Social Layer in an Instrumented
Interconnected World
12 TBs of tweet data every day
? TBs ofdata every day
25 TBs oflog data every day
15
Big Data Includes Any of the following
Characteristics
Extracting insight from an immense volume,
variety and velocity of data, in context, beyond
what was previously possible
Manage the complexity of data in many different
structures, ranging from relational, to logs, to
raw text Streaming data and large volume data
movement Scale from Terabytes to Petabytes (1K
TBs) to Zetabytes (1B TBs)
Variety Velocity Volume
16
Big Data
  • Three Key things with Big Data
  • Volume (2.5 Exabytes of data created every day
    and doubles every 40 months)
  • Velocity (Real-time or nearly real-time info
    needed)
  • Variety (New sources of data unstructured)
  • To turn all this information into Competitive
    Gold, need Innovative thinking and HPC tools
    (Choose the right data build models that predict
    and optimize)
  • Using Big Data enables organizations to decide on
    the basis of Evidence rather than Intuition
  • Data Scientists who manage Big Data and find
    insights in to it

17
3 Vs of Big Data
  • The BIG in big data isnt just about volume

18
  • Introduction
  • We are generating more Data that we can handle!!!
  • Thats why we are here!!!
  • BIG DATA
  • Using Data to our benefit is a far cry!!!
  • In future, everything will be Data driven
  • High time we figured out how to tame this
    monster and use it for the benefit of the
    society

19
  • Introduction
  • BIG DATA poses a big challenge to our
    capabilities
  • Data scaling outdoing scaling of compute
    resources
  • CPU speed not increasing either
  • At the same time, BIG DATA offers a BIGGER
    opportunity
  • Opportunity to
  • Understand nature
  • Understand evolution
  • Understand human behavior/psychology/physiology
  • Understand stock markets
  • Understand road and network traffic
  • Opportunity for us to be more and more
    innovative!!

20
  • Taming BIG DATA
  • Divide Conquer
  • Partition large problem into smaller
    independent sub-problems
  • Can be handled by different workers
  • Threads in a processor core
  • Cores in a multi/many core processor
  • Multiple processors in a machine
  • Multiple machines in a cluster
  • Multiple clusters in a cloud

Abstraction
21
  • Analyzing BIG DATA
  • Data analysis, organization, retrieval, and
    modeling are other foundational challenges. Data
    analysis is a clear bottleneck in many
    applications, both due to lack of scalability of
    the underlying algorithms and due to the
    complexity of the data that needs to be analyzed
  • Challenges and Opportunities with Big Data
  • A community white paper developed by leading
    researchers across the United States

22
SourceChallenges and Opportunities with Big
Data A community white paper developed by
leading researchers across the United States
23
  • Programming Models
  • Abstraction vs. Performance
  • Look at the Generation of Languages
  • Evolution from Machine Language to NLP
  • Abstraction increasing
  • What about Performance?
  • Are we paying too high a price for high levels of
    abstraction??
  • A delicate trade-off!!
  • Holy Grail
  • Desired level of (abstraction performance)

24
  • Domain Specific Languages (DSL)
  • Lot of interest in DSLs recently
  • High-level languages
  • Optimized for a particular domain
  • Two types
  • Internal (embedded in a host language)
  • External (new language, new compiler more
    tedious)
  • OptiML is an internal DSL embedded in SCALA
  • BIG DATA has necessitated the need for developing
    Parallel DSLs (PDSL)
  • Kevin J. Brown et al. A Heterogeneous Parallel
    Framework for Domain-Specific Languages. PACT
    2011 89-100
  • A. K. Sujeeth ey al. OptiML an implicitly
    parallel domain specific language for machine
    learningICML, 2011

25
  • Programming Issues
  • Need to go wide deep!
  • WIDE more nodes in a cluster
  • DEEP more cores in a node
  • Active Research in both WIDE DEEP models
  • Nodes are typically multicore
  • WIDE models at the mercy of OS for leveraging
    multicores
  • WIDE DEEP models not necessarily orthogonal
  • Combining WIDE and DEEP models Nontrivial
  • Hybrid Programming Models MPI OPEN MP

26
  • Distributed Computing
  • Existing distributed computing options like
    Cluster Grid computing provide low levels of
    abstraction
  • Programmer has to deal with
  • synchronization, deadlocks, data dependency,
    mutual exclusion, replication, reliability,
    platform scalability and provisioning
  • Too much to ask from a data mining researcher!!
  • What are the solutions available?

27
  • MapReduce/Hadoop
  • MapReduce - A programming model its associated
    implementation
  • provides a high level of abstraction
  • but has limitations
  • Only data parallel tasks stand to benefit!
  • MapReduce hides parallel/distributed computing
    concepts from users/programmers
  • Even novice users/programmers can leverage
    cluster computing for data-intensive problems
  • Cluster, Grid, MapReduce are intended platforms
    for general purpose computing
  • Hadoop/PIG combo is very effective!

MapReduce Simplified Data Processing on Large
Clusters Jeffrey Dean and Sanjay Ghemawat, OSDI,
2004
28
  • MapReduce/Hadoop
  • Input
  • Map
  • Shuffle
  • Reduce
  • Output

http//www.slideshare.net/hadoop/practical-problem
-solving-with-apache-hadoop-pig By Milind
Bhandarkar
29
  • Research Gaps
  • Distributed computing is the only option for
    large scale data analytics. But
  • current distributed computing systems are for
    general purpose computing and
  • they support low levels of abstraction
    (parallelization is non trivial)
  • MapReduce does provide higher level of
    abstraction,
  • but is not suitable for Data Mining
  • (evident from literature survey)
  • in particular, it scales only for data parallel
    problems
  • Need for a scalable distributed computing
    framework that provides both abstraction and
    performance

30
  • Research Gaps
  • K-means on MapReduce/Hadoop
  • A typical data-parallel problem
  • Suited for MapReduce
  • DBSCAN or OPTICS
  • Not a data-parallel problem
  • Not suitable for MapReduce
  • Throws up new data distribution and algorithmic
    challenges
  • Need for a scalable distributed computing
    framework that allows us to efficiently exploit
    all kinds of parallelisms that exist in an
    algorithm

31
  • _at_ CS Department BITS-Pilani

Programming Model
Researchers (DM Algorithms)
Distributed File System 20 TB NAS for Central
Storage
End users (DM Applications)
INFINIBAND CLUSTER 48 Multicore NODES
32
  • _at_ CS Department BITS-Pilani

SCALABLE SERVICE
End users (DM Applications)
Researchers (DM Algorithms)
CLOUD
33
  • _at_ CS Department BITS-Pilani
  • Advanced Data Analytics Parallel Technologies
    Lab. (ADAPT LAB.) ADAPTing to the future
  • A new distributed computing framework for data
    mining
  • HPC Division
  • Department of Electronics Information
    Technology (DEITY)
  • INR 1.20 Crores (3 Years)
  • Navneet Goyal, Poonam Goyal, Sundar B
  • Collaborators IASRI, New Delhi
  • Full-time PhD Students Sonal Kumari (TCS
    PhD Fellow)
  • Mohit Sati
  • Jagat Sesh
    Challa
  • Saiyedul
    Islam (TCS PhD Fellow)

34
  • _at_ CS Department BITS-Pilani
  • Advanced Data Analytics Parallel Technologies
    Lab. (ADAPT LAB.) ADAPTing to the future
  • 48 node Bewoulf Cluster
  • 20 TB NAS
  • Intel Cluster Studio
  • Vampir Standard
  • Vmware Vcenter
  • IBM SPSS Modeler

35
  • _at_ CS Department BITS-Pilani
  • Advanced Data Analytics Parallel Technologies
    Lab. (ADAPT LAB.) ADAPTing to the future
  • Programming Environment
  • MPI 2.0 (Open MPI) Distributed Memory
  • Does not exploit cores
  • MPI 3.0 exploits cores
  • Open MP vs. TBB (intel) vs. Open CL Shared
    Memory
  • Profiling Debugging Tools
  • Vampir
  • PGI
  • TotalView
  • Intel Cluster Studio XE

36
  • Future Directions
  • Interpretation of Analysis
  • Data Visualization
  • Better Programming Models
  • Data Modeling
  • Sampling Techniques

37
Skill Sets
  • Probability Statistics
  • Linear Algebra
  • Vector Analysis

38
  • Thank you
  • Q A
  • goel_at_pilani.bits-pilani.ac.in
Write a Comment
User Comments (0)
About PowerShow.com