Graphs, Data Mining, and High Performance Computing - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Graphs, Data Mining, and High Performance Computing

Description:

Can high-performance computing make an impact? ... High performance computing may be needed for memory and performance ... A Renaissance in Architecture. Bad news ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 41

Provided by: willia168

Category:

more less

Transcript and Presenter's Notes

Title: Graphs, Data Mining, and High Performance Computing

1
Graphs, Data Mining,and High Performance
Computing

Bruce Hendrickson
Sandia National Laboratories, Albuquerque, NM
University of New Mexico, Computer Science Dept.

2
Outline

High performance computing
Why current approaches cant work for data mining
Test case graphs for knowledge representation
High performance graph algorithms, an oxymoron?
Implications for broader data mining community
Future trends

3
Data Mining and High Performance Computing

We can only consider simple algorithms
Data too big for anything but O(n) algorithms
Often have some kind of real-time constraints
This greatly limits the kinds of questions we can
address
Terascale data gives different insights than
gigascale data
Current search capabilities are wonderful, but
innately limited
Can high-performance computing make an impact?
What if our algorithms ran 100x faster and could
use 100x more memory? 1000x?
Assertion Quantitative improvements in
capabilities result in qualitative changes in the
science that can be done.

4
Modern Computers

Fast processors, slow memory
Use memory hierarchy to keep processor fed
Stage some data in smaller, faster memory (cache)
Can dramatically enhance performance
But only if accesses have spatial or temporal
locality
Use accessed data repeatedly, or use near-by data
next
Parallel computers are collections of these
Pivotal to have a processor own most data it
needs
Memory patterns determine performance
Processor speed hardly matters

5
High Performance Computing

Largely the purview of science and engineering
communities
Machines, programming models, algorithms to
serve their needs
Can these be utilized by learning and data mining
communities?
Search companies make great use of parallelism
for simple things
But not general purpose
Goals
Large (cumulative) core for holding big data sets
Fast and scalable and performance of complex
algorithms
Ease of programmability

6
Algorithms Weve Seen This Week

Hashing (of many sorts)
Feature detection
Sampling
Inverse index construction
Sparse matrix and tensor products
Training
Clustering
All of these involve
complex memory access patterns
only small amounts of computation
Performance dominated by latency waiting for
data

7
Architectural Challenges

Runtime is dominated by latency
Lots of indirect addressing, pointer chasing,
etc.
Perhaps many at once
Very little computation to hide memory costs
Access pattern can be data dependent
Prefetching unlikely to help
Usually only want small part of cache line
Potentially abysmal locality at all levels of
memory hierarchy
Bad serial and abysmal parallel performance

8
Graphs for Knowledge Representation

Graphs can capture rich semantic structure in
data
More complex than bag of features
Examples
Protein interaction networks
Web pages with hyperlinks
Semantic web
Social networks, etc.
Algorithms of interest include
Connectivity (of various sorts)
Clustering and community detection
Common motif discovery
Pattern matching, etc.

9
Semantic Graph Example
10
Finding Threats Subgraph Isomorphism
Image Source T. Coffman, S. Greenblatt, S.
Marcus, Graph-based technologies for intelligence
analysis, CACM, 47(3, March 2004) pp 45-47
11
Mohammed Jabarah (Canadian citizen handed over to
US authorities on suspicion of links to 9/11).
Omar Khadr (at Guantanamo)
Thanks to Kevin McCurley
12
(No Transcript)
13
Graph-Based Informatics Data

Graphs can be enormous
High performance computing may be needed for
memory and performance
Graphs are highly unstructured
High variance in number of neighbors
Little or no locality Not partitionable
Experience with scientific computing graphs of
limited utility
Terrible locality in memory access patterns

14
Desirable Architectural Features

Low latency / high bandwidth
For small messages!
Latency tolerant
Light-weight synchronization mechanisms
Global address space
No data partitioning required
Avoid memory-consuming profusion of ghost-nodes
No local/global numbering conversions
One machine with these properties is the Cray
MTA-2
And successor XMT

15
Massive Multithreading The Cray MTA-2

Slow clock rate (220Mhz)
128 streams per processor
Global address space
Fine-grain synchronization
Simple, serial-like programming model
Advanced parallelizing compilers

Latency Tolerant important for Graph Algorithms
16
Cray MTA Processor
No Processor Cache!
Hashed Memory!

Each thread can have 8 memory refs in flight
Round trip to memory 150 cycles

17
How Does the MTA Work?

Latency tolerance via massive multi-threading
Context switch in a single tick
Global address space, hashed to reduce hot-spots
No cache or local memory. Context switch on
memory request.
Multiple outstanding loads
Remote memory request doesnt stall processor
Other streams work while your request gets
fulfilled
Light-weight, word-level synchronization
Minimizes access conflicts
Flexibly supports dynamic load balancing
Notes
MTA-2 is 7 years old
Largest machine is 40 processors

18
Case Study MTA-2 vs. BlueGene/L

With LLNL, implemented S-T shortest paths in MPI
Ran on IBM/LLNL BlueGene/L, worlds fastest
computer
Finalist for 2005 Gordon Bell Prize
4B vertex, 20B edge, Erdös-Renyi random graph
Analysis touches about 200K vertices
Time 1.5 seconds on 32K processors
Ran similar problem on MTA-2
32 million vertices, 128 million edges
Measured touches about 23K vertices
Time .7 seconds on one processor, .09 seconds on
10 processors
Conclusion 4 MTA-2 processors 32K BlueGene/L
processors

19
But Speed Isnt Everything

Unlike MTA code, MPI code limited to Erdös-Renyi
graphs
Cant support power-law graphs pervasive in
informatics
MPI code is 3 times larger than MTA-2 code
Took considerably longer to develop
MPI code can only solve this very special problem
MTA code is part of general and flexible
infrastructure
MTA easily supports multiple, simultaneous users
But MPI code runs everywhere
MTA code runs only on MTA/Eldorado and on serial
machines

20
Multithreaded Graph Software Design

Build generic infrastructure for core operations
including
Breadth-first search (e.g. short paths)
Distributed local searches (e.g. subgraph
isomorphism)
Rich filtering operations (numerous applications)
Separate basic kernels from instance specifics
Infrastructure is challenging to write
Parallelization performance challenges reside
in infrastructure
Must port to multiple architectures
But with infrastructure in place, application
development is highly productive and portable

21
Customizing Behavior Visitors

Idea from BOOST (Lumsdaine)
Application programmer writes small visitor
functions
Get invoked at key points by basic infrastructure
E.g. when a new vertex is visited, etc.
Adjust behavior or copy data build tailored
knowledge products
Example, with one breadth-first-search routine,
you can
Find short paths
Construct spanning trees
Find connected components, etc.
Architectural dependence is hidden in
infrastructure
Applications programming is highly productive
Use just enough C for flexibility, but not too
much
Note Code runs on serial Linux, Windows, Mac
machines

22
Eldorado Graph Infrastructure C Design Levels
Gives Parallelism, Hides Most Concurrency
Gets parallelism for free
Graph Class
Algorithm Class
Visitor class
Data Str. Class
Analyst Support
Algorithms Programmer
Infrastructure Programmer
Inspired by Boost GL, but not Boost GL
23
Kahans Algorithm for Connected Components
24
Infrastructure Implementation of Kahans Algorithm
Kahans Phase II visitor (Trivial)
Shiloach-Vishkin CRCW (tricky)
Kahans Phase I visitor
Search (tricky)
Kahans Phase III visitor (Trivial)
25
Infrastructure Implementation of Kahans Algorithm
component values start empty Make them
full.
Phase I
Wait until both full,
Add to hash table
26
Traceview Output for Infrastructure Impl. of
Kahans CC algorithm
27
More General Filtering The Bully Algorithm
28
Bully Algorithm Implementation
Traverse e if we would anyway, or if this test
returns true or,and,replace
Lock dest while testing
29
Traceview Output for the Bully Algorithm
30
MTA-2 Scaling of Connected Components
Power Law Graph (highly unstructured)
5.41s
2.91s
31
Computational Results Subgraph Isomorphism
32
A Renaissance in Architecture

Bad news
Power considerations limit the improvement in
clock speed
Good news
Moores Law marches on
Real estate on a chip is no longer at a premium
On a processor, much is already memory control
Tiny bit is computing (e.g. floating point)
The future is not like the past

33
Example AMD Opteron
34
Example AMD Opteron
Memory (Latency Avoidance)
L1 D-Cache
L2 Cache
L1 I-Cache
35
Example AMD Opteron
Memory (Lat. Avoidance)
Out-of-Order Exec Load/Store Mem/Coherency (Latenc
y Tolerance)
L1 D-Cache
Load/Store Unit
L2 Cache
I-Fetch Scan Align
L1 I-Cache
Memory Controller
36
Example AMD Opteron
Memory (Latency Avoidance)
L1 D-Cache
Load/Store Unit
Out-of-Order Exec Load/Store Mem/Coherency (Lat.
Toleration)
L2 Cache
Bus
DDR
HT
I-Fetch Scan Align
L1 I-Cache
Memory Controller
Memory and I/O Interfaces
37
Example AMD Opteron
Memory (Latency Avoidance)
FPU Execution
L1 D-Cache
Load/Store Unit
Out-of-Order Exec Load/Store Mem/Coherency (Lat.
Tolerance)
L2 Cache
Int Execution
Bus
DDR
HT
I-Fetch Scan Align
L1 I-Cache
Memory and I/O Interfaces
Memory Controller
COMPUTER
Thanks to Thomas Sterling
38
Consequences