Graphs, Data Mining, and High Performance Computing - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Graphs, Data Mining, and High Performance Computing

Description:

Can high-performance computing make an impact? ... High performance computing may be needed for memory and performance ... A Renaissance in Architecture. Bad news ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 41
Provided by: willia168
Category:

less

Transcript and Presenter's Notes

Title: Graphs, Data Mining, and High Performance Computing


1
Graphs, Data Mining,and High Performance
Computing
  • Bruce Hendrickson
  • Sandia National Laboratories, Albuquerque, NM
  • University of New Mexico, Computer Science Dept.

2
Outline
  • High performance computing
  • Why current approaches cant work for data mining
  • Test case graphs for knowledge representation
  • High performance graph algorithms, an oxymoron?
  • Implications for broader data mining community
  • Future trends

3
Data Mining and High Performance Computing
  • We can only consider simple algorithms
  • Data too big for anything but O(n) algorithms
  • Often have some kind of real-time constraints
  • This greatly limits the kinds of questions we can
    address
  • Terascale data gives different insights than
    gigascale data
  • Current search capabilities are wonderful, but
    innately limited
  • Can high-performance computing make an impact?
  • What if our algorithms ran 100x faster and could
    use 100x more memory? 1000x?
  • Assertion Quantitative improvements in
    capabilities result in qualitative changes in the
    science that can be done.

4
Modern Computers
  • Fast processors, slow memory
  • Use memory hierarchy to keep processor fed
  • Stage some data in smaller, faster memory (cache)
  • Can dramatically enhance performance
  • But only if accesses have spatial or temporal
    locality
  • Use accessed data repeatedly, or use near-by data
    next
  • Parallel computers are collections of these
  • Pivotal to have a processor own most data it
    needs
  • Memory patterns determine performance
  • Processor speed hardly matters

5
High Performance Computing
  • Largely the purview of science and engineering
    communities
  • Machines, programming models, algorithms to
    serve their needs
  • Can these be utilized by learning and data mining
    communities?
  • Search companies make great use of parallelism
    for simple things
  • But not general purpose
  • Goals
  • Large (cumulative) core for holding big data sets
  • Fast and scalable and performance of complex
    algorithms
  • Ease of programmability

6
Algorithms Weve Seen This Week
  • Hashing (of many sorts)
  • Feature detection
  • Sampling
  • Inverse index construction
  • Sparse matrix and tensor products
  • Training
  • Clustering
  • All of these involve
  • complex memory access patterns
  • only small amounts of computation
  • Performance dominated by latency waiting for
    data

7
Architectural Challenges
  • Runtime is dominated by latency
  • Lots of indirect addressing, pointer chasing,
    etc.
  • Perhaps many at once
  • Very little computation to hide memory costs
  • Access pattern can be data dependent
  • Prefetching unlikely to help
  • Usually only want small part of cache line
  • Potentially abysmal locality at all levels of
    memory hierarchy
  • Bad serial and abysmal parallel performance

8
Graphs for Knowledge Representation
  • Graphs can capture rich semantic structure in
    data
  • More complex than bag of features
  • Examples
  • Protein interaction networks
  • Web pages with hyperlinks
  • Semantic web
  • Social networks, etc.
  • Algorithms of interest include
  • Connectivity (of various sorts)
  • Clustering and community detection
  • Common motif discovery
  • Pattern matching, etc.

9
Semantic Graph Example
10
Finding Threats Subgraph Isomorphism
Image Source T. Coffman, S. Greenblatt, S.
Marcus, Graph-based technologies for intelligence
analysis, CACM, 47(3, March 2004) pp 45-47
11
Mohammed Jabarah (Canadian citizen handed over to
US authorities on suspicion of links to 9/11).  
Omar Khadr (at Guantanamo)
Thanks to Kevin McCurley
12
(No Transcript)
13
Graph-Based Informatics Data
  • Graphs can be enormous
  • High performance computing may be needed for
    memory and performance
  • Graphs are highly unstructured
  • High variance in number of neighbors
  • Little or no locality Not partitionable
  • Experience with scientific computing graphs of
    limited utility
  • Terrible locality in memory access patterns

14
Desirable Architectural Features
  • Low latency / high bandwidth
  • For small messages!
  • Latency tolerant
  • Light-weight synchronization mechanisms
  • Global address space
  • No data partitioning required
  • Avoid memory-consuming profusion of ghost-nodes
  • No local/global numbering conversions
  • One machine with these properties is the Cray
    MTA-2
  • And successor XMT

15
Massive Multithreading The Cray MTA-2
  • Slow clock rate (220Mhz)
  • 128 streams per processor
  • Global address space
  • Fine-grain synchronization
  • Simple, serial-like programming model
  • Advanced parallelizing compilers

Latency Tolerant important for Graph Algorithms
16
Cray MTA Processor
No Processor Cache!
Hashed Memory!
  • Each thread can have 8 memory refs in flight
  • Round trip to memory 150 cycles

17
How Does the MTA Work?
  • Latency tolerance via massive multi-threading
  • Context switch in a single tick
  • Global address space, hashed to reduce hot-spots
  • No cache or local memory. Context switch on
    memory request.
  • Multiple outstanding loads
  • Remote memory request doesnt stall processor
  • Other streams work while your request gets
    fulfilled
  • Light-weight, word-level synchronization
  • Minimizes access conflicts
  • Flexibly supports dynamic load balancing
  • Notes
  • MTA-2 is 7 years old
  • Largest machine is 40 processors

18
Case Study MTA-2 vs. BlueGene/L
  • With LLNL, implemented S-T shortest paths in MPI
  • Ran on IBM/LLNL BlueGene/L, worlds fastest
    computer
  • Finalist for 2005 Gordon Bell Prize
  • 4B vertex, 20B edge, Erdös-Renyi random graph
  • Analysis touches about 200K vertices
  • Time 1.5 seconds on 32K processors
  • Ran similar problem on MTA-2
  • 32 million vertices, 128 million edges
  • Measured touches about 23K vertices
  • Time .7 seconds on one processor, .09 seconds on
    10 processors
  • Conclusion 4 MTA-2 processors 32K BlueGene/L
    processors

19
But Speed Isnt Everything
  • Unlike MTA code, MPI code limited to Erdös-Renyi
    graphs
  • Cant support power-law graphs pervasive in
    informatics
  • MPI code is 3 times larger than MTA-2 code
  • Took considerably longer to develop
  • MPI code can only solve this very special problem
  • MTA code is part of general and flexible
    infrastructure
  • MTA easily supports multiple, simultaneous users
  • But MPI code runs everywhere
  • MTA code runs only on MTA/Eldorado and on serial
    machines

20
Multithreaded Graph Software Design
  • Build generic infrastructure for core operations
    including
  • Breadth-first search (e.g. short paths)
  • Distributed local searches (e.g. subgraph
    isomorphism)
  • Rich filtering operations (numerous applications)
  • Separate basic kernels from instance specifics
  • Infrastructure is challenging to write
  • Parallelization performance challenges reside
    in infrastructure
  • Must port to multiple architectures
  • But with infrastructure in place, application
    development is highly productive and portable

21
Customizing Behavior Visitors
  • Idea from BOOST (Lumsdaine)
  • Application programmer writes small visitor
    functions
  • Get invoked at key points by basic infrastructure
  • E.g. when a new vertex is visited, etc.
  • Adjust behavior or copy data build tailored
    knowledge products
  • Example, with one breadth-first-search routine,
    you can
  • Find short paths
  • Construct spanning trees
  • Find connected components, etc.
  • Architectural dependence is hidden in
    infrastructure
  • Applications programming is highly productive
  • Use just enough C for flexibility, but not too
    much
  • Note Code runs on serial Linux, Windows, Mac
    machines

22
Eldorado Graph Infrastructure C Design Levels
Gives Parallelism, Hides Most Concurrency
Gets parallelism for free
Graph Class
Algorithm Class
Visitor class
Data Str. Class
Analyst Support
Algorithms Programmer
Infrastructure Programmer
Inspired by Boost GL, but not Boost GL
23
Kahans Algorithm for Connected Components
24
Infrastructure Implementation of Kahans Algorithm
Kahans Phase II visitor (Trivial)
Shiloach-Vishkin CRCW (tricky)
Kahans Phase I visitor
Search (tricky)
Kahans Phase III visitor (Trivial)
25
Infrastructure Implementation of Kahans Algorithm
component values start empty Make them
full.
Phase I
Wait until both full,
Add to hash table
26
Traceview Output for Infrastructure Impl. of
Kahans CC algorithm
27
More General Filtering The Bully Algorithm
28
Bully Algorithm Implementation
Traverse e if we would anyway, or if this test
returns true or,and,replace
Lock dest while testing
29
Traceview Output for the Bully Algorithm
30
MTA-2 Scaling of Connected Components
Power Law Graph (highly unstructured)
5.41s
2.91s
31
Computational Results Subgraph Isomorphism
32
A Renaissance in Architecture
  • Bad news
  • Power considerations limit the improvement in
    clock speed
  • Good news
  • Moores Law marches on
  • Real estate on a chip is no longer at a premium
  • On a processor, much is already memory control
  • Tiny bit is computing (e.g. floating point)
  • The future is not like the past

33
Example AMD Opteron
34
Example AMD Opteron
Memory (Latency Avoidance)
L1 D-Cache
L2 Cache
L1 I-Cache
35
Example AMD Opteron
Memory (Lat. Avoidance)
Out-of-Order Exec Load/Store Mem/Coherency (Latenc
y Tolerance)
L1 D-Cache
Load/Store Unit
L2 Cache
I-Fetch Scan Align
L1 I-Cache
Memory Controller
36
Example AMD Opteron
Memory (Latency Avoidance)
L1 D-Cache
Load/Store Unit
Out-of-Order Exec Load/Store Mem/Coherency (Lat.
Toleration)
L2 Cache
Bus
DDR
HT
I-Fetch Scan Align
L1 I-Cache
Memory Controller
Memory and I/O Interfaces
37
Example AMD Opteron
Memory (Latency Avoidance)
FPU Execution
L1 D-Cache
Load/Store Unit
Out-of-Order Exec Load/Store Mem/Coherency (Lat.
Tolerance)
L2 Cache
Int Execution
Bus
DDR
HT
I-Fetch Scan Align
L1 I-Cache
Memory and I/O Interfaces
Memory Controller
COMPUTER
Thanks to Thomas Sterling
38
Consequences
  • Current response, stamp out more processors.
  • Multicore processors. Not very imaginative.
  • Makes life worse for most of us
  • Near future trends
  • Multithreading to tolerate latencies
  • MTA-like capability on commodity machines
  • Potentially big impact on data-centric
    applications
  • Further out
  • Application-specific circuitry
  • E.g. hashing, feature detection, etc.
  • Reconfigurable hardware?
  • Adapt circuits to the application at run time

39
Summary
  • Massive Multithreading has great potential for
    data mining learning
  • Software development is challenging
  • correctness
  • performance
  • Well designed infrastructure can hide many of
    these challenges
  • Once built, infrastructure enables high
    productivity
  • Potential to become mainstream. Stay tuned

40
Acknowledgements
  • Jon Berry
  • Simon Kahan, Petr Konecny (Cray)
  • David Bader, Kamesh Madduri (Ga. Tech) (MTA s-t
    connectivity)
  • Will McClendon (MPI s-t connectivity)
Write a Comment
User Comments (0)
About PowerShow.com