Scalability for High Cardinality in Steerable MDS - PowerPoint PPT Presentation

About This Presentation
Title:

Scalability for High Cardinality in Steerable MDS

Description:

Scalability for High Cardinality in Steerable MDS. CPSC 533C Project Presentation ... Clean compile w/o warnings. Cross-platform with single code base ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 19
Provided by: agr71
Category:

less

Transcript and Presenter's Notes

Title: Scalability for High Cardinality in Steerable MDS


1
Scalability for High Cardinality in Steerable MDS
  • CPSC 533C Project Presentation
  • Allan Rempel
  • December 19, 2005

2
Scalability
  • Ability to run on very large data sets
  • Both theoretical and practical efficiency
  • Both time and space efficiency
  • Effective use of resources memory, disk, db
  • Avoid waste and thrashing
  • Cardinality and dimensionality

3
Review - multidimensional scaling (MDS)
  • Display multivariate abstract point data in 2D
  • Data from bioinformatics, financial sector, etc.
  • No inherent mapping in 2D space
  • p-dim embedding of q-dim space (p lt q) where
    inter-object relationships are approximated in
    low-dimensional space
  • Proximity in high-D -gt proximity in 2D
  • High-dim distance between points (similarity)
    determines relative (x,y) position
  • Absolute (x,y) positions are not meaningful
  • Want to see clusters, curves, etc.
  • Features that stand out from the noise

4
Review - multidimensional scaling (MDS)
  • Refinement of algorithms
  • O(N3), O(N2), O(N log N)
  • 1996, 2002, 2003, 2004
  • Chalmers, Morrison, Ross, Jourdan, Melançon
  • Progressive steerable MDS
  • Munzner et al, 2004, 2005

5
MDSteer
  • MDS visualization system developed by T. Munzner,
    M. Tory, D. Westrom, M. Williams
  • C with Qt GUI, OpenGL
  • Runs on Windows and Linux
  • Test platform P-III, 256 MB, Linux

6
MDSteer
7
MDSteer limitations and proposal
  • Theoretically, handle 300 dimensions and
    1,000,000 points
  • Practically, limited by system memory
  • 100,000 points calculating distance on the fly
  • 10,000 points with precomputed distance matrix
  • 5000 points exhausted memory on test machine
  • Big Question Can we raise these limits by using
    a database? At what cost?

8
Scaling issues
  • Time complexity matters Chalmers
  • Time complexity doesnt matter - Jourdan
  • Lots of things matter
  • Actual vs. perceived speed progressiveness
  • What are the limits of the hardware?

9
Experimental results Chalmers et al
  • 3-D data sets 5000 50,000 points
  • 13-D data sets 2000 24,000 points
  • Took less than 1/3 the time of the O(N2)
  • Achieved lower stress when done
  • Also compared against original O(N3) model
  • 9 seconds vs. 577 and 24 vs. 3642
  • Achieved much lower stress (0.06 vs. 0.2)

10
Experimental results Chalmers et al
11
Comparison Jourdan Melançon
  • Chalmers et al is better for N lt 5500
  • Main diff is in parent-finding, represented by
    Fig. 3

12
Comparison Jourdan Melançon
  • Experimental study confirms theoretical results
  • This technique becomes better for N gt 70,000

13
Multiscale MDS Jourdan Melançon
  • Recursively defining the initial kernel set of
    points can yield much better real-time
    performance
  • Same time complexity, but much faster execution

14
Experimental results
  • Memory usage with/without 5000 node MDS run

15
Offloading distance matrix
  • Step 1 Write to file instead of database
  • Database uses files anyway save db overhead
  • Use /tmp for high speed/capacity 3GB available
  • Similar speed and memory usage as before
  • Probably because file is being cached
  • Probably similar scenario to memory being swapped
  • until 2000 of 5000 points placed thrashing
  • 2000 points 16 MB 5000 points 100 MB

16
Offloading distance matrix
  • Step 2 Write to database
  • Offload some work to database server
  • Tried to connect to db over network
  • Tried to run MDSteer on db server
  • Tried to modify MDSteer for MDS only, no GUI
  • Suspect that database overhead, network comm.,
    NFS vs. local disks would cause significant
    slowdown

17
MDSteer code modifications
  • Build and run on Linux
  • Fixed some bugs
  • Class to handle file-based matrix
  • Class to handle database-based matrix
  • Restructured code to accomodate modes
  • Partial separation of GUI from MDS back end

18
Future work
  • Use .ui files for GUI specification
  • Separate MDS from GUI functionality
  • What does MDSteer do if user not interactive?
  • Continue examination of database use
  • Library provide hooks for use in other s/w
  • Not require recompile for diff modes
  • Clean compile w/o warnings
  • Cross-platform with single code base
Write a Comment
User Comments (0)
About PowerShow.com