Algebraic Techniques for Analysis of Large DiscreteValued Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

Algebraic Techniques for Analysis of Large DiscreteValued Datasets

Description:

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. Mehmet Koyuturk1, Ananth Grama1, ... Truncate decomposition to compress data. Background ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 14
Provided by: koyu
Category:

less

Transcript and Presenter's Notes

Title: Algebraic Techniques for Analysis of Large DiscreteValued Datasets


1
Algebraic Techniques for Analysis of Large
Discrete-Valued Datasets
  • Mehmet Koyuturk1, Ananth Grama1, and Naren
    Ramakrishnan2
  • Dept. of Computer Sciences, Purdue University
  • koyuturk, ayg _at_cs.purdue.edu
  • 2. Dept. of Computer Sciences, Virginia Tech
  • naren_at_cs.vt.edu

2
Motivation
  • Handling large discrete-valued datasets
  • Extracting relations between data items
  • Summarizing data in an error-bounded fashion
  • Clustering of data items
  • Finding coinsize representations for clustered
    data

3
Background
  • Singular Value Decomposition (SVD) Berry et.al.,
    1995
  • Decompose matrix into AUSVT
  • U and V orthogonal matrices, S diagonal with
    singular values
  • Used for Latent Semantic Indexing in Information
    Retrieval
  • Truncate decomposition to compress data

4
Background
  • Semi-Discrete Decomposition (SDD) Kolda and
    OLeary, 1998
  • Restrict entries of U and V to -1,0,1
  • Requires very small amount of storage
  • Can perform as well as SVD in LSI using less than
    one-tenth the storage
  • Effective in finding outlier clusters
  • works well for datasets containing a large number
    of small clusters

5
Rank-1 Approximation
x presence vector y pattern vector
6
Rank-1 Approximation
Iteratively solve for x and y until no
improvement possible
7
Initialization of pattern vector
  • Crucial to escape from local optima
  • Must require at most ?(nz(A)) time, not to
  • Some possible schemes
  • AllOnes Set all entries to 1, poor.
  • Threshold Set only the entries that have
    corresponding columns with of non-zeros more
    than a threshold. Can lead to bad local optima.
  • Maximum Set only the entry that corresponds to
    the column with max. of non-zeros. Risky, that
    column may be shared by lots of patterns.
  • Partition Partition the rows of matrix based on
    a column, than apply threshold scheme taking into
    account only one of the parts. Best among these.

8
Recursive Algorithm
- if x(i)1 row i goes to A1
9
Recursive Algorithm
10
Effectiveness of Analysis
11
Effectiveness of Analysis
12
Run-time Scalability
  • Rank-1 approximation requires O(nz(A)) time
  • Total run-time at each level in the recursive
    tree cannot exceed
  • this since total of nonzeros at each level is
    at most nz(A)
  • ? Run-time is linear in nz(A)

13
Conclusions and Ongoing Work
  • Proposed algorithm is
  • Scalable to exteremely high-dimensions
  • Effective in discovering dominant patterns
  • Hierarchical in nature, allowing multi-resolution
    analysis
  • Currently working on
  • Real-world applications of proposed method
  • Effective initialization schemes
  • Parallel implementation
Write a Comment
User Comments (0)
About PowerShow.com