A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Description:

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz Topics Introduction Algorithm Performance Observation Conclusion and ... – PowerPoint PPT presentation

Number of Views:412
Avg rating:3.0/5.0
Slides: 35
Provided by: far101
Category:

less

Transcript and Presenter's Notes

Title: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining


1
A Clustering Method Based on Nonnegative Matrix
Factorization for Text Mining
  • Farial Shahnaz

2
Topics
  • Introduction
  • Algorithm
  • Performance
  • Observation
  • Conclusion and Future Work

3
Introduction
4
Basic Concepts
  • Text Mining Detection of trends or patterns in
    text data
  • Clustering Grouping or classifying documents
    based on similarity of content

5
Clustering
  • Manual Vs Automated
  • Supervised Vs Unsupervised
  • Hierarchical Vs Partitional

6
Clustering
  • Objective Automated Unsupervised Partitional
    Clustering of Text Data or Documents
  • Method Nonnegative Matrix Factorization or NMF

7
Vector Space Model of Text Data
  • Documents represented as n-dimensional vectors
  • n terms in the dictionary
  • vector component importance of term
  • Document collection represented as
    term-by-document matrix

8
Term-by-Document Matrix
  • Terms in the dictionary, n 9 (a, brown, dog,
    fox, jumped, lazy, over, quick, the)
  • Document 1 a quick brown fox
  • Document 2 jumped over the lazy dog

9
Term-by-Document Matrix
10
Clustering Method NMF
  • Low rank approximation of large sparse matrices
  • Preserves data nonnegativity
  • Introduces the concept of parts-based
    representation (by Lee and Seung in Nature, 1999)

11
Other Methods
  • Other rank reduction methods
  • Principal Component Analysis (PCA)
  • Vector Quantization (VQ)
  • Produce basis vectors with negative entries
  • Additive and Subtractive combinations of basis
    vectors yield original document vectors

12
NMF
  • Produces nonnegative basis vectors
  • Additive combination of basis vectors yield
    original document vector

13
Term-by-Document Matrix (all entries nonnegative)
14
NMF
  • Basis vectors interpreted as semantic features or
    topics
  • Documents clustered on the basis of shared
    features

15
NMF
  • Demonstrated by Xu et. Al (2003)
  • Outperforms Singular Value Decomposition (SVD)
  • Comparable to Graph Partitioning methods

16
Algorithm
17
NMF Definition
  • Given
  • S Document collection
  • Vmxn term-by-document matrix
  • m terms in the dictionary
  • n Number of documents in S

18
NMF Definition
  • NMF is defined as
  • Low rank approximation of Vmxn in terms of some
    metric
  • Factor V into the product WH
  • Wmxk Contains basis vectors
  • Hkxn Contains linear combinations
  • k Selected number of topics or basis
    vectors, k ltlt min(m,n)

19
NMF Common Approach
  • Minimize objective function

20
NMF Existing Methods
  • Multiplicative Method (MM) by Lee and Seung
  • Based on Multiplicative update rules
  • V - WH is monotonically non-increasing and
    constant iff W, H at stationary point
  • Version of Gradient Descent (GD) optimization
    scheme

21
NMF Existing Methods
  • Sparse Encoding by Hoyer
  • Based on study of neural networks
  • Enforces statistical sparsity of H
  • Minimizes sum of non-zeros in H

22
NMF Existing Methods
  • Sparse Encoding by Mu, Plemmons and Santago
  • Similar to Hoyers method
  • Enforces statistical sparsity of H using a
    regularization parameter
  • Minimizes number of non-zeros in H

23
NMF Proposed Algorithm
  • Hybrid Method
  • W approximated using Multiplicative Method
  • H calculated using a Constrained Least Square
    (CLS) model as the metric
  • Penalizes the number of non-zeros
  • Similar to the method by Mu, Plemmons and
    Santago
  • Called GD-CLS

24
GD-CLS
25
Performance
26
Text Collections Used
  • Two benchmark topic detection text collections
  • Reuters Collection of documents on assorted
    topics
  • TDT2 Transcripts from news media

27
Text Collections Used
28
Accuracy Metric
  • Defined by
  • di Document number i
  • 1 1 if the topic labels match
  • ?(di) 0 otherwise
  • k 2, 4, 6, 8, 10, 15, 20
  • ? 0.1, 0.01, 0.001

29
Results for Reuters Results for
TDT2
30
Observations
31
Observations AC
  • AC inversely proportional to k
  • Nature of the collection affects AC
  • Reuters earn, interest, cocoa
  • TDT2 Asian economic crisis, Oprah lawsuit

32
Observations ? parameter
  • AC declines as ? increases ( mostly effective for
    homogeneous text collections)
  • CPU time declines as ? increases

33
Observations Cluster size
  • Imbalance in cluster sizes has adverse effect

34
Conclusion Future Work
  • GD-CLS can be used to effectively cluster text
    data. Further development involves
  • Smart updating
  • Use in Bioinformatics
  • Develop user-interface
  • Convert to C
Write a Comment
User Comments (0)
About PowerShow.com