A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

About This Presentation

Title:

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Description:

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz Topics Introduction Algorithm Performance Observation Conclusion and ... – PowerPoint PPT presentation

Number of Views:412

Avg rating:3.0/5.0

Slides: 35

Provided by: far101

Category:

more less

Transcript and Presenter's Notes

Title: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

1
A Clustering Method Based on Nonnegative Matrix
Factorization for Text Mining

Farial Shahnaz

2
Topics

Introduction
Algorithm
Performance
Observation
Conclusion and Future Work

3
Introduction
4
Basic Concepts

Text Mining Detection of trends or patterns in
text data
Clustering Grouping or classifying documents
based on similarity of content

5
Clustering

Manual Vs Automated
Supervised Vs Unsupervised
Hierarchical Vs Partitional

6
Clustering

Objective Automated Unsupervised Partitional
Clustering of Text Data or Documents
Method Nonnegative Matrix Factorization or NMF

7
Vector Space Model of Text Data

Documents represented as n-dimensional vectors
n terms in the dictionary
vector component importance of term
Document collection represented as
term-by-document matrix

8
Term-by-Document Matrix

Terms in the dictionary, n 9 (a, brown, dog,
fox, jumped, lazy, over, quick, the)
Document 1 a quick brown fox
Document 2 jumped over the lazy dog

9
Term-by-Document Matrix
10
Clustering Method NMF

Low rank approximation of large sparse matrices
Preserves data nonnegativity
Introduces the concept of parts-based
representation (by Lee and Seung in Nature, 1999)

11
Other Methods

Other rank reduction methods
Principal Component Analysis (PCA)
Vector Quantization (VQ)
Produce basis vectors with negative entries
Additive and Subtractive combinations of basis
vectors yield original document vectors

12
NMF

Produces nonnegative basis vectors
Additive combination of basis vectors yield
original document vector

13
Term-by-Document Matrix (all entries nonnegative)
14
NMF

Basis vectors interpreted as semantic features or
topics
Documents clustered on the basis of shared
features

15
NMF

Demonstrated by Xu et. Al (2003)
Outperforms Singular Value Decomposition (SVD)
Comparable to Graph Partitioning methods

16
Algorithm
17
NMF Definition

Given
S Document collection
Vmxn term-by-document matrix
m terms in the dictionary
n Number of documents in S

18
NMF Definition

NMF is defined as
Low rank approximation of Vmxn in terms of some
metric
Factor V into the product WH
Wmxk Contains basis vectors
Hkxn Contains linear combinations
k Selected number of topics or basis
vectors, k ltlt min(m,n)

19
NMF Common Approach

Minimize objective function

20
NMF Existing Methods

Multiplicative Method (MM) by Lee and Seung
Based on Multiplicative update rules
V - WH is monotonically non-increasing and
constant iff W, H at stationary point
Version of Gradient Descent (GD) optimization
scheme

21
NMF Existing Methods

Sparse Encoding by Hoyer
Based on study of neural networks
Enforces statistical sparsity of H
Minimizes sum of non-zeros in H

22
NMF Existing Methods

Sparse Encoding by Mu, Plemmons and Santago
Similar to Hoyers method
Enforces statistical sparsity of H using a
regularization parameter
Minimizes number of non-zeros in H

23
NMF Proposed Algorithm

Hybrid Method
W approximated using Multiplicative Method
H calculated using a Constrained Least Square
(CLS) model as the metric
Penalizes the number of non-zeros
Similar to the method by Mu, Plemmons and
Santago
Called GD-CLS

24
GD-CLS
25
Performance
26
Text Collections Used

Two benchmark topic detection text collections
Reuters Collection of documents on assorted
topics
TDT2 Transcripts from news media

27
Text Collections Used
28
Accuracy Metric

Defined by
di Document number i
1 1 if the topic labels match
?(di) 0 otherwise
k 2, 4, 6, 8, 10, 15, 20
? 0.1, 0.01, 0.001

29
Results for Reuters Results for
TDT2
30
Observations
31
Observations AC

AC inversely proportional to k
Nature of the collection affects AC
Reuters earn, interest, cocoa
TDT2 Asian economic crisis, Oprah lawsuit

32
Observations ? parameter

AC declines as ? increases ( mostly effective for
homogeneous text collections)
CPU time declines as ? increases

33
Observations Cluster size

Imbalance in cluster sizes has adverse effect

34
Conclusion Future Work

GD-CLS can be used to effectively cluster text
data. Further development involves
Smart updating
Use in Bioinformatics
Develop user-interface
Convert to C

Write a Comment

User Comments (0)