Feature selection using hybrid method for text clustering - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Feature selection using hybrid method for text clustering

Description:

x2 statistic (CHI):measures the association between the term and the category. ... half of a pair of related documents given that it occurs in the first half: ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 35

Provided by: gmu

Category:

more less

Transcript and Presenter's Notes

Title: Feature selection using hybrid method for text clustering

1
Feature selection using hybrid method for text
clustering

Ning Kang
INFS 795

2
Outline

Introduction
Feature selection(supervised/unsupervised)
Association mining and frequent itemset
Local adaptive clustering method
Objective and problem definition
Experimental Design and result analysis
Conclusions

3
Introduction

Text clustering is to group similar documents
together.
The general bag of words representation raises
one severe problem the high dimensionality of
the feature space and the inherent data sparsity.

4
Introduction

Clustering in such high dimensional spaces
presents tremendous difficulty and the
performance of clustering algorithm will decline
dramatically, much more than that in predictive
learning such as decision trees or Naive Bayes
classification.
Introduce irrelevant feature will hurt clustering
result.
Curse of dimension by measure similarity

5
Feature selection

Feature selection is a process that choose a
subset from the original feature set according to
some criterions.
Depending on if the class label information is
required, feature selection can be either
supervised or unsupervised.

6
Supervised Feature selection

Information gainmeasures the information
obtained from category prediction by presence or
absence of the term in a document.
x2 statistic (CHI)measures the association
between the term and the category.

7
Unsupervised feature selection

Document Frequency
Term strengthit is computed based on the
conditional probability that a term occurs in the
second half of a pair of related documents given
that it occurs in the first half
Entropy-based ranking The term is measured by
the entropy reduction when it is removed. The
entropy is defined as the following equation

8
Association Rule Mining

Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

9
Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

10
Apply association rule in text mining

Scenario 1
baskets documents
items words in documents
frequent word-groups ? linked concepts.
Scenario 2
baskets web pages
items outgoing links
pages with similar references ? about same topic

11
Co-clustering

To capture the local correlations of data, a
proper feature selection procedure should
operate locally
A local operation would allow to embed different
distance measures in different regions
Definition simultaneous clustering of both row
and column sets in a data matrix.

12
Locally Adaptive Clustering (LAC)

Weighted cluster subset of data points, together
with a weight vector, such that the points are
closely clustered according to the corresponding
weighted Euclidean distance
Objective Find cluster centroids, and weight
vectors.
By Professor Carlotta Domeniconi etc.

13
Local adaptive clustering
1
y
2
x
w1xgtw1y
w2ygtw2x
14
Locally Adaptive Clustering (LAC)

Error function

Subject to
Where
15
Objective and motivation

Text dataset has very high dimension size, it is
hard to apply local adaptive method.
To clustering problem,generally dimension gt16
will be considered high dimension.
May have some common irrelevant feature to all
clusters, and each cluster may have its own
subset of relevant features(subspace clustering).

16
What can we gain?

In text applications, web clustering based on
site contents results in 5,000-15,000 attributes
(pages/contents) for modest Web sites.
Biology and genomic data can have dimensions that
easily surpass 10,000 attributes.
The difference of the gap is significant.
Although the newly developed sub-space clustering
algorithm can in some extent s solve the high
dimensionality problem by finding local
correlations in the data and perform
dimensionality reduction on the locally
correlated clusters individually.

17
What can we gain?

This kind of algorithm may works well for
dimensionality less than 200,but obviously it
will not work well for dimension such as 10,000.
We try to apply some global feature selection
method on original feature space to reduce the
feature to 200300 since in most condition there
is a lot of common unrelated features for all
clusters. Then we take advantage of local
adaptive clustering method to find local
correlated features for each cluster.

18
How do we combine?

The main point of the hybrid feature selection
method is apply local adaptive clustering
algorithm to a preselected feature space which
using frequent itemset ideas. To say concretely
Step1 . Represent document set into Bag of Word
format D.
Step2 . Applying association rule to Set D to get
a group of frequent termset SS1,Sn.
Step3 . Combine all the termset in S to get a set
of global feature GS S1??Sn.
Step4 . Apply Local Adaptive Clustering
algorithm(LAC) to document set D on selected
feature space GS.
Step5 .Get the documents partition and the local
weight for each partition.

19
Problem Definition

Given a set S of D documents in the
Norg-dimensional space and M is the number of
document in dataset, N is the dimensional space
which is gotten after applying association rules
mining on space MNorg . A set of k
centersc1,,ck, Cj ?RN, j1,k coupled with a
set of corresponding weight vectorsw1,,wk,
wj?RN, J1,..,,k.partition S into k
setsS1,,Sk
Where wji and cji represent the ith components
of vectors wj and cj respectively.

20
Dataset used
21
Experiment design

Preprocessing the dataset
Stop word deduction(Based one a stopword list)
Low frequency word deduction(dfgt3)
Data representation
relative term frequency(rtf)term
frequency(tf)/total frequency in that document
Distance measurement
normalized Euclidean distance

22
Experimental Design

Apply association rule mining to get a set of
frequent itemset as a pre-selected set of feature
for each dataset based on designed frequent
itemset support.
Apply Local adaptive clustering(LAC) method to
each dataset to get weighted feature sets for
each category based on the pre-selected feature
set.

23
SPAM dataset(support, h)
24
Weighted features

Some selected typical features with high rank.
SPAMconference,applications,algorithm,
Papers,committee,neuroscience
NON-SPAMsales, money, marketing, credit, pay,
free,order,product

25
Weight comparison for selected feature
26
Stdev comparison for selected feature
27
Why it happens?
3
1
2
Distribution of feature i in a certain
cluster 2,3 will get high weight, 1will get low
weight
28
Experimental Result Analysis(1)

From above result(table 1 and 2), we can see when
dimension of the feature reduced from 791 to 285
in terms of the original feature space 9210, the
clustering result is nearly the samearound 2
which is fairly good. Obviously, the setting of
support for association rules is the most
important factor that influence the final result.
And the weighting parameter h also has impact on
the final result.

29
20Newsgroup dataset
30
Experimental Result Analysis(2)

By analysis of the weight for some selected key
word, we found when a word(feature) in a certain
class get a high weight, that word has a low
standard deviation among that class which means
the value of that feature for each document in
that class is relative tight. So this kind of
feature can be a discriminative feature and
should get a higher weight.

31
Comment on result

The setting of support for the association rules
in first step is very important. The setting of
support should be the one which can get selected
feature around 200-300 features. In this
situation, it can get relatively good clustering
performance and also can converge very fast by
contrast of using high dimension space.

32
Conclusion

By applying association rule to documents
dataset,it dramatically reduce the feature space
from 10,000 to 200300 and still keep the low
error rate,but improve the computation time.
By applying local adaptive clustering on selected
feature, we get weight vector which reflect the
real discriminative word feature for each
cluster.
Effectively help local adaptive cluster algorithm
to high dimensional data such as text data.

33
Future Work

Try other dataset such as microarray dataset.
Try to find some formula between the support
setting,data points numbers and the original
features and target features.
Try to find a way to get the best h
Try to find a way to identify the cluster using
weight vector or feature distribution
Change the initiate center chosen

34
Reference

Aggrawal, C.C Yu, P.S (2000). Finding
generalized projected cluster in high dimensional
spaces. Proc. of SIGMOD00
Yang,Y.(1995). Noise reduction in a statistical
approach to text categorization.Proc. of SIGIR95
C. Domeniconi, D. Papadopoulos, D. Gunopulos, S.
Ma, "Subspace Clustering of High Dimensonal
Data", in Proceedings of the SIAM International
Conference on Data Mining , Lake Buena Vista,
Florida, April 22-24, 2004.
Hichem Frigui and Olfa Nasraoui, Simultaneous
Categorization of Text Documents and
Identification of Cluster-dependent keywords.
Daniel Babara, Carlotta Domeniconi and Ning Kang,
Classifying Documents without Lables. SDM04
Hichem Frigui and Olfa Nasraoui, Simultaneous
Clustering and Attribute Discrimination
Yiming Yang and Jan O.Pedersen, A comparative
study on Feature Selection in Text
Categorization.