Title: Feature selection using hybrid method for text clustering
1Feature selection using hybrid method for text
clustering
2Outline
- Introduction
- Feature selection(supervised/unsupervised)
- Association mining and frequent itemset
- Local adaptive clustering method
- Objective and problem definition
- Experimental Design and result analysis
- Conclusions
3Introduction
- Text clustering is to group similar documents
together. - The general bag of words representation raises
one severe problem the high dimensionality of
the feature space and the inherent data sparsity.
4Introduction
- Clustering in such high dimensional spaces
presents tremendous difficulty and the
performance of clustering algorithm will decline
dramatically, much more than that in predictive
learning such as decision trees or Naive Bayes
classification. - Introduce irrelevant feature will hurt clustering
result. - Curse of dimension by measure similarity
5Feature selection
- Feature selection is a process that choose a
subset from the original feature set according to
some criterions. - Depending on if the class label information is
required, feature selection can be either
supervised or unsupervised.
6Supervised Feature selection
- Information gainmeasures the information
obtained from category prediction by presence or
absence of the term in a document. - x2 statistic (CHI)measures the association
between the term and the category.
7Unsupervised feature selection
- Document Frequency
- Term strengthit is computed based on the
conditional probability that a term occurs in the
second half of a pair of related documents given
that it occurs in the first half - Entropy-based ranking The term is measured by
the entropy reduction when it is removed. The
entropy is defined as the following equation
8Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
9Frequent Itemset
- Itemset
- A collection of one or more items
- Example Milk, Bread, Diaper
- k-itemset
- An itemset that contains k items
- Support count (?)
- Frequency of occurrence of an itemset
- E.g. ?(Milk, Bread,Diaper) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Milk, Bread, Diaper) 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal
to a minsup threshold
10Apply association rule in text mining
- Scenario 1
- baskets documents
- items words in documents
- frequent word-groups ? linked concepts.
- Scenario 2
- baskets web pages
- items outgoing links
- pages with similar references ? about same topic
11Co-clustering
- To capture the local correlations of data, a
proper feature selection procedure should
operate locally - A local operation would allow to embed different
distance measures in different regions - Definition simultaneous clustering of both row
and column sets in a data matrix.
12Locally Adaptive Clustering (LAC)
- Weighted cluster subset of data points, together
with a weight vector, such that the points are
closely clustered according to the corresponding
weighted Euclidean distance - Objective Find cluster centroids, and weight
vectors. - By Professor Carlotta Domeniconi etc.
13Local adaptive clustering
1
y
2
x
w1xgtw1y
w2ygtw2x
14Locally Adaptive Clustering (LAC)
Subject to
Where
15Objective and motivation
- Text dataset has very high dimension size, it is
hard to apply local adaptive method. - To clustering problem,generally dimension gt16
will be considered high dimension. - May have some common irrelevant feature to all
clusters, and each cluster may have its own
subset of relevant features(subspace clustering).
16What can we gain?
- In text applications, web clustering based on
site contents results in 5,000-15,000 attributes
(pages/contents) for modest Web sites. - Biology and genomic data can have dimensions that
easily surpass 10,000 attributes. - The difference of the gap is significant.
Although the newly developed sub-space clustering
algorithm can in some extent s solve the high
dimensionality problem by finding local
correlations in the data and perform
dimensionality reduction on the locally
correlated clusters individually.
17What can we gain?
- This kind of algorithm may works well for
dimensionality less than 200,but obviously it
will not work well for dimension such as 10,000. - We try to apply some global feature selection
method on original feature space to reduce the
feature to 200300 since in most condition there
is a lot of common unrelated features for all
clusters. Then we take advantage of local
adaptive clustering method to find local
correlated features for each cluster.
18How do we combine?
- The main point of the hybrid feature selection
method is apply local adaptive clustering
algorithm to a preselected feature space which
using frequent itemset ideas. To say concretely - Step1 . Represent document set into Bag of Word
format D. - Step2 . Applying association rule to Set D to get
a group of frequent termset SS1,Sn. - Step3 . Combine all the termset in S to get a set
of global feature GS S1??Sn. - Step4 . Apply Local Adaptive Clustering
algorithm(LAC) to document set D on selected
feature space GS. - Step5 .Get the documents partition and the local
weight for each partition.
19Problem Definition
- Given a set S of D documents in the
Norg-dimensional space and M is the number of
document in dataset, N is the dimensional space
which is gotten after applying association rules
mining on space MNorg . A set of k
centersc1,,ck, Cj ?RN, j1,k coupled with a
set of corresponding weight vectorsw1,,wk,
wj?RN, J1,..,,k.partition S into k
setsS1,,Sk - Where wji and cji represent the ith components
of vectors wj and cj respectively.
20Dataset used
21Experiment design
- Preprocessing the dataset
- Stop word deduction(Based one a stopword list)
- Low frequency word deduction(dfgt3)
- Data representation
- relative term frequency(rtf)term
frequency(tf)/total frequency in that document - Distance measurement
- normalized Euclidean distance
22Experimental Design
- Apply association rule mining to get a set of
frequent itemset as a pre-selected set of feature
for each dataset based on designed frequent
itemset support. - Apply Local adaptive clustering(LAC) method to
each dataset to get weighted feature sets for
each category based on the pre-selected feature
set.
23SPAM dataset(support, h)
24Weighted features
- Some selected typical features with high rank.
- SPAMconference,applications,algorithm,
- Papers,committee,neuroscience
- NON-SPAMsales, money, marketing, credit, pay,
free,order,product
25Weight comparison for selected feature
26Stdev comparison for selected feature
27Why it happens?
3
1
2
Distribution of feature i in a certain
cluster 2,3 will get high weight, 1will get low
weight
28Experimental Result Analysis(1)
- From above result(table 1 and 2), we can see when
dimension of the feature reduced from 791 to 285
in terms of the original feature space 9210, the
clustering result is nearly the samearound 2
which is fairly good. Obviously, the setting of
support for association rules is the most
important factor that influence the final result.
And the weighting parameter h also has impact on
the final result.
2920Newsgroup dataset
30Experimental Result Analysis(2)
- By analysis of the weight for some selected key
word, we found when a word(feature) in a certain
class get a high weight, that word has a low
standard deviation among that class which means
the value of that feature for each document in
that class is relative tight. So this kind of
feature can be a discriminative feature and
should get a higher weight.
31Comment on result
- The setting of support for the association rules
in first step is very important. The setting of
support should be the one which can get selected
feature around 200-300 features. In this
situation, it can get relatively good clustering
performance and also can converge very fast by
contrast of using high dimension space.
32Conclusion
- By applying association rule to documents
dataset,it dramatically reduce the feature space
from 10,000 to 200300 and still keep the low
error rate,but improve the computation time. - By applying local adaptive clustering on selected
feature, we get weight vector which reflect the
real discriminative word feature for each
cluster. - Effectively help local adaptive cluster algorithm
to high dimensional data such as text data.
33Future Work
- Try other dataset such as microarray dataset.
- Try to find some formula between the support
setting,data points numbers and the original
features and target features. - Try to find a way to get the best h
- Try to find a way to identify the cluster using
weight vector or feature distribution - Change the initiate center chosen
34Reference
- Aggrawal, C.C Yu, P.S (2000). Finding
generalized projected cluster in high dimensional
spaces. Proc. of SIGMOD00 - Yang,Y.(1995). Noise reduction in a statistical
approach to text categorization.Proc. of SIGIR95 - C. Domeniconi, D. Papadopoulos, D. Gunopulos, S.
Ma, "Subspace Clustering of High Dimensonal
Data", in Proceedings of the SIAM International
Conference on Data Mining , Lake Buena Vista,
Florida, April 22-24, 2004. - Hichem Frigui and Olfa Nasraoui, Simultaneous
Categorization of Text Documents and
Identification of Cluster-dependent keywords. - Daniel Babara, Carlotta Domeniconi and Ning Kang,
Classifying Documents without Lables. SDM04 - Hichem Frigui and Olfa Nasraoui, Simultaneous
Clustering and Attribute Discrimination - Yiming Yang and Jan O.Pedersen, A comparative
study on Feature Selection in Text
Categorization.