Feature selection using hybrid method for text clustering - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Feature selection using hybrid method for text clustering

Description:

x2 statistic (CHI):measures the association between the term and the category. ... half of a pair of related documents given that it occurs in the first half: ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 35
Provided by: gmu
Category:

less

Transcript and Presenter's Notes

Title: Feature selection using hybrid method for text clustering


1
Feature selection using hybrid method for text
clustering
  • Ning Kang
  • INFS 795

2
Outline
  • Introduction
  • Feature selection(supervised/unsupervised)
  • Association mining and frequent itemset
  • Local adaptive clustering method
  • Objective and problem definition
  • Experimental Design and result analysis
  • Conclusions

3
Introduction
  • Text clustering is to group similar documents
    together.
  • The general bag of words representation raises
    one severe problem the high dimensionality of
    the feature space and the inherent data sparsity.

4
Introduction
  • Clustering in such high dimensional spaces
    presents tremendous difficulty and the
    performance of clustering algorithm will decline
    dramatically, much more than that in predictive
    learning such as decision trees or Naive Bayes
    classification.
  • Introduce irrelevant feature will hurt clustering
    result.
  • Curse of dimension by measure similarity

5
Feature selection
  • Feature selection is a process that choose a
    subset from the original feature set according to
    some criterions.
  • Depending on if the class label information is
    required, feature selection can be either
    supervised or unsupervised.

6
Supervised Feature selection
  • Information gainmeasures the information
    obtained from category prediction by presence or
    absence of the term in a document.
  • x2 statistic (CHI)measures the association
    between the term and the category.

7
Unsupervised feature selection
  • Document Frequency
  • Term strengthit is computed based on the
    conditional probability that a term occurs in the
    second half of a pair of related documents given
    that it occurs in the first half
  • Entropy-based ranking The term is measured by
    the entropy reduction when it is removed. The
    entropy is defined as the following equation

8
Association Rule Mining
  • Given a set of transactions, find rules that will
    predict the occurrence of an item based on the
    occurrences of other items in the transaction

9
Frequent Itemset
  • Itemset
  • A collection of one or more items
  • Example Milk, Bread, Diaper
  • k-itemset
  • An itemset that contains k items
  • Support count (?)
  • Frequency of occurrence of an itemset
  • E.g. ?(Milk, Bread,Diaper) 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s(Milk, Bread, Diaper) 2/5
  • Frequent Itemset
  • An itemset whose support is greater than or equal
    to a minsup threshold

10
Apply association rule in text mining
  • Scenario 1
  • baskets documents
  • items words in documents
  • frequent word-groups ? linked concepts.
  • Scenario 2
  • baskets web pages
  • items outgoing links
  • pages with similar references ? about same topic

11
Co-clustering
  • To capture the local correlations of data, a
    proper feature selection procedure should
    operate locally
  • A local operation would allow to embed different
    distance measures in different regions
  • Definition simultaneous clustering of both row
    and column sets in a data matrix.

12
Locally Adaptive Clustering (LAC)
  • Weighted cluster subset of data points, together
    with a weight vector, such that the points are
    closely clustered according to the corresponding
    weighted Euclidean distance
  • Objective Find cluster centroids, and weight
    vectors.
  • By Professor Carlotta Domeniconi etc.

13
Local adaptive clustering
1
y
2
x
w1xgtw1y
w2ygtw2x
14
Locally Adaptive Clustering (LAC)
  • Error function

Subject to
Where
15
Objective and motivation
  • Text dataset has very high dimension size, it is
    hard to apply local adaptive method.
  • To clustering problem,generally dimension gt16
    will be considered high dimension.
  • May have some common irrelevant feature to all
    clusters, and each cluster may have its own
    subset of relevant features(subspace clustering).

16
What can we gain?
  • In text applications, web clustering based on
    site contents results in 5,000-15,000 attributes
    (pages/contents) for modest Web sites.
  • Biology and genomic data can have dimensions that
    easily surpass 10,000 attributes.
  • The difference of the gap is significant.
    Although the newly developed sub-space clustering
    algorithm can in some extent s solve the high
    dimensionality problem by finding local
    correlations in the data and perform
    dimensionality reduction on the locally
    correlated clusters individually.

17
What can we gain?
  • This kind of algorithm may works well for
    dimensionality less than 200,but obviously it
    will not work well for dimension such as 10,000.
  • We try to apply some global feature selection
    method on original feature space to reduce the
    feature to 200300 since in most condition there
    is a lot of common unrelated features for all
    clusters. Then we take advantage of local
    adaptive clustering method to find local
    correlated features for each cluster.

18
How do we combine?
  • The main point of the hybrid feature selection
    method is apply local adaptive clustering
    algorithm to a preselected feature space which
    using frequent itemset ideas. To say concretely
  • Step1 . Represent document set into Bag of Word
    format D.
  • Step2 . Applying association rule to Set D to get
    a group of frequent termset SS1,Sn.
  • Step3 . Combine all the termset in S to get a set
    of global feature GS S1??Sn.
  • Step4 . Apply Local Adaptive Clustering
    algorithm(LAC) to document set D on selected
    feature space GS.
  • Step5 .Get the documents partition and the local
    weight for each partition.

19
Problem Definition
  • Given a set S of D documents in the
    Norg-dimensional space and M is the number of
    document in dataset, N is the dimensional space
    which is gotten after applying association rules
    mining on space MNorg . A set of k
    centersc1,,ck, Cj ?RN, j1,k coupled with a
    set of corresponding weight vectorsw1,,wk,
    wj?RN, J1,..,,k.partition S into k
    setsS1,,Sk
  • Where wji and cji represent the ith components
    of vectors wj and cj respectively.

20
Dataset used
21
Experiment design
  • Preprocessing the dataset
  • Stop word deduction(Based one a stopword list)
  • Low frequency word deduction(dfgt3)
  • Data representation
  • relative term frequency(rtf)term
    frequency(tf)/total frequency in that document
  • Distance measurement
  • normalized Euclidean distance

22
Experimental Design
  • Apply association rule mining to get a set of
    frequent itemset as a pre-selected set of feature
    for each dataset based on designed frequent
    itemset support.
  • Apply Local adaptive clustering(LAC) method to
    each dataset to get weighted feature sets for
    each category based on the pre-selected feature
    set.

23
SPAM dataset(support, h)
24
Weighted features
  • Some selected typical features with high rank.
  • SPAMconference,applications,algorithm,
  • Papers,committee,neuroscience
  • NON-SPAMsales, money, marketing, credit, pay,
    free,order,product

25
Weight comparison for selected feature
26
Stdev comparison for selected feature
27
Why it happens?
3
1
2
Distribution of feature i in a certain
cluster 2,3 will get high weight, 1will get low
weight
28
Experimental Result Analysis(1)
  • From above result(table 1 and 2), we can see when
    dimension of the feature reduced from 791 to 285
    in terms of the original feature space 9210, the
    clustering result is nearly the samearound 2
    which is fairly good. Obviously, the setting of
    support for association rules is the most
    important factor that influence the final result.
    And the weighting parameter h also has impact on
    the final result.

29
20Newsgroup dataset
30
Experimental Result Analysis(2)
  • By analysis of the weight for some selected key
    word, we found when a word(feature) in a certain
    class get a high weight, that word has a low
    standard deviation among that class which means
    the value of that feature for each document in
    that class is relative tight. So this kind of
    feature can be a discriminative feature and
    should get a higher weight.

31
Comment on result
  • The setting of support for the association rules
    in first step is very important. The setting of
    support should be the one which can get selected
    feature around 200-300 features. In this
    situation, it can get relatively good clustering
    performance and also can converge very fast by
    contrast of using high dimension space.

32
Conclusion
  • By applying association rule to documents
    dataset,it dramatically reduce the feature space
    from 10,000 to 200300 and still keep the low
    error rate,but improve the computation time.
  • By applying local adaptive clustering on selected
    feature, we get weight vector which reflect the
    real discriminative word feature for each
    cluster.
  • Effectively help local adaptive cluster algorithm
    to high dimensional data such as text data.

33
Future Work
  • Try other dataset such as microarray dataset.
  • Try to find some formula between the support
    setting,data points numbers and the original
    features and target features.
  • Try to find a way to get the best h
  • Try to find a way to identify the cluster using
    weight vector or feature distribution
  • Change the initiate center chosen

34
Reference
  • Aggrawal, C.C Yu, P.S (2000). Finding
    generalized projected cluster in high dimensional
    spaces. Proc. of SIGMOD00
  • Yang,Y.(1995). Noise reduction in a statistical
    approach to text categorization.Proc. of SIGIR95
  • C. Domeniconi, D. Papadopoulos, D. Gunopulos, S.
    Ma, "Subspace Clustering of High Dimensonal
    Data", in Proceedings of the SIAM International
    Conference on Data Mining , Lake Buena Vista,
    Florida, April 22-24, 2004.
  • Hichem Frigui and Olfa Nasraoui, Simultaneous
    Categorization of Text Documents and
    Identification of Cluster-dependent keywords.
  • Daniel Babara, Carlotta Domeniconi and Ning Kang,
    Classifying Documents without Lables. SDM04
  • Hichem Frigui and Olfa Nasraoui, Simultaneous
    Clustering and Attribute Discrimination
  • Yiming Yang and Jan O.Pedersen, A comparative
    study on Feature Selection in Text
    Categorization.
Write a Comment
User Comments (0)
About PowerShow.com