Title: Recent Trends in Text Mining
1Recent Trends in Text Mining
- Girish Keswani
- gkeswani_at_micron.com
2Text Mining?
- What?
- Data Mining on Text Data
- Why?
- Information Retrieval
- Confusion Set Disambiguation
- Topic Distillation
- How?
- Data Mining
3Organization
- Text Mining Algorithms
- Jargon Used
- Background
- Data Modeling,
- Text Classification, and
- Text Clustering
- Applications
- Experiments NBC, NN and ssFCM
- Further work
- References
4Text Mining Algorithms
- Classification Algorithms
- Naïve Bayes Classifier
- Decision Trees
- Neural Networks
- Clustering Algorithms
- EM Algorithms
- Fuzzy
5Jargon
- DM Data Mining
- IR Information Retrieval
- NBC Naïve Bayes Classifier
- EM Expectation Maximization
- NN Neural Networks
- ssFCM Semi-Supervised Fuzzy C-Means
- Labeled Data (Training Data)
- Unlabeled Data
- Test Data
6Background Modeling
7Background Modeling
- Generative Models of Data 13 Probabilistic
- to generate a document, a class is first
selected based on its prior probability and then
a document is generated using the parameters of
the chosen class distribution - NBC and EM Algorithms are based on this model
8Importance of Unlabeled Data?
D
A
B
Labeled Data Unlabeled Data Test Data
G
F
E
C
- Provides access to feature distribution in set F
using joint probability distributions
9How to make use of Unlabeled Data?
10How to make use of Unlabeled Data?
11Experimental Results 1
Using NBC, EM and ssFCM
12Experimental Results 2
Using NBC and EM
13Extensions and Variants of these approaches
- Authors in 6 propose a concept of Class
Distribution Constraint matrix - Results on Confusion Set Disambiguation
- Automatic Title Generation 7
- Using EM Algorithm
- Non-extractive approach
14Relational Data 9
- A collection of data with relations between
entities explained is known as relational data - Probabilistic Relational Models
15Commercial Use/Products
- IBM Text Analyzer 11
- Decision Tree Based
- SAS Text Miner12
- Singular Value Decomposition
- Filtering Junk Email
- Hotmail, Yahoo
- Advanced Search Engines
16Applications Search Engines
17Vivisimo Search Engine (www.vivisimo.com)
18Experiments
- NBC
- Naïve Bayes Classifier
- Probabilistic
- NN
- Neural Networks
- ssFCM
- Semi-Supervised Fuzzy Clustering
- Fuzzy
19Datasets (20 Newsgroups Data)
Dataset min2 min4 min6
Features -- 9467 5685
Dataset Sampling Percentage Number of Features
Sample25 25 13925
Sample30 30 15067
Sample35 35 16737
Sample40 40 16871
Sample45 45 17712
Sample50 50 19135
Sampling I
Vectors
Data
Raw
Sampling II
Vectors
20Naïve Bayes Classifier
SAMPLE TRAINING TEST ACCURACY
Sample25 20 80 34.4637
Sample25 63 36 48.4945
Sample25 76 23 50.9322
Sample25 82 17 47.7728
Sample25 86 13 48.9971
Sample25 20 80 31.5436
Sample25 63 36 48.0729
Sample25 76 23 47.8661
Sample25 82 17 50.5568
Sample25 86 13 50.4587
Sample30 33 66 39.1137
Sample30 66 33 46.4233
Sample30 77 22 48.5528
Sample30 83 16 52.7383
Sample30 86 13 51.2136
Sample30 33 66 39.26
Sample30 66 33 47.0192
Sample30 77 22 48.8439
Sample30 83 16 49.6907
Sample30 86 13 51.6169
21Naïve Bayes Classifier
22NBC
Sample25 Sample30
23ssFCM
Effect of Labeled Data Effect of Unlabeled Data
24ssFCM
25Further Work
- Ensemble of Classifiers 16
26Further Work
- Knowledge Gathering from Experts
- E.g. 3 class Data
Input Data C1,C2,C3
C1
C3
C2
Test Data ?
Classifier
27References
- 1 Text Classification using Semi-Supervised
Fuzzy Clustering, Girish Keswani and L.O.Hall,
appeared in IEEE WCCI 2002 conference. - 2 Using Unlabeled Data to Improve Text
Classification, Kamal Paul Nigam. - 3 Text Classification from Labeled and
Unlabeled Documents using EM, Kamal Paul Nigam
et al. - 4 The Value of Unlabeled Data for
Classification Problems, Tong Zhang. - 5 Learning from Partially Labeled Data,
Martin Szummer et al. - 6 Training a Naïve Bayes Classifier via the EM
Algorithm with a Class Distribution Constraint,
Yoshimasa Tsuruoka and Junichi Tsujii. - 7 Automatic Title Generation using EM, Paul
E. Kennedy and Alexander G. Hauptmann. - 8 Unlabeled Data can degrade Classification
Performance of Generative Classifiers, Fabio G.
Cozman and Ira Cohen. - 9 Probabilistic Classification and Clustering
in Relational Data, Ben Taskar et al. - 10 Using Clustering to Boost Text
Classification, Y.C. Fang et al. - 11 IBM Text Analyzer A decision-tree-based
symbolic rule induction system for text
categorization, D.E. Johnson et al. - 12 SAS Text Miner, Reincke
- 13 Pattern Recognition, Duda and Hart 2000
- 14 Machine Learning, Tom Mitchell
- 15 Data Mining, Margaret Dunham
- 16 http//www-2.cs.cmu.edu/afs/cs/project/jair/p
ub/volume11/opitz99a-html/