Recent Trends in Text Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Recent Trends in Text Mining

Description:

'to generate a document, a class is first selected based on its ... Filtering Junk Email. Hotmail, Yahoo. Advanced Search Engines. Applications: Search Engines ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 28

Provided by: Tem88

Category:

more less

Transcript and Presenter's Notes

Title: Recent Trends in Text Mining

1
Recent Trends in Text Mining

Girish Keswani
gkeswani_at_micron.com

2
Text Mining?

What?
Data Mining on Text Data
Why?
Information Retrieval
Confusion Set Disambiguation
Topic Distillation
How?
Data Mining

3
Organization

Text Mining Algorithms
Jargon Used
Background
Data Modeling,
Text Classification, and
Text Clustering
Applications
Experiments NBC, NN and ssFCM
Further work
References

4
Text Mining Algorithms

Classification Algorithms
Naïve Bayes Classifier
Decision Trees
Neural Networks
Clustering Algorithms
EM Algorithms
Fuzzy

5
Jargon

DM Data Mining
IR Information Retrieval
NBC Naïve Bayes Classifier
EM Expectation Maximization
NN Neural Networks
ssFCM Semi-Supervised Fuzzy C-Means
Labeled Data (Training Data)
Unlabeled Data
Test Data

6
Background Modeling

Vector Space Model

7
Background Modeling

Generative Models of Data 13 Probabilistic
to generate a document, a class is first
selected based on its prior probability and then
a document is generated using the parameters of
the chosen class distribution
NBC and EM Algorithms are based on this model

8
Importance of Unlabeled Data?
D
A
B
Labeled Data Unlabeled Data Test Data
G
F
E
C

Provides access to feature distribution in set F
using joint probability distributions

9
How to make use of Unlabeled Data?
10
How to make use of Unlabeled Data?
11
Experimental Results 1
Using NBC, EM and ssFCM
12
Experimental Results 2
Using NBC and EM
13
Extensions and Variants of these approaches

Authors in 6 propose a concept of Class
Distribution Constraint matrix
Results on Confusion Set Disambiguation
Automatic Title Generation 7
Using EM Algorithm
Non-extractive approach

14
Relational Data 9

A collection of data with relations between
entities explained is known as relational data
Probabilistic Relational Models

15
Commercial Use/Products

IBM Text Analyzer 11
Decision Tree Based
SAS Text Miner12
Singular Value Decomposition
Filtering Junk Email
Hotmail, Yahoo
Advanced Search Engines

16
Applications Search Engines
17
Vivisimo Search Engine (www.vivisimo.com)
18
Experiments

NBC
Naïve Bayes Classifier
Probabilistic
NN
Neural Networks
ssFCM
Semi-Supervised Fuzzy Clustering
Fuzzy

19
Datasets (20 Newsgroups Data)

Sampling I
Sampling II

Dataset min2 min4 min6
Features -- 9467 5685
Dataset Sampling Percentage Number of Features
Sample25 25 13925
Sample30 30 15067
Sample35 35 16737
Sample40 40 16871
Sample45 45 17712
Sample50 50 19135
Sampling I
Vectors
Data
Raw
Sampling II
Vectors
20
Naïve Bayes Classifier
SAMPLE TRAINING TEST ACCURACY
Sample25 20 80 34.4637
Sample25 63 36 48.4945
Sample25 76 23 50.9322
Sample25 82 17 47.7728
Sample25 86 13 48.9971
Sample25 20 80 31.5436
Sample25 63 36 48.0729
Sample25 76 23 47.8661
Sample25 82 17 50.5568
Sample25 86 13 50.4587
Sample30 33 66 39.1137
Sample30 66 33 46.4233
Sample30 77 22 48.5528
Sample30 83 16 52.7383
Sample30 86 13 51.2136
Sample30 33 66 39.26
Sample30 66 33 47.0192
Sample30 77 22 48.8439
Sample30 83 16 49.6907
Sample30 86 13 51.6169
21
Naïve Bayes Classifier
22
NBC
Sample25 Sample30
23
ssFCM
Effect of Labeled Data Effect of Unlabeled Data

24
ssFCM
25
Further Work

Ensemble of Classifiers 16

26
Further Work

Knowledge Gathering from Experts
E.g. 3 class Data

Input Data C1,C2,C3
C1
C3
C2
Test Data ?
Classifier
27
References

1 Text Classification using Semi-Supervised
Fuzzy Clustering, Girish Keswani and L.O.Hall,
appeared in IEEE WCCI 2002 conference.
2 Using Unlabeled Data to Improve Text
Classification, Kamal Paul Nigam.
3 Text Classification from Labeled and
Unlabeled Documents using EM, Kamal Paul Nigam
et al.
4 The Value of Unlabeled Data for
Classification Problems, Tong Zhang.
5 Learning from Partially Labeled Data,
Martin Szummer et al.
6 Training a Naïve Bayes Classifier via the EM
Algorithm with a Class Distribution Constraint,
Yoshimasa Tsuruoka and Junichi Tsujii.
7 Automatic Title Generation using EM, Paul
E. Kennedy and Alexander G. Hauptmann.
8 Unlabeled Data can degrade Classification
Performance of Generative Classifiers, Fabio G.
Cozman and Ira Cohen.
9 Probabilistic Classification and Clustering
in Relational Data, Ben Taskar et al.
10 Using Clustering to Boost Text
Classification, Y.C. Fang et al.
11 IBM Text Analyzer A decision-tree-based
symbolic rule induction system for text
categorization, D.E. Johnson et al.
12 SAS Text Miner, Reincke
13 Pattern Recognition, Duda and Hart 2000
14 Machine Learning, Tom Mitchell
15 Data Mining, Margaret Dunham
16 http//www-2.cs.cmu.edu/afs/cs/project/jair/p
ub/volume11/opitz99a-html/