Recent Trends in Text Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Recent Trends in Text Mining

Description:

'to generate a document, a class is first selected based on its ... Filtering Junk Email. Hotmail, Yahoo. Advanced Search Engines. Applications: Search Engines ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 28
Provided by: Tem88
Category:
Tags: com | hotmail | mining | nbc | recent | text | trends

less

Transcript and Presenter's Notes

Title: Recent Trends in Text Mining


1
Recent Trends in Text Mining
  • Girish Keswani
  • gkeswani_at_micron.com

2
Text Mining?
  • What?
  • Data Mining on Text Data
  • Why?
  • Information Retrieval
  • Confusion Set Disambiguation
  • Topic Distillation
  • How?
  • Data Mining

3
Organization
  • Text Mining Algorithms
  • Jargon Used
  • Background
  • Data Modeling,
  • Text Classification, and
  • Text Clustering
  • Applications
  • Experiments NBC, NN and ssFCM
  • Further work
  • References

4
Text Mining Algorithms
  • Classification Algorithms
  • Naïve Bayes Classifier
  • Decision Trees
  • Neural Networks
  • Clustering Algorithms
  • EM Algorithms
  • Fuzzy

5
Jargon
  • DM Data Mining
  • IR Information Retrieval
  • NBC Naïve Bayes Classifier
  • EM Expectation Maximization
  • NN Neural Networks
  • ssFCM Semi-Supervised Fuzzy C-Means
  • Labeled Data (Training Data)
  • Unlabeled Data
  • Test Data

6
Background Modeling
  • Vector Space Model

7
Background Modeling
  • Generative Models of Data 13 Probabilistic
  • to generate a document, a class is first
    selected based on its prior probability and then
    a document is generated using the parameters of
    the chosen class distribution
  • NBC and EM Algorithms are based on this model

8
Importance of Unlabeled Data?
D
A
B
Labeled Data Unlabeled Data Test Data
G
F
E
C
  • Provides access to feature distribution in set F
    using joint probability distributions

9
How to make use of Unlabeled Data?
10
How to make use of Unlabeled Data?
11
Experimental Results 1
Using NBC, EM and ssFCM
12
Experimental Results 2
Using NBC and EM
13
Extensions and Variants of these approaches
  • Authors in 6 propose a concept of Class
    Distribution Constraint matrix
  • Results on Confusion Set Disambiguation
  • Automatic Title Generation 7
  • Using EM Algorithm
  • Non-extractive approach

14
Relational Data 9
  • A collection of data with relations between
    entities explained is known as relational data
  • Probabilistic Relational Models

15
Commercial Use/Products
  • IBM Text Analyzer 11
  • Decision Tree Based
  • SAS Text Miner12
  • Singular Value Decomposition
  • Filtering Junk Email
  • Hotmail, Yahoo
  • Advanced Search Engines

16
Applications Search Engines
17
Vivisimo Search Engine (www.vivisimo.com)
18
Experiments
  • NBC
  • Naïve Bayes Classifier
  • Probabilistic
  • NN
  • Neural Networks
  • ssFCM
  • Semi-Supervised Fuzzy Clustering
  • Fuzzy

19
Datasets (20 Newsgroups Data)
  • Sampling I
  • Sampling II

Dataset min2 min4 min6
Features -- 9467 5685
Dataset Sampling Percentage Number of Features
Sample25 25 13925
Sample30 30 15067
Sample35 35 16737
Sample40 40 16871
Sample45 45 17712
Sample50 50 19135
Sampling I
Vectors
Data
Raw
Sampling II
Vectors
20
Naïve Bayes Classifier
SAMPLE TRAINING TEST ACCURACY
Sample25 20 80 34.4637
Sample25 63 36 48.4945
Sample25 76 23 50.9322
Sample25 82 17 47.7728
Sample25 86 13 48.9971
Sample25 20 80 31.5436
Sample25 63 36 48.0729
Sample25 76 23 47.8661
Sample25 82 17 50.5568
Sample25 86 13 50.4587
Sample30 33 66 39.1137
Sample30 66 33 46.4233
Sample30 77 22 48.5528
Sample30 83 16 52.7383
Sample30 86 13 51.2136
Sample30 33 66 39.26
Sample30 66 33 47.0192
Sample30 77 22 48.8439
Sample30 83 16 49.6907
Sample30 86 13 51.6169
21
Naïve Bayes Classifier
22
NBC
Sample25 Sample30
23
ssFCM
Effect of Labeled Data Effect of Unlabeled Data

24
ssFCM
25
Further Work
  • Ensemble of Classifiers 16

26
Further Work
  • Knowledge Gathering from Experts
  • E.g. 3 class Data

Input Data C1,C2,C3
C1
C3
C2
Test Data ?
Classifier
27
References
  • 1 Text Classification using Semi-Supervised
    Fuzzy Clustering, Girish Keswani and L.O.Hall,
    appeared in IEEE WCCI 2002 conference.
  • 2 Using Unlabeled Data to Improve Text
    Classification, Kamal Paul Nigam.
  • 3 Text Classification from Labeled and
    Unlabeled Documents using EM, Kamal Paul Nigam
    et al.
  • 4 The Value of Unlabeled Data for
    Classification Problems, Tong Zhang.
  • 5 Learning from Partially Labeled Data,
    Martin Szummer et al.
  • 6 Training a Naïve Bayes Classifier via the EM
    Algorithm with a Class Distribution Constraint,
    Yoshimasa Tsuruoka and Junichi Tsujii.
  • 7 Automatic Title Generation using EM, Paul
    E. Kennedy and Alexander G. Hauptmann.
  • 8 Unlabeled Data can degrade Classification
    Performance of Generative Classifiers, Fabio G.
    Cozman and Ira Cohen.
  • 9 Probabilistic Classification and Clustering
    in Relational Data, Ben Taskar et al.
  • 10 Using Clustering to Boost Text
    Classification, Y.C. Fang et al.
  • 11 IBM Text Analyzer A decision-tree-based
    symbolic rule induction system for text
    categorization, D.E. Johnson et al.
  • 12 SAS Text Miner, Reincke
  • 13 Pattern Recognition, Duda and Hart 2000
  • 14 Machine Learning, Tom Mitchell
  • 15 Data Mining, Margaret Dunham
  • 16 http//www-2.cs.cmu.edu/afs/cs/project/jair/p
    ub/volume11/opitz99a-html/
Write a Comment
User Comments (0)
About PowerShow.com