David G. UnderhillLuke K. McDowell - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

David G. UnderhillLuke K. McDowell

Description:

David G. UnderhillLuke K. McDowell – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 34
Provided by: davidun
Category:

less

Transcript and Presenter's Notes

Title: David G. UnderhillLuke K. McDowell


1
Enhancing Text Analysis via Dimensionality
Reduction
  • David G. Underhill Luke K. McDowell
  • Computer Science Department, United States Naval
    Academy

David J. Marchette Jeffrey L. Solka Naval
Surface Warfare Center, Dahlgren Division
2
How to make sense of an overwhelming amount of
data?
3
How to make sense of an overwhelming amount of
data?
  • Can dimensionality reduction help?

4
Outline
  • Problem Statement
  • Background
  • Text Mining Process
  • Dimensionality Reduction
  • Experimental Analysis
  • Classification
  • Contributions and Conclusions
  • Future Work

5
Text Mining Overview
Distance Matrix
Term Document Matrix
Encode
Compare
Analyze
6
Text Mining Overview
7
Dimensionality Reduction (DR)
  • Goal simplify a complex data set in a way that
    preserves meanings inherent in the original data
  • Usually applied to geometric or numerical data
  • How can DR improve text mining?
  • May reveal patterns obscured in the original data
  • Improves analysis time over the original, larger
    data
  • Greatly decreases storage and transmission costs

8
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Experimental Question and Method
  • Task 1 Classification
  • Nearest Neighbor Classifier
  • Linear Classifier
  • Quadratic Classifier
  • Contributions and Conclusions
  • Future Work

9
Experimental Question
  • Can DR improve text mining performance?
  • Many valid DR approaches
  • Relative DR performance unknown for textual data

Ultimate Goal Identify DR techniques that best
facilitate text mining.
10
Experimental Method
  • Evaluate 5 DR methods
  • Linear
  • 1) PCA (Principal Components Analysis)
  • 2) MDS (Multidimensional Scaling)
  • Non-Linear
  • 3) Isomap
  • 4) LLE (Locally Linear Embedding)
  • 5) LDM (Lafons Diffusion Maps)
  • Baseline
  • None-Sort original features sorted by average
    weight
  • Evaluate 3 classifiers
  • 1) Nearest Neighbor
  • 2) Linear
  • 3) Quadratic
  • Evaluate 3 data sets
  • 1) Science News
  • 2) Google News

11
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Experimental Question
  • Classification
  • Nearest Neighbor Classifier
  • Linear Classifier
  • Quadratic Classifier
  • Contributions and Conclusions
  • Future Work

12
Classification
  • Labeling documents with known categories based on
    training data

Assessment accuracy of category assignments
13
k-Nearest Neighbor Classifier
  • Assign category based on k nearest neighbors
  • Most frequent category is assigned
  • k 9 used for following graphs
  • Trends similar for other values

14
kNN Classifier on Science News
15
kNN Classifier on Google News
16
kNN Classifier on Science Technology
17
Linear Classifier
  • Assign category based on a linear combination of
    features
  • Assumes features are
  • normally distributed
  • Results for the quadratic classifier,
  • which doesnt make this assumption,
  • were comparable

18
Linear Classifier on Science News
19
Linear Classifier on Google News
20
Linear Classifier on Science Technology
21
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Contributions and Conclusions
  • Future Work

22
Classification Results
  • Applying DR improves accuracy versus not applying
    DR for a fixed number of dimensions
  • Best DR techniques achieve high accuracy in few
    dimensions
  • MDS Isomap yield the most consistent and
    reliable results
  • This advantage is more pronounced on difficult
    corpuses
  • Contradicts van der Maaten et al. 2007 results
    show PCA best, but only evaluates one textual
    data set
  • PCA is good, but not the best it suffers on
    harder data sets

23
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Contributions and Conclusions
  • Future Work

24
Future Work
  • More precisely characterize MDS, Isomap advantage
  • Investigate other classification methods
  • Evaluate data sets with different kinds of
    information

25
Acknowledgements
  • Trident Scholar Research Program
  • Office of Naval Research

26
Enhancing Text Analysis via Dimensionality
Reduction
  • David G. Underhill Luke K. McDowell
  • Computer Science Department, United States Naval
    Academy

David J. Marchette Jeffrey L. Solka Naval
Surface Warfare Center, Dahlgren Division
27
2-Dimensional Visualizations
  • Reduction to just 2 dimensions
  • Easy visualization graph on Cartesian plot
  • Each point is colored according to its category
  • Assess quality of separation with best 2
    dimensions
  • Highlight areas of confusion

28
2D Visualization of Science News (2-cat)
29
2D Visualization of Science News (8-cat)
30
2D Visualization of Google News
31
2D Visualization of Science Technology
32
kNN Classifier on Science News (2-category)
33
kNN Classifier on Science News (4-OL)
Write a Comment
User Comments (0)
About PowerShow.com