Title: David G. UnderhillLuke K. McDowell
1Enhancing Text Analysis via Dimensionality
Reduction
- David G. Underhill Luke K. McDowell
- Computer Science Department, United States Naval
Academy
David J. Marchette Jeffrey L. Solka Naval
Surface Warfare Center, Dahlgren Division
2How to make sense of an overwhelming amount of
data?
3How to make sense of an overwhelming amount of
data?
- Can dimensionality reduction help?
4Outline
- Problem Statement
- Background
- Text Mining Process
- Dimensionality Reduction
- Experimental Analysis
- Classification
- Contributions and Conclusions
- Future Work
5Text Mining Overview
Distance Matrix
Term Document Matrix
Encode
Compare
Analyze
6Text Mining Overview
7Dimensionality Reduction (DR)
- Goal simplify a complex data set in a way that
preserves meanings inherent in the original data - Usually applied to geometric or numerical data
- How can DR improve text mining?
- May reveal patterns obscured in the original data
- Improves analysis time over the original, larger
data - Greatly decreases storage and transmission costs
8Outline
- Problem Statement
- Background
- Experimental Analysis
- Experimental Question and Method
- Task 1 Classification
- Nearest Neighbor Classifier
- Linear Classifier
- Quadratic Classifier
- Contributions and Conclusions
- Future Work
9Experimental Question
- Can DR improve text mining performance?
- Many valid DR approaches
- Relative DR performance unknown for textual data
Ultimate Goal Identify DR techniques that best
facilitate text mining.
10Experimental Method
- Evaluate 5 DR methods
- Linear
- 1) PCA (Principal Components Analysis)
- 2) MDS (Multidimensional Scaling)
- Non-Linear
- 3) Isomap
- 4) LLE (Locally Linear Embedding)
- 5) LDM (Lafons Diffusion Maps)
- Baseline
- None-Sort original features sorted by average
weight - Evaluate 3 classifiers
- 1) Nearest Neighbor
- 2) Linear
- 3) Quadratic
- Evaluate 3 data sets
- 1) Science News
- 2) Google News
11Outline
- Problem Statement
- Background
- Experimental Analysis
- Experimental Question
- Classification
- Nearest Neighbor Classifier
- Linear Classifier
- Quadratic Classifier
- Contributions and Conclusions
- Future Work
12Classification
- Labeling documents with known categories based on
training data
Assessment accuracy of category assignments
13k-Nearest Neighbor Classifier
- Assign category based on k nearest neighbors
- Most frequent category is assigned
- k 9 used for following graphs
- Trends similar for other values
14kNN Classifier on Science News
15kNN Classifier on Google News
16kNN Classifier on Science Technology
17Linear Classifier
- Assign category based on a linear combination of
features - Assumes features are
- normally distributed
- Results for the quadratic classifier,
- which doesnt make this assumption,
- were comparable
18Linear Classifier on Science News
19Linear Classifier on Google News
20Linear Classifier on Science Technology
21Outline
- Problem Statement
- Background
- Experimental Analysis
- Contributions and Conclusions
- Future Work
22Classification Results
- Applying DR improves accuracy versus not applying
DR for a fixed number of dimensions
- Best DR techniques achieve high accuracy in few
dimensions
- MDS Isomap yield the most consistent and
reliable results - This advantage is more pronounced on difficult
corpuses - Contradicts van der Maaten et al. 2007 results
show PCA best, but only evaluates one textual
data set - PCA is good, but not the best it suffers on
harder data sets
23Outline
- Problem Statement
- Background
- Experimental Analysis
- Contributions and Conclusions
- Future Work
24Future Work
- More precisely characterize MDS, Isomap advantage
- Investigate other classification methods
- Evaluate data sets with different kinds of
information
25Acknowledgements
- Trident Scholar Research Program
- Office of Naval Research
26Enhancing Text Analysis via Dimensionality
Reduction
- David G. Underhill Luke K. McDowell
- Computer Science Department, United States Naval
Academy
David J. Marchette Jeffrey L. Solka Naval
Surface Warfare Center, Dahlgren Division
272-Dimensional Visualizations
- Reduction to just 2 dimensions
- Easy visualization graph on Cartesian plot
- Each point is colored according to its category
- Assess quality of separation with best 2
dimensions - Highlight areas of confusion
282D Visualization of Science News (2-cat)
292D Visualization of Science News (8-cat)
302D Visualization of Google News
312D Visualization of Science Technology
32kNN Classifier on Science News (2-category)
33kNN Classifier on Science News (4-OL)