Title: MIDN 1C David G' Underhill
1Exploring Dimensionality Reduction for Text Mining
- MIDN 1/C David G. Underhill
- Assistant Professor Lucas K. McDowell
- Computer Science Department
2How to make sense of an overwhelming amount of
data?
3How to make sense of an overwhelming amount of
data?
- Can dimensionality reduction help?
4Outline
- Problem Statement
- Background
- Text Mining Process
- Dimensionality Reduction
- Experimental Analysis
- Task 1 Classification
- Task 2 Literature-Based Discovery
- Contributions and Conclusions
- Future Work
5Text Mining Overview
Distance Matrix
Term Document Matrix
Encode
Compare
Analyze
6Text Mining Overview
7Dimensionality Reduction (DR)
- Goal simplify a complex data set in a way that
preserves meanings inherent in the original data - Usually applied to geometric or numerical data
- How can DR improve text mining?
- May reveal patterns obscured in the original data
- Improves analysis time over the original, larger
data - Greatly decreases storage and transmission costs
82-Dimensional Visualizations
- Reduction to just 2 dimensions
- Easy visualization graph on Cartesian plot
- Each point is colored according to its category
- Assess quality of separation with best 2
dimensions - Highlight areas of confusion
9Outline
- Problem Statement
- Background
- Experimental Analysis
- Experimental Question and Method
- Task 1 Classification
- Nearest Neighbor Classifier
- Linear Classifier
- Task 2 Literature-Based Discovery
- Contributions and Conclusions
- Future Work
10Experimental Question
- Can DR improve text mining performance?
- Many valid DR approaches
- Relative DR performance unknown for textual data
Ultimate Goal Identify DR techniques that best
facilitate text mining.
11Experimental Method
- Evaluate 5 DR methods
- Linear
- 1) PCA (Principal Components Analysis)
- 2) MDS (Multidimensional Scaling)
- Non-Linear
- 3) Isomap
- 4) LLE (Locally Linear Embedding)
- 5) LDM (Lafons Diffusion Maps)
- Baseline
- None-Sort original features sorted by average
weight - Evaluate on two text mining tasks
- 1) Classification
- 2) Literature-Based Discovery
- Evaluate with three data sets
- 1) Science News
- 2) Google News
- 3) Science Technology
12Outline
- Problem Statement
- Background
- Experimental Analysis
- Experimental Question
- Task 1 Classification
- Nearest Neighbor Classifier
- Linear Classifier
- Task 2 Literature-Based Discovery
- Contributions and Conclusions
- Future Work
13Classification
- Labeling documents with known categories based on
training data
Assessment accuracy of category assignments
14k-Nearest Neighbor Classifier
- Assign category based on k nearest neighbors
- Most frequent category is assigned
- k 9 used for following graphs
- Trends similar for other values
15kNN Classifier on Science News (8-category)
16kNN Classifier on Google News
17kNN Classifier on Science Technology
18kNN Classifier on Science News (2-category)
19kNN Classifier on Science News (4-category)
20Linear Classifier
- Assign category based on a linear combination of
features - Assumes features are
- normally distributed
- Results for the quadratic classifier,
- which doesnt make this assumption,
- were comparable
21Linear Classifier on Science News (8-category)
22Linear Classifier on Google News
23Linear Classifier on Science Technology
24Classification Results
- Applying DR improves accuracy versus not applying
DR
- Best DR techniques achieve high accuracy in few
dimensions
- MDS Isomap yield the most consistent and
reliable results - This advantage is more pronounced on difficult
corpuses - Contradicts van der Maaten et al. 2007 results
show PCA best, but only evaluates one textual
data set - PCA is good, but not the best it suffers on
harder data sets
25Outline
- Problem Statement
- Background
- Experimental Analysis
- Experimental Question
- Task 1 Classification
- Nearest Neighbor Classifier
- Linear Classifier
- 2D Visualizations
- Task 2 Literature-Based Discovery
- Contributions and Conclusions
- Future Work
26Literature-Based Discovery (LBD)
- Identify candidate interesting associations
between seemingly unrelated documents - Example Swansons manual discovery
High Blood Viscosity Platelet Aggregation
Fish Oil
Reynauds Disease
27Automatic LBD Assessment
- Time-Consuming
- Subjective
- Expensive
28Literature-Based Discovery
Score Pairs
Assessment novelty scores of candidate
discoveries
29Novelty Scoring Metric
- Example Interesting Connection Found
- Student sleep deprivation paired with
cross-cultural approach to improving sleep
- Computing the Novelty Score
- 1) Compute relative significance on Google
Scholar - 2) Compute relative significance on Google
- 3) Compute novelty score estimate
- Relevant on Google Scholar and not well-known on
Google gt high novelty estimate
30Novelty Scoring Metric
- Example Uninteresting Pair gt Low Score
- Anti-Counterfeiting paired with Adjusting
Electromagnetic Properties of the Sacagawea Coin
- Computing the Novelty Score
- 1) Compute relative significance on Google
Scholar - 2) Compute relative significance on Google
- 3) Compute novelty score estimate
- No distinction between Google and Google Scholar
gt low novelty estimate
31LBD Effectiveness on Science News (4-cat)
32LBD Effectiveness on Science Technology
33LBD Relative Effectiveness
None- Sort
None-Sort
LDM
LDM
PCA
PCA
LLE
LLE
MDS
MDS
Isomap
Isomap
Science News (4-cat)
Science News (8-cat)
Science Technology
Google News
None- Sort
None- Sort
PCA
LDM
PCA
LDM
MDS
LLE
LLE
MDS
Isomap
Isomap
34LBD Results
- For a fixed number of dimensions, applying DR can
improve quality of candidate discoveries over not
applying DR - Performance is effective even with relatively few
dimensions - PCA and Isomap yield the most consistent and
reliable results
35Outline
- Problem Statement
- Background
- Experimental Analysis
- Contributions and Conclusions
- Future Work
36Conclusions and Contributions
- Evaluated two distinct text mining processes with
regards to dimensionality reduction techniques - Showed that DR can be highly effective
- Surprisingly, non-linear techniques did not
improve performance - Classification
- PCA (most commonly used) is inconsistent for text
classification - MDS and Isomap are often the best
- Literature-Based Discovery
- Developed novel keyword extraction and LBD
scoring techniques - PCA was often the best
- May want to combine results from multiple
techniques to maximize performance
37Outline
- Problem Statement
- Background
- Experimental Analysis
- Contributions and Conclusions
- Future Work
38Future Work
- Human evaluation of LBD results
- LBD is subjective
- Would benefit from human analysis
- Examination of document pairs found by more than
one technique - May be superior
- Improved keyword extraction
- Part of speech tagging
- Improve DR process
- Ability to insert new documents
- Efficiency
39Acknowledgements
- Naval Surface Warfare Center, Dahlgren Division
- Dr. David Marchette
- Dr. Jeff Solka
- Trident Scholar Committee
- Office of Naval Research
- Multimedia Support Center (MSC)
- Publications Office (PAO)
40Exploring Dimensionality Reduction for Text Mining
- MIDN 1/C David G. Underhill
- Assistant Professor Lucas K. McDowell
- Computer Science Department
412D Visualization of Science News (2-cat)
422D Visualization of Science News (8-cat)
432D Visualization of Google News
442D Visualization of Science Technology