MIDN 1C David G' Underhill - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

MIDN 1C David G' Underhill

Description:

MIDN 1C David G' Underhill – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 45

Provided by: davidun

Category:

more less

Transcript and Presenter's Notes

Title: MIDN 1C David G' Underhill

1
Exploring Dimensionality Reduction for Text Mining

MIDN 1/C David G. Underhill
Assistant Professor Lucas K. McDowell
Computer Science Department

2
How to make sense of an overwhelming amount of
data?
3
How to make sense of an overwhelming amount of
data?

Can dimensionality reduction help?

4
Outline

Problem Statement
Background
Text Mining Process
Dimensionality Reduction
Experimental Analysis
Task 1 Classification
Task 2 Literature-Based Discovery
Contributions and Conclusions
Future Work

5
Text Mining Overview
Distance Matrix
Term Document Matrix
Encode
Compare
Analyze
6
Text Mining Overview
7
Dimensionality Reduction (DR)

Goal simplify a complex data set in a way that
preserves meanings inherent in the original data
Usually applied to geometric or numerical data

How can DR improve text mining?
May reveal patterns obscured in the original data
Improves analysis time over the original, larger
data
Greatly decreases storage and transmission costs

8
2-Dimensional Visualizations

Reduction to just 2 dimensions
Easy visualization graph on Cartesian plot
Each point is colored according to its category
Assess quality of separation with best 2
dimensions
Highlight areas of confusion

9
Outline

Problem Statement
Background
Experimental Analysis
Experimental Question and Method
Task 1 Classification
Nearest Neighbor Classifier
Linear Classifier
Task 2 Literature-Based Discovery
Contributions and Conclusions
Future Work

10
Experimental Question

Can DR improve text mining performance?
Many valid DR approaches
Relative DR performance unknown for textual data

Ultimate Goal Identify DR techniques that best
facilitate text mining.
11
Experimental Method

Evaluate 5 DR methods
Linear
1) PCA (Principal Components Analysis)
2) MDS (Multidimensional Scaling)
Non-Linear
3) Isomap
4) LLE (Locally Linear Embedding)
5) LDM (Lafons Diffusion Maps)
Baseline
None-Sort original features sorted by average
weight
Evaluate on two text mining tasks
1) Classification
2) Literature-Based Discovery
Evaluate with three data sets
1) Science News
2) Google News
3) Science Technology

12
Outline

Problem Statement
Background
Experimental Analysis
Experimental Question
Task 1 Classification
Nearest Neighbor Classifier
Linear Classifier
Task 2 Literature-Based Discovery
Contributions and Conclusions
Future Work

13
Classification

Labeling documents with known categories based on
training data

Assessment accuracy of category assignments
14
k-Nearest Neighbor Classifier

Assign category based on k nearest neighbors
Most frequent category is assigned
k 9 used for following graphs
Trends similar for other values

15
kNN Classifier on Science News (8-category)
16
kNN Classifier on Google News
17
kNN Classifier on Science Technology
18
kNN Classifier on Science News (2-category)
19
kNN Classifier on Science News (4-category)
20
Linear Classifier

Assign category based on a linear combination of
features
Assumes features are
normally distributed
Results for the quadratic classifier,
which doesnt make this assumption,
were comparable

21
Linear Classifier on Science News (8-category)
22
Linear Classifier on Google News
23
Linear Classifier on Science Technology
24
Classification Results

Applying DR improves accuracy versus not applying
DR

Best DR techniques achieve high accuracy in few
dimensions

MDS Isomap yield the most consistent and
reliable results
This advantage is more pronounced on difficult
corpuses
Contradicts van der Maaten et al. 2007 results
show PCA best, but only evaluates one textual
data set
PCA is good, but not the best it suffers on
harder data sets

25
Outline

Problem Statement
Background
Experimental Analysis
Experimental Question
Task 1 Classification
Nearest Neighbor Classifier
Linear Classifier
2D Visualizations
Task 2 Literature-Based Discovery
Contributions and Conclusions
Future Work

26
Literature-Based Discovery (LBD)

Identify candidate interesting associations
between seemingly unrelated documents
Example Swansons manual discovery

High Blood Viscosity Platelet Aggregation
Fish Oil
Reynauds Disease
27
Automatic LBD Assessment

Time-Consuming
Subjective
Expensive

28
Literature-Based Discovery
Score Pairs
Assessment novelty scores of candidate
discoveries
29
Novelty Scoring Metric

Example Interesting Connection Found
Student sleep deprivation paired with
cross-cultural approach to improving sleep

Computing the Novelty Score
1) Compute relative significance on Google
Scholar
2) Compute relative significance on Google
3) Compute novelty score estimate
Relevant on Google Scholar and not well-known on
Google gt high novelty estimate

30
Novelty Scoring Metric

Example Uninteresting Pair gt Low Score
Anti-Counterfeiting paired with Adjusting
Electromagnetic Properties of the Sacagawea Coin

Computing the Novelty Score
1) Compute relative significance on Google
Scholar
2) Compute relative significance on Google
3) Compute novelty score estimate
No distinction between Google and Google Scholar
gt low novelty estimate

31
LBD Effectiveness on Science News (4-cat)
32
LBD Effectiveness on Science Technology
33
LBD Relative Effectiveness
None- Sort
None-Sort
LDM
LDM
PCA
PCA
LLE
LLE
MDS
MDS
Isomap
Isomap
Science News (4-cat)
Science News (8-cat)
Science Technology
Google News
None- Sort
None- Sort
PCA
LDM
PCA
LDM
MDS
LLE
LLE
MDS
Isomap
Isomap
34
LBD Results

For a fixed number of dimensions, applying DR can
improve quality of candidate discoveries over not
applying DR
Performance is effective even with relatively few
dimensions
PCA and Isomap yield the most consistent and
reliable results

35
Outline

Problem Statement
Background
Experimental Analysis
Contributions and Conclusions
Future Work

36
Conclusions and Contributions

Evaluated two distinct text mining processes with
regards to dimensionality reduction techniques
Showed that DR can be highly effective
Surprisingly, non-linear techniques did not
improve performance
Classification
PCA (most commonly used) is inconsistent for text
classification
MDS and Isomap are often the best
Literature-Based Discovery
Developed novel keyword extraction and LBD
scoring techniques
PCA was often the best
May want to combine results from multiple
techniques to maximize performance

37
Outline

Problem Statement
Background
Experimental Analysis
Contributions and Conclusions
Future Work

38
Future Work

Human evaluation of LBD results
LBD is subjective
Would benefit from human analysis
Examination of document pairs found by more than
one technique
May be superior
Improved keyword extraction
Part of speech tagging
Improve DR process
Ability to insert new documents
Efficiency

39
Acknowledgements

Naval Surface Warfare Center, Dahlgren Division
Dr. David Marchette
Dr. Jeff Solka
Trident Scholar Committee
Office of Naval Research
Multimedia Support Center (MSC)
Publications Office (PAO)

40
Exploring Dimensionality Reduction for Text Mining

MIDN 1/C David G. Underhill
Assistant Professor Lucas K. McDowell
Computer Science Department

41
2D Visualization of Science News (2-cat)
42
2D Visualization of Science News (8-cat)
43
2D Visualization of Google News
44
2D Visualization of Science Technology

Write a Comment

User Comments (0)