Feature Extraction from Textual Datasets - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Feature Extraction from Textual Datasets

Description:

Collection of product reviews for a camera. Discuss a variety of ... noise, buy, sensor, panasonic, silly, fuji. quality, manufacture, pay, operational, lens ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 29

Provided by: Patr608

Category:

more less

Transcript and Presenter's Notes

Title: Feature Extraction from Textual Datasets

1
Feature Extraction fromTextual Datasets

Patrick Moran, Bethany Herwaldt, Jeffrey Salter
Dr. Carl Meyer, NC State University
Shaina Race, Ralph Abbey

2
The Problem Specific

Collection of product reviews for a camera
Discuss a variety of camera features
Lens / Weight / Noise / Size
Have varied opinions on these topics
Want a summary
Identify what topics are discussed
Later research classify as good or bad

3
The Problem General

We have a textual dataset.
Documents in set have a variety of contents
We want to know
What topics are discussed.
What is being said on the topics.
We want no domain-specific inputs.
As unsupervised as possible.

4
Leica D-Lux 3

146 Reviews from a variety of web sites.
General consensus
Great camera
Too expensive
Viewed as a DSLR alternative

5
First Impressions

Seems like a soft-clustering problem.
Clustering the documents, with each topic being a
cluster.
Each review discusses certain topics.
Each review should belong to multiple clusters.
An NMF seemed the right tool.
We enforced sparsity via Patrick Hoyer's method.
Degree of sparsity is a parameter.

6
The Clustering
Review 1
Review 2
Review n
H
Term 1
Term 2
A
W
Term m
7
Interpretation

Columns of W should be pseudo-documents.
Each one should correspond to a given topic.
The larger a word's entry in the column, the more
relevant that term is to the topic.
We should be able to summarize a topic by
reading off the highest elements of the
corresponding vector.

8
Problems

Results
noise, buy, sensor, panasonic, silly, fuji
quality, manufacture, pay, operational, lens
Format, shoot, flash, slowlag, promise,
automotive, flashoth, side, equipment, side
Image, clarity, color, small, size, alternative,
mk, lightweight, sturdy, c-lux
camera, amazing, happy, menu, master, photo, mp,
close

9
Problems

Results weren't so good.
Generally good words.
Some poor words mixed in.
Poor grouping of words.
Assumptions of using NMF for clustering not
satisfied.
Some synonyms may be less likely to appear
together if they are alternative words.
Unrelated words appear together often.

10
What can be done?

We need a different approach for grouping these
words together.
NMF does still seem to be a good source of words,
but it gives too few.
We can use an additional metric for getting more
and better words.

11
Aldteran Metric

We have a vector of the frequencies of words in
the English language.
Divide frequency in dataset by frequency in our
corpus of English.

12
Some Practicalities

English is a corpus of movie and TV scripts.
Gotten from Wiktionary.
Shared a conversational tone with the data.
We applied Porter's stemming algorithm.
running, runner and run all transformed to
run
We applied a stoplist.
Did not consider a, an, the, etc.

13
Collecting Up Terms

Simply taking words with high Aldteran metrics
gives many misspellings high weights.
They are infrequent in English.
Simply taking top x words of NMF gives overly
common words.
However, the single top word was always good.
This word was generally dominant by factors of
2-10.

14
Collecting Up Terms

Build sets from NMF.
Single highest weighted term in each column of W
in our NMF is placed in the keyword set.
The next ß terms of each column are then added to
the potential keyword set.
Iterate this process, building the sets on each
NMF run.
Then add the n words in the potential keyword set
with the highest Aldteran ratings into the
keyword set.

15
Graphing the Keywords

We form a graph of keywords.
Keywords are nodes.
Weight of path between 2 nodes is based on
Semantic similarity
Word Proximity
This graph can then be clustered to find topics.

16
Semantic Similarity - WordNet
heavy
size
17
Semantic Similarity Finesse

Once the subgraphs meet, we go one step further
in each direction.
Now we take the size of the overlap of the two
subgraphs as a second metric.
Words could be related through obscure meanings.

18
Word Proximity

Cui, Mittal Datal concluded insignificant
relationship between words more than 5 apart.
For each pair of words, we count up the number of
times they appear within 5 words of each other.
We divide this by the min of the number of
occurrences of the 2 words.

19
Combining the Metrics

Weight the other two by some parameter.
Empirically, 0.5 (even weighting) performs well,
but we lack any theory.

20
The Graph
.2879
Quality
Item
.1879
.0289
.7864
.1457
.1289
.7864
Image
Menu
.1457
.0986
.0289
.0569
.1289
.2891
.1879
Nikon
Sony
.6498
21
Cluster the Graph

Use your favorite graph clustering algorithm to
cluster keywords into topics.
A topic is a set of words which, together, define
a topic of converstion.
We used an algorithm that projected the data into
a lower dimensional space via SVD, then
partitioned with principal direction gap
partitioning, and post-processed with K-means.

22
Results - Good

image, images, color, quality, clarity
lens, optics, image, sturdy
canon, nikon, sony, mp, packaging
Pictures, candid, landscapes

options, menu, item, manual, settings, sensor,
photographer, worlds, shoots
love, very, great, also, expensive
camera, cameras

23
Results - Bad

use, its
Delicate, shipping, raw, mode, ratio
size, post, noise, flash, screen
feature, format, shoot lightweight

everyday
grandchildren
aspect
digital, compact, complicate, swears

24
Cluster Entropy

A measure of cluster goodness.
Requires a human to classify the data for
scoring.
For a cluster X whose proportion of topic i is xi

25
Cluster Entropy

Since human classification is subjective we
calculate 2 entropies
Most generous to result 0.2042
Least generous to result 0.3247
Entropy of random grouping 0.440
We also tested on another dataset
Most generous to result 0.1261
Least generous to result 0.2034
Entropy of random grouping 0.2208

26
Future Work -So Many Parameters!

k - Rank of the NMF
ß Number of words per NMF column to consider
Sparsity of our NMF
The number of clusters to request.

Number of times to iterate the NMF.
Clustering algorithm used.
Scaling constants for combining
The two semantic similarity measures.
Semantic similarity and word proximity.

27
Future Work

Replace word proximity with sophisticated natural
language processing.
Create an unsupervised, preferably less
empirical, means of setting the parameters.
Integrate spell-checking and fixing, along with
dictionary based stemming.
Improve our English corpus.

28
Thanks