Feature Extraction from Textual Datasets - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Feature Extraction from Textual Datasets

Description:

Collection of product reviews for a camera. Discuss a variety of ... noise, buy, sensor, panasonic, silly, fuji. quality, manufacture, pay, operational, lens ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 29
Provided by: Patr608
Category:

less

Transcript and Presenter's Notes

Title: Feature Extraction from Textual Datasets


1
Feature Extraction fromTextual Datasets
  • Patrick Moran, Bethany Herwaldt, Jeffrey Salter
  • Dr. Carl Meyer, NC State University
  • Shaina Race, Ralph Abbey

2
The Problem Specific
  • Collection of product reviews for a camera
  • Discuss a variety of camera features
  • Lens / Weight / Noise / Size
  • Have varied opinions on these topics
  • Want a summary
  • Identify what topics are discussed
  • Later research classify as good or bad

3
The Problem General
  • We have a textual dataset.
  • Documents in set have a variety of contents
  • We want to know
  • What topics are discussed.
  • What is being said on the topics.
  • We want no domain-specific inputs.
  • As unsupervised as possible.

4
Leica D-Lux 3
  • 146 Reviews from a variety of web sites.
  • General consensus
  • Great camera
  • Too expensive
  • Viewed as a DSLR alternative

5
First Impressions
  • Seems like a soft-clustering problem.
  • Clustering the documents, with each topic being a
    cluster.
  • Each review discusses certain topics.
  • Each review should belong to multiple clusters.
  • An NMF seemed the right tool.
  • We enforced sparsity via Patrick Hoyer's method.
  • Degree of sparsity is a parameter.

6
The Clustering
Review 1
Review 2
Review n
H
Term 1
Term 2
A
W
Term m
7
Interpretation
  • Columns of W should be pseudo-documents.
  • Each one should correspond to a given topic.
  • The larger a word's entry in the column, the more
    relevant that term is to the topic.
  • We should be able to summarize a topic by
    reading off the highest elements of the
    corresponding vector.

8
Problems
  • Results
  • noise, buy, sensor, panasonic, silly, fuji
  • quality, manufacture, pay, operational, lens
  • Format, shoot, flash, slowlag, promise,
    automotive, flashoth, side, equipment, side
  • Image, clarity, color, small, size, alternative,
    mk, lightweight, sturdy, c-lux
  • camera, amazing, happy, menu, master, photo, mp,
    close

9
Problems
  • Results weren't so good.
  • Generally good words.
  • Some poor words mixed in.
  • Poor grouping of words.
  • Assumptions of using NMF for clustering not
    satisfied.
  • Some synonyms may be less likely to appear
    together if they are alternative words.
  • Unrelated words appear together often.

10
What can be done?
  • We need a different approach for grouping these
    words together.
  • NMF does still seem to be a good source of words,
    but it gives too few.
  • We can use an additional metric for getting more
    and better words.

11
Aldteran Metric
  • We have a vector of the frequencies of words in
    the English language.
  • Divide frequency in dataset by frequency in our
    corpus of English.

12
Some Practicalities
  • English is a corpus of movie and TV scripts.
  • Gotten from Wiktionary.
  • Shared a conversational tone with the data.
  • We applied Porter's stemming algorithm.
  • running, runner and run all transformed to
    run
  • We applied a stoplist.
  • Did not consider a, an, the, etc.

13
Collecting Up Terms
  • Simply taking words with high Aldteran metrics
    gives many misspellings high weights.
  • They are infrequent in English.
  • Simply taking top x words of NMF gives overly
    common words.
  • However, the single top word was always good.
  • This word was generally dominant by factors of
    2-10.

14
Collecting Up Terms
  • Build sets from NMF.
  • Single highest weighted term in each column of W
    in our NMF is placed in the keyword set.
  • The next ß terms of each column are then added to
    the potential keyword set.
  • Iterate this process, building the sets on each
    NMF run.
  • Then add the n words in the potential keyword set
    with the highest Aldteran ratings into the
    keyword set.

15
Graphing the Keywords
  • We form a graph of keywords.
  • Keywords are nodes.
  • Weight of path between 2 nodes is based on
  • Semantic similarity
  • Word Proximity
  • This graph can then be clustered to find topics.

16
Semantic Similarity - WordNet
heavy
size
17
Semantic Similarity Finesse
  • Once the subgraphs meet, we go one step further
    in each direction.
  • Now we take the size of the overlap of the two
    subgraphs as a second metric.
  • Words could be related through obscure meanings.

18
Word Proximity
  • Cui, Mittal Datal concluded insignificant
    relationship between words more than 5 apart.
  • For each pair of words, we count up the number of
    times they appear within 5 words of each other.
  • We divide this by the min of the number of
    occurrences of the 2 words.

19
Combining the Metrics
  • Weight the other two by some parameter.
  • Empirically, 0.5 (even weighting) performs well,
    but we lack any theory.

20
The Graph
.2879
Quality
Item
.1879
.0289
.7864
.1457
.1289
.7864
Image
Menu
.1457
.0986
.0289
.0569
.1289
.2891
.1879
Nikon
Sony
.6498
21
Cluster the Graph
  • Use your favorite graph clustering algorithm to
    cluster keywords into topics.
  • A topic is a set of words which, together, define
    a topic of converstion.
  • We used an algorithm that projected the data into
    a lower dimensional space via SVD, then
    partitioned with principal direction gap
    partitioning, and post-processed with K-means.

22
Results - Good
  • image, images, color, quality, clarity
  • lens, optics, image, sturdy
  • canon, nikon, sony, mp, packaging
  • Pictures, candid, landscapes
  • options, menu, item, manual, settings, sensor,
    photographer, worlds, shoots
  • love, very, great, also, expensive
  • camera, cameras

23
Results - Bad
  • use, its
  • Delicate, shipping, raw, mode, ratio
  • size, post, noise, flash, screen
  • feature, format, shoot lightweight
  • everyday
  • grandchildren
  • aspect
  • digital, compact, complicate, swears

24
Cluster Entropy
  • A measure of cluster goodness.
  • Requires a human to classify the data for
    scoring.
  • For a cluster X whose proportion of topic i is xi

25
Cluster Entropy
  • Since human classification is subjective we
    calculate 2 entropies
  • Most generous to result 0.2042
  • Least generous to result 0.3247
  • Entropy of random grouping 0.440
  • We also tested on another dataset
  • Most generous to result 0.1261
  • Least generous to result 0.2034
  • Entropy of random grouping 0.2208

26
Future Work -So Many Parameters!
  • k - Rank of the NMF
  • ß Number of words per NMF column to consider
  • Sparsity of our NMF
  • The number of clusters to request.
  • Number of times to iterate the NMF.
  • Clustering algorithm used.
  • Scaling constants for combining
  • The two semantic similarity measures.
  • Semantic similarity and word proximity.

27
Future Work
  • Replace word proximity with sophisticated natural
    language processing.
  • Create an unsupervised, preferably less
    empirical, means of setting the parameters.
  • Integrate spell-checking and fixing, along with
    dictionary based stemming.
  • Improve our English corpus.

28
Thanks
  • Thanks to the NSF for the funding.
  • Thanks to Dr. Carl Meyer, Shaina Race, and Ralph
    Abbey for mentoring.
  • Thanks to my teammates Jeffrey Salter and Bethany
    Herwaldt for their hard work.
  • Thanks to NCSU for their support.
  • Thanks to Dr. Langville for the background I
    needed in this research.
Write a Comment
User Comments (0)
About PowerShow.com