CSC 9010: Text Mining Applications DocumentLevel Techniques - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

CSC 9010: Text Mining Applications DocumentLevel Techniques

Description:

... for articles about Tiger Woods in an API ... Tiger Woods takes some drama out of cut streak with opening round at Funai. ... Tiger Woods has no such worries. ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 35

Provided by: Matu

Learn more at: http://www.csc.villanova.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSC 9010: Text Mining Applications DocumentLevel Techniques

1
CSC 9010 Text Mining Applications
Document-Level Techniques

Dr. Paula Matuszek
Paula_A_Matuszek_at_glaxosmithkline.com
(610) 270-6851

2
Dealing with Documents

Sometimes our information need is not for
something specific which we can capture in a
clearcut knowledge model
What is the current research in secure networks?
What are our competitors working on?
Who should review this paper?
These kinds of questions are more typically
answered by techniques which look at the entire
document, or set of documents.
Categorizing
Clustering
Visualizing

3
Document Categorization

Document categorization
Assign documents to pre-defined categories
Examples
Process email into work, personal, junk
Process documents from a newsgroup into
interesting, not interesting, spam and
flames
Process transcripts of bugged phone calls into
relevant and irrelevant
Issues
Real-time?
How many categories/document? Flat or
hierarchical?
Categories defined automatically or by hand?

4
Document Categorization

Usually
relatively few categories
well defined a person could do task easily
Categories don't change quickly
Flat vs Hierarchy
Simple categorization is into mutually-exclusive
document collections
Richer categorization is into hierarchy with
multiple inheritance
broader and narrower categories
documents can go more than one place
merges into search-engine with category browsers

5
Categorization -- Automatic

Statistical approaches similar to search engine
Set of training documents define categories
Underlying representation of document is bag of
words/TFIDF variant
Category description is created using neural
nets, regression trees, other Machine Learning
techniques
Individual documents categorized by net, inferred
rules, etc
Requires relatively little effort to create
categories
Accuracy is heavily dependent on "good" training
examples
Typically limited to flat, mutually exclusive
categories

6
Categorization Manual

Natural Language/linguistic techniques
Categories are defined by people
underlying representation of document is stream
of tokens
category description contains
ontology of terms and relations
pattern-matching rules
individual documents categorized by
pattern-matching
Defining categories can be very time-consuming
Typically takes some experimentation to "get it
right"
Can handle much more complex structures

7
Document Classification

Document classification
Cluster documents based on similarity
Examples
Group samples of writing in an attempt to
determine author(s)
Look for hot spots in customer feedback
Find new trends in a document collection
(outliers, hard to classify)
Getting into areas where we dont know ahead of
time what we will have true mining

8
Document Classification -- How

Typical process is
Describe each document
Assess similiaries among documents
Establish classification scheme which creates
optimal "separation"
One typical approach
document is represented as term vector
cosine similarity for measuring association
bottom-up pairwise combining of documents to get
clusters
Assumes you have the corpus in hand

9
Document Clustering

Approaches vary a great deal in
document characteristics used to describe
document (linguistic or semantic? bow?
methods used to define "similar"
methods used to create clusters
Other relevant factors
Number of clusters to extract is variable
Often combined with visualization tools based on
similarity and/or clusters
Sometimes important that approach be incremental
Useful approach when you don't have a handle on
the domain or it's changing

10
Document Visualization

Visualization
Visually display relationships among documents
Examples
hyperbolic viewer based on document similarity
browse a field of scientific documents
map based techniques showing peaks, valleys,
outliers
graphs showing relationships between companies
and research areas
Highly interactive, intended to aid a human in
finding interrelationships and new knowledge in
the document set.

11
Latent Semantic Analysis

Bag of Words methods we have looked at ignore
syntax -- A document is "about" the words in it
People interpret documents in a richer context
a document is about some domain
reflected in the vocabulary
but not limited to it

12
Match Topic and Phrase

I saw Pathfinder on Mars with a telescope.
The Pathfinder photograph mars our perception of
a lifeless planet.
The Pathfinder photograph from Ford has arrived.
When a Pathfinder fords a river it sometimes mars
its paint job.

Astronomy
Automobiles
Biology

13
Domain-Based Processing

This task is relatively easy because we know a
lot about all of the domains, and can
disambiguate using that knowledge.
It's not completely trivial the biology choice
could also have been astronomy.
Information Extraction systems like GATE and
AeroText model the domain knowledge explicitly,
but this takes a lot of effort.
Is there an easier way?

14
Word Co-Occurrences

BOW approaches assume meaning is carried by
vocabulary, ignore syntax
Domain modeling approaches capture detailed
knowledge about the meaning
An intermediate position is to look at vocabulary
groups what words tend to occur together?
Still a statistical approach, but richer
representation than single terms

15
Examples of What We Would Like

Looking for articles about Tiger Woods in an API
newswire database brings up stories about the
golfer, followed by articles about golf
tournaments that don't mention his name.
Constraining the search to days when no articles
were written about Tiger Woods still brings up
stories about golf tournaments and well-known
players.
So we are recognizing that Tiger Woods is about
golf.
javelina.cet.middlebury.edu/lsa/out/lsa_definition
.htm

16
Example

Tiger Woods takes some drama out of cut streak
with opening round at Funai.
Every player on the money list is at Disney
trying to make it to the Tour Championship.
Tiger Woods has no such worries.
Going into this week's event, Tiger Woods has
made the cut in 113 successive events. He tied
the PGA Tour's consecutive cut record two weeks
ago at the Funai Classic in Orlando, Florida,
while Cink finished second.
Stewart Cink finished second at the Funai Classic
at Walt Disney World.

17
Example, Cont.

Woods tended to occur in same articles as Funai.
Cinc also tended to occur in articles about Funai
So there is a relationship between Wood and Cinc
which is stronger than is indicated just by the
one article in which they are both mentioned.
It has to do with cuts, Funai, and the
championship tour.
So by creating a term-document matrix and
examining it we can find potential relationships
which are latent, or hidden. They are tied
together by the meaning, or semantics, of the
terms.
This is the basic concept of Latent Semantic
Analysis and Latent Semantic Indexing.

18
Problem Very High Dimensionality

A vector of TFIDF representing a document is
high dimensional.
If we start looking at a matrix of terms by
documents, it gets even worse.
Need some way to trim words looked at
First, throw away anything "not useful"
Second, identify clusters and pick representative
terms

19
Throw Away

Most domain semantics carried by nouns,
adjectives, verbs, adverbs
throw away prepositions, articles, conjunctions,
pronouns
Very frequent words don't add to domain
semantics.
throw away common verbs (go, be, see),
adjectives (big, good, bad ), adverbs (very)
throw away words which appear in most documents
Very infrequent words don't either
throw away terms which only appear in one
document

20
What's Left

A condensed matrix where we can assume that most
terms are meaningful.
It's still very large, and very sparse.
Basic index table for a keyword search tool.
Where can we go now?
We have fewer concepts than terms
So move from terms to concepts
So Identify clusters and pick representative
terms

21
Singular Value Decomposition

One approach to this is called Singular Value
Decomposition.
Have a term space of thousands of dimensions,
with each document a vector in that space.
Want to project or map those dimensions onto a
smaller number of dimensions in such a way that
relative distance among vectors is preserved as
much as possible.
We end up with a much smaller number of
dimensions, and a vector for each document of its
value for those dimensions
For a detailed explanation
http//www.acm.org/sigmm/MM98/electronic_proceedin
gs/huang/node4.html

22
Dimension Reduction

For n (words) x m (documents) matrix M
Finds least squares best U (nxk)
Rows of U map input features (words) to encoded
features (concept clusters)
Closely related to
symm. eigenvalue decomposition,
factor analysis
principle component analysis
Subroutine in many math packages.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
23
LSI/LSA

Latent semantic indexing is the application of
SVD to IR.
Latent semantic analysis is the more general
term.
Features are words, examples are text passages.
Latent Not visible on the surface
Semantic Word meanings

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
24
Geometric View

Words embedded in high-d space.

exam
test
fish
0.02
0.42
0.01
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
25
Comparison to VSM

AThe feline climbed upon the roof
BA cat leapt onto a house
CThe final will be on a Thursday
How similar?
Vector space model sim(A,B)0
LSI sim(A,B).49sim(A,C).45
Non-zero sim with no words in common by overlap
in reduced representation.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
26
What Does LSI Do?

Lets send it to school

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
27
Platos Problem

7th grader learns 10-15 new words today, fewer
than 1 by direct instruction. Perhaps 3 were
even encountered. How can this be?
Plato You already knew them.
LSA Many weak relationships combined (data to
back it up!)
Rate comparable to students.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
28
Vocabulary

TOEFL synonym test
Choose alternative with highest similarity score.
LSA correct on 64 of 80 items.
Matches avg applicant to US college. Mistakes
correlate w/ people (r.44).
best solo measure of intelligence

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
29
Multiple Choice Exam

Trained on psych textbook.
Given same test as students.
LSA 60 lower than average, but passes.
Has trouble with hard ones.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
30
Essay Test

LSA cant write.
If you cant do, judge.
Students write essays, LSA trained on related
text.
Compare similarity and length with graded essays
(labeled).
Cosine weighted average of top 10. Regression to
combine sim and len.
Correlation .64-.84. Better than human. Bag of
words!?

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
31
Digit Representations

Look at similarities of all pairs from one to
nine.
Look at best fit of these similarities in one
dimension they come out in order!
Similar experiments with cities in Europe in two
dimensions.

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
32
Word Sense

The chemistry student knew this was not a good
time to forget how to calculate volume and mass.
heavy? .21
church? .14
LSI picks best p

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
33
LSApplications

Improve IR.
Cross-language IR. Train on parallel collection.
Measure text coherency.
Use essays to pick educational text.
Grade essays.
Visualize word clusters
Demos at http//LSA.colorado.edu

www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
34
LSI Background Reading

Landauer, Laham, Foltz (1998). Learning
human-like knowledge by Singular Value
Decomposition A Progress Report. Advances in
Neural Information Processing Systems 10, (pp.
44-51)
http//lsa.colorado.edu/papers/nips.ps

Write a Comment

User Comments (0)