Title: CSC 9010: Text Mining Applications DocumentLevel Techniques
1CSC 9010 Text Mining Applications
Document-Level Techniques
- Dr. Paula Matuszek
- Paula_A_Matuszek_at_glaxosmithkline.com
- (610) 270-6851
2Dealing with Documents
- Sometimes our information need is not for
something specific which we can capture in a
clearcut knowledge model - What is the current research in secure networks?
- What are our competitors working on?
- Who should review this paper?
- These kinds of questions are more typically
answered by techniques which look at the entire
document, or set of documents. - Categorizing
- Clustering
- Visualizing
3Document Categorization
- Document categorization
- Assign documents to pre-defined categories
- Examples
- Process email into work, personal, junk
- Process documents from a newsgroup into
interesting, not interesting, spam and
flames - Process transcripts of bugged phone calls into
relevant and irrelevant - Issues
- Real-time?
- How many categories/document? Flat or
hierarchical? - Categories defined automatically or by hand?
4Document Categorization
- Usually
- relatively few categories
- well defined a person could do task easily
- Categories don't change quickly
- Flat vs Hierarchy
- Simple categorization is into mutually-exclusive
document collections - Richer categorization is into hierarchy with
multiple inheritance - broader and narrower categories
- documents can go more than one place
- merges into search-engine with category browsers
5Categorization -- Automatic
- Statistical approaches similar to search engine
- Set of training documents define categories
- Underlying representation of document is bag of
words/TFIDF variant - Category description is created using neural
nets, regression trees, other Machine Learning
techniques - Individual documents categorized by net, inferred
rules, etc - Requires relatively little effort to create
categories - Accuracy is heavily dependent on "good" training
examples - Typically limited to flat, mutually exclusive
categories
6Categorization Manual
- Natural Language/linguistic techniques
- Categories are defined by people
- underlying representation of document is stream
of tokens - category description contains
- ontology of terms and relations
- pattern-matching rules
- individual documents categorized by
pattern-matching - Defining categories can be very time-consuming
- Typically takes some experimentation to "get it
right" - Can handle much more complex structures
7Document Classification
- Document classification
- Cluster documents based on similarity
- Examples
- Group samples of writing in an attempt to
determine author(s) - Look for hot spots in customer feedback
- Find new trends in a document collection
(outliers, hard to classify) - Getting into areas where we dont know ahead of
time what we will have true mining
8Document Classification -- How
- Typical process is
- Describe each document
- Assess similiaries among documents
- Establish classification scheme which creates
optimal "separation" - One typical approach
- document is represented as term vector
- cosine similarity for measuring association
- bottom-up pairwise combining of documents to get
clusters - Assumes you have the corpus in hand
9Document Clustering
- Approaches vary a great deal in
- document characteristics used to describe
document (linguistic or semantic? bow? - methods used to define "similar"
- methods used to create clusters
- Other relevant factors
- Number of clusters to extract is variable
- Often combined with visualization tools based on
similarity and/or clusters - Sometimes important that approach be incremental
- Useful approach when you don't have a handle on
the domain or it's changing
10Document Visualization
- Visualization
- Visually display relationships among documents
- Examples
- hyperbolic viewer based on document similarity
browse a field of scientific documents - map based techniques showing peaks, valleys,
outliers - graphs showing relationships between companies
and research areas - Highly interactive, intended to aid a human in
finding interrelationships and new knowledge in
the document set.
11Latent Semantic Analysis
- Bag of Words methods we have looked at ignore
syntax -- A document is "about" the words in it - People interpret documents in a richer context
- a document is about some domain
- reflected in the vocabulary
- but not limited to it
12Match Topic and Phrase
- I saw Pathfinder on Mars with a telescope.
- The Pathfinder photograph mars our perception of
a lifeless planet. - The Pathfinder photograph from Ford has arrived.
- When a Pathfinder fords a river it sometimes mars
its paint job.
- Astronomy
- Automobiles
- Biology
13Domain-Based Processing
- This task is relatively easy because we know a
lot about all of the domains, and can
disambiguate using that knowledge. - It's not completely trivial the biology choice
could also have been astronomy. - Information Extraction systems like GATE and
AeroText model the domain knowledge explicitly,
but this takes a lot of effort. - Is there an easier way?
14Word Co-Occurrences
- BOW approaches assume meaning is carried by
vocabulary, ignore syntax - Domain modeling approaches capture detailed
knowledge about the meaning - An intermediate position is to look at vocabulary
groups what words tend to occur together? - Still a statistical approach, but richer
representation than single terms
15Examples of What We Would Like
- Looking for articles about Tiger Woods in an API
newswire database brings up stories about the
golfer, followed by articles about golf
tournaments that don't mention his name. - Constraining the search to days when no articles
were written about Tiger Woods still brings up
stories about golf tournaments and well-known
players. - So we are recognizing that Tiger Woods is about
golf. - javelina.cet.middlebury.edu/lsa/out/lsa_definition
.htm
16Example
- Tiger Woods takes some drama out of cut streak
with opening round at Funai. - Every player on the money list is at Disney
trying to make it to the Tour Championship.
Tiger Woods has no such worries. - Going into this week's event, Tiger Woods has
made the cut in 113 successive events. He tied
the PGA Tour's consecutive cut record two weeks
ago at the Funai Classic in Orlando, Florida,
while Cink finished second. - Stewart Cink finished second at the Funai Classic
at Walt Disney World.
17Example, Cont.
- Woods tended to occur in same articles as Funai.
- Cinc also tended to occur in articles about Funai
- So there is a relationship between Wood and Cinc
which is stronger than is indicated just by the
one article in which they are both mentioned. - It has to do with cuts, Funai, and the
championship tour. - So by creating a term-document matrix and
examining it we can find potential relationships
which are latent, or hidden. They are tied
together by the meaning, or semantics, of the
terms. - This is the basic concept of Latent Semantic
Analysis and Latent Semantic Indexing.
18Problem Very High Dimensionality
- A vector of TFIDF representing a document is
high dimensional. - If we start looking at a matrix of terms by
documents, it gets even worse. - Need some way to trim words looked at
- First, throw away anything "not useful"
- Second, identify clusters and pick representative
terms
19Throw Away
- Most domain semantics carried by nouns,
adjectives, verbs, adverbs - throw away prepositions, articles, conjunctions,
pronouns - Very frequent words don't add to domain
semantics. - throw away common verbs (go, be, see),
adjectives (big, good, bad ), adverbs (very) - throw away words which appear in most documents
- Very infrequent words don't either
- throw away terms which only appear in one
document
20What's Left
- A condensed matrix where we can assume that most
terms are meaningful. - It's still very large, and very sparse.
- Basic index table for a keyword search tool.
- Where can we go now?
- We have fewer concepts than terms
- So move from terms to concepts
- So Identify clusters and pick representative
terms
21Singular Value Decomposition
- One approach to this is called Singular Value
Decomposition. - Have a term space of thousands of dimensions,
with each document a vector in that space. - Want to project or map those dimensions onto a
smaller number of dimensions in such a way that
relative distance among vectors is preserved as
much as possible. - We end up with a much smaller number of
dimensions, and a vector for each document of its
value for those dimensions - For a detailed explanation
- http//www.acm.org/sigmm/MM98/electronic_proceedin
gs/huang/node4.html
22Dimension Reduction
- For n (words) x m (documents) matrix M
- Finds least squares best U (nxk)
- Rows of U map input features (words) to encoded
features (concept clusters) - Closely related to
- symm. eigenvalue decomposition,
- factor analysis
- principle component analysis
- Subroutine in many math packages.
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
23LSI/LSA
- Latent semantic indexing is the application of
SVD to IR. - Latent semantic analysis is the more general
term. - Features are words, examples are text passages.
- Latent Not visible on the surface
- Semantic Word meanings
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
24Geometric View
- Words embedded in high-d space.
exam
test
fish
0.02
0.42
0.01
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
25Comparison to VSM
- AThe feline climbed upon the roof
- BA cat leapt onto a house
- CThe final will be on a Thursday
- How similar?
- Vector space model sim(A,B)0
- LSI sim(A,B).49sim(A,C).45
- Non-zero sim with no words in common by overlap
in reduced representation.
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
26What Does LSI Do?
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
27Platos Problem
- 7th grader learns 10-15 new words today, fewer
than 1 by direct instruction. Perhaps 3 were
even encountered. How can this be? - Plato You already knew them.
- LSA Many weak relationships combined (data to
back it up!) - Rate comparable to students.
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
28Vocabulary
- TOEFL synonym test
- Choose alternative with highest similarity score.
- LSA correct on 64 of 80 items.
- Matches avg applicant to US college. Mistakes
correlate w/ people (r.44). - best solo measure of intelligence
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
29Multiple Choice Exam
- Trained on psych textbook.
- Given same test as students.
- LSA 60 lower than average, but passes.
- Has trouble with hard ones.
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
30Essay Test
- LSA cant write.
- If you cant do, judge.
- Students write essays, LSA trained on related
text. - Compare similarity and length with graded essays
(labeled). - Cosine weighted average of top 10. Regression to
combine sim and len. - Correlation .64-.84. Better than human. Bag of
words!?
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
31Digit Representations
- Look at similarities of all pairs from one to
nine. - Look at best fit of these similarities in one
dimension they come out in order! - Similar experiments with cities in Europe in two
dimensions.
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
32Word Sense
- The chemistry student knew this was not a good
time to forget how to calculate volume and mass. - heavy? .21
- church? .14
- LSI picks best p
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
33LSApplications
- Improve IR.
- Cross-language IR. Train on parallel collection.
- Measure text coherency.
- Use essays to pick educational text.
- Grade essays.
- Visualize word clusters
- Demos at http//LSA.colorado.edu
www.cs.princeton.edu/courses/archive/fall01/cs302/
notes/12-5/LSI.ppt
34LSI Background Reading
- Landauer, Laham, Foltz (1998). Learning
human-like knowledge by Singular Value
Decomposition A Progress Report. Advances in
Neural Information Processing Systems 10, (pp.
44-51) - http//lsa.colorado.edu/papers/nips.ps