Title: David Newman, UC Irvine Lecture 12: Topic Models 1
1CS 277 Data MiningLecture 12 Topic Models
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Homework 2 due now
- Homework 3 available on web
- Progress Report 2 due Tuesday Nov 13 in class
- will give instructions for progress report 2 on
Thurs
3Progress Report 1 Comments
- Good progress overall
- But (for some of you)
- Scale back your plans
- Be realistic about what you can accomplish
- Focus on producing a preliminary result
- This is a DATA mining project
- If you dont yet have your data, get it as soon
as possible
4Progress Report 1 Comments (cont.)
- Extreme programming XP/rapid prototyping
- Dont have grand fixed plan
- Get something working early on
- Suggestions
- Do simple method first
- Use subset of data first
- For next week
- Produce an actual result with actual data
- Pretend you have 1 week to convince your manager
that this is a worthwhile project. What work
would you do, and how would you present it?
5Todays lecture
- Topic modeling
- Recap on LSI/SVD
- NMF
- Topic modeling
6Latent Semantic Indexing
- Latent Semantic Indexing (LSI) Singular Value
Decomposition (SVD) Principal Component
Analysis (PCA) - Matlab
- U,S,V svds(X,20)
- err norm(X USV)
- Issues?
71st and 6th topic vectors
- topic 1
- 0.29 terror
- 0.28 12
- 0.26 sept
- 0.26 aftermath
- 0.23 2001
- 0.17 commandeered
- 0.15 times
- 0.15 piecing
- 0.14 jetliners
- 0.13 york
- topic 6
- -0.27 cheese
- -0.21 flour
- -0.19 cheddar
- -0.18 grams
- -0.16 eggs
- -0.16 baking
- -0.15 teaspoon
- -0.15 soup
- -0.13 butter
- -0.13 cups
8Nonnegative Matrix Factorization (NMF)
- V WH W0, H0
- Q V WH
- Gradient descent?
- ? whiteboard
-
9Unsupervised Learning from Text
- Large collections of unlabeled documents..
- Web
- Digital libraries
- Email archives, etc
- Often wish to organize/summarize/index/tag these
documents automatically - We will look at probabilistic techniques for
clustering and topic extraction from sets of
documents
10Pennsylvania Gazette
1728-1800 80,000 articles 25 million
words www.accessible.com
11Enron email data
250,000 emails 28,000 authors 1999-2002
12(No Transcript)
13Other Examples of Data Sets
- CiteSeer digital collection
- 700,000 papers, 700,000 authors, 1986-2005
- MEDLINE collection
- 16 million abstracts in medicine/biology
- US Patent collection
- and many more....
14Problems of Interest
- What topics do these documents span?
- Which documents are about a particular topic?
- How have topics changed over time?
- What does author X write about?
- Who is likely to write about topic Y?
- Who wrote this specific document?
- and so on..
15Probability Models for Documents
- Example 50,000 possible words in our vocabulary
- Simple memoryless model, aka "bag of words"
- 50,000-sided die
- each side of the die represents 1 word
- a non-uniform die each side/word has its own
probability - to generate N words we toss the die N times
- gives a "bag of words" (no sequence information)
- This is a simple probability model
- p( document f ) P p(word i f )
- to "learn" the model we just count frequencies
- p(word i) number of occurrences of i / total
number
16The Multinomial Model
- Example tossing a 6-sided die
- P 1/6, 1/6, 1/6, 1/6, 1/6, 1/6
- Multinomial model for documents
- V-sided die probability distribution over
possible words - Some words have higher probability than others
- Document with K words generated by N memoryless
draws - Typically interested in conditional multinomials,
e.g., - p(words spam) versus p(words non-spam)
17Real examples of Word Multinomials
18Explaining The Topic Model
The topic model is based on an easy to understand
document generator. Documents are generated
by randomly choosing words out of topic buckets.
orange
pumpkin
carrot
mango
banana
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
19Explaining The Topic Model
orange
Recipe 1
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
20Explaining The Topic Model
orange
Recipe 1
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
21Explaining The Topic Model
orange
Recipe 1 pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
22Explaining The Topic Model
orange
Recipe 1 pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
23Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
24Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
25Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
26Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
27Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
28Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
29Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
30Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
31Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
32Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
33Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
34Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
35The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
words not colored!
36The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
???
???
???
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
?
???
???
???
???
???
???
?
?
words in topic
topic 1
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
?
???
???
???
???
???
???
???
???
???
topic 2
topics in doc
words not colored!
37The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
orange carrot mango pumpkin banana
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
topic model
words in topic
topic 1
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
apple leek broccoli kiwi spinach
topic 2
38Humans Interpret Topics
After the model produces the topics, a human
domain expert interprets the list of most likely
words and chooses a topic name.
orange carrot mango pumpkin banana
topic 1 Orange/Yellow Fruit/Veg.
topic 1
topic 2 Green Fruit/Veg.
apple leek broccoli kiwi spinach
topic 2
39Explaining The Topic Model
40Graphical Model
annotated.
topics in doc
zi is a "label" for each word wi Prob( wi f
, zi t) multinomial over words a
"topic" Prob( zi qd ) distribution
over topics that is document specific
topic
word
words in topic
41Learning using Gibbs Sampling
count of word w assigned to topic t
count of topic t assigned to doc d
probability that word i is assigned to topic t
42Probabilistic Model
P( Data Parameters)
Parameters
Real World Data
P( Parameters Data)
Statistical Inference
43See a Gibbs sampler live!
- http//psiexp.ss.uci.edu/research/demotopics2.html
44 sample topics from different collections
45 46A Graphical Model
p( doc f ) P p(wi f )
f
f "parameter vector" set of
probabilities one per word
p( w f )
w2
w1
wn
47Another view....
p( doc f ) P p(wi f )
- This is plate notation
- Items inside the plate
- are conditionally independent
- given the variable outside
- the plate
- There are n conditionallyindependent
replicatesrepresented by the plate
f
wi
i1n
48Being Bayesian....
This is a prior on our multinomial parameters,
e.g., a simple Dirichlet smoothing prior with
symmetric parameter a to avoid
estimates of probabilities that are 0
a
f
wi
i1n
49Being Bayesian....
Learning infer p( f words, a )
proportional to p( words f) p(fa)
a
f
wi
i1n
50Multiple Documents
p( corpus f ) P p( doc f )
a
f
wi
1n
1D
51Different Document Types
p( w f) is a multinomial over words
a
f
wi
1n
52Different Document Types
p( w f) is a multinomial over words
a
f
wi
1n
1D
53Different Document Types
p( w f , zd) is a multinomial over
words zd is the "label" for each doc
a
zd
f
wi
1n
1D
54Different Document Types
p( w f , zd) is a multinomial over
words zd is the "label" for each
doc Different multinomials, depending on
the value of zd (discrete) f now
represents z different
multinomials
a
zd
f
wi
1n
1D
55Unknown Document Types
Now the values of z for each document are unknown
- hopeless?
p
a
zd
f
wi
1n
1D
56Unknown Document Types
Now the values of z for each document are unknown
- hopeless?
p
a
Not hopeless ) Can learn about both z
and q e.g., EM algorithm This gives
probabilistic clustering p(w zk , f) is
the kth multinomial over words
zd
f
wi
1n
1D
57Topic Model
zi is a "label" for each word p( w f , zi
k) multinomial over words a
"topic" p( zi qd ) distribution over
topics that is document specific
qd
a
zi
f
wi
1n
1D
58Key Features of Topic Models
- Generative model for documents in form of bags of
words - Allows a document to be composed of multiple
topics - Much more powerful than 1 doc - 1 cluster
- Completely unsupervised
- Topics learned directly from data
- Leverages strong dependencies at word level AND
large data sets - Learning algorithm
- Gibbs sampling is the method of choice
- Scalable
- Linear in number of word tokens
- Can be run on millions of documents
59More Details on Learning
- Gibbs sampling for x and z
- Typically run several hundred Gibbs iterations
- 1 iteration full pass through all words in all
documents - Estimating ? and ?
- x and z sample - point estimates
- non-informative Dirichlet priors for ? and ?
- Computational Efficiency
- Learning is linear in the number of word tokens ?
- Memory requirements can be a limitation for large
corpora - Predictions on new documents
- can average over ? and ? (from different
samples, different runs)
60History of topic models
- Origins in statistics
- latent class models in social science
- admixture models in statistical genetics
- Applications in computer science
- Hoffman, SIGIR, 1999
- Blei, Jordan, Ng, JMLR 2003
- Griffiths and Steyvers, PNAS, 2004
- More recent work
- Author-topic models Steyvers et al, Rosen-Zvi et
al, 2004 - Hierarchical topics McCallum et al, 2006
- Correlated topic models Blei and Lafferty, 2005
- Dirichlet process models Teh, Jordan, et al
- Large-scale web applications Buntine et al,
2004, 2005 - undirected models Welling et al, 2004
61Examples of Topics from CiteSeer
62Four example topics from NIPS
63Clusters v. Topics
64Clusters v. Topics
One Cluster
65Clusters v. Topics
Multiple Topics
One Cluster
66What can Topic Models be used for?
- Queries
- Who writes on this topic?
- e.g., finding experts or reviewers in a
particular area - What topics does this person do research on?
- Comparing groups of authors or documents
- Discovering trends over time
- Detecting unusual papers and authors
- Interactive browsing of a digital library via
topics - Parsing documents (and parts of documents) by
topic - and more..
67What is this paper about?
- Empirical Bayes screening for multi-item
associations - Bill DuMouchel and Daryl Pregibon, ACM SIGKDD
2001 - Most likely topics according to the model are
- data, mining, discovery, association, attribute..
- set, subset, maximal, minimal, complete,
- measurements, correlation, statistical,
variation, - Bayesian, model, prior, data, mixture,..
683 of 300 example topics (TASA)
69Automated Tagging of Words (numbers colors ?
topic assignments)
70Experiments on Various Data Sets
- Corpora
- CiteSeer 160K abstracts, 85K authors
- NIPS 1.7K papers, 2K authors
- Enron 250K emails, 28K authors (sender)
- Medline 300K abstracts, 128K authors
- Removed stop words no stemming
- Ignore word order, just use word counts
- Processing time
- Nips 2000 Gibbs iterations ? 8 hours
- CiteSeer 2000 Gibbs iterations ? 4 days
71Four example topics from CiteSeer (T300)
72More CiteSeer Topics
73Temporal patterns in topics hot and cold topics
- We have CiteSeer papers from 1986-2002
- For each year, calculate the fraction of words
assigned to each topic - - a time-series for topics
- Hot topics become more prevalent
- Cold topics become less prevalent
74(No Transcript)
75(No Transcript)
76(No Transcript)
77(No Transcript)
78(No Transcript)
79(No Transcript)