David Newman, UC Irvine Lecture 12: Topic Models 1 - PowerPoint PPT Presentation

1 / 79

About This Presentation

Title:

David Newman, UC Irvine Lecture 12: Topic Models 1

Description:

spinach. kiwi. Green Fruit/Veg. ... Recipe 2: mango apple apple leek orange broccoli spinach pumpkin broccoli apple kiwi banana ... – PowerPoint PPT presentation

Number of Views:222

Avg rating:3.0/5.0

Slides: 80

Provided by: Informatio367

Category:

more less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 12: Topic Models 1

1
CS 277 Data MiningLecture 12 Topic Models

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Homework 2 due now
Homework 3 available on web
Progress Report 2 due Tuesday Nov 13 in class
will give instructions for progress report 2 on
Thurs

3
Progress Report 1 Comments

Good progress overall
But (for some of you)
Scale back your plans
Be realistic about what you can accomplish
Focus on producing a preliminary result
This is a DATA mining project
If you dont yet have your data, get it as soon
as possible

4
Progress Report 1 Comments (cont.)

Extreme programming XP/rapid prototyping
Dont have grand fixed plan
Get something working early on
Suggestions
Do simple method first
Use subset of data first
For next week
Produce an actual result with actual data
Pretend you have 1 week to convince your manager
that this is a worthwhile project. What work
would you do, and how would you present it?

5
Todays lecture

Topic modeling
Recap on LSI/SVD
NMF
Topic modeling

6
Latent Semantic Indexing

Latent Semantic Indexing (LSI) Singular Value
Decomposition (SVD) Principal Component
Analysis (PCA)
Matlab
U,S,V svds(X,20)
err norm(X USV)
Issues?

7
1st and 6th topic vectors

topic 1
0.29 terror
0.28 12
0.26 sept
0.26 aftermath
0.23 2001
0.17 commandeered
0.15 times
0.15 piecing
0.14 jetliners
0.13 york

topic 6
-0.27 cheese
-0.21 flour
-0.19 cheddar
-0.18 grams
-0.16 eggs
-0.16 baking
-0.15 teaspoon
-0.15 soup
-0.13 butter
-0.13 cups

8
Nonnegative Matrix Factorization (NMF)

V WH W0, H0
Q V WH
Gradient descent?
? whiteboard

9
Unsupervised Learning from Text

Large collections of unlabeled documents..
Web
Digital libraries
Email archives, etc
Often wish to organize/summarize/index/tag these
documents automatically
We will look at probabilistic techniques for
clustering and topic extraction from sets of
documents

10
Pennsylvania Gazette
1728-1800 80,000 articles 25 million
words www.accessible.com
11
Enron email data
250,000 emails 28,000 authors 1999-2002
12
(No Transcript)
13
Other Examples of Data Sets

CiteSeer digital collection
700,000 papers, 700,000 authors, 1986-2005
MEDLINE collection
16 million abstracts in medicine/biology
US Patent collection
and many more....

14
Problems of Interest

What topics do these documents span?
Which documents are about a particular topic?
How have topics changed over time?
What does author X write about?
Who is likely to write about topic Y?
Who wrote this specific document?
and so on..

15
Probability Models for Documents

Example 50,000 possible words in our vocabulary
Simple memoryless model, aka "bag of words"
50,000-sided die
each side of the die represents 1 word
a non-uniform die each side/word has its own
probability
to generate N words we toss the die N times
gives a "bag of words" (no sequence information)
This is a simple probability model
p( document f ) P p(word i f )
to "learn" the model we just count frequencies
p(word i) number of occurrences of i / total
number

16
The Multinomial Model

Example tossing a 6-sided die
P 1/6, 1/6, 1/6, 1/6, 1/6, 1/6
Multinomial model for documents
V-sided die probability distribution over
possible words
Some words have higher probability than others
Document with K words generated by N memoryless
draws
Typically interested in conditional multinomials,
e.g.,
p(words spam) versus p(words non-spam)

17
Real examples of Word Multinomials
18
Explaining The Topic Model
The topic model is based on an easy to understand
document generator. Documents are generated
by randomly choosing words out of topic buckets.
orange
pumpkin
carrot
mango
banana
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
19
Explaining The Topic Model
orange
Recipe 1
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
20
Explaining The Topic Model
orange
Recipe 1
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
21
Explaining The Topic Model
orange
Recipe 1 pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
22
Explaining The Topic Model
orange
Recipe 1 pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
23
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
24
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
25
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
26
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
27
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
28
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
29
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
30
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
31
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
32
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
33
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
34
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
35
The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
words not colored!
36
The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
???
???
???
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
?
???
???
???
???
???
???
?
?
words in topic
topic 1
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
?
???
???
???
???
???
???
???
???
???
topic 2
topics in doc
words not colored!
37
The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
orange carrot mango pumpkin banana
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
topic model
words in topic
topic 1
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
apple leek broccoli kiwi spinach
topic 2
38
Humans Interpret Topics
After the model produces the topics, a human
domain expert interprets the list of most likely
words and chooses a topic name.
orange carrot mango pumpkin banana
topic 1 Orange/Yellow Fruit/Veg.
topic 1
topic 2 Green Fruit/Veg.
apple leek broccoli kiwi spinach
topic 2
39
Explaining The Topic Model
40
Graphical Model
annotated.
topics in doc
zi is a "label" for each word wi Prob( wi f
, zi t) multinomial over words a
"topic" Prob( zi qd ) distribution
over topics that is document specific
topic
word
words in topic
41
Learning using Gibbs Sampling
count of word w assigned to topic t
count of topic t assigned to doc d
probability that word i is assigned to topic t
42
Probabilistic Model
P( Data Parameters)
Parameters
Real World Data
P( Parameters Data)
Statistical Inference
43
See a Gibbs sampler live!

http//psiexp.ss.uci.edu/research/demotopics2.html

44
sample topics from different collections
45

LAST SLIDE

46
A Graphical Model
p( doc f ) P p(wi f )
f
f "parameter vector" set of
probabilities one per word
p( w f )
w2
w1
wn
47
Another view....
p( doc f ) P p(wi f )

This is plate notation
Items inside the plate
are conditionally independent
given the variable outside
the plate
There are n conditionallyindependent
replicatesrepresented by the plate

f
wi
i1n
48
Being Bayesian....
This is a prior on our multinomial parameters,
e.g., a simple Dirichlet smoothing prior with
symmetric parameter a to avoid
estimates of probabilities that are 0
a
f
wi
i1n
49
Being Bayesian....
Learning infer p( f words, a )
proportional to p( words f) p(fa)
a
f
wi
i1n
50
Multiple Documents
p( corpus f ) P p( doc f )
a
f
wi
1n
1D
51
Different Document Types
p( w f) is a multinomial over words
a
f
wi
1n
52
Different Document Types
p( w f) is a multinomial over words
a
f
wi
1n
1D
53
Different Document Types
p( w f , zd) is a multinomial over
words zd is the "label" for each doc
a
zd
f
wi
1n
1D
54
Different Document Types
p( w f , zd) is a multinomial over
words zd is the "label" for each
doc Different multinomials, depending on
the value of zd (discrete) f now
represents z different
multinomials
a
zd
f
wi
1n
1D
55
Unknown Document Types
Now the values of z for each document are unknown
- hopeless?
p
a
zd
f
wi
1n
1D
56
Unknown Document Types
Now the values of z for each document are unknown
- hopeless?
p
a
Not hopeless ) Can learn about both z
and q e.g., EM algorithm This gives
probabilistic clustering p(w zk , f) is
the kth multinomial over words
zd
f
wi
1n
1D
57
Topic Model
zi is a "label" for each word p( w f , zi
k) multinomial over words a
"topic" p( zi qd ) distribution over
topics that is document specific
qd
a
zi
f
wi
1n
1D
58
Key Features of Topic Models

Generative model for documents in form of bags of
words
Allows a document to be composed of multiple
topics
Much more powerful than 1 doc - 1 cluster
Completely unsupervised
Topics learned directly from data
Leverages strong dependencies at word level AND
large data sets
Learning algorithm
Gibbs sampling is the method of choice
Scalable
Linear in number of word tokens
Can be run on millions of documents

59
More Details on Learning

Gibbs sampling for x and z
Typically run several hundred Gibbs iterations
1 iteration full pass through all words in all
documents
Estimating ? and ?
x and z sample - point estimates
non-informative Dirichlet priors for ? and ?
Computational Efficiency
Learning is linear in the number of word tokens ?
Memory requirements can be a limitation for large
corpora
Predictions on new documents
can average over ? and ? (from different
samples, different runs)

60
History of topic models

Origins in statistics
latent class models in social science
admixture models in statistical genetics
Applications in computer science
Hoffman, SIGIR, 1999
Blei, Jordan, Ng, JMLR 2003
Griffiths and Steyvers, PNAS, 2004
More recent work
Author-topic models Steyvers et al, Rosen-Zvi et
al, 2004
Hierarchical topics McCallum et al, 2006
Correlated topic models Blei and Lafferty, 2005
Dirichlet process models Teh, Jordan, et al
Large-scale web applications Buntine et al,
2004, 2005
undirected models Welling et al, 2004

61
Examples of Topics from CiteSeer
62
Four example topics from NIPS
63
Clusters v. Topics
64
Clusters v. Topics
One Cluster
65
Clusters v. Topics
Multiple Topics
One Cluster
66
What can Topic Models be used for?