David Newman, UC Irvine Lecture 12: Topic Models 1 - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

David Newman, UC Irvine Lecture 12: Topic Models 1

Description:

spinach. kiwi. Green Fruit/Veg. ... Recipe 2: mango apple apple leek orange broccoli spinach pumpkin broccoli apple kiwi banana ... – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 80
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 12: Topic Models 1


1
CS 277 Data MiningLecture 12 Topic Models
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 2 due now
  • Homework 3 available on web
  • Progress Report 2 due Tuesday Nov 13 in class
  • will give instructions for progress report 2 on
    Thurs

3
Progress Report 1 Comments
  • Good progress overall
  • But (for some of you)
  • Scale back your plans
  • Be realistic about what you can accomplish
  • Focus on producing a preliminary result
  • This is a DATA mining project
  • If you dont yet have your data, get it as soon
    as possible

4
Progress Report 1 Comments (cont.)
  • Extreme programming XP/rapid prototyping
  • Dont have grand fixed plan
  • Get something working early on
  • Suggestions
  • Do simple method first
  • Use subset of data first
  • For next week
  • Produce an actual result with actual data
  • Pretend you have 1 week to convince your manager
    that this is a worthwhile project. What work
    would you do, and how would you present it?

5
Todays lecture
  • Topic modeling
  • Recap on LSI/SVD
  • NMF
  • Topic modeling

6
Latent Semantic Indexing
  • Latent Semantic Indexing (LSI) Singular Value
    Decomposition (SVD) Principal Component
    Analysis (PCA)
  • Matlab
  • U,S,V svds(X,20)
  • err norm(X USV)
  • Issues?

7
1st and 6th topic vectors
  • topic 1
  • 0.29 terror
  • 0.28 12
  • 0.26 sept
  • 0.26 aftermath
  • 0.23 2001
  • 0.17 commandeered
  • 0.15 times
  • 0.15 piecing
  • 0.14 jetliners
  • 0.13 york
  • topic 6
  • -0.27 cheese
  • -0.21 flour
  • -0.19 cheddar
  • -0.18 grams
  • -0.16 eggs
  • -0.16 baking
  • -0.15 teaspoon
  • -0.15 soup
  • -0.13 butter
  • -0.13 cups

8
Nonnegative Matrix Factorization (NMF)
  • V WH W0, H0
  • Q V WH
  • Gradient descent?
  • ? whiteboard

9
Unsupervised Learning from Text
  • Large collections of unlabeled documents..
  • Web
  • Digital libraries
  • Email archives, etc
  • Often wish to organize/summarize/index/tag these
    documents automatically
  • We will look at probabilistic techniques for
    clustering and topic extraction from sets of
    documents

10
Pennsylvania Gazette
1728-1800 80,000 articles 25 million
words www.accessible.com
11
Enron email data
250,000 emails 28,000 authors 1999-2002
12
(No Transcript)
13
Other Examples of Data Sets
  • CiteSeer digital collection
  • 700,000 papers, 700,000 authors, 1986-2005
  • MEDLINE collection
  • 16 million abstracts in medicine/biology
  • US Patent collection
  • and many more....

14
Problems of Interest
  • What topics do these documents span?
  • Which documents are about a particular topic?
  • How have topics changed over time?
  • What does author X write about?
  • Who is likely to write about topic Y?
  • Who wrote this specific document?
  • and so on..

15
Probability Models for Documents
  • Example 50,000 possible words in our vocabulary
  • Simple memoryless model, aka "bag of words"
  • 50,000-sided die
  • each side of the die represents 1 word
  • a non-uniform die each side/word has its own
    probability
  • to generate N words we toss the die N times
  • gives a "bag of words" (no sequence information)
  • This is a simple probability model
  • p( document f ) P p(word i f )
  • to "learn" the model we just count frequencies
  • p(word i) number of occurrences of i / total
    number

16
The Multinomial Model
  • Example tossing a 6-sided die
  • P 1/6, 1/6, 1/6, 1/6, 1/6, 1/6
  • Multinomial model for documents
  • V-sided die probability distribution over
    possible words
  • Some words have higher probability than others
  • Document with K words generated by N memoryless
    draws
  • Typically interested in conditional multinomials,
    e.g.,
  • p(words spam) versus p(words non-spam)

17
Real examples of Word Multinomials
18
Explaining The Topic Model
The topic model is based on an easy to understand
document generator. Documents are generated
by randomly choosing words out of topic buckets.
orange
pumpkin
carrot
mango
banana
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
19
Explaining The Topic Model
orange
Recipe 1
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
20
Explaining The Topic Model
orange
Recipe 1
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
21
Explaining The Topic Model
orange
Recipe 1 pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
22
Explaining The Topic Model
orange
Recipe 1 pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
23
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
24
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
25
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
26
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
27
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
Orange/Yellow Fruit/Veg.
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
28
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
29
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
30
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
31
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
32
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
33
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
34
Explaining The Topic Model
orange
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
pumpkin
67
carrot
mango
banana
33
33
Orange/Yellow Fruit/Veg.
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
67
kiwi
spinach
leek
broccoli
apple
Green Fruit/Veg.
35
The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
words not colored!
36
The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
???
???
???
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
?
???
???
???
???
???
???
?
?
words in topic
topic 1
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
?
???
???
???
???
???
???
???
???
???
topic 2
topics in doc
words not colored!
37
The Topic Model
The topic model inverts this process to determine
both the likely topics in the collection and the
mix of topics within each document. Note that
there is no a priori definition of topics.
orange carrot mango pumpkin banana
Recipe 1 pumpkin kiwi carrot orange carrot
orange leek broccoli carrot banana kiwi pumpkin
topic model
words in topic
topic 1
Recipe 2 mango apple apple leek orange broccoli
spinach pumpkin broccoli apple kiwi banana
apple leek broccoli kiwi spinach
topic 2
38
Humans Interpret Topics
After the model produces the topics, a human
domain expert interprets the list of most likely
words and chooses a topic name.
orange carrot mango pumpkin banana
topic 1 Orange/Yellow Fruit/Veg.
topic 1
topic 2 Green Fruit/Veg.
apple leek broccoli kiwi spinach
topic 2
39
Explaining The Topic Model
40
Graphical Model
annotated.
topics in doc
zi is a "label" for each word wi Prob( wi f
, zi t) multinomial over words a
"topic" Prob( zi qd ) distribution
over topics that is document specific
topic
word
words in topic
41
Learning using Gibbs Sampling
count of word w assigned to topic t
count of topic t assigned to doc d
probability that word i is assigned to topic t
42
Probabilistic Model
P( Data Parameters)
Parameters
Real World Data
P( Parameters Data)
Statistical Inference
43
See a Gibbs sampler live!
  • http//psiexp.ss.uci.edu/research/demotopics2.html

44
sample topics from different collections
45
  • LAST SLIDE

46
A Graphical Model
p( doc f ) P p(wi f )
f
f "parameter vector" set of
probabilities one per word
p( w f )
w2
w1
wn
47
Another view....
p( doc f ) P p(wi f )
  • This is plate notation
  • Items inside the plate
  • are conditionally independent
  • given the variable outside
  • the plate
  • There are n conditionallyindependent
    replicatesrepresented by the plate

f
wi
i1n
48
Being Bayesian....
This is a prior on our multinomial parameters,
e.g., a simple Dirichlet smoothing prior with
symmetric parameter a to avoid
estimates of probabilities that are 0
a
f
wi
i1n
49
Being Bayesian....
Learning infer p( f words, a )
proportional to p( words f) p(fa)
a
f
wi
i1n
50
Multiple Documents
p( corpus f ) P p( doc f )
a
f
wi
1n
1D
51
Different Document Types
p( w f) is a multinomial over words
a
f
wi
1n
52
Different Document Types
p( w f) is a multinomial over words
a
f
wi
1n
1D
53
Different Document Types
p( w f , zd) is a multinomial over
words zd is the "label" for each doc
a
zd
f
wi
1n
1D
54
Different Document Types
p( w f , zd) is a multinomial over
words zd is the "label" for each
doc Different multinomials, depending on
the value of zd (discrete) f now
represents z different
multinomials
a
zd
f
wi
1n
1D
55
Unknown Document Types
Now the values of z for each document are unknown
- hopeless?
p
a
zd
f
wi
1n
1D
56
Unknown Document Types
Now the values of z for each document are unknown
- hopeless?
p
a
Not hopeless ) Can learn about both z
and q e.g., EM algorithm This gives
probabilistic clustering p(w zk , f) is
the kth multinomial over words
zd
f
wi
1n
1D
57
Topic Model
zi is a "label" for each word p( w f , zi
k) multinomial over words a
"topic" p( zi qd ) distribution over
topics that is document specific
qd
a
zi
f
wi
1n
1D
58
Key Features of Topic Models
  • Generative model for documents in form of bags of
    words
  • Allows a document to be composed of multiple
    topics
  • Much more powerful than 1 doc - 1 cluster
  • Completely unsupervised
  • Topics learned directly from data
  • Leverages strong dependencies at word level AND
    large data sets
  • Learning algorithm
  • Gibbs sampling is the method of choice
  • Scalable
  • Linear in number of word tokens
  • Can be run on millions of documents

59
More Details on Learning
  • Gibbs sampling for x and z
  • Typically run several hundred Gibbs iterations
  • 1 iteration full pass through all words in all
    documents
  • Estimating ? and ?
  • x and z sample - point estimates
  • non-informative Dirichlet priors for ? and ?
  • Computational Efficiency
  • Learning is linear in the number of word tokens ?
  • Memory requirements can be a limitation for large
    corpora
  • Predictions on new documents
  • can average over ? and ? (from different
    samples, different runs)

60
History of topic models
  • Origins in statistics
  • latent class models in social science
  • admixture models in statistical genetics
  • Applications in computer science
  • Hoffman, SIGIR, 1999
  • Blei, Jordan, Ng, JMLR 2003
  • Griffiths and Steyvers, PNAS, 2004
  • More recent work
  • Author-topic models Steyvers et al, Rosen-Zvi et
    al, 2004
  • Hierarchical topics McCallum et al, 2006
  • Correlated topic models Blei and Lafferty, 2005
  • Dirichlet process models Teh, Jordan, et al
  • Large-scale web applications Buntine et al,
    2004, 2005
  • undirected models Welling et al, 2004

61
Examples of Topics from CiteSeer
62
Four example topics from NIPS
63
Clusters v. Topics
64
Clusters v. Topics
One Cluster
65
Clusters v. Topics
Multiple Topics
One Cluster
66
What can Topic Models be used for?
  • Queries
  • Who writes on this topic?
  • e.g., finding experts or reviewers in a
    particular area
  • What topics does this person do research on?
  • Comparing groups of authors or documents
  • Discovering trends over time
  • Detecting unusual papers and authors
  • Interactive browsing of a digital library via
    topics
  • Parsing documents (and parts of documents) by
    topic
  • and more..

67
What is this paper about?
  • Empirical Bayes screening for multi-item
    associations
  • Bill DuMouchel and Daryl Pregibon, ACM SIGKDD
    2001
  • Most likely topics according to the model are
  • data, mining, discovery, association, attribute..
  • set, subset, maximal, minimal, complete,
  • measurements, correlation, statistical,
    variation,
  • Bayesian, model, prior, data, mixture,..

68
3 of 300 example topics (TASA)
69
Automated Tagging of Words (numbers colors ?
topic assignments)
70
Experiments on Various Data Sets
  • Corpora
  • CiteSeer 160K abstracts, 85K authors
  • NIPS 1.7K papers, 2K authors
  • Enron 250K emails, 28K authors (sender)
  • Medline 300K abstracts, 128K authors
  • Removed stop words no stemming
  • Ignore word order, just use word counts
  • Processing time
  • Nips 2000 Gibbs iterations ? 8 hours
  • CiteSeer 2000 Gibbs iterations ? 4 days

71
Four example topics from CiteSeer (T300)
72
More CiteSeer Topics
73
Temporal patterns in topics hot and cold topics
  • We have CiteSeer papers from 1986-2002
  • For each year, calculate the fraction of words
    assigned to each topic
  • - a time-series for topics
  • Hot topics become more prevalent
  • Cold topics become less prevalent

74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com