David Newman, UC Irvine Lecture 13: Topic Models 1

About This Presentation

Title:

David Newman, UC Irvine Lecture 13: Topic Models 1

Description:

Produce an actual result with actual data ... dog cat fish hen. red. green. blue. Prob(green|fish) = ? Na ve Bayes (Multinomial) ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 37

Provided by: Informatio367

Category:

more less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 13: Topic Models 1

1
CS 277 Data MiningLecture 13 Topic Models
(cont.)

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Homework 3 available on web
Progress Report 2 due Tuesday Nov 13 in class
email me ppt/pdf by 2pm Tuesday Nov 13

3
Progress Report 2

2-page ppt/pdf
Produce an actual result with actual data
You are to convince your manager (the rest of the
class) that this is a worthwhile project. What
work do you do, and how do you present it?
Submit by email 2-page ppt/pdf
2 slides that summarize your best result(s) so
far
Should contain
Your name (top right corner)
Clear description of the main task
Your best result so far
Produce an actual result with actual data
Make it graphical (use text sparingly)
Email to me no later 2pm Tuesday Nov 13th

4
Homework 2

Homework 2 review

5
K-Nearest Neighbor

TF accuracy ()
26, 28, 29, 29, 64, 27, 29, 29, 29
IDF accuracy ()
71, 69, 73, 72, 73, 73, 73, 72, 73
Best K, accuracy
(K3, 39) (K18, 69) (K16, 72) (K1, 12)
(K21, 72) (K52, 81) (K16, 80) (K2, 3)
Did you compute accuracy on test data?
(K52, 74)

6
Complexity

D documents
Dtrain
Dtest (well compute per test document)
W words
L average document length
C classes

7
K-Nearest Neighbor

Training Complexity
Time zero
Space D W e, or D L (space for Xtrain)
Test Complexity (per test doc)
Time D L
Space same as training

8
Naïve Bayes

Binary accuracy ()
43, 56, 67, 53, 57, 58, 62
Multinomial accuracy ()
79, 78, 79, 78, 78, 79, 79
McCallum (Fig 3) did slightly better?

9
Naïve Bayes

Optimal smoothing parameter
1, 1, 1, 0.06, 1, 0.1
Dirichlet(a)
a is real number, a gt 0
Pseudo counts can be fractional. a0.06

dog cat fish hen
red green blue
Prob(greenfish) ?
10
Naïve Bayes (Multinomial)

Training Complexity
Time W C D/C W D
Space W C
Test Complexity (per test doc)
Time L C
Space W C

11
Support Vector Machine

1 vs. rest
27, 76, 76, 76, 76, 76, 76, 76, 76
1 vs. 1
56, 56, 56, 54, 56, 56, 61, 56
Several comments
SVM 1 vs 1 should give best results
SVM should perform best
SVM 1 vs 1 has to be the best
I dont think this is necessarily the case

12
Support Vector Machine

Training Complexity
Time D3 (empirically D1.6 - D1.9)
Space S L
Test Complexity (per test doc)
Time S L
Space S L
1 vs rest build C SVMs
1 vs 1 build C2 SVMs

13
Weka

Attempt at J48 (Wekas decision tree)
Memory problems

14
Conclusions

Best overall classifier
NB? ?
SVM?
Best classifier for each class
NB or SVM?
Best training time complexity
K-NN (zero)
Best testing time complexity
NB LC
Best testing space complexity
NB WC

15
Todays lecture

Topic modeling
LSI (aka SVD)
NMF
LDA
PLSI (today)

16
Topic Modeling as dimensionality reduction
W.D W.K K.D
X
f
q

text collection
mixtures
topics
K ltlt min(W,D)
17
Probabilistic Latent Semantic Indexing

Factorization
X U S V
X W H
PLSI
P(w,d) P(d)? P(wz) P(zd)
P(w,d) ? P(wz) P(z) P(dz)

X
f
q

18
Probabilistic Latent Semantic Indexing

Select document d with probability P(d)
Select a topic with probability P(zd)
Generate a word with probability P(wz)
P(d,w) ?
Log likelihood of collection ?

19
Topic Modeling

Latent Dirichlet Allocation (LDA) evolved from
PLSI
LDA has now replace PLSI
Gibbs sampling (MCMC method) is popular inference
method for LDA
Newer methods, e.g. Hierarchical Dirichlet
Processes, learn optimal of topics

20
What can Topic Models be used for?

Queries
Who writes on this topic?
e.g., finding experts or reviewers in a
particular area
What topics does this person do research on?
Comparing groups of authors or documents
Discovering trends over time
Detecting unusual papers and authors
Interactive browsing of a digital library via
topics
Parsing documents (and parts of documents) by
topic
and more..

21
Examples of Topics from CiteSeer
22
Four example topics from NIPS
23
Clusters v. Topics
24
Clusters v. Topics
One Cluster
25
Clusters v. Topics
Multiple Topics
One Cluster
26
3 of 300 example topics (TASA)
27
Automated Tagging of Words (numbers colors ?
topic assignments)
28
Experiments on Various Data Sets