Title: David Newman, UC Irvine Lecture 13: Topic Models 1
1CS 277 Data MiningLecture 13 Topic Models
(cont.)
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Homework 3 available on web
- Progress Report 2 due Tuesday Nov 13 in class
- email me ppt/pdf by 2pm Tuesday Nov 13
3Progress Report 2
- 2-page ppt/pdf
- Produce an actual result with actual data
- You are to convince your manager (the rest of the
class) that this is a worthwhile project. What
work do you do, and how do you present it? - Submit by email 2-page ppt/pdf
- 2 slides that summarize your best result(s) so
far - Should contain
- Your name (top right corner)
- Clear description of the main task
- Your best result so far
- Produce an actual result with actual data
- Make it graphical (use text sparingly)
- Email to me no later 2pm Tuesday Nov 13th
4Homework 2
5K-Nearest Neighbor
- TF accuracy ()
- 26, 28, 29, 29, 64, 27, 29, 29, 29
- IDF accuracy ()
- 71, 69, 73, 72, 73, 73, 73, 72, 73
-
- Best K, accuracy
- (K3, 39) (K18, 69) (K16, 72) (K1, 12)
(K21, 72) (K52, 81) (K16, 80) (K2, 3) - Did you compute accuracy on test data?
- (K52, 74)
6Complexity
- D documents
- Dtrain
- Dtest (well compute per test document)
- W words
- L average document length
- C classes
7K-Nearest Neighbor
- Training Complexity
- Time zero
- Space D W e, or D L (space for Xtrain)
- Test Complexity (per test doc)
- Time D L
- Space same as training
8Naïve Bayes
- Binary accuracy ()
- 43, 56, 67, 53, 57, 58, 62
- Multinomial accuracy ()
- 79, 78, 79, 78, 78, 79, 79
- McCallum (Fig 3) did slightly better?
-
9Naïve Bayes
- Optimal smoothing parameter
- 1, 1, 1, 0.06, 1, 0.1
- Dirichlet(a)
- a is real number, a gt 0
- Pseudo counts can be fractional. a0.06
dog cat fish hen
red green blue
Prob(greenfish) ?
10Naïve Bayes (Multinomial)
- Training Complexity
- Time W C D/C W D
- Space W C
- Test Complexity (per test doc)
- Time L C
- Space W C
11Support Vector Machine
- 1 vs. rest
- 27, 76, 76, 76, 76, 76, 76, 76, 76
- 1 vs. 1
- 56, 56, 56, 54, 56, 56, 61, 56
- Several comments
- SVM 1 vs 1 should give best results
- SVM should perform best
- SVM 1 vs 1 has to be the best
- I dont think this is necessarily the case
12Support Vector Machine
- Training Complexity
- Time D3 (empirically D1.6 - D1.9)
- Space S L
- Test Complexity (per test doc)
- Time S L
- Space S L
- 1 vs rest build C SVMs
- 1 vs 1 build C2 SVMs
13Weka
- Attempt at J48 (Wekas decision tree)
- Memory problems
14Conclusions
- Best overall classifier
- NB? ?
- SVM?
- Best classifier for each class
- NB or SVM?
- Best training time complexity
- K-NN (zero)
- Best testing time complexity
- NB LC
- Best testing space complexity
- NB WC
15Todays lecture
- Topic modeling
- LSI (aka SVD)
- NMF
- LDA
- PLSI (today)
16Topic Modeling as dimensionality reduction
W.D W.K K.D
X
f
q
text collection
mixtures
topics
K ltlt min(W,D)
17Probabilistic Latent Semantic Indexing
- Factorization
- X U S V
- X W H
- PLSI
- P(w,d) P(d)? P(wz) P(zd)
- P(w,d) ? P(wz) P(z) P(dz)
X
f
q
18Probabilistic Latent Semantic Indexing
- Select document d with probability P(d)
- Select a topic with probability P(zd)
- Generate a word with probability P(wz)
- P(d,w) ?
- Log likelihood of collection ?
19Topic Modeling
- Latent Dirichlet Allocation (LDA) evolved from
PLSI - LDA has now replace PLSI
- Gibbs sampling (MCMC method) is popular inference
method for LDA - Newer methods, e.g. Hierarchical Dirichlet
Processes, learn optimal of topics
20What can Topic Models be used for?
- Queries
- Who writes on this topic?
- e.g., finding experts or reviewers in a
particular area - What topics does this person do research on?
- Comparing groups of authors or documents
- Discovering trends over time
- Detecting unusual papers and authors
- Interactive browsing of a digital library via
topics - Parsing documents (and parts of documents) by
topic - and more..
21Examples of Topics from CiteSeer
22Four example topics from NIPS
23Clusters v. Topics
24Clusters v. Topics
One Cluster
25Clusters v. Topics
Multiple Topics
One Cluster
263 of 300 example topics (TASA)
27Automated Tagging of Words (numbers colors ?
topic assignments)
28Experiments on Various Data Sets
- Corpora
- CiteSeer 160K abstracts, 85K authors
- NIPS 1.7K papers, 2K authors
- Enron 250K emails, 28K authors (sender)
- Medline 300K abstracts, 128K authors
- Removed stop words no stemming
- Ignore word order, just use word counts
- Processing time
- Nips 2000 Gibbs iterations ? 8 hours
- CiteSeer 2000 Gibbs iterations ? 4 days
29Temporal patterns in topics hot and cold topics
- We have CiteSeer papers from 1986-2002
- For each year, calculate the fraction of words
assigned to each topic - ? a time-series for topics
- Hot topics become more prevalent
- Cold topics become less prevalent
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Applications
- The Calit2 browser of research and researchers
- http//yarra.ics.uci.edu/calit2/