David Newman, UC Irvine Lecture 13: Topic Models 1 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

David Newman, UC Irvine Lecture 13: Topic Models 1

Description:

Produce an actual result with actual data ... dog cat fish hen. red. green. blue. Prob(green|fish) = ? Na ve Bayes (Multinomial) ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 37
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 13: Topic Models 1


1
CS 277 Data MiningLecture 13 Topic Models
(cont.)
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 3 available on web
  • Progress Report 2 due Tuesday Nov 13 in class
  • email me ppt/pdf by 2pm Tuesday Nov 13

3
Progress Report 2
  • 2-page ppt/pdf
  • Produce an actual result with actual data
  • You are to convince your manager (the rest of the
    class) that this is a worthwhile project. What
    work do you do, and how do you present it?
  • Submit by email 2-page ppt/pdf
  • 2 slides that summarize your best result(s) so
    far
  • Should contain
  • Your name (top right corner)
  • Clear description of the main task
  • Your best result so far
  • Produce an actual result with actual data
  • Make it graphical (use text sparingly)
  • Email to me no later 2pm Tuesday Nov 13th

4
Homework 2
  • Homework 2 review

5
K-Nearest Neighbor
  • TF accuracy ()
  • 26, 28, 29, 29, 64, 27, 29, 29, 29
  • IDF accuracy ()
  • 71, 69, 73, 72, 73, 73, 73, 72, 73
  • Best K, accuracy
  • (K3, 39) (K18, 69) (K16, 72) (K1, 12)
    (K21, 72) (K52, 81) (K16, 80) (K2, 3)
  • Did you compute accuracy on test data?
  • (K52, 74)

6
Complexity
  • D documents
  • Dtrain
  • Dtest (well compute per test document)
  • W words
  • L average document length
  • C classes

7
K-Nearest Neighbor
  • Training Complexity
  • Time zero
  • Space D W e, or D L (space for Xtrain)
  • Test Complexity (per test doc)
  • Time D L
  • Space same as training

8
Naïve Bayes
  • Binary accuracy ()
  • 43, 56, 67, 53, 57, 58, 62
  • Multinomial accuracy ()
  • 79, 78, 79, 78, 78, 79, 79
  • McCallum (Fig 3) did slightly better?

9
Naïve Bayes
  • Optimal smoothing parameter
  • 1, 1, 1, 0.06, 1, 0.1
  • Dirichlet(a)
  • a is real number, a gt 0
  • Pseudo counts can be fractional. a0.06

dog cat fish hen
red green blue
Prob(greenfish) ?
10
Naïve Bayes (Multinomial)
  • Training Complexity
  • Time W C D/C W D
  • Space W C
  • Test Complexity (per test doc)
  • Time L C
  • Space W C

11
Support Vector Machine
  • 1 vs. rest
  • 27, 76, 76, 76, 76, 76, 76, 76, 76
  • 1 vs. 1
  • 56, 56, 56, 54, 56, 56, 61, 56
  • Several comments
  • SVM 1 vs 1 should give best results
  • SVM should perform best
  • SVM 1 vs 1 has to be the best
  • I dont think this is necessarily the case

12
Support Vector Machine
  • Training Complexity
  • Time D3 (empirically D1.6 - D1.9)
  • Space S L
  • Test Complexity (per test doc)
  • Time S L
  • Space S L
  • 1 vs rest build C SVMs
  • 1 vs 1 build C2 SVMs

13
Weka
  • Attempt at J48 (Wekas decision tree)
  • Memory problems

14
Conclusions
  • Best overall classifier
  • NB? ?
  • SVM?
  • Best classifier for each class
  • NB or SVM?
  • Best training time complexity
  • K-NN (zero)
  • Best testing time complexity
  • NB LC
  • Best testing space complexity
  • NB WC

15
Todays lecture
  • Topic modeling
  • LSI (aka SVD)
  • NMF
  • LDA
  • PLSI (today)

16
Topic Modeling as dimensionality reduction
W.D W.K K.D
X
f
q

text collection
mixtures
topics
K ltlt min(W,D)
17
Probabilistic Latent Semantic Indexing
  • Factorization
  • X U S V
  • X W H
  • PLSI
  • P(w,d) P(d)? P(wz) P(zd)
  • P(w,d) ? P(wz) P(z) P(dz)

X
f
q

18
Probabilistic Latent Semantic Indexing
  • Select document d with probability P(d)
  • Select a topic with probability P(zd)
  • Generate a word with probability P(wz)
  • P(d,w) ?
  • Log likelihood of collection ?

19
Topic Modeling
  • Latent Dirichlet Allocation (LDA) evolved from
    PLSI
  • LDA has now replace PLSI
  • Gibbs sampling (MCMC method) is popular inference
    method for LDA
  • Newer methods, e.g. Hierarchical Dirichlet
    Processes, learn optimal of topics

20
What can Topic Models be used for?
  • Queries
  • Who writes on this topic?
  • e.g., finding experts or reviewers in a
    particular area
  • What topics does this person do research on?
  • Comparing groups of authors or documents
  • Discovering trends over time
  • Detecting unusual papers and authors
  • Interactive browsing of a digital library via
    topics
  • Parsing documents (and parts of documents) by
    topic
  • and more..

21
Examples of Topics from CiteSeer
22
Four example topics from NIPS
23
Clusters v. Topics
24
Clusters v. Topics
One Cluster
25
Clusters v. Topics
Multiple Topics
One Cluster
26
3 of 300 example topics (TASA)
27
Automated Tagging of Words (numbers colors ?
topic assignments)
28
Experiments on Various Data Sets
  • Corpora
  • CiteSeer 160K abstracts, 85K authors
  • NIPS 1.7K papers, 2K authors
  • Enron 250K emails, 28K authors (sender)
  • Medline 300K abstracts, 128K authors
  • Removed stop words no stemming
  • Ignore word order, just use word counts
  • Processing time
  • Nips 2000 Gibbs iterations ? 8 hours
  • CiteSeer 2000 Gibbs iterations ? 4 days

29
Temporal patterns in topics hot and cold topics
  • We have CiteSeer papers from 1986-2002
  • For each year, calculate the fraction of words
    assigned to each topic
  • ? a time-series for topics
  • Hot topics become more prevalent
  • Cold topics become less prevalent

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Applications
  • The Calit2 browser of research and researchers
  • http//yarra.ics.uci.edu/calit2/
Write a Comment
User Comments (0)
About PowerShow.com