Title: Modeling Documents
1Modeling Documents
- Amruta Joshi
- Department of Computer Science
- Stanford University
2Outline
- Topic Models
- Topic Extraction2
- Author Information
- Modeling Topics
- Modeling Authors
- Author Topic Model
- Inference
- Integrating topics and syntax
- Probabilistic Models
- Composite Model
- Inference
3Motivation
- Identifying content of a document
- Identifying its latent structure
- More specifically
- Given a collection of documents we want to create
a model to collect information about - Authors
- Topics
- Syntactic constructs
4Topics Authors
- Why model topics?
- Observe topic trends
- How documents relate to one-another
- Tagging abstracts
- Why model authors interests?
- Identifying what author writes about
- Identifying authors with similar interests
- Authorship attribution
- Creating reviewer lists
- Finding unusual work by an author
5Topic Extraction Overview
- Supervised Learning Techniques
- Learn from labeled document collection
- But Unlabeled documents, Rapidly changing fields
(Yang 1998)
6Topic Extraction Overview
- Dimensionality Reduction
- Represent documents in Vector Space of terms
- Map to low-dimensionality
- Non-linear dim. reduction
- WEBSOM (Lagus et. al. 1999)
- Linear Projection
- LSI (Berry, Dumais, OBrien 1995)
- Regions represent topics
7Topic Extraction Overview
- Cluster documents on semantic content
- Typically, each cluster has just 1 topic
- Aspect Model
- Topic modeled as distribution over words
- Documents generated from multiple topics
8Author Information Overview
- Analyzing text using
- Stylometry
- statistical analysis using literary style,
frequency of word usage, etc - Semantics
- Content of document
9Author Information Overview
- Graph-based models
- Build Interactive ReferralWeb using citations
- Kautz, Selman, Shah 1997
- Build Co-Author Graphs
- White Smith
- Page-Rank for analysis
10The Big Idea
- Topic Model
- Model topics as distribution over words
- Author Model
- Model author as distribution over words
- Author-Topic Model
- Probabilistic Model for both
- Model topics as distribution over words
- Model authors as distribution over topics
11Bayesian Networks
nodes random variables edges direct
probabilistic influence
Topology captures independence XRay
conditionally independent of Pneumonia given
Infiltrates
Slide Credit Lisa Getoor, UMD College Park
12Bayesian Networks
- Associated with each node Xi there is a
conditional probability distribution P(XiPai?)
distribution over Xi for each assignment to
parents - If variables are discrete, P is usually
multinomial - P can be linear Gaussian, mixture of Gaussians,
Slide Credit Lisa Getoor, UMD College Park
13BN Learning
- BN models can be learned from empirical data
- parameter estimation via numerical optimization
- structure learning via combinatorial search.
Slide Credit Lisa Getoor, UMD College Park
14Generative Model
Mixture weights
Mixture components
Bayesian approach use priors Mixture weights
Dirichlet( a ) Mixture components
Dirichlet( b )
15Bayesian Network for modeling document generation
16Topic Model Plate Notation
17Topic Model Geometric Representation
18Modeling Authors with words
19Author-Topic Model
20Inference
- Expectation Maximization
- But poor results (local Maxima)
- Gibbs Sampling
- Parameters ?, ?
- Start with initial random assignment
- Update parameter using other parameters
- Converges after n iterations
- Burn-in time
21Inference and Learning for Documents
?mj
?dj
22Matrix Factorization
23Topic Model Inference
River
Loan
Money
Bank
Stream
documents
Can we recover the original topics and topic
mixtures from this data?
Slide Credit Padhraic Smyth, UC Irvine
24Example of Gibbs Sampling
- Assign word tokens randomly to topics (?topic
1 ?topic 2 )
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
25After 1 iteration
- Apply sampling equation to each word token
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
26After 4 iterations
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
27After 32 iterations
?
?
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
28Results
- Tested on Scientific Papers
- NIPS Dataset
- V13,649 D1,740 K2,037
- Topics 100
- tokens 2,301,375
- CiteSeer Dataset
- V30,799 D162,489 K85,465
- Topics 300
- tokens 11,685,514
29Evaluating Predictive Power
- Perplexity
- Indicates ability to predict words on new unseen
documents
30Results Perplexity
31Recap
- First
- Author Model
- Topic Model
- Then
- Author-Topic Model
- Next
- Integrating Topics Syntax
32Integrating topics syntax
- Probabilistic Models
- Short-range dependencies
- Syntactic Constraints
- Represented as distinct syntactic classes
- HMM, Probabilistic CFGs
- Long-range dependencies
- Semantic Constraints
- Represented as probabilistic distribution
- Bayes Model, Topic Model
- New Idea! Use both
33How to integrate these?
- Mixture of Models
- Each word exhibits either short or long range
dependencies - Product of Models
- Each word exhibits both short or long range
dependencies - Composite Model
- Asymmetric
- All words exhibit short-range dependencies
- Subset of words exhibit long-range dependencies
34The Composite Model 1
- Capturing asymmetry
- Replace probability distribution over words with
semantic model - Syntactic model chooses when to emit content word
- Semantic model chooses which word to emit
- Methods
- Syntactic component is HMM
- Semantic component is Topic model
35Generating phrases
network used for images image
obtained with kernel output described
with objects neural network trained with
svm images
36The Composite Model 2 (Graphical)
37The Composite Model 3
- ?(d) documents distribution over topics
- Transitions between classes ci-1 and ci follow
distribution ?(Ci-1) - A document is generated as
- For each word wi in document d
- Draw zi from ?(d)
- Draw ci from ?(Ci-1)
- If ci1, then draw wi from ?(zi),
- else draw wi from ?(ci)
38Results
- Tested on
- Brown corpus (tagged with word types)
- Concatenated Brown TASA corpus
- HMM Topic Model
- 20 Classes
- start/end Markers Class 19 classes
- T 200
39Results
- Identifying Syntactic classes semantic topics
- Clean separation observed
- Identifying function words content words
- control plain verb (syntax) or semantic word
- Part-of-Speech Tagging
- Identifying syntactic class
- Document Classification
- Brown corpus 500 docs gt 15 groups
- Results similar to plain Topic Model
40Extensions to Topic Model
- Integrating link information (Cohn, Hofmann 2001)
- Learning Topic Hierarchies
- Integrating Syntax Topics
- Integrate authorship info with content
(author-topic model) - Grade-of-membership Models
- Random sentence generation
41Conclusion
- Identifying its latent structure
- Document Content is modeled for
- Semantic Associations topic model
- Authorship - author topic model
- Syntactic Constructs HMM
42Acknowledgements
- Prof. Rajeev Motwani
- Advice and guidance regarding topic selection
- T. K. Satish Kumar
- Help on Probabilistic Models
43 44References
- Primary
- Steyvers, M., Smyth, P., Rosen-Zvi, M.,
Griffiths, T. (2004). Probabilistic Author-Topic
Models for Information Discovery. The Tenth ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining. Seattle, Washington. - Steyvers, M. Griffiths, T. Probabilistic topic
models. (http//psiexp.ss.uci.edu/research/papers/
SteyversGriffithsLSABookFormatted.pdf) - Rosen-Zvi, M., Griffiths T., Steyvers, M.,
Smyth, P. (2004). The Author-Topic Model for
Authors and Documents. In 20th Conference on
Uncertainty in Artificial Intelligence. Banff,
Canada - Griffiths, T.L., Steyvers, M., Blei, D.M.,
Tenenbaum, J.B. (in press). Integrating Topics
and Syntax. In Advances in Neural Information
Processing Systems, 17. - Griffiths, T., Steyvers, M. (2004). Finding
Scientific Topics. Proceedings of the National
Academy of Sciences, 101 (suppl. 1), 5228-5235.Â