Modeling Documents - PowerPoint PPT Presentation

About This Presentation

Title:

Modeling Documents

Description:

Why model authors' interests? Identifying what author writes about ... Author-Topic Model. T. Topic distribution over words. w. Word. z. Topic. A ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 45

Provided by: Amr9

Learn more at: http://www-cs-students.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Modeling Documents

1
Modeling Documents

Amruta Joshi
Department of Computer Science
Stanford University

2
Outline

Topic Models
Topic Extraction2
Author Information
Modeling Topics
Modeling Authors
Author Topic Model
Inference
Integrating topics and syntax
Probabilistic Models
Composite Model
Inference

3
Motivation

Identifying content of a document
Identifying its latent structure
More specifically
Given a collection of documents we want to create
a model to collect information about
Authors
Topics
Syntactic constructs

4
Topics Authors

Why model topics?
Observe topic trends
How documents relate to one-another
Tagging abstracts
Why model authors interests?
Identifying what author writes about
Identifying authors with similar interests
Authorship attribution
Creating reviewer lists
Finding unusual work by an author

5
Topic Extraction Overview

Supervised Learning Techniques
Learn from labeled document collection
But Unlabeled documents, Rapidly changing fields
(Yang 1998)

6
Topic Extraction Overview

Dimensionality Reduction
Represent documents in Vector Space of terms
Map to low-dimensionality
Non-linear dim. reduction
WEBSOM (Lagus et. al. 1999)
Linear Projection
LSI (Berry, Dumais, OBrien 1995)
Regions represent topics

7
Topic Extraction Overview

Cluster documents on semantic content
Typically, each cluster has just 1 topic
Aspect Model
Topic modeled as distribution over words
Documents generated from multiple topics

8
Author Information Overview

Analyzing text using
Stylometry
statistical analysis using literary style,
frequency of word usage, etc
Semantics
Content of document

9
Author Information Overview

Graph-based models
Build Interactive ReferralWeb using citations
Kautz, Selman, Shah 1997
Build Co-Author Graphs
White Smith
Page-Rank for analysis

10
The Big Idea

Topic Model
Model topics as distribution over words
Author Model
Model author as distribution over words
Author-Topic Model
Probabilistic Model for both
Model topics as distribution over words
Model authors as distribution over topics

11
Bayesian Networks
nodes random variables edges direct
probabilistic influence
Topology captures independence XRay
conditionally independent of Pneumonia given
Infiltrates
Slide Credit Lisa Getoor, UMD College Park
12
Bayesian Networks

Associated with each node Xi there is a
conditional probability distribution P(XiPai?)
distribution over Xi for each assignment to
parents
If variables are discrete, P is usually
multinomial
P can be linear Gaussian, mixture of Gaussians,

Slide Credit Lisa Getoor, UMD College Park
13
BN Learning

BN models can be learned from empirical data
parameter estimation via numerical optimization
structure learning via combinatorial search.

Slide Credit Lisa Getoor, UMD College Park
14
Generative Model
Mixture weights
Mixture components
Bayesian approach use priors Mixture weights
Dirichlet( a ) Mixture components
Dirichlet( b )
15
Bayesian Network for modeling document generation
16
Topic Model Plate Notation
17
Topic Model Geometric Representation
18
Modeling Authors with words
19
Author-Topic Model
20
Inference

Expectation Maximization
But poor results (local Maxima)
Gibbs Sampling
Parameters ?, ?
Start with initial random assignment
Update parameter using other parameters
Converges after n iterations
Burn-in time

21
Inference and Learning for Documents
?mj
?dj
22
Matrix Factorization
23
Topic Model Inference
River
Loan
Money
Bank
Stream
documents
Can we recover the original topics and topic
mixtures from this data?
Slide Credit Padhraic Smyth, UC Irvine
24
Example of Gibbs Sampling

Assign word tokens randomly to topics (?topic
1 ?topic 2 )

River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
25
After 1 iteration

Apply sampling equation to each word token

River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
26
After 4 iterations
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
27
After 32 iterations
?
?
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
28
Results

Tested on Scientific Papers
NIPS Dataset
V13,649 D1,740 K2,037
Topics 100
tokens 2,301,375
CiteSeer Dataset
V30,799 D162,489 K85,465
Topics 300
tokens 11,685,514

29
Evaluating Predictive Power

Perplexity
Indicates ability to predict words on new unseen
documents

30
Results Perplexity
31
Recap

First
Author Model
Topic Model
Then
Author-Topic Model
Next
Integrating Topics Syntax

32
Integrating topics syntax

Probabilistic Models
Short-range dependencies
Syntactic Constraints
Represented as distinct syntactic classes
HMM, Probabilistic CFGs
Long-range dependencies
Semantic Constraints
Represented as probabilistic distribution
Bayes Model, Topic Model
New Idea! Use both

33
How to integrate these?

Mixture of Models
Each word exhibits either short or long range
dependencies
Product of Models
Each word exhibits both short or long range
dependencies
Composite Model
Asymmetric
All words exhibit short-range dependencies
Subset of words exhibit long-range dependencies

34
The Composite Model 1

Capturing asymmetry
Replace probability distribution over words with
semantic model
Syntactic model chooses when to emit content word
Semantic model chooses which word to emit
Methods
Syntactic component is HMM
Semantic component is Topic model

35
Generating phrases
network used for images image
obtained with kernel output described
with objects neural network trained with
svm images
36
The Composite Model 2 (Graphical)
37
The Composite Model 3

?(d) documents distribution over topics
Transitions between classes ci-1 and ci follow
distribution ?(Ci-1)
A document is generated as
For each word wi in document d
Draw zi from ?(d)
Draw ci from ?(Ci-1)
If ci1, then draw wi from ?(zi),
else draw wi from ?(ci)

38
Results

Tested on
Brown corpus (tagged with word types)
Concatenated Brown TASA corpus
HMM Topic Model
20 Classes
start/end Markers Class 19 classes
T 200

39
Results

Identifying Syntactic classes semantic topics
Clean separation observed
Identifying function words content words
control plain verb (syntax) or semantic word
Part-of-Speech Tagging
Identifying syntactic class
Document Classification
Brown corpus 500 docs gt 15 groups
Results similar to plain Topic Model

40
Extensions to Topic Model

Integrating link information (Cohn, Hofmann 2001)
Learning Topic Hierarchies
Integrating Syntax Topics
Integrate authorship info with content
(author-topic model)
Grade-of-membership Models
Random sentence generation

41
Conclusion

Identifying its latent structure
Document Content is modeled for
Semantic Associations topic model
Authorship - author topic model
Syntactic Constructs HMM

42
Acknowledgements

Prof. Rajeev Motwani
Advice and guidance regarding topic selection
T. K. Satish Kumar
Help on Probabilistic Models

Thank you!

44
References

Primary
Steyvers, M., Smyth, P., Rosen-Zvi, M.,
Griffiths, T. (2004). Probabilistic Author-Topic
Models for Information Discovery. The Tenth ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining. Seattle, Washington.
Steyvers, M. Griffiths, T. Probabilistic topic
models. (http//psiexp.ss.uci.edu/research/papers/
SteyversGriffithsLSABookFormatted.pdf)
Rosen-Zvi, M., Griffiths T., Steyvers, M.,
Smyth, P. (2004). The Author-Topic Model for
Authors and Documents. In 20th Conference on
Uncertainty in Artificial Intelligence. Banff,
Canada
Griffiths, T.L., Steyvers, M., Blei, D.M.,
Tenenbaum, J.B. (in press). Integrating Topics
and Syntax. In Advances in Neural Information
Processing Systems, 17.
Griffiths, T., Steyvers, M. (2004). Finding
Scientific Topics. Proceedings of the National
Academy of Sciences, 101 (suppl. 1), 5228-5235.