Modeling Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Modeling Documents

Description:

Why model authors' interests? Identifying what author writes about ... Author-Topic Model. T. Topic distribution over words. w. Word. z. Topic. A ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 45
Provided by: Amr9
Category:

less

Transcript and Presenter's Notes

Title: Modeling Documents


1
Modeling Documents
  • Amruta Joshi
  • Department of Computer Science
  • Stanford University

2
Outline
  • Topic Models
  • Topic Extraction2
  • Author Information
  • Modeling Topics
  • Modeling Authors
  • Author Topic Model
  • Inference
  • Integrating topics and syntax
  • Probabilistic Models
  • Composite Model
  • Inference

3
Motivation
  • Identifying content of a document
  • Identifying its latent structure
  • More specifically
  • Given a collection of documents we want to create
    a model to collect information about
  • Authors
  • Topics
  • Syntactic constructs

4
Topics Authors
  • Why model topics?
  • Observe topic trends
  • How documents relate to one-another
  • Tagging abstracts
  • Why model authors interests?
  • Identifying what author writes about
  • Identifying authors with similar interests
  • Authorship attribution
  • Creating reviewer lists
  • Finding unusual work by an author

5
Topic Extraction Overview
  • Supervised Learning Techniques
  • Learn from labeled document collection
  • But Unlabeled documents, Rapidly changing fields
    (Yang 1998)

6
Topic Extraction Overview
  • Dimensionality Reduction
  • Represent documents in Vector Space of terms
  • Map to low-dimensionality
  • Non-linear dim. reduction
  • WEBSOM (Lagus et. al. 1999)
  • Linear Projection
  • LSI (Berry, Dumais, OBrien 1995)
  • Regions represent topics

7
Topic Extraction Overview
  • Cluster documents on semantic content
  • Typically, each cluster has just 1 topic
  • Aspect Model
  • Topic modeled as distribution over words
  • Documents generated from multiple topics

8
Author Information Overview
  • Analyzing text using
  • Stylometry
  • statistical analysis using literary style,
    frequency of word usage, etc
  • Semantics
  • Content of document

9
Author Information Overview
  • Graph-based models
  • Build Interactive ReferralWeb using citations
  • Kautz, Selman, Shah 1997
  • Build Co-Author Graphs
  • White Smith
  • Page-Rank for analysis

10
The Big Idea
  • Topic Model
  • Model topics as distribution over words
  • Author Model
  • Model author as distribution over words
  • Author-Topic Model
  • Probabilistic Model for both
  • Model topics as distribution over words
  • Model authors as distribution over topics

11
Bayesian Networks
nodes random variables edges direct
probabilistic influence
Topology captures independence XRay
conditionally independent of Pneumonia given
Infiltrates
Slide Credit Lisa Getoor, UMD College Park
12
Bayesian Networks
  • Associated with each node Xi there is a
    conditional probability distribution P(XiPai?)
    distribution over Xi for each assignment to
    parents
  • If variables are discrete, P is usually
    multinomial
  • P can be linear Gaussian, mixture of Gaussians,

Slide Credit Lisa Getoor, UMD College Park
13
BN Learning
  • BN models can be learned from empirical data
  • parameter estimation via numerical optimization
  • structure learning via combinatorial search.

Slide Credit Lisa Getoor, UMD College Park
14
Generative Model
Mixture weights
Mixture components
Bayesian approach use priors Mixture weights
Dirichlet( a ) Mixture components
Dirichlet( b )
15
Bayesian Network for modeling document generation
16
Topic Model Plate Notation
17
Topic Model Geometric Representation
18
Modeling Authors with words
19
Author-Topic Model
20
Inference
  • Expectation Maximization
  • But poor results (local Maxima)
  • Gibbs Sampling
  • Parameters ?, ?
  • Start with initial random assignment
  • Update parameter using other parameters
  • Converges after n iterations
  • Burn-in time

21
Inference and Learning for Documents
?mj
?dj
22
Matrix Factorization
23
Topic Model Inference
River
Loan
Money
Bank
Stream
documents
Can we recover the original topics and topic
mixtures from this data?
Slide Credit Padhraic Smyth, UC Irvine
24
Example of Gibbs Sampling
  • Assign word tokens randomly to topics (?topic
    1 ?topic 2 )

River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
25
After 1 iteration
  • Apply sampling equation to each word token

River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
26
After 4 iterations
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
27
After 32 iterations
?
?
River
Loan
Money
Bank
Stream
Slide Credit Padhraic Smyth, UC Irvine
28
Results
  • Tested on Scientific Papers
  • NIPS Dataset
  • V13,649 D1,740 K2,037
  • Topics 100
  • tokens 2,301,375
  • CiteSeer Dataset
  • V30,799 D162,489 K85,465
  • Topics 300
  • tokens 11,685,514

29
Evaluating Predictive Power
  • Perplexity
  • Indicates ability to predict words on new unseen
    documents

30
Results Perplexity
31
Recap
  • First
  • Author Model
  • Topic Model
  • Then
  • Author-Topic Model
  • Next
  • Integrating Topics Syntax

32
Integrating topics syntax
  • Probabilistic Models
  • Short-range dependencies
  • Syntactic Constraints
  • Represented as distinct syntactic classes
  • HMM, Probabilistic CFGs
  • Long-range dependencies
  • Semantic Constraints
  • Represented as probabilistic distribution
  • Bayes Model, Topic Model
  • New Idea! Use both

33
How to integrate these?
  • Mixture of Models
  • Each word exhibits either short or long range
    dependencies
  • Product of Models
  • Each word exhibits both short or long range
    dependencies
  • Composite Model
  • Asymmetric
  • All words exhibit short-range dependencies
  • Subset of words exhibit long-range dependencies

34
The Composite Model 1
  • Capturing asymmetry
  • Replace probability distribution over words with
    semantic model
  • Syntactic model chooses when to emit content word
  • Semantic model chooses which word to emit
  • Methods
  • Syntactic component is HMM
  • Semantic component is Topic model

35
Generating phrases
network used for images image
obtained with kernel output described
with objects neural network trained with
svm images
36
The Composite Model 2 (Graphical)
37
The Composite Model 3
  • ?(d) documents distribution over topics
  • Transitions between classes ci-1 and ci follow
    distribution ?(Ci-1)
  • A document is generated as
  • For each word wi in document d
  • Draw zi from ?(d)
  • Draw ci from ?(Ci-1)
  • If ci1, then draw wi from ?(zi),
  • else draw wi from ?(ci)

38
Results
  • Tested on
  • Brown corpus (tagged with word types)
  • Concatenated Brown TASA corpus
  • HMM Topic Model
  • 20 Classes
  • start/end Markers Class 19 classes
  • T 200

39
Results
  • Identifying Syntactic classes semantic topics
  • Clean separation observed
  • Identifying function words content words
  • control plain verb (syntax) or semantic word
  • Part-of-Speech Tagging
  • Identifying syntactic class
  • Document Classification
  • Brown corpus 500 docs gt 15 groups
  • Results similar to plain Topic Model

40
Extensions to Topic Model
  • Integrating link information (Cohn, Hofmann 2001)
  • Learning Topic Hierarchies
  • Integrating Syntax Topics
  • Integrate authorship info with content
    (author-topic model)
  • Grade-of-membership Models
  • Random sentence generation

41
Conclusion
  • Identifying its latent structure
  • Document Content is modeled for
  • Semantic Associations topic model
  • Authorship - author topic model
  • Syntactic Constructs HMM

42
Acknowledgements
  • Prof. Rajeev Motwani
  • Advice and guidance regarding topic selection
  • T. K. Satish Kumar
  • Help on Probabilistic Models

43
  • Thank you!

44
References
  • Primary
  • Steyvers, M., Smyth, P., Rosen-Zvi, M.,
    Griffiths, T. (2004). Probabilistic Author-Topic
    Models for Information Discovery. The Tenth ACM
    SIGKDD International Conference on Knowledge
    Discovery and Data Mining. Seattle, Washington.
  • Steyvers, M. Griffiths, T. Probabilistic topic
    models. (http//psiexp.ss.uci.edu/research/papers/
    SteyversGriffithsLSABookFormatted.pdf)
  • Rosen-Zvi, M., Griffiths T., Steyvers, M.,
    Smyth, P. (2004). The Author-Topic Model for
    Authors and Documents. In 20th Conference on
    Uncertainty in Artificial Intelligence. Banff,
    Canada
  • Griffiths, T.L., Steyvers, M.,  Blei, D.M.,
    Tenenbaum, J.B. (in press). Integrating Topics
    and Syntax. In Advances in Neural Information
    Processing Systems, 17.
  • Griffiths, T., Steyvers, M. (2004). Finding
    Scientific Topics. Proceedings of the National
    Academy of Sciences, 101 (suppl. 1), 5228-5235. 
Write a Comment
User Comments (0)
About PowerShow.com