Title: Pachinko Allocation: DAGStructured Mixture Models of Topic Correlations
1Pachinko Allocation DAG-Structured Mixture
Models of Topic Correlations
- Wei Li
- Andrew McCallum
- Computer Science Department
- University of Massachusetts Amherst
With thanks to David Blei ,Yee Whye Teh, Sam
Roweis for helpful discussion, and thanks to
Michael Jordan for help in naming the model.
2Statistical Topic Models
- Discover a low-dimensional set of topics that
summarize concepts in text collections
- Non-textual data images and biological findings
3Latent Dirichlet Allocation
Blei, Ng, Jordan, 2003
a
N
topic distribution
?
n
z
topic
ß
T
w
f
word
Per-topic multinomial over words
4Correlated Topic Model
Blei, Lafferty, 2005
?
?
N
logistic normal
?
n
z
ß
T
w
f
Square matrix of pairwise correlations.
5Topic Correlation Representation
7 topics A, B, C, D, E, F, G Correlations
A, B, C, D, E and C, D, E, F, G
CTM
B
C
D
E
F
G
A
B
C
D
E
F
6Pachinko Machine
7Pachinko Allocation Model (PAM)
Thanks to Michael Jordan for suggesting the name
Li, McCallum, 2006
?11
Model structure directed acyclic graph (DAG)
at each interior node a Dirichlet over its
children and words at leaves
Model structure, not the graphical model
?22
?21
For each document Sample a multinomial from
each Dirichlet
?31
?33
?32
For each word in this document Starting from
the root, sample a child from successive
nodes, down to a leaf. Generate the word at the
leaf
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
Like a Polya tree, but DAG shaped, with arbitrary
number of children.
8Pachinko Allocation Model
Li, McCallum, 2006
?11
- DAG may have arbitrary structure
- arbitrary depth
- any number of children per node
- sparse connectivity
- edges may skip layers
Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
9Pachinko Allocation Model
Li, McCallum, 2006
?11
Model structure, not the graphical model
?22
?21
Distributions over distributions over topics...
Distributions over topicsmixtures, representing
topic correlations
?31
?33
?32
?41
?42
?43
?44
?45
Distributions over words (like LDA topics)
word1
word2
word3
word4
word5
word6
word7
word8
Some interior nodes could contain one
multinomial, used for all documents. (i.e. a very
peaked Dirichlet)
10Pachinko Allocation Model
Li, McCallum, 2006
?11
Estimate all these Dirichlets from
data. Estimate model structure from data.
(number of nodes, and connectivity)
Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
11Related Models
- Latent Dirichlet Allocation
- Correlated Topic Model
- Hierarchical LDA
- Hierarchical Dirichlet Processes
12Pachinko Allocation Special Cases
Latent Dirichlet Allocation
?11
?21
?22
?23
?24
?25
word1
word2
word3
word4
word5
word6
word7
word8
13Hierarchical LDA
CS, AI, NLP
CS
CS, AI, Robotics
AI
CS, AI, NLP, Robotics
NLP
Robotics
14Pachinko Allocation Special Cases
Hierarchical Latent Dirichlet Allocation (HLDA)
Very low variance Dirichlet at root
?11
Each leaf of the HLDA topic hier. has a distr.
over nodes on path to the root.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
15Pachinko Allocation on a Topic Hierarchy
Combining best of HLDA and Pachinko Allocation
?00
ThePAMDAG.
?11
?12
...representingcorrelations amongtopic leaves.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
16Correlated Topic Model
- CTM captures pairwise correlations.
- The number of parameters in CTM grows as the
square of the number of topics.
17Hierarchical Dirichlet Processes
- HDP can be used to automatically determine the
number of topics. - HDP captures topic correlations only when the
data is pre-organized into nested groups. - HDP does not learn the topic hierarchy.
18PAM - Notation
- V x1, , xv word vocabulary
- T t1, , ts topics
- r root
- gi(?i) Dirichlet distribution associated with
topic ti
19PAM - Generative Process
- To generate a document
- For each topic ti, sample a multinomial
distribution ?i from gi(?i). - For each word w in the document
- Sample a topic path zw based on the multinomials,
starting from the root. - Sample the word from the last topic.
20PAM - Likelihood
- Joint probability of d, z(d) and ?(d)
- Marginal probability of
21Four-level PAM
... with two layers, no skipping
layers,fully-connected from one layer to the
next.
?11
?21
?23
?22
super-topics
sub-topics
?31
?32
?33
?34
?35
fixed multinomials
word1
word2
word3
word4
word5
word6
word7
word8
22Graphical Models
Four-level PAM (with fixed multinomials for
sub-topics)
LDA
T
a1
a
a2
N
N
?2
?
?3
n
n
z2
z3
z
ß
ß
T
T
w
f
w
f
23Inference Gibbs Sampling
T
a2
a3
N
?2
?3
n
Jointly sampled
z2
z3
ß
T
w
f
Dirichlet parameters a are estimated with moment
matching
24Experimental Results
- Topic clarity by human judgement
- Likelihood on held-out data
- Document classification
25Datasets
- Rexa (http//rexa.info/)
- 4000 documents, 278438 word tokens and 25597
unique words. - NIPS
- 1647 documents, 114142 word tokens and 11708
unique words. - 20 newsgroup comp5 subset
- 4836 documents, 35567 unique words.
26Example Topics
images, motion eyes
motion (some generic)
motion
eyes
images
LDA 100 motion detection field optical flow sensit
ive moving functional detect contrast light dimens
ional intensity computer mt measures occlusion tem
poral edge real
PAM 100 motion video surface surfaces figure scene
camera noisy sequence activation generated analy
tical pixels measurements assigne advance lated sh
own closed perceptual
LDA 20 visual model motion field object image ima
ges objects fields receptive eye position spatial
direction target vision multiple figure orientatio
n location
PAM 100 eye head vor vestibulo oculomotor vestibul
ar vary reflex vi pan rapid semicircular canals re
sponds streams cholinergic rotation topographicall
y detectors ning
PAM 100 image digit faces pixel surface interpolat
ion scene people viewing neighboring sensors patch
es manifold dataset magnitude transparency rich dy
namical amounts tor
27Blind Topic Evaluation
- Randomly select 25 similar pairs of topics
generated from PAM and LDA - 5 people
- Each asked to select the topic in each pair that
you find more semantically coherent.
Topic counts
28Examples
5 votes 0 vote 4
votes 1 vote
29Examples
4 votes 1 vote 1
vote 4 votes
30Topic Correlations
31Likelihood Comparison
- Dataset NIPS
- Two experiments
- Varying number of topics
- Different proportions of training data
32Likelihood Comparison
33Likelihood Comparison
- Different proportions of training data
34Likelihood Estimation
- VariationalPerform inference in a simpler model
- (Gibbs sampling) Harmonic mean
- Approximate the marginal probability with the
harmonic mean of conditional probabilities - (Gibbs sampling) Empirical likelihood
- Estimate the distribution based on empirical
samples
35Empirical Likelihood Estimation
36Document Classification
- 20 newsgroup comp5 subset
- 5-way classification (accuracy in )
Statistically significant with a p-value
37Conclusion and Future Work
- Pachinko Allocation provides the flexibility to
capture arbitrary nested mixtures of topic
correlations. - More applications
- More advanced DAG structures
- Nonparametric PAM with nested HDP
38Non-parametric PAM
?1
ß1
f1
?0
ß1i
?0
a1
ß0
H
p1i
p0
inf
z2
z3
x
?
d
inf
N