Title: Topic Models for Groups, Correlations, Trends, and Phrases
1Topic Models forGroups, Correlations, Trends,
and Phrases
- Andrew McCallum
- Computer Science Department
- University of Massachusetts Amherst
Joint work with ?Xuerui Wang, Natasha
Mohanty, Andres Corrada, Chris Pal, Wei Li, David
Mimno and Gideon Mann.
2Social Network in an Email Dataset
3ART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
4Groups and Topics
- Input
- Observed relations between people
- Attributes on those relations (text, or
categorical) - Output
- Attributes clustered into topics
- Groups of people---varying depending on topic
5Discovering Groups from Observed Set of Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Admiration relations among six high school
students.
6Adjacency Matrix Representing Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
G1
G2
G1
G2
G3
G3
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
7Group Model Partitioning Entities into Groups
Stochastic Blockstructures for Relations Nowicki,
Snijders 2001
Beta
Dirichlet
Multinomial
S number of entities G number of groups
Binomial
Enhanced with arbitrary number of groups in
Kemp, Griffiths, Tenenbaum 2004
8Two Relations with Different Attributes
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Social Admiration Soci(A, B) Soci(A, D) Soci(A,
F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B)
Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C)
Soci(D, E) Soci(E, B) Soci(E, D) Soci(E,
F) Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1 G1 G2 G2 G2
G1
G1
G1
G2
G2
G2
A
C
E
B
D
F
A
C
B
D
E
F
9The Group-Topic Model Discovering Groups and
Topics Simultaneously
Wang, Mohanty, McCallum 2006
Beta
Uniform
Dirichlet
Multinomial
Dirichlet
Binomial
Multinomial
10Inference and Estimation
- Gibbs Sampling
- Many r.v.s can be integrated out
- Easy to implement
- Reasonably fast
We assume the relationship is symmetric.
11Dataset 1U.S. Senate
- 16 years of voting records in the US Senate (1989
2005) - a Senator may respond Yea or Nay to a resolution
- 3423 resolutions with text attributes (index
terms) - 191 Senators in total across 16 years
S.543 Title An Act to reform Federal deposit
insurance, protect the deposit insurance funds,
recapitalize the Bank Insurance Fund, improve
supervision and regulation of insured depository
institutions, and for other purposes. Sponsor
Sen Riegle, Donald W., Jr. MI (introduced
3/5/1991) Cosponsors (2) Latest Major Action
12/19/1991 Became Public Law No 102-242. Index
terms Banks and banking Accounting
Administrative fees Cost control Credit Deposit
insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen
(D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea
Bradley (D-NJ), Nay Conrad (D-ND), Nay
12Topics Discovered (U.S. Senate)
Education Energy Military Misc. Economic
education energy government federal
school power military labor
aid water foreign insurance
children nuclear tax aid
drug gas congress tax
students petrol aid business
elementary research law employee
prevention pollution policy care
Mixture of Unigrams
Education Domestic Foreign Economic Social Security Medicare
education foreign labor social
school trade insurance security
federal chemicals tax insurance
aid tariff congress medical
government congress income care
tax drugs minimum medicare
energy communicable wage disability
research diseases business assistance
Group-Topic Model
13Groups Discovered (US Senate)
Groups from topic Education Domestic
14Senators Who Change Coalition the most Dependent
on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
15Dataset 2The UN General Assembly
- Voting records of the UN General Assembly (1990 -
2003) - A country may choose to vote Yes, No or Abstain
- 931 resolutions with text attributes (titles)
- 192 countries in total
- Also experiments later with resolutions from
1960-2003
Vote on Permanent Sovereignty of Palestinian
People, 87th plenary meeting The draft
resolution on permanent sovereignty of the
Palestinian people in the occupied Palestinian
territory, including Jerusalem, and of the Arab
population in the occupied Syrian Golan over
their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3
against with 6 abstentions In favour
Afghanistan, Argentina, Belgium, Brazil, Canada,
China, France, Germany, India, Japan, Mexico,
Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey,
and other 126 countries. Against Israel,
Marshall Islands, United States. Abstain
Australia, Cameroon, Georgia, Kazakhstan,
Uzbekistan, Zambia.
16Topics Discovered (UN)
Everything Nuclear Human Rights Security in Middle East
Everything Nuclear Security in Middle East
nuclear rights occupied
weapons human israel
use palestine syria
implementation situation security
countries israel calls
Mixture of Unigrams
Nuclear Non-proliferation Nuclear Arms Race Human Rights
nuclear nuclear rights
states arms human
united prevention palestine
weapons race occupied
nations space israel
Group-TopicModel
17GroupsDiscovered(UN)
The countries list for each group are ordered by
their 2005 GDP (PPP) and only 5 countries are
shown in groups that have more than 5 members.
18Groups and Topics, Trends over Time (UN)
19Outline
Social Network Analysis with Topic Models
a
- Role Discovery (Author-Recipient-Topic Model,
ART) - Group Discovery (Group-Topic Model, GT)
- Enhanced Topic Models
- Correlations among Topics (Pachinko Allocation,
PAM) - Time Localized Topics (Topics-over-Time Model,
TOT) - Markov Dependencies in Topics (Topical N-Grams
Model, TNG) - Bibliometric Impact Measures enabled by Topics
a
Multi-Conditional Mixtures
20Latent Dirichlet Allocation
Blei, Ng, Jordan, 2003
a
N
?
n
z
ß
T
w
f
21Correlated Topic Model
Blei, Lafferty, 2005
?
?
N
logistic normal
?
n
z
ß
T
w
f
Square matrix of pairwise correlations.
22Pachinko Machine
23Pachinko Allocation Model
Thanks to Michael Jordan for suggesting the name
Li, McCallum, 2005, 2006
?11
Given directed acyclic graph (DAG) at each
interior node a Dirichlet over its children
and words at leaves
Model structure, not the graphical model
?22
?21
For each document Sample a multinomial from
each Dirichlet
?31
?33
?32
For each word in this document Starting from
the root, sample a child from successive
nodes, down to a leaf. Generate the word at the
leaf
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
Like a Polya tree, but DAG shaped, with arbitrary
number of children.
24Pachinko Allocation Model
Li, McCallum, 2005
?11
- DAG may have arbitrary structure
- arbitrary depth
- any number of children per node
- sparse connectivity
- edges may skip layers
Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
25Pachinko Allocation Model
Li, McCallum, 2005
?11
Model structure, not the graphical model
?22
?21
Distributions over distributions over topics...
Distributions over topicsmixtures, representing
topic correlations
?31
?33
?32
?41
?42
?43
?44
?45
Distributions over words (like LDA topics)
word1
word2
word3
word4
word5
word6
word7
word8
Some interior nodes could contain one
multinomial, used for all documents. (i.e. a very
peaked Dirichlet)
26Pachinko Allocation Model
Li, McCallum, 2005
?11
Estimate all these Dirichlets from
data. Estimate model structure from data.
(number of nodes, and connectivity)
Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
27Pachinko Allocation Special Cases
Latent Dirichlet Allocation
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
28Pachinko Allocation Special Cases
Hierarchical Latent Dirichlet Allocation (HLDA)
Very low variance Dirichlet at root
?11
Each leaf of the HLDA topic hier. has a distr.
over nodes on path to the root.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
29Pachinko Allocation on a Topic Hierarchy
Combining best of HLDA and Pachinko Allocation
?00
ThePAMDAG.
?11
?12
...representingcorrelations amongtopic leaves.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
30Pachinko Allocation Model
... with two layers, no skipping
layers,fully-connected from one layer to the
next.
?11
?21
?23
?22
super-topics
sub-topics
?31
?32
?33
?34
?35
fixed multinomials
word1
word2
word3
word4
word5
word6
word7
word8
Another special case would select only one
super-topic per document.
31Graphical Models
PAM (with fixed multinomials for topics)
LDA
q
a
a
N
N
q
?
?
n
n
z1
z2
zm
z
ß
ß
T
T
w
f
w
f
32Pachinko Allocation Model
- Likelihood
- Estimate zs by Gibbs sampling
- Estimate ?s by moment matching.
33Preliminary Experimental Results
- Topic Coherence
- Likelihood on held-out data
- Document classification
34NIPS Dataset
NIPS Conference PapersVolumes 0-12 Spanning
1987 1999. Prepared by Sam Roweis.
- 1740 papers
- 13649 Words
- 2,301,375 tokens
35Topic Coherence Comparison
models, estimation, stopwords
estimation, some junk
LDA 100 estimation likelihood maximum noisy estima
tes mixture scene surface normalization generated
measurements surfaces estimating estimated iterati
ve combined figure divisive sequence ideal
LDA 20 models model parameters distribution bayes
ian probability estimation data gaussian methods l
ikelihood em mixture show approach paper density f
ramework approximation markov
Example super-topic 33 input hidden units
function number 27 estimation bayesian parameters
data methods 24 distribution gaussian markov
likelihood mixture 11 exact kalman full
conditional deterministic 1 smoothing
predictive regularizers intermediate slope
36Topic Coherence Comparison
images, motion eyes
motion, some junk
motion
eyes
images
LDA 100 motion detection field optical flow sensit
ive moving functional detect contrast light dimens
ional intensity computer mt measures occlusion tem
poral edge real
PAM 100 motion video surface surfaces figure scene
camera noisy sequence activation generated analy
tical pixels measurements assigne advance lated sh
own closed perceptual
LDA 20 visual model motion field object image ima
ges objects fields receptive eye position spatial
direction target vision multiple figure orientatio
n location
PAM 100 eye head vor vestibulo oculomotor vestibul
ar vary reflex vi pan rapid semicircular canals re
sponds streams cholinergic rotation topographicall
y detectors ning
PAM 100 image digit faces pixel surface interpolat
ion scene people viewing neighboring sensors patch
es manifold dataset magnitude transparency rich dy
namical amounts tor
37Topic Coherence Comparison
neural networks, much less junk
neural networks, some junk
neural networks, some junk
PAM 100 input hidden units function number functio
ns networks output linear layer single results wei
ght inputs basis parameters standard network patte
rns study
LDA 100 network layer multi trained high perceptro
n layers give type nonlinearity perceptrons module
modified matched performed provided designed samp
les study mode
LDA 20 architecture network input output structure
paper level task work sequences sequence multiple
problem shows connectionist networks context perf
orm scale learn
38Blind Topic Evaluation
- Randomly select 25 similar pairs of topics
generated from PAM and LDA - 5 people
- Each asked select the topic in each pair that
you find most semantically coherent.
Prefer PAM
Topic counts
LDA PAM
5 votes 0 5
gt 4 votes 3 8
gt 3 votes 9 16
39Example Topic Pairswith Human Evaluation
40Topic Correlations in PAM
5000 research paper abstracts, from across all CS
Numbers on edges are supertopics Dirichlet
parameters
41Likelihood on Held Out Data
- Likelihood comparison
- NIPS abstracts
- Train the model with 75 data
- Calculate likelihood on 25 data
- Calculate likelihood by
- Sampling many, many documents from the model
- Estimating a simple mixture of multinomials from
these - Calculate the likelihood of data under this
simple mixture.
42Likelihood Comparison
43Document Classification
Comp5 from 20 Newsgroups corpus. Train on 25,
test on 75Like Naive Bayes, but use LDA/PAM
per-class instead of multinomial.
2.5 increase
Test Accuracy ()
44Outline
Social Network Analysis with Topic Models
a
- Role Discovery (Author-Recipient-Topic Model,
ART) - Group Discovery (Group-Topic Model, GT)
- Enhanced Topic Models
- Correlations among Topics (Pachinko Allocation,
PAM) - Time Localized Topics (Topics-over-Time Model,
TOT) - Markov Dependencies in Topics (Topical N-Grams
Model, TNG) - Bibliometric Impact Measures enabled by Topics
a
a
a
Multi-Conditional Mixtures
45Want to Model Trends over Time
- Is prevalence of topic growing or waning?
- Pattern appears only briefly
- Capture its statistics in focused way
- Dont confuse it with patterns elsewhere in time
- How do roles, groups, influence shift over time?
46Topics over Time (TOT)
Wang, McCallum 2006
?
Dirichlet
?
multinomialover topics
Uniformprior
Dirichlet prior
topicindex
z
?
?
timestamp
word
w
t
?
?
T
T
Nd
Betaover time
Multinomialover words
D
47State of the Union Address
208 Addresses delivered between January 8, 1790
and January 29, 2002.
- To increase the number of documents, we split the
addresses into paragraphs and treated them as
documents. One-line paragraphs were excluded.
Stopping was applied. - 17156 documents
- 21534 words
- 669,425 tokens
Our scheme of taxation, by means of which this
needless surplus is taken from the people and put
into the public Treasury, consists of a tariff
or duty levied upon importations from abroad and
internal-revenue taxes levied upon the
consumption of tobacco and spirituous and malt
liquors. It must be conceded that none of the
things subjected to internal-revenue
taxation are, strictly speaking, necessaries.
There appears to be no just complaint of this
taxation by the consumers of these articles, and
there seems to be nothing so well able to bear
the burden without hardship to any portion of the
people.
1910
48Comparing TOT with LDA
49Sample Topic Cold War
world nations united states peace free economic mi
litary soviet international security strength defe
nse freedom europe force peoples efforts aggressio
n today
50ComparingTOTagainst LDA
51TOT on 17 years of NIPS proceedings
52Topic Distributions Conditioned on Time
topic mass (in vertical height)
time
53TOT on 17 years of NIPS proceedings
TOT
LDA
54TOT versusLDAon my email
55TOT improves ability to Predict Time
Predicting the year of a State-of-the-Union
address.
L1 distance between predicted year and actual
year.
56Outline
Social Network Analysis with Topic Models
a
- Role Discovery (Author-Recipient-Topic Model,
ART) - Group Discovery (Group-Topic Model, GT)
- Enhanced Topic Models
- Correlations among Topics (Pachinko Allocation,
PAM) - Time Localized Topics (Topics-over-Time Model,
TOT) - Markov Dependencies in Topics (Topical N-Grams
Model, TNG) - Bibliometric Impact Measures enabled by Topics
a
a
a
a
Multi-Conditional Mixtures
57Topics Modeling Phrases
- Topics based only on unigrams often difficult to
interpret - Topic discovery itself is confused because
important meaning / distinctions carried by
phrases. - Significant opportunity to provide improved
language models to ASR, MT, IR, etc.
58Topical N-gram Model
Wang, McCallum 2005
?
?
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?
?1
?
?2
W
W
T
T
59LDA Topic
LDA algorithms algorithm genetic problems efficie
nt
Topical N-grams genetic algorithms genetic
algorithm evolutionary computation evolutionary
algorithms fitness function
60Sample Topical N-gram topics
Sample LDA topics
61Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
policy action states actions function reward contr
ol agent q-learning optimal goal learning space st
ep environment system problem steps sutton policie
s
learning optimal reinforcement state problems poli
cy dynamic action programming actions function mar
kov methods decision rl continuous spaces step pol
icies planning
reinforcement learning optimal policy dynamic
programming optimal control function
approximator prioritized sweeping finite-state
controller learning system reinforcement learning
rl function approximators markov decision
problems markov decision processes local
search state-action pair markov decision
process belief states stochastic policy action
selection upright position reinforcement learning
methods
62Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
motion response direction cells stimulus figure co
ntrast velocity model responses stimuli moving cel
l intensity population image center tuning complex
directions
motion visual field position figure direction fiel
ds eye location retina receptive velocity vision m
oving system flow edge center light local
receptive field spatial frequency temporal
frequency visual motion motion energy tuning
curves horizontal cells motion detection preferred
direction visual processing area mt visual
cortex light intensity directional
selectivity high contrast motion
detectors spatial phase moving stimuli decision
strategy visual stimuli
63Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
speech word training system recognition hmm speake
r performance phoneme acoustic words context syste
ms frame trained sequence phonetic speakers mlp hy
brid
word system recognition hmm speech training perfor
mance phoneme words context systems frame trained
speaker sequence speakers mlp frames segmentation
models
speech recognition training data neural
network error rates neural net hidden markov
model feature vectors continuous speech training
procedure continuous speech recognition gamma
filter hidden control speech production neural
nets input representation output layers training
algorithm test set speech frames speaker dependent
64Outline
Social Network Analysis with Topic Models
a
- Role Discovery (Author-Recipient-Topic Model,
ART) - Group Discovery (Group-Topic Model, GT)
- Enhanced Topic Models
- Correlations among Topics (Pachinko Allocation,
PAM) - Time Localized Topics (Topics-over-Time Model,
TOT) - Markov Dependencies in Topics (Topical N-Grams
Model, TNG) - Bibliometric Impact Measures enabled by Topics
a
a
a
a
a
Multi-Conditional Mixtures
65Social Networks in Research Literature
- Better understand structure of our own research
area. - Structure helps us learn a new field.
- Aid collaboration
- Map how ideas travel through social networks of
researchers. - Aids for hiring and finding reviewers!
66Traditional Bibliometrics
- Analyses a small amount of data(e.g. 19 articles
from a single issue of a journal) - Uses journal as a proxy for research
topic(but there is no journal for information
extraction) - Uses impact measures almost exclusively based on
simple citation counts.
How can we use topic models to create new,
interesting impact measures?
67Our Data
- Over 1 million research papers, gathered as part
of Rexa.info portal. - Cross linked references / citations.
68Finding Topics with TNG
Traditional unigram LDArun on 1 milliontitles /
abstracts (200 topics) ...select 300k papers
onML, NLP, robotics, vision... Find 200 TNG
topics among those papers.
69Topical Bibliometric Impact Measures
Mann, Mimno, McCallum, 2006
- Topical Citation Counts
- Topical Impact Factors
- Topical Longevity
- Topical Diversity
- Topical Precedence
- Topical Transfer
70Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
LowDiversity
HighDiversity
71Topical Diversity
Can also be measured on particular papers...
72Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
- Information Retrieval
- On Relevance, Probabilistic Indexing and
Information Retrieval, Kuhns and Maron (1960) - Expected Search Length A Single Measure of
Retrieval Effectiveness Based on the Weak
Ordering Action of Retrieval Systems, Cooper
(1968) - Relevance feedback in information retrieval,
Rocchio (1971) - Relevance feedback and the optimization of
retrieval effectiveness, Salton (1971) - New experiments in relevance feedback, Ide
(1971) - Automatic Indexing of a Sound Database Using
Self-organizing Neural Nets, Feiten and Gunzel
(1982)
73Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
- Speech Recognition
- Some experiments on the recognition of speech,
with one and two ears, E. Colin Cherry (1953) - Spectrographic study of vowel reduction, B.
Lindblom (1963) - Automatic Lipreading to enhance speech
recognition, Eric D. Petajan (1965) - Effectiveness of linear prediction
characteristics of the speech wave for..., B.
Atal (1974) - Automatic Recognition of Speakers from Their
Voices, B. Atal (1976)
74Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cits Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being Undigital with digital cameras extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase a repository of Web pages
75Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
76Outline
Social Network Analysis with Topic Models
a
- Role Discovery (Author-Recipient-Topic Model,
ART) - Group Discovery (Group-Topic Model, GT)
- Enhanced Topic Models
- Correlations among Topics (Pachinko Allocation,
PAM) - Time Localized Topics (Topics-over-Time Model,
TOT) - Markov Dependencies in Topics (Topical N-Grams
Model, TNG) - Bibliometric Impact Measures enabled by Topics
a
a
a
a
a
a
Multi-Conditional Mixtures
77 Topic Model Musings
- 3 years ago Latent Dirichlet Allocation appeared
as a complex innovation...but now these methods
mechanics are well-understood. - Innovation now is to understand data and
modeling needs,how to structure a new model to
capture these.