Title: Predictively Modeling Social Text
1Predictively Modeling Social Text
- William W. Cohen
- Machine Learning Dept. and Language Technologies
Institute - School of Computer Science
- Carnegie Mellon University
- Joint work with Amr Ahmed, Andrew Arnold,
Ramnath Balasubramanyan, Frank Lin, Matt Hurst
(MSFT), Ramesh Nallapati, Noah Smith, Eric Xing,
Tae Yano
2Newswire Text
Social Media Text
- Formal
- Primary purpose
- Inform typical reader about recent events
- Broad audience
- Explicitly establish shared context with reader
- Ambiguity often avoided
- Informal
- Many purposes
- Entertain, connect, persuade
- Narrow audience
- Friends and colleagues
- Shared context already established
- Many statements are ambiguous out of social
context
3Newswire Text
Social Media Text
- Goals of analysis
- Extract information about events from text
- Understanding text requires understanding
typical reader - conventions for communicating with him/her
- Prior knowledge, background,
- Goals of analysis
- Very diverse
- Evaluation is difficult
- And requires revisiting often as goals evolve
- Often understanding social text requires
understanding a community
4Outline
- Tools for analysis of text
- Probabilistic models for text, communities, and
time - Mixture models and LDA models for text
- LDA extensions to model hyperlink structure
- LDA extensions to model time
- Alternative framework based on graph analysis to
model time community - Preliminary results tradeoffs
- Discussion of results challenges
5Introduction to Topic Models
?
?
C
football
..
WN
..
W1
W2
W3
The
Pittsburgh
Steelers
won
M
b
b
Box is shorthand for many repetitions of the
structure.
6Introduction to Topic Models
?
?
C
politics
..
WN
..
W1
W2
W3
The
Pittsburgh
mayor
stated
M
b
b
7Introduction to Topic Models
- Naïve Bayes Model Compact representation
?
?
C
C
..
WN
W1
W2
W3
W
M
N
b
M
b
8Introduction to Topic Models
- For each document d 1,?, M
- Generate Cd Mult( ?)
- For each position n 1,?, Nd
- Generate wn Mult(?,Cd)
?
C
- For document d 1
- Generate Cd Mult( ?) football
- For each position n 1,?, Nd67
- Generate w1 Mult(?,Cd) the
- Generate w2 Pittsburgh
- Generate w3 Steelers
- .
..
WN
W1
W2
W3
M
b
9Introduction to Topic Models
?
- In the graphs
- shaded circles are known values
- parents of variable W are the inputs to the
function used in generating W. - Goal given known values, estimate the rest,
usually to maximize the probability of the
observations
C
..
WN
W1
W2
W3
M
b
10Introduction to Topic Models
- Mixture model unsupervised naïve Bayes model
- Joint probability of words and classes
- But classes are not visible
?
C
Z
W
N
M
b
11Introduction to Topic Models
- Learning for naïve Bayes
- Take logs, the function is convex, linear and
easy to optimize for any parameter
- Learning for mixture model
- Many local maxima (at least one for each
permutation of classes) - Expectation/maximization is most common method
12Introduction to Topic Models
- Mixture model EM solution
E-step
M-step
13Introduction to Topic Models
- Mixture model EM solution
E-step
Estimate the expected values of the unknown
variables (soft classification)
M-step
Maximize the values of the parameters subject to
this guessusually, this is learning the
parameter values given the soft classifications
14Introduction to Topic Models
15Introduction to Topic Models
- Probabilistic Latent Semantic Analysis Model
d
- Select document d Mult(?)
- For each position n 1,?, Nd
- generate zn Mult( ?d)
- generate wn Mult( ?zn)
?d
?
Topic distribution
z
- Mixture model
- each document is generated by a single (unknown)
multinomial distribution of words, the corpus is
mixed by ? - PLSA model
- each word is generated by a single unknown
multinomial distribution of words, each document
is mixed by ?d
w
N
M
?
16Introduction to Topic Models
JMLR, 2003
17Introduction to Topic Models
- Latent Dirichlet Allocation
?
- For each document d 1,?,M
- Generate ?d Dir( ?)
- For each position n 1,?, Nd
- generate zn Mult( ?d)
- generate wn Mult( ?zn)
a
z
w
N
M
?
18Introduction to Topic Models
19Introduction to Topic Models
20Introduction to Topic Models
- Latent Dirichlet Allocation
- Overcomes some technical issues with PLSA
- PLSA only estimates mixing parameters for
training docs - Parameter learning is more complicated
- Gibbs Sampling easy to program, often slow
- Variational EM
21Introduction to Topic Models
- Perplexity comparison of various models
Unigram
Mixture model
PLSA
Lower is better
LDA
22Introduction to Topic Models
- Prediction accuracy for classification using
learning with topic-models as features
Higher is better
23Outline
- Tools for analysis of text
- Probabilistic models for text, communities, and
time - Mixture models and LDA models for text
- LDA extensions to model hyperlink structure
- LDA extensions to model time
- Alternative framework based on graph analysis to
model time community - Preliminary results tradeoffs
- Discussion of results challenges
24Hyperlink modeling using LDA
25Hyperlink modeling using LinkLDAErosheva,
Fienberg, Lafferty, PNAS, 2004
a
?
- For each document d 1,?,M
- Generate ?d Dir( ?)
- For each position n 1,?, Nd
- generate zn Mult( ?d)
- generate wn Mult( ?zn)
- For each citation j 1,?, Ld
- generate zj Mult( . ?d)
- generate cj Mult( . ?zj)
z
z
w
c
N
L
M
?
g
Learning using variational EM
26Hyperlink modeling using LDAErosheva, Fienberg,
Lafferty, PNAS, 2004
27Newswire Text
Social Media Text
- Goals of analysis
- Extract information about events from text
- Understanding text requires understanding
typical reader - conventions for communicating with him/her
- Prior knowledge, background,
- Goals of analysis
- Very diverse
- Evaluation is difficult
- And requires revisiting often as goals evolve
- Often understanding social text requires
understanding a community
Science as a testbed for social text an open
community which we understand
28Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
- Copycat model of citation influence
- LDA model for cited papers
- Extended LDA model for citing papers
- For each word, depending on coin flip c, you
might chose to copy a word from a cited paper
instead of generating the word
29Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
- Citation influence graph for LDA paper
30Models of hypertext for blogs ICWSM 2008
Ramesh Nallapati
me
31LinkLDA model for citing documents Variant of
PLSA model for cited documents Topics are shared
between citing, cited Links depend on topics in
two documents
Link-PLSA-LDA
32Experiments
- 8.4M blog postings in Nielsen/Buzzmetrics corpus
- Collected over three weeks summer 2005
- Selected all postings with gt2 inlinks or gt2
outlinks - 2248 citing (2 outlinks), 1777 cited documents
(2 inlinks) - Only 68 in both sets, which are duplicated
- Fit model using variational EM
33Topics in blogs
Model can answer questions like which blogs are
most likely to be cited when discussing topic z?
34Topics in blogs
Model can be evaluated by predicting which links
an author will include in a an article
Link-LDA
Link-PLDA-LDA
Lower is better
35Another model Pairwise Link-LDA
- LDA for both cited and citing documents
- Generate an indicator for every pair of docs
- Vs. generating pairs of docs
- Link depends on the mixing components (?s)
- stochastic block model
36Pairwise Link-LDA supports new inferences
but doesnt perform better on link prediction
37Outline
- Tools for analysis of text
- Probabilistic models for text, communities, and
time - Mixture models and LDA models for text
- LDA extensions to model hyperlink structure
- Observation these models can be used for many
purposes - LDA extensions to model time
- Alternative framework based on graph analysis to
model time community - Discussion of results challenges
38Predicting Response to Political Blog Posts with
Topic Models NAACL 09
Noah Smith
Tae Yano
39Political blogs and and comments
Posts are often coupled with comment sections
Comment style is casual, creative, less carefully
edited
39
40Political blogs and comments
- Most of the text associated with large A-list
community blogs is comments - 5-20x as many words in comments as in text for
the 5 sites considered in Yano et al. - A large part of socially-created commentary in
the blogosphere is comments. - Not blog ? blog hyperlinks
- Comments do not just echo the post
41Modeling political blogs
Our political blog model
CommentLDA
z, z topic w word (in post) w word (in
comments) u user
D of documents N of words in post
M of words in comments
42Modeling political blogs
Our proposed political blog model
LHS is vanilla LDA
D of documents N of words in post
M of words in comments
43Modeling political blogs
RHS to capture the generation of reaction
separately from the post body
Our proposed political blog model
Two chambers share the same topic-mixture
Two separate sets of word distributions
D of documents N of words in post
M of words in comments
44Modeling political blogs
Our proposed political blog model
User IDs of the commenters as a part of comment
text
generate the words in the comment section
D of documents N of words in post
M of words in comments
45Modeling political blogs
Another model we tried
Took out the words from the comment section!
The model is structurally equivalent to the
LinkLDA from (Erosheva et al., 2004)
This is a model agnostic to the words in the
comment section!
D of documents N of words in post
M of words in comments
46Topic discovery - Matthew Yglesias (MY) site
46
47Topic discovery - Matthew Yglesias (MY) site
47
48Topic discovery - Matthew Yglesias (MY) site
48
49Comment prediction
(MY)
- LinkLDA and CommentLDA consistently outperform
baseline models - Neither consistently outperforms the other.
20.54
Comment LDA (R)
(CB)
(RS)
16.92
32.06
Link LDA (R)
Link LDA (C)
user prediction Precision at top 10 From left to
right Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c),
Baseline (Freq, NB)
49
50From Episodes to Sagas Temporally Clustering
News Via Social-Media Commentary current work
Noah Smith
Matthew Hurst
Frank Lin
Ramnath Balasubramanyan
51Motivation
- News-related blogosphere is driven by recency
- Some recent news is better understood based on
context of sequence of related stories - Some readers have this context some dont
- To reconstruct the context, reconstruct the
sequence of related stories (saga) - Similar to retrospective event detection
- First efforts
- Find related stories
- Cluster by time
- Evaluation agreement with human annotators
52Clustering results on Democratic-primary-related
documents
k-walks (more later)
SpeCluster time Mixture of multinomials
model for general text timestamp from Gaussian
53Clustering results on Democratic-primary-related
documents
- Also had three human annotators build
gold-standard timelines - hierarchical
- annotated with names of events, times,
- Can evaluate a machine-produced timeline by
tree-distance to gold-standard one
54Clustering results on Democratic-primary-related
documents
- Issue divergence of opinion with human
annotators - is modeling community interests the problem?
- how much of what we want is actually in the
data? - should this task be supervised or unsupervised?
55More sophisticated time models
- Hierarchical LDA Over Time model
- LDA to generate text
- Also generate a timestamp for each document from
topic-specific Gaussians - Non-parametric model
- Number of clusters is also generated (not
specified by user) - Allows use of user-provided prototypes
- Evaluated on liberal/conservative blogs and ML
papers from NIPS conferences
Ramnath Balasubramanyan
56Results with HOTS model - unsupervised
57Results with HOTS model human guidance
- Adding human seeds for some key events improves
performance on all events. - Allows a user to partially specify a timeline of
events and have the system complete it.
58Outline
- Tools for analysis of text
- Probabilistic models for text, communities, and
time - Mixture models and LDA models for text
- LDA extensions to model hyperlink structure
- LDA extensions to model time
- Alternative framework based on graph analysis to
model time community - Preliminary results tradeoffs
- Discussion of results challenges
59Spectral Clustering Graph MatrixVector Node
? Weight
v
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A
A 3
B 2
C 3
D
E
F
G
H
I
J
H
M
60Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A 3
B 2
C 3
D
E
F
G
H
I
J
A 213101
B 3131
C 3121
D
E
F
G
H
I
J
H
M
61Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A 3
B 2
C 3
D
E
F
G
H
I
J
A 5
B 6
C 5
D
E
F
G
H
I
J
H
M
62Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
e2
0.4
0.2
x
x
x
x
x
x
x
x
x
0.0
x
x
x
-0.2
y
z
y
y
e3
z
z
z
-0.4
y
z
z
z
z
z
z
z
y
e2
-0.4
-0.2
0
0.2
e1
Shi Meila, 2002
M
63Spectral Clustering
- If W is row-normalized adjacency matrix for a
connected graph with k closely-connected
subcommunities then - the top eigenvector is a constant vector
- the next k eigenvectors are roughly piecewise
constant with pieces corresponding to
subcommunities
- Spectral clustering
- Find the top k1 eigenvectors v1,,vk1
- Discard the top one
- Replace every node a with k-dimensional vector
xa ltv2(a),,vk1 (a) gt - Cluster with k-means
M
64Spectral Clustering Pros and Cons
- Elegant, and well-founded mathematically
- Works quite well when relations are approximately
transitive (like similarity, social connections) - Expensive for very large datasets
- Computing eigenvectors is the bottleneck
- Noisy datasets cause problems
- Informative eigenvectors need not be in top few
- Performance can drop suddenly from good to
terrible
65Experimental results best-case assignment of
class labels to clusters
Adamic Glance Divided They Blog 2004
66Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A 3
B 2
C 3
D
E
F
G
H
I
J
A 5
B 6
C 5
D
E
F
G
H
I
J
H
M
67Repeated averaging with neighbors as a clustering
method
- Pick a vector v0 (maybe even at random)
- Compute v1 Wv0
- i.e., replace v0x with weighted average of
v0y for the neighbors y of x - Plot v1x for each x
- Repeat for v2, v3,
- What are the dynamics of this process?
68Repeated averaging with neighbors on a sample
problem
larger
small
69PIC Power Iteration Clusteringrun power
iteration (repeated averaging w/ neighbors) with
early stopping
Frank Lin
- Formally, can show this works when spectral
techniques work - Experimentally, linear time
- Easy to implement and efficient
- Very easily parallelized
- Experimentally, often better than traditional
spectral methods
70Experimental results best-case assignment of
class labels to clusters
71Experiments run time and scalability
Time in millisec
72Clustering results on Democratic-primary-related
documents
k-walks
- k-walks is early version of PIC
- cluster a graph with several types of nodes
blog entries news stories and dates. - clusters (communities) of the graph correspond
to events in the saga. - Advantage
- PIC clusters at interactive speeds.
vs human annotators
73Outline
- Tools for analysis of text
- Probabilistic models for text, communities, and
time - Mixture models and LDA models for text
- LDA extensions to model hyperlink structure
- LDA extensions to model time
- Alternative framework based on graph analysis to
model time community - Preliminary results tradeoffs
- Discussion of results challenges
74Comments
Social Media Text
- Probabilistic models
- can model many aspects of social text
- Community (links, comments)
- Time
- Evaluation
- introspective, qualitative on communities we
understand - Scientific communities
- quantitative on predictive tasks
- Link prediction, user prediction,
- Against gold-standard visualization (sagas)
- Goals of analysis
- Very diverse
- Evaluation is difficult
- And requires revisiting often as goals evolve
- Often understanding social text requires
understanding a community
75Thanks to
- NIH/NIGMS
- NSF
- Microsoft LiveLabs
- Microsoft Research
- Johnson Johnson
- Language Technology Inst, CMU