Topic modeling

About This Presentation

Title:

Topic modeling

Description:

Topic modeling Mark Steyvers Department of Cognitive Sciences University of California, Irvine – PowerPoint PPT presentation

Number of Views:487

Avg rating:3.0/5.0

Slides: 52

Provided by: uci121

Learn more at: https://steyvers.socsci.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Topic modeling

1
Topic modeling
Mark Steyvers Department of Cognitive
Sciences University of California, Irvine
2
Some topics we can discuss

Introduction to LDA basic topic model
Preliminary work on therapy transcripts
Extensions to LDA
Conditional topic models (for predicting
behavioral codes)
Various topic models for word order
Topic models incorporating parse trees
Topic models for dialogue
Topic models incorporating speech information

3
Most basic topic model LDA(Latent Dirichlet
Allocation)
4
Automatic and unsupervised extraction of semantic
themes from large text collections.

Pennsylvania Gazette
(1728-1800)
80,000 articles

Enron 250,000 emails
NYT 330,000 articles
NSF/ NIH 100,000 grants
AOL queries 20,000,000 queries 650,000 users
16 million Medline articles
5
Model Input

Matrix of counts number of times words occur in
documents
Note
word order is lost bag of words approach
Some function words are deleted the, a, in

documents
words
6
Basic Assumptions

Each topic is a distribution over words
Each document a mixture of topics
Each word in a document originates from a single
topic

7
Document mixture of topics
auto car parts cars used ford honda truck toyota
party store wedding birthday jewelry ideas cards c
ake gifts
hannah montana zac efron disney high school
musical miley cyrus hilary duff
webmd cymbalta xanax gout vicodin effexor predniso
ne lexapro ambien
20
Document ------------------------------- --------
--------------------------------------------------
---- ---------------------------------------------
------------------------------------------
80
100
Document ------------------------------- --------
--------------------------------------------------
---- ---------------------------------------------
------------------------------------------
8
Generative Process

For each document, choose a mixture of topics
? ? Dirichlet(?)
Sample a topic 1..T from the mixture z ?
Multinomial(?)
Sample a word from the topic w ?
Multinomial(?(z)) ? ? Dirichlet(ß)

Nd
D
T
9
Prior Distributions

Dirichlet priors encourage sparsity on topic
mixtures and topics

Topic 3
Word 3
Topic 1
Topic 2
Word 1
Word 2
? Dirichlet( a )
? Dirichlet( ß )
(darker colors indicate lower probability)
10
Toy Example
MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1
MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 ....
1.0
.6
RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1
RIVER2 MONEY1 BANK2 LOAN1 MONEY1 ....
.4
1.0
RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2....
Topics
Topic Weights
Documents and topic assignments
11
Statistical Inference
MONEY? BANK BANK? LOAN? BANK? MONEY? BANK? MONEY?
BANK? LOAN? LOAN? BANK? MONEY? ....
?
?
RIVER? MONEY? BANK? STREAM? BANK? BANK? MONEY?
RIVER? MONEY? BANK? LOAN? MONEY? ....
?
RIVER? BANK? STREAM? BANK? RIVER? BANK?....
Topics
Topic Weights
Documents and topic assignments
12
Statistical Inference

Three sets of latent variables
document-topic distributions ?
topic-word distributions ?
topic assignments z
Estimate posterior distribution over topic
assignments
P( z w )
we collapse over topic mixtures and word
mixtures
we can later infer ? and ?
Use approximate methods Markov chain Monte
Carlo (MCMC) with Gibbs sampling

13
Toy Example Artificial Dataset
Two topics
16 documents
Docs
Can we recover the original topics and topic
mixtures from this data?
14
Initialization assign word tokens randomly to
topics
(?topic 1 ?topic 2 )
15
Gibbs Sampling
count of topic t assigned to doc d
count of word w assigned to topic t
probability that word i is assigned to topic t
16
After 1 iteration

Apply sampling equation to each word token

(?topic 1 ?topic 2 )
17
After 4 iterations
(?topic 1 ?topic 2 )
18
After 8 iterations
(?topic 1 ?topic 2 )
19
After 32 iterations
?
(?topic 1 ?topic 2 )
20
Summary of Algorithm
INPUT word-document counts (word order is
irrelevant)
OUTPUT topic assignments to each word P( zi
) likely words in each topic P( w z ) likely
topics in each document (gist) P( z d )
21
Example topics from TASA an educational corpus

37K docs 26K word vocabulary
300 topics e.g.

22
Three documents with the word play(numbers
colors ? topic assignments)
23
LSA
documents
dims
dims
documents
C U D VT
dims
words
words
dims
Topic model
documents
topics
documents
C F Q
topics
words
words
normalized co-occurrence matrix
mixture weights
mixture components
24
Documents as Topics Mixturesa Geometric
Interpretation
P(word1)
1
topic 1
observeddocument
0
topic 2
1
P(word2)
P(word3)
1
P(word1)P(word2)P(word3) 1
25
Some Preliminary Work on Therapy Transcripts
26
Defining documents

Can define document in multiple ways
all words within a therapy session
all words from a particular speaker within a
session
Clearly we need to extend topic model to
dialogue.

27
(No Transcript)
28
Positive/Negative Topic Usage by Group
29
Positive/ Negative Topic Usage by Changes in
Satisfaction
This graph shows that couples with a decrease in
satisfaction over the course of therapy use
relatively negative language. Those who leave
the therapy with increased satisfaction exhibit
more positive language
30
Topics used by Satisfied/ Unsatisfied Couples
Topic 38
talk divorce problem house along separate separati
on talking agree example
Dissatisfied couples talk relatively more often
about separation and divorce
31
Affect Dynamics

Analyze the short-term dynamics of affect usage
Do unhappy couples follow up negative language
with negative language more often than happy
couples? In other words, are unhappy couples
involved in a negative feedback loop?
Calculated
P( z2 z1 )
P( z2 z1- )
P( z2- z1 )
P( z2- z1- )
E.g. P( z2- z1 ) is the probability that
after a positive word the next non-neutral word
will be a negative word

32
Markov Chain Illustration Base rates
.51
.49

.27
z
Normal Controls
-
-

.73
.72
.28

.45
.55
.33
z
Positive Change
-
-

.73
.67
.27

.38
.62
.37
z
Little Change
-
-

.78
.63
.22
.35
.65

.41
z
Negative Change
-
-

.59
.78
.22
33
Modeling Extensions
34
Extensions

Multi-label Document Classification
conditional topic models
Topic models and word order
ngrams/collocations
hidden-markov models
Some potential model developments
topic models incorporating parse trees
topic models for dialogue
topic models incorporating speech information

35
Conditional Topic Models
Assume there is a topic associated with each
label/behavioral code. Model only is allowed to
assign words to labels that are associated with
the document This model can learn the
distribution of words associated with each
label/behavioral code
36
Vulnerabilityyes Hard Expressionno
Vulnerability
word? word word? word? word? word? word? word?
word? word? word? word? word? ....
?
Vulnerabilityno Hard Expressionyes
word? word? word? word? word? word? word? word?
word? word? word? word? ....
Hard Expression
?
Vulnerabilityyes Hard Expressionyes
word? word? word? word? word? word?....
Topics associated with Behavioral Codes
Topic Weights
Documents and topic assignments
37
Preliminary Results
38
Topic Models for short-range sequential
dependencies
39
Hidden Markov Topics Model

Syntactic dependencies ? short range dependencies
Semantic dependencies ? long-range

q
Semantic state generate words from topic model
z1
z2
z3
z4
w1
w2
w3
w4
Syntactic states generate words from HMM
s1
s2
s3
s4
(Griffiths, Steyvers, Blei, Tenenbaum, 2004)
40
NIPS Semantics
IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VI
EWS PIXEL VISUAL
KERNEL SUPPORT VECTOR SVM KERNELS SPACE FUNCTION
MACHINES SET
NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUT
S WEIGHTS OUTPUTS
EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEA
RNING MIXTURES FUNCTION GATE
MEMBRANE SYNAPTIC CELL CURRENT DENDRITIC POTENTI
AL NEURON CONDUCTANCE CHANNELS
DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR D
ISTRIBUTION EM BAYESIAN PARAMETERS
STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT L
EARNING CLASSES OPTIMAL
NIPS Syntax
IN WITH FOR ON FROM AT USING INTO OVER WITHIN
I X T N - C F P
IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENT
S EXISTS SEEMS
SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE
DESCRIBE SUGGEST
MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD
APPROACH PAPER PROCESS
HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HE
NCE FINALLY
USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESEN
TED DEFINED GENERATED SHOWN
41
Random sentence generation
LANGUAGE S RESEARCHERS GIVE THE SPEECH S THE
SOUND FEEL NO LISTENERS S WHICH WAS TO BE
MEANING S HER VOCABULARIES STOPPED WORDS S HE
EXPRESSLY WANTED THAT BETTER VOWEL
42
Collocation Topic Model
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT
CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDA
Y DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NA
SDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10
500_STOCK_INDEX
WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS
FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIE
S RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRM
S SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING
INVESTMENT_BANKERS INVESTMENT_BANKS
SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED
AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTAC
K NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATI
ONAL QAEDA TERRORIST_ATTACKS
BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS
COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_C
OURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPA
NIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CA
SE GROUP
43
Potential Model Developments
44
Using parse trees/ pos taggers?
S
S
NP
NP
VP
VP
You complete me
I complete you
45
Modeling Dialogue
46
Topic Segmentation Model

Purver, Kording, Griffiths, Tenenbaum, J. B.
(2006). Unsupervised topic modeling for
multi-party spoken discourse. Proceedings of the
21st International Conference on Computational
Linguistics and 44th Annual Meeting of the
Association for Computational Linguistics
Automatically segments multi-party discourse into
topically coherent segments
Outperforms standard HMMs
Model does not incorporate speaker information or
speaker turns
goal is simply to segment long stream of words
into segments

47
At each utterance, there is a prob. of changing
theta, the topic mixture. If no change is
indicated, words are drawn from the same mixture
of topics. If there is a change, the topic
mixture is resampled from Dirichley
48
Latent Dialogue Structure modelDing et al. (Nips
workshop, 2009)

Designed for modeling sequences of messages on
discussion forums
Models the relationship of messages within
documents a message might relate to any
previous message within a dialogue
It does not incorporate speaker specific
variables

49
Some details
50
Learning User Intentions in Spoken Dialogue
Systems Chinaei et al. (ICAART, 2009)

Applies HTMM model (Gruber et al., 2007) to
dialogue
Assumes that within each talk-turn, words are
drawn from same topic z (not mixture!). At start
of new talk-turn, there is some probability (psi
below) of sampling new topic z from mixture theta

51
Other ideas

Can we enhance topic models with non-verbal
speech information
Each topic is a distribution over words as well
as voicing information (f0, timing, etc)

T
Nd
D
Non-verbal feature
52
Other Extensions
53
Learning Topic Hierarchies(example psych Review
Abstracts)
THE OF AND TO IN A IS
A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS
ACCOUNT
SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES IN
TERPERSONAL PERSONALITY SAMPLING
MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DI
RECTION CONTOURS SURFACES
DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGE
R EXTINCTION PAIN
RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMUL
I RECALL CHOICE CONDITIONING
SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SE
MANTIC
ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIO
NAL THINKING
GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL G
ROUPS MEMBERS
SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HA
NDEDNESS
REASONING ATTITUDE CONSISTENCY SITUATIONAL INFEREN
CE JUDGMENT PROBABILITIES STATISTICAL
IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT
ORIENTATION HOLOGRAPHIC
CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMU
LATION TOLERANCE RESPONSES

Write a Comment

User Comments (0)

About PowerShow.com

Topic modeling - PowerPoint PPT Presentation

Topic modeling

Topic modeling Mark Steyvers Department of Cognitive Sciences University of California, Irvine – PowerPoint PPT presentation