Latent Semantic Analysis Probabilistic Topic Models - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Latent Semantic Analysis Probabilistic Topic Models

Description:

Title: The Topics Model for Semantic Representation Author: Steyvers Last modified by: Mark Steyvers Created Date: 9/23/2003 4:02:03 PM Document presentation format – PowerPoint PPT presentation

Number of Views:274
Avg rating:3.0/5.0
Slides: 32
Provided by: Stey
Category:

less

Transcript and Presenter's Notes

Title: Latent Semantic Analysis Probabilistic Topic Models


1
Latent Semantic AnalysisProbabilistic Topic
Models Associative Memory
2
The Psychological Problem
  • How do we learn semantic structure?
  • Covariation between words and the contexts they
    appear in (e.g. LSA)
  • How do we represent semantic structure?
  • Semantic Spaces (e.g. LSA)
  • Probabilistic Topics

3
Latent Semantic Analysis(Landauer Dumais, 1997)
high dimensional space
SVD
STREAM
word-document counts
RIVER
BANK
MONEY
  • Each word is a single point in semantic space
  • Similarity measured by cosine of angle between
    word vectors

4
Critical Assumptions of Semantic Spaces (e.g.
LSA)
  • Psychological distance should obey three axioms
  • Minimality
  • Symmetry
  • Triangle inequality

5
For conceptual relations, violations of distance
axioms often found
  • Similarities can often be asymmetric
  • North-Korea is more similar to China than
    vice versa
  • Pomegranate is more similar to Apple than
    vice versa
  • Violations of triangle inequality

AC
AB
BC
Euclidian distance AC ? AB BC
6
Triangle Inequality in Semantic Spaces might not
always hold
THEATER
w1
w2
w3
SOCCER
PLAY
Euclidian distance AC ? AB BC
Cosine similarity cos(w1,w3)
cos(w1,w2)cos(w2,w3) sin(w1,w2)sin(w2,w3)
7
Nearest neighbor problem (Tversky Hutchinson
(1986)
  • In similarity data, Fruit is nearest neighbor
    in 18 out of 20 fruit words
  • In 2D solution, Fruit can be nearest neighbor
    of at most 5 items
  • High-dimensional solutions might solve this but
    these are less appealing

8
Probabilistic Topic Models
  • A probabilistic version of LSA no spatial
    constraints.
  • Originated in domain of statistics machine
    learning
  • (e.g., Hoffman, 2001 Blei, Ng, Jordan, 2003)
  • Extracts topics from large collections of text
  • Topics are interpretable unlike the arbitrary
    dimensions of LSA

9
Model is Generative
Find parameters that reconstruct data
DATA Corpus of text Word counts for each document
Topic Model
10
Probabilistic Topic Models
  • Each document is a probability distribution over
    topics (distribution over topics gist)
  • Each topic is a probability distribution over
    words

11
Document generation as a probabilistic process
  1. for each document, choosea mixture of topics
  2. For every word slot, sample a topic 1..T from
    the mixture
  3. sample a word from the topic

TOPICS MIXTURE
...
TOPIC
TOPIC
WORD
...
WORD
12
Example
money
money
loan
bank
DOCUMENT 1 money1 bank1 bank1 loan1 river2
stream2 bank1 money1 river2 bank1 money1 bank1
loan1 money1 stream2 bank1 money1 bank1 bank1
loan1 river2 stream2 bank1 money1 river2 bank1
money1 bank1 loan1 bank1 money1 stream2
.8
loan
bank
bank
loan
.2
TOPIC 1
.3
DOCUMENT 2 river2 stream2 bank2 stream2 bank2
money1 loan1 river2 stream2 loan1 bank2 river2
bank2 bank1 stream2 river2 loan1 bank2
stream2 bank2 money1 loan1 river2 stream2 bank2
stream2 bank2 money1 river2 stream2 loan1
bank2 river2 bank2 money1 bank1 stream2 river2
bank2 stream2 bank2 money1
river
bank
.7
river
stream
river
bank
stream
TOPIC 2
Bayesian approach use priors Mixture weights
Dirichlet( a ) Mixture components
Dirichlet( b )
Mixture components
Mixture weights
13
Inverting (fitting) the model
?
DOCUMENT 1 money? bank? bank? loan? river?
stream? bank? money? river? bank? money? bank?
loan? money? stream? bank? money? bank? bank?
loan? river? stream? bank? money? river? bank?
money? bank? loan? bank? money? stream?
?
TOPIC 1
DOCUMENT 2 river? stream? bank? stream? bank?
money? loan? river? stream? loan? bank? river?
bank? bank? stream? river? loan? bank?
stream? bank? money? loan? river? stream? bank?
stream? bank? money? river? stream? loan?
bank? river? bank? money? bank? stream? river?
bank? stream? bank? money?
?
TOPIC 2
Mixture components
Mixture weights
14
Application to corpus data
  • TASA corpus text from first grade to college
  • representative sample of text
  • 26,000 word types (stop words removed)
  • 37,000 documents
  • 6,000,000 word tokens

15
Example topics from an educational corpus (TASA)
  • 37K docs, 26K words
  • 1700 topics, e.g.

PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRES
S IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM O
FFSET GRAPHIC SURFACE PRODUCED CHARACTERS
PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHA
KESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAM
ATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPE
RA PERFORMED
TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING S
OCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COUR
T GAMES TRY COACH GYM SHOT
JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDA
NT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS
ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL
HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIE
NTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST MET
HOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION S
CIENCE FACTS DATA RESULTS EXPLANATION
STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY T
EACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUD
IED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
16
Polysemy
PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRES
S IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM O
FFSET GRAPHIC SURFACE PRODUCED CHARACTERS
PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHA
KESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAM
ATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPE
RA PERFORMED
TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING S
OCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COUR
T GAMES TRY COACH GYM SHOT
JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDA
NT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS
ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL
HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIE
NTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST MET
HOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION S
CIENCE FACTS DATA RESULTS EXPLANATION
STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY T
EACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUD
IED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
17
Three documents with the word play(numbers
colors ? topic assignments)
18
No Problem of Triangle Inequality
TOPIC 1
TOPIC 2
SOCCER
MAGNETIC
FIELD
Topic structure easily explains violations of
triangle inequality
19
Applications
20
Enron email data
500,000 emails 5000 authors 1999-2002
21
Enron topics
TEXANS WINFOOTBALL FANTASY SPORTSLINE PLAY TEAM G
AME SPORTS GAMES
GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRIT
UAL VISIT
ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING
SAFETY WATER GASOLINE
FERC MARKET ISO COMMISSION ORDER FILING COMMENTS P
RICE CALIFORNIA FILED
POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARK
ET PRICE UTILITY CUSTOMERS ELECTRIC
STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL
POWER BONDS MOU
TIMELINE
May 22, 2000 Start of California energy crisis
22
Applying Model to Psychological Data
23
Network of Word Associations
BAT
BALL
BASEBALL
GAME
PLAY
STAGE
THEATER
(Association norms by Doug Nelson et al. 1998)
24
Explaining structure with topics
BAT
BALL
topic 1
BASEBALL
GAME
PLAY
topic 2
STAGE
THEATER
25
Modeling Word Association
  • Word association modeled as prediction
  • Given that a single word is observed, what future
    other words might occur?
  • Under a single topic assumption

Response
Cue
26
Observed associates for the cue play
27
Model predictions
RANK 9
28
Median rank of first associate
Median Rank
29
Recall example study List
  • STUDY Bed, Rest, Awake, Tired, Dream, Wake,
    Snooze, Blanket, Doze, Slumber, Snore, Nap,
    Peace, Yawn, Drowsy
  • FALSE RECALL Sleep 61

30
Recall as a reconstructive process
  • Reconstruct study list based on the stored gist
  • The gist can be represented by a distribution
    over topics
  • Under a single topic assumption

Retrieved word
Study list
31
Predictions for the Sleep list
STUDYLIST
EXTRALIST (top 8)
Write a Comment
User Comments (0)
About PowerShow.com