HCC class lecture 14 comments - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

HCC class lecture 14 comments

Description:

LSA implicitly models gaussian random processes for theme and word generation. ... LSA themes are not meaningful beyond the first few (the ones with strongest ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 16
Provided by: csBer
Category:
Tags: hcc | class | comments | lecture | theme

less

Transcript and Presenter's Notes

Title: HCC class lecture 14 comments


1
HCC classlecture 14 comments
John Canny3/9/05
2
Administrivia
3
Clustering LSA again
  • The input is a matrix. Rows represent text blocks
    (sentences, paragraphs or documents)
  • Columns are distinct terms
  • Matrix elements are term counts (x tfidf weight)
  • The idea is to Factor this matrix into A D B

Themes
Terms
Terms
B
D

Textblocks
M
A
Textblocks
4
LSA again
  • A encodes the representation of each text block
    in a space of themes.
  • B encodes each theme with term weights. It can be
    used to explicitly describe the theme.

Themes
Terms
Terms
B
D

Textblocks
M
A
Textblocks
5
LSA limitations
  • LSA has a few assumptions that dont make much
    sense
  • If documents really do comprise different
    themes there shouldnt be negative weights in
    the LSA matrices.
  • LSA implicitly models gaussian random processes
    for theme and word generation. Actual document
    statistics are far from gaussian.
  • SVD forces themes to be orthogonal in the A and B
    matrices. Why should they be?

6
Non-negative Matrix Factorization
  • NMF deals with non-negativity and orthogonality,
    but still uses gaussian statistics
  • If documents really do comprise different
    themes there shouldnt be negative weights in
    the LSA matrices.
  • LSA implicitly models gaussian random processes
    for theme and word generation. Actual document
    statistics are far from gaussian.
  • SVD forces themes to be orthogonal in the A and B
    matrices. Why should they be?

7
LSA again
  • The consequences are
  • LSA themes are not meaningful beyond the first
    few (the ones with strongest singular value).
  • LSA is largely insensitive to the choice of
    semantic space (most 300-dim spaces will do).

8
NMF
  • The corresponding properties
  • NMF components track themes well (up to 30 or
    more).
  • The NMF components can be used directly as topic
    markers, so the choice is important.

9
NMF
  • NMF is an umbrella term for several algorithms.
  • The one in this paper uses least squares to match
    the original term matrix. i.e. it minimizes
  • ? (M AB)2
  • Another natural metric is the KL or
    Kullback-Liebler divergence. The KL-divergence
    between two probability distributions p and q is
  • ? p log p/q
  • Another natural version of NMF uses KL-divergence
    between M and its approximation as A B.

10
NMF
  • KL-divergence is usually a more accurate way to
    compare probability distributions.
  • However, in clustering applications, the quality
    of fit to the probability distribution is
    secondary to the quality of the clusters.
  • KL-divergence NMF performs well for smoothing
    (extrapolation) tasks, but not as well as
    least-squares for clustering.
  • The reasons are not entirely clear, but it may
    simply be an artifact of the basic NMF
    recurrences, which find only locally-optimal
    matches.

11
A Simpler Text Summarizer
  • A simpler text summarizer based on inter-sentence
    analysis did as well as any of the custom systems
    on the DUC-2002 dataset (Document Understanding
    Conference).
  • This algorithm called TextRank was based on a
    graphical analysis of the similarity graph
    between sentences in the text.

12
A Simpler Text Summarizer
  • Vertices in the graph represent sentences, edge
    weights are similarity between sentences

S1
S2
S7
S3
S6
S4
S5
13
Textrank
  • TextRank computes vertex strength using a variant
    of Googles Pagerank. It gives the probability of
    being at a vertex during a long random walk on
    the graph.

S1
S2
S7
S3
S6
S4
S5
14
Textrank
  • The highest-ranked vertices comprise the summary.
  • Textrank achieved the same summary performance as
    the best single-sentence summarizers at DUC-2002.
    (TextRank appeared in ACL 2004)

15
Discussion Topics
  • T1 The best text analysis algorithms for a
    variety of tasks seem to use numerical (BOW or
    graphical models) of texts. Discuss what
    information these representations capture and why
    they might be effective.
Write a Comment
User Comments (0)
About PowerShow.com