Iterative residual rescaling: An analysis and generalization of LSI - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Iterative residual rescaling: An analysis and generalization of LSI

Description:

Rie Kubota Ando & Lillian Lee. ... In the 24th Annual International ACM SIGIR Conference (SIGIR'2001) ... Kappa average precision: Pair-wise average precision: ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 33
Provided by: sil973
Category:

less

Transcript and Presenter's Notes

Title: Iterative residual rescaling: An analysis and generalization of LSI


1
Iterative residual rescaling An analysis and
generalization of LSI
  • Rie Kubota Ando Lillian Lee. Iterative residual
    rescaling An analysis and generalization of LSI.
    In the 24th Annual International ACM SIGIR
    Conference (SIGIR'2001), 2001.
  • Presenter ???

2
Introduction
  • The disadvantage of VSM
  • Documents that do not share terms are mapped to
    orthogonal vectors even if they are clearly
    related.
  • LSI attempts to overcome this shortcomings by
    projecting the term-document matrix onto a
    lower-dimensional subspace.

3
Introduction of IRR
Weight
  • LSI
  • IRR

doc
U
VT
A
SVD
term
eigenvalue
eigenvector
eigenvector
rescaling
4
Frobenius norm and matrix 2-norm
  • Frobenius norm
  • 2-norm

5
Analyzing LSI
  • Topic-based similarities
  • C an n-document collection
  • D m-by-n term-document matrix
  • k underlying topics (kltn)
  • Relevance score
  • for each document and each topic
  • for each document
  • True topic-based similarity between and
  • then we can get a n-by-n matrix S

topic
topic
doc
doc
S
doc
doc
topic
doc
6
The optimum subspace
  • Give a subspace of , and B
    form an orthonormal basis of

7
The optimum subspace
  • We have m-by-n term-document matrix D
  • D the
    projection of D onto
  • is

8
The optimum subspace
  • Deviation matrix
  • find a subspace such that the entries of
    it are small.
  • The optimum subspace
  • Optimum error

if optimum error is high, then we cannot expect
the optimum subspace to fully reveal the topic
dominances.
9
The singular value decomposition and LSI
  • SVD
  • Gained on the left singular vector by following
    observation
  • be the projection of
    onto the span of
  • let be the residual vector

10
Analysis of LSI
11
Non-uniformity and LSI
  • A crucial quantity in our analysis is the
    dominance of a given topic t

12
Non-uniformity and LSI
  • Topic mingling
  • If the topic mingling is high means the
    similarity of each document with different topics
    is high, then the topics will be fairly difficult
    to distinguish.

13
Non-uniformity and LSI
  • let be the ith largest singular value of
    . Then

14
Non-uniformity and LSI
  • Define
  • We can get the ratio
  • the more largest topic dominates the
    collection, the higher this ratio will tend to be.

15
Non-uniformity and LSI
  • Original error
  • Let denote the VSM space
  • then as
  • Root original error

( Input error )
16
Non-uniformity and LSI
  • Let be the h-dimension LSI subspace
    spanned by the first h left singular vectors of D
  • if
  • must be close to when the
    topic-document distribution is relatively
    uniform.

17
Notation for related values
  • is topic mingling
  • For
  • we write
  • the approximation becomes closer as the
    optimum error (or optimum error) becomes smaller.

18
Andos IRR algorithm
  • IRR algorithm

19
Introduction of IRR
20
Andos IRR algorithm
find the max x which approximate R
21
Andos IRR algorithm
22
Auto-scale method
  • Automatic scaling factor determination

topic
When approximately single-topic
doc
23
Auto-scale method
  • Implement auto-scale
  • We set q to a linear function of f(D)

24
Dimension selection
  • Stopping criterion
  • residual ratio (effective for both LSI and
    IRR)

25
Evaluation Matrix
  • Kappa average precision
  • Pair-wise average precision
  • the measured similarity for any two
    intra-topic documents( share at least one topic)
    should be higher than for any two cross-topic
    documents which have no topics in common.

Denote the document pair with the jth largest
measure cosine
Non intra-topic probability
26
Evaluation Matrix
  • Clustering
  • let C be a cluster-topic contingency table
  • is the number of documents in cluster
    i that relevance to topic j.
  • define

27
Experimental setting
  • (1)Choose two TREC topics (can choose more than
    two)
  • (2)Specified seven distribution type
  • (25,25), (30,20), (35,15), (40,10), (43,7),
    (45,5), (46,4)
  • Each document was relevant to exactly one of the
    pre-select topics.
  • (3)Extracted single-word stemmed terms using
    TALENT and removed stop-words.
  • (4)Create term-document matrix, and
    length-normalized the document vector.
  • (5)implement AUTO-SCALE, set

28
Controlled-distribution results
  • The chosen scaling factor increases on average as
    the non-uniformity goes up.

29
Controlled-distribution results
lowest S(C)
Highest S(C)
30
Controlled-distribution results
31
Conclusion
  • Provided a new theoretical analysis of LSI.
  • Showing a precise relationship between LSIs
    performance and the uniformity of the underlying
    topic-document distribution.
  • Extend Andos IRR algorithm.
  • IRR provide a very good performance in comparison
    to LSI.

32
IRR on summarization
doc
sentence
term
turn to
term
U
VT
IRR
Put all document as a query to count the
similarity
Write a Comment
User Comments (0)
About PowerShow.com