Vector Space Model - PowerPoint PPT Presentation

About This Presentation

Title:

Vector Space Model

Description:

Vector Space Model Rong Jin * * Choosing Bases for VSM Modify the bases of the vector space Each basis is a concept: a group of words Every document is a mixture of ... – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 63

Provided by: rong7

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Vector Space Model

1
Vector Space Model

Rong Jin

2
Basic Issues in A Retrieval Model
3
Basic Issues in IR

How to represent queries?
How to represent documents?
How to compute the similarity between documents
and queries?
How to utilize the users feedbacks to enhance
the retrieval performance?

4
IR Formal Formulation

Vocabulary Vw1, w2, , wn of language
Query q q1,,qm, where qi ? V
Collection C d1, , dk
Document di (di1,,dimi), where dij ? V
Set of relevant documents R(q) ? C
Generally unknown and user-dependent
Query is a hint on which doc is in R(q)
Task compute R(q), an approximate R(q)

5
Computing R(q)

Strategy 1 Document selection
Classification function f(d,q) ?0,1
Outputs 1 for relevance, 0 for irrelevance
R(q) is determined as a set d?Cf(d,q)1
System must decide if a doc is relevant or not
(absolute relevance)
Example Boolean retrieval

6
Document Selection Approach
True R(q)
Classifier C(q)
-
-

-
-

-

-
-
-
-
-
-
-
-
-
-
-
7
Computing R(q)

Strategy 2 Document ranking
Similarity function f(d,q) ??
Outputs a similarity between document d and query
q
Cut off ?
The minimum similarity for document and query to
be relevant
R(q) is determined as the set d?Cf(d,q)gt?
System must decide if one doc is more likely to
be relevant than another (relative relevance)

8
Document Selection vs. Ranking
True R(q)
-
-

-
-

-

-
?
-
-
-
-
-
-
-
-
-
-
9
Document Selection vs. Ranking
True R(q)
-
-

-
-

-

-
-
-
-
-
-
-
-
-
-
-
10
Ranking is often preferred

Similarity function is more general than
classification function
The classifier is unlikely to be accurate
Ambiguous information needs, short queries
Relevance is a subjective concept
Absolute relevance vs. relative relevance

11
Probability Ranking Principle

As stated by Cooper
Ranking documents in probability maximizes the
utility of IR systems

If a reference retrieval systems response to
each request is a ranking of the documents in the
collections in order of decreasing probability of
usefulness to the user who submitted the request,
where the probabilities are estimated as
accurately as possible on the basis of whatever
data made available to the system for this
purpose, then the overall effectiveness of the
system to its users will be the best that is
obtainable on the basis of that data.
12
Vector Space Model

Any text object can be represented by a term
vector
Examples Documents, queries, sentences, .
A query is viewed as a short document
Similarity is determined by relationship between
two vectors
e.g., the cosine of the angle between the
vectors, or the distance between vectors
The SMART system
Developed at Cornell University, 1960-1999
Still used widely

13
Vector Space Model illustration
Java Starbuck Microsoft
D1 1 1 0
D2 0 1 1
D3 1 0 1
D4 1 1 1
Query 1 0.1 1
14
Vector Space Model illustration
15
Vector Space Model Similarity

Represent both documents and queries by word
histogram vectors
n the number of unique words
A query q (q1, q2,, qn)
qi occurrence of the i-th word in query
A document dk (dk,1, dk,2,, dk,n)
dk,i occurrence of the the i-th word in document
Similarity of a query q to a document dk

16
Some Background in Linear Algebra

Dot product (scalar product)
Example
Measure the similarity by dot product

17
Some Background in Linear Algebra

Length of a vector
Angle between two vectors

q
dk
18
Some Background in Linear Algebra

Example
Measure similarity by the angle between vectors

q
dk
19
Vector Space Model Similarity

Given
A query q (q1, q2,, qn)
qi occurrence of the i-th word in query
A document dk (dk,1, dk,2,, dk,n)
dk,i occurrence of the the i-th word in document
Similarity of a query q to a document dk

q
dk
20
Vector Space Model Similarity
q
dk
21
Vector Space Model Similarity
q
dk
22
Term Weighting

wk,i the importance of the i-th word for
document dk
Why weighting ?
Some query terms carry more information
TF.IDF weighting
TF (Term Frequency) Within-doc-frequency
IDF (Inverse Document Frequency)
TF normalization avoid the bias of long documents

23
TF Weighting

A term is important if it occurs frequently in
document
Formulas
f(t,d) term occurrence of word t in document d
Maximum frequency normalization

Term frequency normalization
24
TF Weighting

A term is important if it occurs frequently in
document
Formulas
f(t,d) term occurrence of word t in document d
Okapi/BM25 TF

Term frequency normalization
doclen(d) the length of document d avg_doclen
average document length
k,b predefined constants
25
TF Normalization

Why?
Document length variation
Repeated occurrences are less informative than
the first occurrence
Two views of document length
A doc is long because it uses more words
A doc is long because it has more contents
Generally penalize long doc, but avoid
over-penalizing (pivoted normalization)

26
TF Normalization
Norm. TF
Raw TF
Pivoted normalization
27
IDF Weighting

A term is discriminative if it occurs only in a
few documents
Formula IDF(t) 1 log(n/m) n
total number of docs m -- docs with term t
(doc freq)
Can be interpreted as mutual information

28
TF-IDF Weighting

TF-IDF weighting
The importance of a term t to a document d
weight(t,d)TF(t,d)IDF(t)
Freq in doc ? high tf ? high weight
Rare in collection? high idf? high weight

29
TF-IDF Weighting

TF-IDF weighting
The importance of a term t to a document d
weight(t,d)TF(t,d)IDF(t)
Freq in doc ? high tf ? high weight
Rare in collection? high idf? high weight
Both qi and dk,i arebinary values, i.e. presence
and absence of a word in query and document.

30
Problems with Vector Space Model

Still limited to word based matching
A document will never be retrieved if it does not
contain any query word
How to modify the vector space model ?

31
Choice of Bases
D
Q
D1
32
Choice of Bases
D
Q
D1
33
Choice of Bases
D
D
Q
D1
34
Choice of Bases
D
D
Q
Q
D1
35
Choice of Bases
D
Q
D1
36
Choosing Bases for VSM

Modify the bases of the vector space
Each basis is a concept a group of words
Every document is a vector in the concept space

c1 c2 c3 c4 c5 m1 m2 m3 m4
A1 1 1 1 1 1 0 0 0 0
A2 0 0 0 0 0 1 1 1 1
37
Choosing Bases for VSM

Modify the bases of the vector space
Each basis is a concept a group of words
Every document is a mixture of concepts

c1 c2 c3 c4 c5 m1 m2 m3 m4
A1 1 1 1 1 1 0 0 0 0
A2 0 0 0 0 0 1 1 1 1
38
Choosing Bases for VSM

Modify the bases of the vector space
Each basis is a concept a group of words
Every document is a mixture of concepts
How to define/select basic concept?
In VS model, each term is viewed as an
independent concept

39
Basic Matrix Multiplication
40
Basic Matrix Multiplication
41
Linear Algebra Basic Eigen Analysis

Eigenvectors (for a square m?m matrix S)
Example

eigenvalue
(right) eigenvector
42
Linear Algebra Basic Eigen Analysis
43
Linear Algebra Basic Eigen Decomposition
S U ?
UT
44
Linear Algebra Basic Eigen Decomposition
S U ?
UT
45
Linear Algebra Basic Eigen Decomposition
S U ?
UT

This is generally true for symmetric square
matrix
Columns of U are eigenvectors of S
Diagonal elements of ? are eigenvalues of S

46
Singular Value Decomposition
For an m? n matrix A of rank r there exists a
factorization (Singular Value Decomposition
SVD) as follows
The columns of U are left singular vectors.
The columns of V are right singular vectors
? is a diagonal matrix with singular values
47
Singular Value Decomposition

Illustration of SVD dimensions and sparseness

48
Singular Value Decomposition

Illustration of SVD dimensions and sparseness

49
Singular Value Decomposition

Illustration of SVD dimensions and sparseness

50
Low Rank Approximation

Approximate matrix with the largest singular
values and singular vectors

51
Low Rank Approximation

Approximate matrix with the largest singular
values and singular vectors

52
Low Rank Approximation

Approximate matrix with the largest singular
values and singular vectors

53
Latent Semantic Indexing (LSI)

Computation using single value decomposition
(SVD) with the first m largest singular values
and singular vectors, where m is the number of
concepts

?
54
Finding Good Concepts
55
SVD Example m2
56
SVD Example m2
57
SVD Example m2
58
SVD Example m2
59
SVD Orthogonality
v1 v2 0
u1 u2
0
60
SVD Properties
X
X
?
X rank(X) 2
X rank(X) 9