Title: Vector Space Model
1Vector Space Model
2Basic Issues in A Retrieval Model
3Basic Issues in IR
- How to represent queries?
- How to represent documents?
- How to compute the similarity between documents
and queries? - How to utilize the users feedbacks to enhance
the retrieval performance?
4IR Formal Formulation
- Vocabulary Vw1, w2, , wn of language
- Query q q1,,qm, where qi ? V
- Collection C d1, , dk
- Document di (di1,,dimi), where dij ? V
- Set of relevant documents R(q) ? C
- Generally unknown and user-dependent
- Query is a hint on which doc is in R(q)
- Task compute R(q), an approximate R(q)
5Computing R(q)
- Strategy 1 Document selection
- Classification function f(d,q) ?0,1
- Outputs 1 for relevance, 0 for irrelevance
- R(q) is determined as a set d?Cf(d,q)1
- System must decide if a doc is relevant or not
(absolute relevance) - Example Boolean retrieval
6Document Selection Approach
True R(q)
Classifier C(q)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
7Computing R(q)
- Strategy 2 Document ranking
- Similarity function f(d,q) ??
- Outputs a similarity between document d and query
q - Cut off ?
- The minimum similarity for document and query to
be relevant - R(q) is determined as the set d?Cf(d,q)gt?
- System must decide if one doc is more likely to
be relevant than another (relative relevance)
8Document Selection vs. Ranking
True R(q)
-
-
-
-
-
-
?
-
-
-
-
-
-
-
-
-
-
9Document Selection vs. Ranking
True R(q)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
10Ranking is often preferred
- Similarity function is more general than
classification function - The classifier is unlikely to be accurate
- Ambiguous information needs, short queries
- Relevance is a subjective concept
- Absolute relevance vs. relative relevance
11Probability Ranking Principle
- As stated by Cooper
- Ranking documents in probability maximizes the
utility of IR systems
If a reference retrieval systems response to
each request is a ranking of the documents in the
collections in order of decreasing probability of
usefulness to the user who submitted the request,
where the probabilities are estimated as
accurately as possible on the basis of whatever
data made available to the system for this
purpose, then the overall effectiveness of the
system to its users will be the best that is
obtainable on the basis of that data.
12Vector Space Model
- Any text object can be represented by a term
vector - Examples Documents, queries, sentences, .
- A query is viewed as a short document
- Similarity is determined by relationship between
two vectors - e.g., the cosine of the angle between the
vectors, or the distance between vectors - The SMART system
- Developed at Cornell University, 1960-1999
- Still used widely
13Vector Space Model illustration
Java Starbuck Microsoft
D1 1 1 0
D2 0 1 1
D3 1 0 1
D4 1 1 1
Query 1 0.1 1
14Vector Space Model illustration
15Vector Space Model Similarity
- Represent both documents and queries by word
histogram vectors - n the number of unique words
- A query q (q1, q2,, qn)
- qi occurrence of the i-th word in query
- A document dk (dk,1, dk,2,, dk,n)
- dk,i occurrence of the the i-th word in document
- Similarity of a query q to a document dk
16Some Background in Linear Algebra
- Dot product (scalar product)
- Example
- Measure the similarity by dot product
17Some Background in Linear Algebra
- Length of a vector
- Angle between two vectors
q
dk
18Some Background in Linear Algebra
- Example
- Measure similarity by the angle between vectors
q
dk
19Vector Space Model Similarity
- Given
- A query q (q1, q2,, qn)
- qi occurrence of the i-th word in query
- A document dk (dk,1, dk,2,, dk,n)
- dk,i occurrence of the the i-th word in document
- Similarity of a query q to a document dk
q
dk
20Vector Space Model Similarity
q
dk
21Vector Space Model Similarity
q
dk
22Term Weighting
- wk,i the importance of the i-th word for
document dk - Why weighting ?
- Some query terms carry more information
- TF.IDF weighting
- TF (Term Frequency) Within-doc-frequency
- IDF (Inverse Document Frequency)
- TF normalization avoid the bias of long documents
23TF Weighting
- A term is important if it occurs frequently in
document - Formulas
- f(t,d) term occurrence of word t in document d
- Maximum frequency normalization
Term frequency normalization
24TF Weighting
- A term is important if it occurs frequently in
document - Formulas
- f(t,d) term occurrence of word t in document d
- Okapi/BM25 TF
Term frequency normalization
doclen(d) the length of document d avg_doclen
average document length
k,b predefined constants
25TF Normalization
- Why?
- Document length variation
- Repeated occurrences are less informative than
the first occurrence - Two views of document length
- A doc is long because it uses more words
- A doc is long because it has more contents
- Generally penalize long doc, but avoid
over-penalizing (pivoted normalization)
26TF Normalization
Norm. TF
Raw TF
Pivoted normalization
27IDF Weighting
- A term is discriminative if it occurs only in a
few documents - Formula IDF(t) 1 log(n/m) n
total number of docs m -- docs with term t
(doc freq) - Can be interpreted as mutual information
28TF-IDF Weighting
- TF-IDF weighting
- The importance of a term t to a document d
- weight(t,d)TF(t,d)IDF(t)
- Freq in doc ? high tf ? high weight
- Rare in collection? high idf? high weight
29TF-IDF Weighting
- TF-IDF weighting
- The importance of a term t to a document d
- weight(t,d)TF(t,d)IDF(t)
- Freq in doc ? high tf ? high weight
- Rare in collection? high idf? high weight
- Both qi and dk,i arebinary values, i.e. presence
and absence of a word in query and document.
30Problems with Vector Space Model
- Still limited to word based matching
- A document will never be retrieved if it does not
contain any query word - How to modify the vector space model ?
31Choice of Bases
D
Q
D1
32Choice of Bases
D
Q
D1
33Choice of Bases
D
D
Q
D1
34Choice of Bases
D
D
Q
Q
D1
35Choice of Bases
D
Q
D1
36Choosing Bases for VSM
- Modify the bases of the vector space
- Each basis is a concept a group of words
- Every document is a vector in the concept space
c1 c2 c3 c4 c5 m1 m2 m3 m4
A1 1 1 1 1 1 0 0 0 0
A2 0 0 0 0 0 1 1 1 1
37Choosing Bases for VSM
- Modify the bases of the vector space
- Each basis is a concept a group of words
- Every document is a mixture of concepts
c1 c2 c3 c4 c5 m1 m2 m3 m4
A1 1 1 1 1 1 0 0 0 0
A2 0 0 0 0 0 1 1 1 1
38Choosing Bases for VSM
- Modify the bases of the vector space
- Each basis is a concept a group of words
- Every document is a mixture of concepts
- How to define/select basic concept?
- In VS model, each term is viewed as an
independent concept
39Basic Matrix Multiplication
40Basic Matrix Multiplication
41Linear Algebra Basic Eigen Analysis
- Eigenvectors (for a square m?m matrix S)
- Example
eigenvalue
(right) eigenvector
42Linear Algebra Basic Eigen Analysis
43Linear Algebra Basic Eigen Decomposition
S U ?
UT
44Linear Algebra Basic Eigen Decomposition
S U ?
UT
45Linear Algebra Basic Eigen Decomposition
S U ?
UT
- This is generally true for symmetric square
matrix - Columns of U are eigenvectors of S
- Diagonal elements of ? are eigenvalues of S
46Singular Value Decomposition
For an m? n matrix A of rank r there exists a
factorization (Singular Value Decomposition
SVD) as follows
The columns of U are left singular vectors.
The columns of V are right singular vectors
? is a diagonal matrix with singular values
47Singular Value Decomposition
- Illustration of SVD dimensions and sparseness
48Singular Value Decomposition
- Illustration of SVD dimensions and sparseness
49Singular Value Decomposition
- Illustration of SVD dimensions and sparseness
50Low Rank Approximation
- Approximate matrix with the largest singular
values and singular vectors
51Low Rank Approximation
- Approximate matrix with the largest singular
values and singular vectors
52Low Rank Approximation
- Approximate matrix with the largest singular
values and singular vectors
53Latent Semantic Indexing (LSI)
- Computation using single value decomposition
(SVD) with the first m largest singular values
and singular vectors, where m is the number of
concepts
?
54Finding Good Concepts
55SVD Example m2
56SVD Example m2
57SVD Example m2
58SVD Example m2
59SVD Orthogonality
v1 v2 0
u1 u2
0
60SVD Properties
X
X
?
X rank(X) 2
X rank(X) 9
- rank(S) the maximum number of either row or
column vectors within matrix S that are linearly
independent. - SVD produces the best low rank approximation
61SVD Visualization
X
62SVD Visualization
- SVD tries to preserve the Euclidean distance of
document vectors