Generalized Vector Space Model - PowerPoint PPT Presentation

About This Presentation
Title:

Generalized Vector Space Model

Description:

... 2, 0), D12=(0,0, 2, ... c1,3=w1,4 w1,6 w1,12=0 0 0=0 (D4, D6 and D12 has minterm min3. ... m2), c3,3=w3,4 w3,6 w2,12=2 1 2=5 (D4, D6 and D12 has minterm m3. ... – PowerPoint PPT presentation

Number of Views:389
Avg rating:3.0/5.0
Slides: 23
Provided by: cs038
Category:

less

Transcript and Presenter's Notes

Title: Generalized Vector Space Model


1
Generalized Vector Space Model
  • Definition Let ki be a vector associated
    with the index term ki . Independence of index
    terms in the vector model implies that the set
    of vectors k1 ,k2 ,,kt is linearly independent
    and forms a basis for the subspace of interest.
    The dimension of this space is the number t of
    index terms in the collection.

2
An example for independent
  • V1(1, 0, 0), V2(0, 1, 0), V3(0, 0, 1).
  • V1 ? V20000.
  • Vi ? Vj0.
  • Each element represents a keywords.
  • Different keywords are treated as totally
    different items. This is not reasonable since
    sometimes they are related.

3
  • Definition Given the set k1 ,k2 ,,kt of index
    terms in a collection, as before, let wi,j be the
    weight associated with the term-document pair ki
    ,dj. If the wi,j weights are all binary, then
    all possible patterns of term co-occurrence
    (inside documents) can be represented by a set of
    2t minterms given by min1 (0,0,,0),
    min2 (1,0,,0),, min2t (1,1,,1).
  • Let gi (minj ) return the weight 0,1 of the
    index term ki in the minterm mini. (gi(dj) is
    defined similarly.)

4
  • Definition Let us define the following set of
    vectors (containing 2t elements)
  • m1(0, 0, , 1)
  • m2(0, 0, , 1, 0)
  • ..
  • m 2t-1(0, 0, , 1).
  • where each vector mi is associated with the
    respective minterm mini .
  • For mi .mj0 for all

5
The new vector kki is defined as
1.1
1.2
6
(No Transcript)
7
An example for Generalized Vector Space Model
  • Suppose that the system has 12 documents and 4
    keywords.
  • D1(2, 1, 0, 0), D2(5, 1, 0, 0), D3(1, 1, 1,
    1),
  • D4(0, 0, 2, 2), D5(0, 1, 1, 2), D6(0, 0, 1,
    1),
  • D7(0, 0, 1, 0), D8(1, 1, 0, 0), D9(2, 1, 1,
    1),
  • D10(0, 2, 2, 2). D11(1, 0, 2, 0), D12(0,0,
    2,1).
  • Minterms 6 minterms are used as independent
    vectors to form a base.
  • min1(1, 1, 0, 0), min2(1, 1, 1, 1), min3(0,
    0, 1, 1),
  • min4(0, 1, 1, 1), min5(0, 0,1, 0), min6(1,
    0, 1, 0).

8
Generalized Vector Space Model
  • Independent vectors
  • v1 (1, 0, 0, 0, 0, 0), v2(0, 1, 0, 0, 0,
    0),
  • v3(0, 0, 1, 0, 0, 0), v4(0, 0, 0, 1, 0,
    0),
  • v5(0, 0, 0, 0, 1, 0), v6(0, 0, 0, 0, 0, 1).
  • Vi represents minterm mini.
  • Each pair of Vi and Vj is orthogonal. (dot
    product0)
  • The four keywords k1, k2, k3, and k4 are
    represent by a combination of the independent
    vectors.

9
Generalized Vector Space Model
  • The four keywords k1, k2, k3, and k4 are
    represent by a combination of the independent
    vectors.
  • k1(c1,1V1c1,2V2c1,3V3c1,4V4c1,5V5c1,6V6)/C
  • where c1,1w1,1w1,2w1,8 251 (D1, D2, and D8
    has minterm min1), c1,2w1,3w1,9 123(D3 and
    D9 has minterm min2), c1,3w1,4w1,6w1,120000
    (D4, D6 and D12 has minterm min3.),
    c1,4w1,5w1,1000. c1,5w1,70. c1,6w1,111.
  • C(c1,1 2c1,2 2c1,3 2c1,4 2c1,5 2c1,6 2)0.5

10
Generalized Vector Space Model
  • k2(c2,1V1c2,2V2c2,3V3c2,4V4c2,5V5c2,6V6)/C
  • where c2,1w2,1w2,2w2,8 111 (D1, D2, and D8
    has minterm m1), c2,2w2,3w2,9 112(D3 and D9
    has minterm m2), c2,3w2,4w2,6w2,120000
    (D4, D6 and D12 has minterm m3.),
    c2,4w2,5w2,10123. c2,5w2,70. c2,6w2,110.
  • C(c2,1 2c2,2 2c2,3 2c2,4 2c2,5 2c2,6 2)0.5

11
Generalized Vector Space Model
  • k3(c3,1V1c3,2V2c3,3V3c3,4V4c3,5V5c3,6V6)/C
  • where c3,1w3,1w3,2w3,8 0 (D1, D2, and D8 has
    minterm m1), c3,2w3,3w3,9 112(D3 and D9 has
    minterm m2), c3,3w3,4w3,6w2,122125 (D4, D6
    and D12 has minterm m3.), c3,4w3,5w3,10123.
    c3,5w3,71. c3,6w3,112.
  • C(c3,1 2c3,2 2c3,3 2c3,4 2c3,5 2c3,6 2)0.5

12
Generalized Vector Space Model
  • k4(c4,1V1c4,2V2c4,3V3c4,4V4c4,5V5c4,6V6)/C
  • where c4,1w4,1w4,2w4,8 0 (D1, D2, and D8 has
    minterm m1), c4,2w4,3w4,9 112(D3 and D9 has
    minterm m2), c4,3w4,4w4,6w4,122114 (D4, D6
    and D12 has minterm m3.), c4,4w4,5w4,10224.
    c4,5w4,70. c4,6w4,110.
  • C(c4,1 2c4,2 2c4,3 2c4,4 2c4,5 2c4,6 2)0.5
  • Kis are converted from a vector of length 4 into
    a vector of length 6.

13
Extended Boolean Model
  • Disadvantages of Boolean Model
  • No term weight is used
  • Counterexample query qKx AND Ky.
  • Documents containing just one term, e,g, Kx
    is considered as irrelevant as another document
    containing none of these terms.
  • No term weight is used
  • The size of the output might be too large or too
    small

14
Extended Boolean Model
  • The Extended Boolean model was introduced in 1983
    by Salton, Fox, and Wu703
  • The idea is to make use of term weight as vector
    space model.
  • Strategy Combine Boolean query with vector space
    model.
  • Why not just use Vector Space Model?
  • Advantages It is easy for user to provide query.

15
Extended Boolean Model
  • Each document is represented by a vector (similar
    to vector space model.)
  • Remember the formula.
  • Query is in terms of Boolean formula.
  • How to rank the documents?

16
Fig. Extended Boolean logic considering the space
composed of two terms kx and ky only.
  • ky
  • ky
  • kx
  • kx

17
Extended Boolean Model
  • For query qKx or Ky, (0,0) is the point we try
    to avoid. Thus, we can use
  • to rank the documents
  • The bigger the better.

18
Extended Boolean Model
  • For query qKx and Ky, (1,1) is the most
    desirable point.
  • We use
  • to rank the documents.
  • The bigger the better.

19
Extend the idea to m terms
  • qork1 ?p k2 ?p ?p Km
  • qandk1 ?p k2 ?p ?p km

20
Properties
  • The p norm as defined above enjoys a couple of
    interesting properties as follows. First, when
    p1 it can be verified that
  • Second, when p? it can be verified that
  • Sim(qor,dj)max(xi)
  • Sim(qand,dj)min(xi)

21
Example
  • For instance, consider the query q(k1 ?k2) ? k3.
    The similarity sim(q,dj) between a document dj
    and this query is then computed as
  • Any boolean can be expressed as a numeral
    formula.

22
Exercise
  • 1. Give the numeral formula for extended Boolean
    model of the query
  • q(k1 or k2 or k3)and (not k4 or k5). (assume
    that there are 5 terms in total.)
  • 2. Assume that the document is represented by the
    vector (0.8, 0.1, 0.0, 0.0, 1.0).
  • What is sim(q, d) for extended Boolean model?
  • Also try to do more exercise for other Boolean
    formulas.
Write a Comment
User Comments (0)
About PowerShow.com