Title: Classical Models
1Lecture 3Document Models for IR
- Classical Models
- Latent Semantic Indexing Model
- A Structural Model
2Logic View of Document
- A text document may be represent for computer
analysis in different formats - Full text
- Index Terms
- Structures
3The Role of Indexer
- The huge size of the Internet makes it
unrealistic to use the full text for information
retrieval that requires quick response - The indexer simplifies the logical view of a
document - Indexing method dictates document storage and
retrieval algorithms - Automation of indexing methods is necessary for
information retrieval over the Internet.
4Possible drawbacks
- Summary of document through a set of index terms
may lead to poor performance - many unrelated documents may be included in the
answer set for a query - relevant documents which are not indexed by any
of the query keywords cannot be retrieved
5An Formal Description of IR Models
- A quadruple D,Q,F,R(q,d)
- D (document) is a set of composed of logical view
(or representations) for the documents in the
collection. - Q (queries) is a set composed of logical views
(or representations) for user information needs. - F (Framework) is a framework for modeling
documents representations, queries, and their
relationships - R(q,d) is a ranking function which associates a
real number with a query q and a document
representation d. Such ranking defines an
ordering among the documents with regard to the
query q .
6Classic Models
- Boolean Model
- Vector Space Model
- Probabilistic Model
7Boolean Model
- Documents representations full text or a set of
key-words (contained in the text or not) - Query representation logic operators, query
terms, query expressions - Searching using inverted file and set operations
to construct the result set
8Boolean Searching
- Queries
- A and B and (C D)
- Break collection into two unordered sets
- Documents that match the query
- Documents that dont
- Return all the match the query
9Boolean Model
ka
kb
(1,1,0)
(1,0,0)
(1,1,1)
kc
The three conjunctive components for the query
qka (kb kc)
10Another Example
Consider three document about cityu _at_
http//www.cityu.edu.hk/cityu/about/index.htm abo
ut FSE _at_ http//www.cityu.edu.hk/cityu/dpt-acad/f
se.htm about CS _at_ http//www.cs.cityu.edu.hk/c
ontent/about/ Query degree aim returns
about cityU Query degree aim returns all
three
11Advantages
- Simple and clean formalism
- The answer set is exactly what the users look
for. - Therefore, users can have complete control if
they know how to write a Boolean formula of terms
for the document(s) they want to find out. - Easy to be implemented on computers
- Popular (most of the search engines support this
model)
12Disadvantage
- Results are considered to be equals and no
ranking of the documents - The set of all documents that satisfies a query
may be still too large for the users to browser
through or too little - The users may only know what they are looking for
in a vague way but not be able to formulate it as
a Boolean expression - Need to train the users
13Improvement to Boolean model
- Expand and refine query through interactive
protocols - Automation of query formula generation
- Assign weights to query terms and rank the
results accordingly
14Vector Space Model
- Vector Presentation
- Similarity Measure
15Vector Space Model
- represent stored text as well as information
queries by vectors of terms - term is typically a word, a word stem, or a
phrase associated with the text under
consideration or may be word weights. - generate terms by term weighting system
- terms are not equally useful for content
representation - assign high weights to terms deems important and
low weights to the less important terms
16Vector Presentation
- Every document in the collection is presented by
a vector - Distinct terms in the collection is called Index
terms, or vocabulary
Computer XML Operating System Microsoft Office Uni
x Search Engines
Page Collection
Collection about computer
Index terms
17Terms relationship
- Each term is identified as Ti
- No relationship between terms in vector space,
they are orthogonal - Instead ,in collection about computer, terms,
like computer, OS, are correlated to each
other.
18Vector space model
- A vocabulary of 2 terms forms a 2D space, each
document may contain 0,1 or 2 terms. We may see
the following vectors for the representation. - D1lt0,0gt
- D2lt0,0.3gt
- D3lt2,3gt
19Term/Document matrix
- t-Terms will form a t-D space
- Documents and queries can be presented as t-D
vectors - Documents can be considered as a point in t-D
space - We may form a matrix of n by t rows for n
documents indexed with t terms.
20Document-Term Matrix
Terms
Weight of a term in the document
Documents
21Decide the weight
- Combine two factors in the document-term weight
- tfij frequency of term j in document I
- df j document frequency of term j
- number of documents containing term j
- idfj
- inverse document frequency of term j
- log2 (N/ df j) (N number of documents in
collection) - Inverse document frequency -- an indication of
term values as a document discriminator.
22Tf-idf term weight
- A term occurs frequently in the document but
rarely in the remaining of the collection has a
high weight - A typical combined term importance indicator
- wij tfij? idfj tfij? log2 (N/ df j)
- many other ways are recommended to determine the
document-term weight
23An Example of 5 documents
- D1 How to Bake Bread without Recipes
- D2 The Classic Art of Viennese Pastry
- D3 Numerical Recipes The Art of Scientific
Computing - D4 Breads, Pastries, Pies and Cakes Quantity
Baking Recipes - D5 Pastry A Book of Best French Recipe
24Six Index terms
- T1 Bak(e,ing)
- T2 recipes
- T3 bread
- T4 cake
- T5 pastr(y, ies)
- T6 pie
25An Example of 5 documents
- D1 How to Bake Bread without Recipes
- D2 The Classic Art of Viennese Pastry
- D3 Numerical Recipes The Art of Scientific
Computing - D4 Breads, Pastries, Pies and Cakes Quantity
Baking Recipes - D5 Pastry A Book of Best French Recipe
26Term Frequency in documents
(I,j)1 of document I contains item j once
27Document frequency of term j
28Tf-idf weight matrix
log(5/2) log(5/4) log(5/2) 0 0
0 0 0 0 0 log(5/3)
0 0 log(5/4) 0 0 0
0 log(5/2) log(5/4) log(5/2) log(5)
log(5/3) log(5) 0 log(5/4) 0 0
log(5/3) 0
29Exercise
- Write a program that use Tf-idf term weight to
form the term/document matrix. Test it for the
following three documents - http//www.cityu.edu.hk/cityu/about/index.htm
- http//www.cityu.edu.hk/cityu/dpt-acad/fse.htm
- http//www.cs.cityu.edu.hk/content/about/
30Similarity Measure
- Determine the similarity between document D and
query Q - Lots of method can be used to calculate the
similarity - Cosine Similarity Measures
31Similarity Measure cosine
dj
?
Q
Cosine similarity measures the cosine of the
angle between two vectors
32Advantages of VSM
- Term-weighting scheme improves retrieval
performance - Partial matching strategy allows retrieval of
documents that approximate the query conditions - Its cosine ranking formula sorts the documents
according to their degree of similarity to the
query
33Limitations of VSM
- underlying assumption is that the terms in the
vector are orthogonal - the need for several query terms if a
discriminating ranking is to be achieved, whereas
only two or three ANDed terms may suffice in a
Boolean environment to obtain a high-quality
output - Difficult to explicitly specifying synonymous and
phrasal relationships, where these can be easily
handled in a Boolean environment by means of the
OR and AND operators or by an extended Boolean
model
34Latent Semantic Indexing Model of document/query
- Map document and query vector into a lower
dimensional space which is associated with
concepts - Information retrieval using a singular value
decomposition model of latent semantic structure.
11th ACM SIGIR Conference, pp.465-480, 1988 - by G.W.Furnas, S. Deerwester,S.T.Dumais,T.K.Landau
er, R.A.Harshman,L.A.Streeter, and K.E.Lochbaum - http//www.cs.utk.edu/lsi/
- A tutorial
- http//www.cs.utk.edu/berry/lsi/node5.html
35General Approach
- It is based on Vector Space Model
- In vector space model, terms are treated
independently - Here some relationship of the terms are obtained,
implicitly, magically through matrix analysis - This allows reduction of some un-necessary
information in the document representation.
36term- document association matrix
- Let t be the number of terms and N be the number
of documents - Let M(Mij) be term-document association matrix.
- Mij may be considered as weight associated with
the term-document pair (ti,dj)
37Eigenvalue and Eigenvector
- Let A be an mxn matrix and x be an n-dimensional
vector - x is an eigenvalue of A if Ax is the same as cx
where c is a scale factor. - Example
- A x
- Then Ax3x
- 3 is an eigenvalue, and x is an eigenvector
- Question find another eigenvalue?
38Example continued
- yt(1,-1). Ay(1,-1)ty.
- Therefore, another eigenvalue is 1 and its
associated eigenvector is y - Then
- Let
- S
- Then A(x,y) (x,y)S
- More over xty0
39Example continued
- Let K(x,y)/sqrt(2)
- Then
- KtKI
- and AKSKt
40A General Theorem from Linear Algebra
- If A is a symmetrical matrix, then
- There exist a matrix K (KtKI) and a diagonal
matrix S such that - AKSKt
41Application to our case
- Both MMt and Mt M are symmetric
- In addition, their eigenvalues are the same
except that the large one has an extra number of
zeros.
42Decomposition of term- document association matrix
- Decompose MKSDt
- K the matrix of eigenvectors derived from the
term-to-term correlation matrix given by MMt - Dt that of Mt M
- S an (rxr) matrix of singular values where r is
the rank of M
43Reduced Concept Space
- Let Ss be the s largest singular values of S.
- Let Ks and Dst be the corresponding columns of
rows K and S. - The matrix MsKsSsDst
- is closest to M in the least square sense
- NOTE Ms has the same number of rows (terms) and
columns (documents) as M but it may be totally
different from M. - A numerical example
- www.cs.arizona.edu/classes/ cs630/spring03/slides/
jan-29.ppt
44The relationship of two documents di and dj
- MstMs(KsSsDst )t(KsSsDst )
- DsSsKstKsSsDst
- DsSsSsDst
- DsSs(DsSs)t
- The (i,j) element quantifies the relationship
between document i and j.
45The choice of s
- It should be large enough to allow fitting all
the structure in the original data - It should be small enough to allow filtering out
noise caused by variation of choices of terms.
46Ranking documents according to a query
- Model the query Q as a pseudo-document in the
original term document matrix M - The vector MstQ provides ranks of all documents
with respect to this query Q.
47Advantage
- When s is small with respect to t and N, it
provides an efficient indexing model - it provides for elimination of noise and removal
of redundancy - it introduces conceptualization based on theory
of singular value decomposition
48Graph model of document/query
- Improving Effectiveness and Efficiency of Web
Search - by Graph-based Text Representation
- Junji Tomita and Yoshihiko Hayashi
- http//www9.org/final-posters/13/poster13.html
- Interactive web search by graphical query
refinement - By Junji Tomita and Genichiro Kikui
- http//www10.org/cdrom/posters/1078.pdf
49Graph-based text representation model
- Subject Graph
- a node represents a term in the text,
- a link denotes an association between the linked
terms. - Significance of terms and term-term associations
are represented as weights assigned to them.
50Assignment of Weights
- Term-statistics-based weighting schemes
- frequencies of terms
- frequencies of term-term association
- multiplied by inverse document frequency
51Similarity of documents
- Subject Graph Matching.
- Weight terms and term-term associations with ?
and 1-? for adequately chosen ?. - Then calculate the cosine value of two documents
treating weighted terms and term-term
associations as elements of the vector space
model.
52Query as graph
- Sometimes users query is vague
- System Represents users query as a query graph
- User can interactively and explicitly clarify
his/her query by looking at and editing the query
graph - System implicitly edits the query graph
according to users choice on documents
53User interface of the system
54A query graph
guide
transport
travel
train
Asia
Japan
link and nodes with no link or
55Interactive query graph refinement
- User inputs sentences as query, system displays
the initial query graph made from the inputs - User edits the query graph by removing and/or
adding nodes and/or links - System measures the relevance score of each
document against the modified query graph - System ranks search results in descending score
order and displays their titles to the user
interface - User selects documents relevant to his/her needs
- System refines the query graph based on the
documents selected by user and the old query
graph - System displays the new query graph to user
- Repeat previous steps until the user is satisfied
with the search results
56Details of step 6 making a new query graph
57Digest Graph
- The output of search engines is presented via
graphical representation - a subgraph of the Subject Graph for the entire
document. - The subgraph is generated on the fly in response
to the current query. - User can intuitively understand the subject of
each document from the terms and the term-term
associations in the graph.