Classical Models - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Classical Models

Description:

The huge size of the Internet makes it unrealistic to use the full text for ... Simple and clean formalism. The answer set is exactly what the users look for. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 58
Provided by: scie241
Category:

less

Transcript and Presenter's Notes

Title: Classical Models


1
Lecture 3Document Models for IR
  • Classical Models
  • Latent Semantic Indexing Model
  • A Structural Model

2
Logic View of Document
  • A text document may be represent for computer
    analysis in different formats
  • Full text
  • Index Terms
  • Structures

3
The Role of Indexer
  • The huge size of the Internet makes it
    unrealistic to use the full text for information
    retrieval that requires quick response
  • The indexer simplifies the logical view of a
    document
  • Indexing method dictates document storage and
    retrieval algorithms
  • Automation of indexing methods is necessary for
    information retrieval over the Internet.

4
Possible drawbacks
  • Summary of document through a set of index terms
    may lead to poor performance
  • many unrelated documents may be included in the
    answer set for a query
  • relevant documents which are not indexed by any
    of the query keywords cannot be retrieved

5
An Formal Description of IR Models
  • A quadruple D,Q,F,R(q,d)
  • D (document) is a set of composed of logical view
    (or representations) for the documents in the
    collection.
  • Q (queries) is a set composed of logical views
    (or representations) for user information needs.
  • F (Framework) is a framework for modeling
    documents representations, queries, and their
    relationships
  • R(q,d) is a ranking function which associates a
    real number with a query q and a document
    representation d. Such ranking defines an
    ordering among the documents with regard to the
    query q .

6
Classic Models
  • Boolean Model
  • Vector Space Model
  • Probabilistic Model

7
Boolean Model
  • Documents representations full text or a set of
    key-words (contained in the text or not)
  • Query representation logic operators, query
    terms, query expressions
  • Searching using inverted file and set operations
    to construct the result set

8
Boolean Searching
  • Queries
  • A and B and (C D)
  • Break collection into two unordered sets
  • Documents that match the query
  • Documents that dont
  • Return all the match the query

9
Boolean Model
ka
kb
(1,1,0)
(1,0,0)
(1,1,1)
kc
The three conjunctive components for the query
qka (kb kc)
10
Another Example
Consider three document about cityu _at_
http//www.cityu.edu.hk/cityu/about/index.htm abo
ut FSE _at_ http//www.cityu.edu.hk/cityu/dpt-acad/f
se.htm about CS _at_ http//www.cs.cityu.edu.hk/c
ontent/about/ Query degree aim returns
about cityU Query degree aim returns all
three
11
Advantages
  • Simple and clean formalism
  • The answer set is exactly what the users look
    for.
  • Therefore, users can have complete control if
    they know how to write a Boolean formula of terms
    for the document(s) they want to find out.
  • Easy to be implemented on computers
  • Popular (most of the search engines support this
    model)

12
Disadvantage
  • Results are considered to be equals and no
    ranking of the documents
  • The set of all documents that satisfies a query
    may be still too large for the users to browser
    through or too little
  • The users may only know what they are looking for
    in a vague way but not be able to formulate it as
    a Boolean expression
  • Need to train the users

13
Improvement to Boolean model
  • Expand and refine query through interactive
    protocols
  • Automation of query formula generation
  • Assign weights to query terms and rank the
    results accordingly

14
Vector Space Model
  • Vector Presentation
  • Similarity Measure

15
Vector Space Model
  • represent stored text as well as information
    queries by vectors of terms
  • term is typically a word, a word stem, or a
    phrase associated with the text under
    consideration or may be word weights.
  • generate terms by term weighting system
  • terms are not equally useful for content
    representation
  • assign high weights to terms deems important and
    low weights to the less important terms

16
Vector Presentation
  • Every document in the collection is presented by
    a vector
  • Distinct terms in the collection is called Index
    terms, or vocabulary

Computer XML Operating System Microsoft Office Uni
x Search Engines
Page Collection
Collection about computer
Index terms
17
Terms relationship
  • Each term is identified as Ti
  • No relationship between terms in vector space,
    they are orthogonal
  • Instead ,in collection about computer, terms,
    like computer, OS, are correlated to each
    other.

18
Vector space model
  • A vocabulary of 2 terms forms a 2D space, each
    document may contain 0,1 or 2 terms. We may see
    the following vectors for the representation.
  • D1lt0,0gt
  • D2lt0,0.3gt
  • D3lt2,3gt

19
Term/Document matrix
  • t-Terms will form a t-D space
  • Documents and queries can be presented as t-D
    vectors
  • Documents can be considered as a point in t-D
    space
  • We may form a matrix of n by t rows for n
    documents indexed with t terms.

20
Document-Term Matrix
Terms
Weight of a term in the document
Documents
21
Decide the weight
  • Combine two factors in the document-term weight
  • tfij frequency of term j in document I
  • df j document frequency of term j
  • number of documents containing term j
  • idfj
  • inverse document frequency of term j
  • log2 (N/ df j) (N number of documents in
    collection)
  • Inverse document frequency -- an indication of
    term values as a document discriminator.

22
Tf-idf term weight
  • A term occurs frequently in the document but
    rarely in the remaining of the collection has a
    high weight
  • A typical combined term importance indicator
  • wij tfij? idfj tfij? log2 (N/ df j)
  • many other ways are recommended to determine the
    document-term weight

23
An Example of 5 documents
  • D1 How to Bake Bread without Recipes
  • D2 The Classic Art of Viennese Pastry
  • D3 Numerical Recipes The Art of Scientific
    Computing
  • D4 Breads, Pastries, Pies and Cakes Quantity
    Baking Recipes
  • D5 Pastry A Book of Best French Recipe

24
Six Index terms
  • T1 Bak(e,ing)
  • T2 recipes
  • T3 bread
  • T4 cake
  • T5 pastr(y, ies)
  • T6 pie

25
An Example of 5 documents
  • D1 How to Bake Bread without Recipes
  • D2 The Classic Art of Viennese Pastry
  • D3 Numerical Recipes The Art of Scientific
    Computing
  • D4 Breads, Pastries, Pies and Cakes Quantity
    Baking Recipes
  • D5 Pastry A Book of Best French Recipe

26
Term Frequency in documents
(I,j)1 of document I contains item j once
27
Document frequency of term j
28
Tf-idf weight matrix
log(5/2) log(5/4) log(5/2) 0 0
0 0 0 0 0 log(5/3)
0 0 log(5/4) 0 0 0
0 log(5/2) log(5/4) log(5/2) log(5)
log(5/3) log(5) 0 log(5/4) 0 0
log(5/3) 0
29
Exercise
  • Write a program that use Tf-idf term weight to
    form the term/document matrix. Test it for the
    following three documents
  • http//www.cityu.edu.hk/cityu/about/index.htm
  • http//www.cityu.edu.hk/cityu/dpt-acad/fse.htm
  • http//www.cs.cityu.edu.hk/content/about/

30
Similarity Measure
  • Determine the similarity between document D and
    query Q
  • Lots of method can be used to calculate the
    similarity
  • Cosine Similarity Measures

31
Similarity Measure cosine
dj
?
Q
Cosine similarity measures the cosine of the
angle between two vectors
32
Advantages of VSM
  • Term-weighting scheme improves retrieval
    performance
  • Partial matching strategy allows retrieval of
    documents that approximate the query conditions
  • Its cosine ranking formula sorts the documents
    according to their degree of similarity to the
    query

33
Limitations of VSM
  • underlying assumption is that the terms in the
    vector are orthogonal
  • the need for several query terms if a
    discriminating ranking is to be achieved, whereas
    only two or three ANDed terms may suffice in a
    Boolean environment to obtain a high-quality
    output
  • Difficult to explicitly specifying synonymous and
    phrasal relationships, where these can be easily
    handled in a Boolean environment by means of the
    OR and AND operators or by an extended Boolean
    model

34
Latent Semantic Indexing Model of document/query
  • Map document and query vector into a lower
    dimensional space which is associated with
    concepts
  • Information retrieval using a singular value
    decomposition model of latent semantic structure.
    11th ACM SIGIR Conference, pp.465-480, 1988
  • by G.W.Furnas, S. Deerwester,S.T.Dumais,T.K.Landau
    er, R.A.Harshman,L.A.Streeter, and K.E.Lochbaum
  • http//www.cs.utk.edu/lsi/
  • A tutorial
  • http//www.cs.utk.edu/berry/lsi/node5.html

35
General Approach
  • It is based on Vector Space Model
  • In vector space model, terms are treated
    independently
  • Here some relationship of the terms are obtained,
    implicitly, magically through matrix analysis
  • This allows reduction of some un-necessary
    information in the document representation.

36
term- document association matrix
  • Let t be the number of terms and N be the number
    of documents
  • Let M(Mij) be term-document association matrix.
  • Mij may be considered as weight associated with
    the term-document pair (ti,dj)

37
Eigenvalue and Eigenvector
  • Let A be an mxn matrix and x be an n-dimensional
    vector
  • x is an eigenvalue of A if Ax is the same as cx
    where c is a scale factor.
  • Example
  • A x
  • Then Ax3x
  • 3 is an eigenvalue, and x is an eigenvector
  • Question find another eigenvalue?

38
Example continued
  • yt(1,-1). Ay(1,-1)ty.
  • Therefore, another eigenvalue is 1 and its
    associated eigenvector is y
  • Then
  • Let
  • S
  • Then A(x,y) (x,y)S
  • More over xty0

39
Example continued
  • Let K(x,y)/sqrt(2)
  • Then
  • KtKI
  • and AKSKt

40
A General Theorem from Linear Algebra
  • If A is a symmetrical matrix, then
  • There exist a matrix K (KtKI) and a diagonal
    matrix S such that
  • AKSKt

41
Application to our case
  • Both MMt and Mt M are symmetric
  • In addition, their eigenvalues are the same
    except that the large one has an extra number of
    zeros.

42
Decomposition of term- document association matrix
  • Decompose MKSDt
  • K the matrix of eigenvectors derived from the
    term-to-term correlation matrix given by MMt
  • Dt that of Mt M
  • S an (rxr) matrix of singular values where r is
    the rank of M

43
Reduced Concept Space
  • Let Ss be the s largest singular values of S.
  • Let Ks and Dst be the corresponding columns of
    rows K and S.
  • The matrix MsKsSsDst
  • is closest to M in the least square sense
  • NOTE Ms has the same number of rows (terms) and
    columns (documents) as M but it may be totally
    different from M.
  • A numerical example
  • www.cs.arizona.edu/classes/ cs630/spring03/slides/
    jan-29.ppt

44
The relationship of two documents di and dj
  • MstMs(KsSsDst )t(KsSsDst )
  • DsSsKstKsSsDst
  • DsSsSsDst
  • DsSs(DsSs)t
  • The (i,j) element quantifies the relationship
    between document i and j.

45
The choice of s
  • It should be large enough to allow fitting all
    the structure in the original data
  • It should be small enough to allow filtering out
    noise caused by variation of choices of terms.

46
Ranking documents according to a query
  • Model the query Q as a pseudo-document in the
    original term document matrix M
  • The vector MstQ provides ranks of all documents
    with respect to this query Q.

47
Advantage
  • When s is small with respect to t and N, it
    provides an efficient indexing model
  • it provides for elimination of noise and removal
    of redundancy
  • it introduces conceptualization based on theory
    of singular value decomposition

48
Graph model of document/query
  • Improving Effectiveness and Efficiency of Web
    Search
  • by Graph-based Text Representation
  • Junji Tomita and Yoshihiko Hayashi
  • http//www9.org/final-posters/13/poster13.html
  • Interactive web search by graphical query
    refinement
  • By Junji Tomita and Genichiro Kikui
  • http//www10.org/cdrom/posters/1078.pdf

49
Graph-based text representation model
  • Subject Graph
  • a node represents a term in the text,
  • a link denotes an association between the linked
    terms.
  • Significance of terms and term-term associations
    are represented as weights assigned to them.

50
Assignment of Weights
  • Term-statistics-based weighting schemes
  • frequencies of terms
  • frequencies of term-term association
  • multiplied by inverse document frequency

51
Similarity of documents
  • Subject Graph Matching.
  • Weight terms and term-term associations with ?
    and 1-? for adequately chosen ?.
  • Then calculate the cosine value of two documents
    treating weighted terms and term-term
    associations as elements of the vector space
    model.

52
Query as graph
  • Sometimes users query is vague
  • System Represents users query as a query graph
  • User can interactively and explicitly clarify
    his/her query by looking at and editing the query
    graph
  • System implicitly edits the query graph
    according to users choice on documents

53
User interface of the system
54
A query graph
guide
transport
travel
train
Asia
Japan
link and nodes with no link or
55
Interactive query graph refinement
  • User inputs sentences as query, system displays
    the initial query graph made from the inputs
  • User edits the query graph by removing and/or
    adding nodes and/or links
  • System measures the relevance score of each
    document against the modified query graph
  • System ranks search results in descending score
    order and displays their titles to the user
    interface
  • User selects documents relevant to his/her needs
  • System refines the query graph based on the
    documents selected by user and the old query
    graph
  • System displays the new query graph to user
  • Repeat previous steps until the user is satisfied
    with the search results

56
Details of step 6 making a new query graph
57
Digest Graph
  • The output of search engines is presented via
    graphical representation
  • a subgraph of the Subject Graph for the entire
    document.
  • The subgraph is generated on the fly in response
    to the current query.
  • User can intuitively understand the subject of
    each document from the terms and the term-term
    associations in the graph.
Write a Comment
User Comments (0)
About PowerShow.com