Lecture 3: Document Models for IR - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Lecture 3: Document Models for IR

Description:

No relationship between terms in vector space, they are orthogonal ... D2: The Classic Art of Viennese Pastry. D3: Numerical Recipes: The Art of Scientific Computing ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 60
Provided by: scie241
Category:

less

Transcript and Presenter's Notes

Title: Lecture 3: Document Models for IR


1
Lecture 3Document Models for IR
  • Prof. Xiaotie Deng
  • Department of Computer Science

2
Outline
  • Background
  • Classical Models
  • Latent Semantic Indexing Model
  • Graph Model

3
Background Document Logic View
  • A text document may be represented for computer
    analysis in different formats
  • Full text
  • Index Terms
  • Structures

4
Background Document Indexer
  • The huge size of the Internet makes it
    unrealistic to use the full text for information
    retrieval that requires quick response
  • The indexer simplifies the logical view of a
    document
  • Indexing method dictates document storage and
    retrieval algorithms
  • Automation of indexing methods is necessary for
    information retrieval over the Internet.

5
Background Document Indexer Possible
Drawbacks
  • Summary of document through a set of index terms
    may lead to poor performance
  • many unrelated documents may be included in the
    answer set for a query
  • relevant documents which are not indexed by any
    of the query keywords cannot be retrieved

6
Background IR Models A Formal Description
  • A quadruple D, Q, F, R (q, d)
  • D (document) is a set of composed of logical view
    (or representations) for the documents in the
    collection.
  • Q (queries) is a set composed of logical views
    (or representations) for user information needs.
  • F (Framework) is a framework for modeling
    documents representations, queries, and their
    relationships
  • R(q,d) is a ranking function which associates a
    real number with a query q and a document
    representation d. Such ranking defines an
    ordering among the documents with regard to the
    query q .

7
Classic Models
  • Boolean Model
  • Vector Space Model
  • Probabilistic Model

8
Classic Models Boolean Model
  • Documents representations full text or a set of
    key-words (contained in the text or not)
  • Query representation logic operators, query
    terms, query expressions
  • Searching using inverted file and set operations
    to construct the result set

9
Classic Models Boolean Model Search
  • Queries
  • A B (C D)
  • Break collection into two unordered sets
  • Documents that match the query
  • Documents that dont
  • Return all the match the query

10
Classic Models Boolean Model Example 1
ka
kb
(1,1,0)
(1,0,0)
(1,1,1)
kc
The three conjunctive components for the query q
ka (kb kc)
11
Classic Models Boolean Model Example 2
Consider three document about cityu _at_
http//www.cityu.edu.hk/cityu/about/index.htm abo
ut FSE _at_ http//www.cityu.edu.hk/cityu/dpt-acad/f
se.htm about CS _at_ http//www.cs.cityu.edu.hk/c
ontent/about/ Query degree aim returns
about cityU Query degree aim returns all
three
12
Classic Models Boolean Model Advantages
  • Simple and clean formalism
  • The answer set is exactly what the users look
    for.
  • Therefore, users can have complete control if
    they know how to write a Boolean formula of terms
    for the document(s) they want to find out.
  • Easy to be implemented on computers
  • Popular (most of the search engines support this
    model)

13
Classic Models Boolean Model Disadvantages
  • Results are considered to be equals and no
    ranking of the documents
  • The set of all documents that satisfies a query
    may be still too large for the users to browser
    through or too little
  • The users may only know what they are looking for
    in a vague way but not be able to formulate it as
    a Boolean expression
  • Need to train the users

14
Classic Models Boolean Model Improvement
  • Expand and refine query through interactive
    protocols
  • Automation of query formula generation
  • Assign weights to query terms and rank the
    results accordingly

15
Classic Models Vector Space Model
  • Representation
  • Similarity Measure
  • Advantages
  • Limitations

16
Classic Models Vector Space Model
Representation Introduction
  • Represent stored text as well as information
    queries by vectors of terms
  • term is typically a word, a word stem, or a
    phrase associated with the text under
    consideration or may be word weights.
  • Generate terms by term weighting system
  • terms are not equally useful for content
    representation
  • assign high weights to terms deems important and
    low weights to the less important terms

17
Classic Models Vector Space Model
Representation Illustration
  • Every document in the collection is presented by
    a vector
  • Distinct terms in the collection is called Index
    terms, or vocabulary

Computer XML Operating System Microsoft Office Uni
x Search Engines
Page Collection
Collection about computer
Index terms
18
Classic Models Vector Space Model
Representation Terms Relationship
  • Each term i is identified as Ti
  • No relationship between terms in vector space,
    they are orthogonal
  • Instead, in collection about computer, terms,
    like computer, OS, are correlated to each
    other.

19
Classic Models Vector Space Model
Representation Vector
  • A vocabulary of 2 terms forms a 2D space, each
    document may contain 0,1 or 2 terms. We may see
    the following vectors for the representation.
  • D1lt0,0gt
  • D2lt0,0.3gt
  • D3lt2,3gt

20
Classic Models Vector Space Model
Representation Matrix
  • t-Terms will form a t-D space
  • Documents and queries can be presented as t-D
    vectors
  • Documents can be considered as a point in t-D
    space
  • We may form a matrix of n by t rows for n
    documents indexed with t terms.
  • The matrix is called Document/Term Matrix

21
Classic Models Vector Space Model
Representation Matrix Illustration
Terms
Weight of a term in the document
Documents
22
Classic Models Vector Space Model
Representation Matrix Weight
  • Combine two factors in the document-term weight
  • tfij frequency of term j in document i
  • df j document frequency of term j number of
    documents containing term j
  • idfj inverse document frequency of term j
    log2 (N/ df j) (N number of documents in
    collection)
  • Inverse document frequency -- an indication of
    term values as a document discriminator.

23
Classic Models Vector Space Model
Representation Matrix Weight tf-idf
Indicator
  • A term occurs frequently in the document but
    rarely in the remaining of the collection has a
    high weight
  • A typical combined term importance indicator
  • wij tfij ? idfj tfij ? log2 (N/ df j)
  • Many other ways are recommended to determine the
    document-term weight

24
Classic Models Vector Space Model
Representation Matrix Example Page 1
  • 5 Documents
  • D1 How to Bake Bread without Recipes
  • D2 The Classic Art of Viennese Pastry
  • D3 Numerical Recipes The Art of Scientific
    Computing
  • D4 Breads, Pastries, Pies and Cakes Quantity
    Baking Recipes
  • D5 Pastry A Book of Best French Recipe

25
Classic Models Vector Space Model
Representation Matrix Example Page 2
  • 6 Index Terms
  • T1 Bak(e,ing)
  • T2 recipes
  • T3 bread
  • T4 cake
  • T5 pastr(y,ies)
  • T6 pie

26
Classic Models Vector Space Model
Representation Matrix Example Page 3
  • D1 How to Bake Bread without Recipes
  • D2 The Classic Art of Viennese Pastry
  • D3 Numerical Recipes The Art of Scientific
    Computing
  • D4 Breads, Pastries, Pies and Cakes Quantity
    Baking Recipes
  • D5 Pastry A Book of Best French Recipe

27
Term Frequency in documents
Classic Models Vector Space Model
Representation Matrix Example Page 4
(I,j)1 of document I contains item j once
28
Document frequency of term j
Classic Models Vector Space Model
Representation Matrix Example Page 5
29
tf-idf Weight Matrix
Classic Models Vector Space Model
Representation Matrix Example Page 6
log(5/2) log(5/4) log(5/2) 0 0
0 0 0 0 0 log(5/3)
0 0 log(5/4) 0 0 0
0 log(5/2) log(5/4) log(5/2) log(5)
log(5/3) log(5) 0 log(5/4) 0 0
log(5/3) 0
30
Classic Models Vector Space Model
Representation Matrix Exercise
  • Write a program that use tf-idf term weight to
    form the term/document matrix. Test it for the
    following three documents
  • http//www.cityu.edu.hk/cityu/about/index.htm
  • http//www.cityu.edu.hk/cityu/dpt-acad/fse.htm
  • http//www.cs.cityu.edu.hk/content/about/

31
Classic Models Vector Space Model
Similarity Measure
  • Determine the similarity between document D and
    query Q
  • Lots of method can be used to calculate the
    similarity
  • Cosine Similarity Measures

32
Classic Models Vector Space Model
Similarity Measure Cosine
dj
?
Q
Cosine similarity measures the cosine of the
angle between two vectors
33
Classic Models Vector Space Model Advantages
  • Term-weighting scheme improves retrieval
    performance
  • Partial matching strategy allows retrieval of
    documents that approximate the query conditions
  • Its cosine ranking formula sorts the documents
    according to their degree of similarity to the
    query

34
Classic Models Vector Space Model
Limitations
  • underlying assumption is that the terms in the
    vector are orthogonal
  • the need for several query terms if a
    discriminating ranking is to be achieved, whereas
    only two or three ANDed terms may suffice in a
    Boolean environment to obtain a high-quality
    output
  • Difficult to explicitly specifying synonymous and
    phrasal relationships, where these can be easily
    handled in a Boolean environment by means of the
    OR and AND operators or by an extended Boolean
    model

35
Latent Semantic Indexing Model
  • Map document and query vector into a lower
    dimensional space which is associated with
    concepts
  • Information retrieval using a singular value
    decomposition model of latent semantic structure.
    11th ACM SIGIR Conference, pp.465-480, 1988
  • by G.W.Furnas, S. Deerwester,S.T.Dumais,T.K.Landau
    er, R.A.Harshman,L.A.Streeter, and K.E.Lochbaum
  • http//www.cs.utk.edu/lsi/
  • A Tutorial http//www.cs.utk.edu/berry/lsi/nod
    e5.html

36
LSI General Approach
  • It is based on Vector Space Model
  • In Vector Space Model, terms are treated
    independently
  • Here some relationship of the terms are obtained,
    implicitly, magically through matrix analysis
  • This allows reduction of some un-necessary
    information in the document representation.

37
LSI Background Knowledge Term Document
Association Matrix
  • Term Document Association Matrix
  • Let t be the number of terms and N be the number
    of documents
  • Let M(Mij) be term-document association matrix.
  • Mij may be considered as weight associated with
    the term-document pair (ti,dj)

38
LSI Background Knowledge Eigenvalue and
Eigenvector
  • We have
  • A an n n matrix
  • v an n-dimensional vector
  • c a scalar
  • If Avcv, then
  • c is called an eigenvalue of A
  • v is called an eigenvector of A

39
LSI Background Knowledge Eigenvalue and
Eigenvector Example Page 1
  • A x
  • Then Ax3x
  • 3 is an eigenvalue, and x is an eigenvector
  • Question find another eigenvalue?

40
LSI Background Knowledge Eigenvalue and
Eigenvector Example Page 2
  • yt(1,-1). Ay(1,-1)ty.
  • Therefore, another eigenvalue is 1 and its
    associated eigenvector is y
  • Then let
  • S
  • Then A(x,y) (x,y)S
  • More over xty0

41
LSI Background Knowledge Eigenvalue and
Eigenvector Example Page 3
  • Let K(x,y)/sqrt(2)
  • Then
  • KtKI
  • and AKSKt

42
LSI Background Knowledge A Corollary of
Eigen Decomposition Theorem
  • If A is a symmetrical matrix, then there exist a
    matrix K (KtKI) and a diagonal matrix S such
    that AKSKt
  • See http//mathworld.wolfram.com/EigenDecompositio
    nTheorem.html
  • Application to our use
  • Both MMt and Mt M are symmetric
  • In addition, their eigenvalues are the same
    except that the large one has an extra number of
    zeros.

43
LSI Background Knowledge Matrix
Decomposition
  • Decompose MKSDt
  • K the matrix of eigenvectors derived from the
    term-to-term correlation matrix given by MMt
  • Dt that of Mt M
  • S an (rxr) matrix of singular values where r is
    the rank of M

44
LSI Reduced Concept Space
  • Let Ss be the s largest singular values of S.
  • Let Ks and Dst be the corresponding columns of
    rows K and S.
  • The matrix MsKsSsDst
  • is closest to M in the least square sense
  • NOTE Ms has the same number of rows (terms) and
    columns (documents) as M but it may be totally
    different from M.
  • A numerical example
  • http//www.cse.ogi.edu/class/cse580ir/handouts/10
    20October/Models20in20Information20Retrieval20
    II20Motivating20Engineering20Decisions/sld041.
    htm

45
LSI The Relationship of Two Documents di and
dj
  • MstMs(KsSsDst )t(KsSsDst )
  • DsSsKstKsSsDst
  • DsSsSsDst
  • DsSs(DsSs)t
  • The (i,j) element quantifies the relationship
    between document i and j.

46
LSI The Choice of s
  • It should be large enough to allow fitting all
    the structure in the original data
  • It should be small enough to allow filtering out
    noise caused by variation of choices of terms.

47
LSI Ranking of Query Results
  • Model the query Q as a pseudo-document in the
    original term document matrix M
  • The vector MstQ provides ranks of all documents
    with respect to this query Q.

48
LSI Advantages
  • When s is small with respect to t and N, it
    provides an efficient indexing model
  • it provides for elimination of noise and removal
    of redundancy
  • it introduces conceptualization based on theory
    of singular value decomposition

49
Graph Model
  • Improving Effectiveness and Efficiency of Web
    Search by Graph-based Text Representation
  • Junji Tomita and Yoshihiko Hayashi
  • http//www9.org/final-posters/13/poster13.html
  • Interactive Web Search by Graphical Query
    Refinement
  • Junji Tomita and Genichiro Kikui
  • http//www10.org/cdrom/posters/1078.pdf

50
Graph Model Subject Graph Basics
  • A node represents a term in the text.
  • A link denotes an association between the linked
    terms.
  • Significance of terms and term-term associations
    are represented as weights assigned to them.

51
Graph Model Subject Graph Weight Assignment
  • Term-statistics-based weighting schemes
  • frequencies of terms
  • frequencies of term-term association
  • multiplied by inverse document frequency

52
Graph Model Subject Graph Similarity of
Documents
  • Subject Graph Matching.
  • Weight terms and term-term associations with µ
    and 1-µ for adequately chosen µ.
  • Then calculate the cosine value of two documents
    treating weighted terms and term-term
    associations as elements of the vector space
    model.

53
Graph Model Query As A Graph
  • Sometimes users query is vague
  • System Represents users query as a query graph
  • User can interactively and explicitly clarify
    his/her query by looking at and editing the query
    graph
  • System implicitly edits the query graph according
    to users choice on documents

54
Graph Model Query As A Graph A Sample
System User Interface
55
Graph Model Query As A Graph Illustration
guide
transport
travel
train
Asia
Japan
link and nodes with no link or
56
Graph Model Query As A Graph Interactive
Graph Query Refinement
  • User inputs sentences as query, system displays
    the initial query graph made from the inputs
  • User edits the query graph by removing and/or
    adding nodes and/or links
  • System measures the relevance score of each
    document against the modified query graph
  • System ranks search results in descending score
    order and displays their titles to the user
    interface
  • User selects documents relevant to his/her needs
  • System refines the query graph based on the
    documents selected by user and the old query
    graph
  • System displays the new query graph to user
  • Repeat previous steps until the user is satisfied
    with the search results

57
Graph Model Query As A Graph Interactive
Graph Query Refinement Details of Step 6 --
Making A New Query Graph
58
Graph Model Query As A Graph Digest Graph
  • The output of search engines is presented via
    graphical representation
  • a sub-graph of the Subject Graph for the entire
    document.
  • The sub-graph is generated on the fly in response
    to the current query.
  • User can intuitively understand the subject of
    each document from the terms and the term-term
    associations in the graph.

59
Summary
  • Background
  • Classical Models
  • Latent Semantic Indexing Model
  • Graph Model
Write a Comment
User Comments (0)
About PowerShow.com