INFORMATION SEARCH - PowerPoint PPT Presentation

About This Presentation
Title:

INFORMATION SEARCH

Description:

'Perfect search machine' defined by Larry Page is something that understand ... Def: Stemming : The process of chopping the ending of a term, e.g. removing 'ed', 'ing' ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 47
Provided by: sno88
Category:

less

Transcript and Presenter's Notes

Title: INFORMATION SEARCH


1
INFORMATION SEARCH
  • Presenter Pham Kim Son
  • Saint Petersburg State University

2
Overview
  • Introduction
  • Overview of IR system and basic terminologies
  • Classical models of Information Retrieval
  • Boolean Model
  • Vector Model
  • Modern models of Information Retrieval
  • Latent Semantic Indexing
  • Correlation Method
  • Conclusion

3
Introduction
  • People need information to solve problem
  • Very simple thing
  • Complex thing
  • Perfect search machine defined by Larry Page is
    something that understand exactly what you mean
    and return exactly what you want.

4
Challenges to IR
  • Introduction of www
  • www is large
  • Heterogeneous
  • Gives challenges to IR

5
Overview of IR and basic terminologies
  • IR can be divided in to 3 components
  • Input
  • Processor
  • Output

Query
Processor
Output
Input
Documents
6
Input
Input
  • The main task Obtain a good representation of
    each document and query for computer to use
  • A document representative a list of extracted
    keywords
  • Idea Let the computer process the natural
    language in the document

7
Input (cont)
  • Obstacles
  • Theory of human language has not been
    sufficiently developed for every language
  • Not clear how to use it to enhance information
    retrieval

8
Input (cont)
  • Representing documents by keywords
  • Step1Removing common words
  • Step2 Stemming

9
  • Step1 Removing common words
  • Very high frequency words very common words
  • They should removed
  • comparing with stop-list
  • So
  • Non-significant words not interfere IR
    process
  • Reduce the size document between
    30-gt50 per cent (C.J. van Rijsbergen)

are
be
By
will
of the
10
  • Step2Stemming
  • Def Stemming The process of chopping the
    ending of a term, e.g. removing ed, ing
  • Algorithm Porter

11
Processor
Processor
Input
Input
  • This part of the IR system are concerned with
    the retrieval process
  • Structuring the documents in an appropriate way
  • Performing actual retrieval function, using a
    predefined model

12
Output
Processor
output
Input
  • The output will be a set of documents, assumed to
    be relevant to users.
  • Purpose of IR
  • Retrieved all relevant documents
  • Retrieved irrelevant documents as few as possible

13
  • Definition

Relevant docs retrieved
relevant docs
Relevant docs retrieved
All docs retrieved
14
1
Precision
0
1
Recall
15
Information Retrieval Models
  • Classical models
  • Boolean model
  • Vector model
  • Novel models
  • Latent semantic indexing model
  • Correlation method

16
Boolean model
  • Earliest and simplest method, widely used in IR
    systems today
  • Based on set theories and Boolean algebra.
  • Queries are Boolean expressions of keywords,
    connected by operators AND, OR, ANDNOT
  • Ex (Saint Petersburg AND Russia) (beautiful
    city AND Russia)

17
  • Inverted files are widely used
  • Ex
  • Term1 doc 2 , doc 5, doc6
  • Term2 doc 2, doc4, doc5
  • Query q (term1 AND term2)
  • Result doc2, doc5
  • Term-document matrix can be used

18
Thinking about Boolean model
  • Advantages
  • Very simple model based on sets theory
  • Easy to understand and implement
  • Supports exact query
  • Disadvantages
  • Retrieval based on binary decision criteria ,
    without notion of partial matching
  • Sets are easy, but complex Boolean expressions
    are not
  • The Boolean queries formulated by users are most
    often so simplistic
  • Retrieves so many or so few documents
  • Gives unranked results

19
Vector Model
  • Why Vector Model ?
  • Boolean model
  • just takes into account the existence or
    nonexistence of terms in a document
  • Has no sense about their different contributions
    to documents

20
Overview theory of vector model
  • Documents and queries are displayed as vectors in
    index-term space
  • Space dimension is equal to the vocabulary size
  • Components of these vectors the weights of the
    corresponding index term, which reflects its
    significant in terms of representative and
    discrimination power
  • Retrieval is based on whether the query vector
    and document vector are closed enough.

21
  • Set of document
  • A finite set of terms
  • Every document can be displayed as vector
  • the same to the query

22
j
dj
?
q
i
  • Similarity of query q and document d
  • Given a threshold , all documents with
    similarity gt threshold are retrieved

23
Compute a good weight
  • A variety of weighting schemes are available
  • They are based on three proven principles
  • Terms that occur in only a few documents are more
    valuable than ones that appear in many
  • The more often a term occur in a document, the
    more likely it is to be important to that
    document
  • A term that appears the same number of times in a
    short document and in a long one is likely to be
    more available for the former

24
tf-idf-dl (tf-idf) scheme
  • The term frequency of a term ti in document dj
  • The length of document dj
  • DLj total number of terms occurrences in
    document dj
  • Inverted document frequency collection of N
    documents, inverted document frequency of a term
    ti that appears in n document is
  • Weight

25
Think about Vector model
  • Advantages
  • Term weighting improves the quality of the answer
  • Partial matching allows to retrieve the documents
    that approximate the query conditions
  • Cosine ranking formula sorts the answer
  • Disadvantages
  • Assumes the independences of terms
  • Polysemy and synonymy problem are unsolved

26
Modern models of IR
  • Why ?
  • Problems with polysemy
  • Bass (fish or music ?)
  • Problems with synonymy
  • Car or automobile ?
  • These failures can be traced to
  • The way index terms are identified is incomplete
  • Lack of efficient methods to deal with polysemy
  • Idea to solve this problem take the advance of
    implicit higher order structure(latent) in the
    association terms with documents .

27
Latent Semantic Indexing (LSI)
  • LSI overview
  • Representing documents roughly by terms is
    unreliability, ambiguity and redundancy
  • Should find a method , which can
  • Documents and terms are displayed as vectors in
    a k-dim concepts space. Its weights indicating
    the strength of association with each of these
    concepts
  • That method should be flexible enough to remove
    the weak concepts, considered as noises

28
  • Document-term matrix Amxn are built
  • Matrix A is factored in to 3 matrices, using
    Singular value decomposition SVD
  • U,V are orthogonal matrices

29
  • These special matrices show a break down of the
    original relationship (doc-term) to a linearly
    independent components (factors)
  • Many of these components are very small (
    considered as noises) and should be ignored

30
  • Criteria to choose k
  • Ignore noises
  • Important information are not lost
  • Documents and terms are displayed as vectors in
    k-dim space
  • Theory Eckart Young ensures us about not losing
    important information

31
Query
  • Should find a method to display a query to k-dim
    space
  • Query q can be seen as a document.
  • From equation
  • We have

32
Similarity between objects
  • Term-Term
  • Dot product between two rows vector of matrix Ak
    reflects the similarity between two terms
  • Term-term similarity matrix
  • Can consider the rows of matrix as
    coordinate of terms.
  • The relation between taking rows of as
    coordinate and rows of as coordinates is
    simple

33
  • Document-document
  • Dot product between two columns vectors of matrix
    Ak reflect the similarity between two documents.
  • Can consider the row of matrix as
    coordinates of documents.
  • Term-document
  • This value can be obtained by looking at the
    element of matrix Ak
  • Drawback between and within comparisons can not
    be done simultaneously without resizing the
    coordinate.

34
Example
  • q1 human machine interface for Lab ABC computer
    applications
  • q2 a survey of user opinion of computer system
    response time.
  • q3 the EPS user interface management system
  • q4 System and human system engineering testing
    of EPS
  • q5Relation of user-perceived response time to
    error measurement
  • q6 The generation of random, binary, unordered
    tree
  • q7 The intersection graph of paths in trees
  • q8 Graph minors IV Widths of trees and
    well-quasi-ordering
  • q9 Graph minors A survey

35
Numerical results
36
Query human computer interaction
37
Updating
  • Folding in
  • New document
  • New terms and docs has no effect on the
    presentation of pre-existing docs and terms
  • Re-computing SVD
  • Re-compute SVD
  • Requires times and memory
  • Choosing one of these two methods

38
Think about LSI
  • Advantages
  • Synonymy problem is solved
  • Displaying documents in a more reliable space
    Concepts space
  • Disadvantages
  • Polysemy problem is still unsolved
  • A special algorithms for handling with large size
    matrices should be implemented

39
Correlation Method
  • Idea If a keyword is present in the document,
    correlated keywords should be taken into account
    as well. So, the concepts containing in the
    document arent obscured by the choices of a
    specific vocabulary.
  • In vector space model similarity vector
  • Depend on the user query q, we now build the best
    query, taking the correlated keyword into account
    as well.

40
  • The correlation matrix is built based on the
    term-document matrix
  • Let A is the term-document matrix
  • D number of document, is the mean vector.
    Clearly that
  • Covariance matrix is computed
  • Correlation Matrix S

41
  • Better query
  • We now use SVD to reduce noises in the
    correlation of keywords
  • We choose the first k largest factors to obtain
    the k dimensional of S
  • Generate our best query
  • Vector of similarity
  • Define a projection, defined by

42
Strength of correlation method
  • In real world, correlation between words are
    static.
  • Number of terms has a higher stability level when
    comparing with number of documents
  • Number of documents are many times larger than
    number of keywords.
  • This method is able to handle database with a
    very large number of documents and doesnt have
    to update the correlation matrix every time
    adding new documents.
  • Its importance in the electronic networks.

43
Conclusion
  • IR overview
  • Classical IR models
  • Boolean model
  • Vector model
  • Modern IR models
  • LSI
  • Correlation methods

44
Any questions ?
45
References
  • Gheorghe Muresan Using document Clustering and
    language modeling in mediated IR.
  • Georges Dupret Latent Concepts and the Number
    Orthogonal Factors in Latent Semantic Indexing
  • C.J. van RIJSBERGEN Information Retrieval
  • Ricardo Baeza-Yates Modern Information Retrieval
  • Sandor Dominich Mathematical foundation of
    information retrieval.
  • IR lectures note from www.cs.utexas.edu
  • Scott Deerwester, Susan T.Dumais Indexing by
    Latent Semantic Analysis

46
Desktop Search
  • Design the desktop search that satisfies
  • Not affects computers performance
  • Privacy is protected
  • Ability to search as many types of files as
    possible( consider music file)
  • Multi languages search
  • User-friendly
  • Support semantic search if possible
Write a Comment
User Comments (0)
About PowerShow.com