Introduction to Information Retrieval - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Introduction to Information Retrieval

Description:

Document = a unit available for retrieval. Collection = the ... 'bat' (baseball vs. mammal) 'Apple' (company vs. fruit) 'bit' (unit of data vs. act of eating) ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 22
Provided by: lin87
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Information Retrieval


1
Introduction to Information Retrieval
  • Martin Volk
  • Computational Linguistics
  • Stockholm University
  • volk_at_ling.su.se

2
Documents
  • Textual documents
  • Manuals (Software manuals, airplane manuals,
    power plant manuals)
  • Patent texts
  • Law texts
  • Medical texts
  • Audio and video documents

3
IR System
IR System
4
Information Retrieval
  • Terms (JM p. 647)
  • Document a unit available for retrieval
  • Collection the set of all documents
  • Term lexical item in a document
  • Query formulation of the users information
    need ( question)
  • Index a list of terms with pointers to their
    source

5
Questions
  • How can the relevant information be found?
  • How can the relevant facts be retrieved?
  • How can the found facts be used for inferencing?
  • How can the information be used for
    classification?
  • How can information be found across languages?

6
Relevance
  • Relevance is a subjective judgment and may
    include
  • Being on the proper subject.
  • Being timely (recent information).
  • Being authoritative (from a trusted source).
  • Satisfying the goals of the user and his/her
    intended use of the information (information
    need).

7
Keyword Search
  • Simplest notion of relevance is that the query
    string appears verbatim in the document.
  • Slightly less strict notion is that the words in
    the query appear frequently in the document, in
    any order (bag of words).

8
Problems with Keywords
  • May not retrieve relevant documents that include
    synonymous terms.
  • restaurant vs. café
  • PRC vs. China
  • May retrieve irrelevant documents that include
    ambiguous terms.
  • bat (baseball vs. mammal)
  • Apple (company vs. fruit)
  • bit (unit of data vs. act of eating)

9
IR System Components
  • Text Operations forms index words (tokens).
  • Stopword removal
  • Stemming
  • Indexing constructs an inverted index of word to
    document pointers.
  • Searching retrieves documents that contain a
    given query token from the inverted index.
  • Ranking scores all retrieved documents according
    to a relevance metric.

10
Web Search System
IR System
11
Recall and Precision
  • Recall is the ratio of
  • The number of correctly found documents to
  • The number of all relevant documents.
  • ? What percentage of the relevant documents has
    been found?
  • Precision is the ratio of
  • The number of correctly found documents to
  • The number of all found documents.
  • ? What percentage of the found documents are
    relevant?

12
The vector space model
  • The idea documents (d) and queries (q) are
    represented as vectors of features that stand for
    the terms (t)
  • dj (t1,j, t2,j, t3,j, ... tN,j)
  • qk (t1,k, t2,k, t3,k, ... tN,k)
  • Simple assumption terms are present (1) or
    absent (0)

13
The vector space model
  • Then the similarity between the query (vector)
    and the document (vector) is measured by summing
    the number of terms they share.
  • But some terms are more important than others.
  • Therefore 1/0 is replaced by weights.
  • dj (w1,j, w2,j, w3,j, ... wN,j)
  • qk (w1,k, w2,k, w3,k, ... wN,k)
  • The collection is then a matrix of weights with
    wi,j which stands for the weight of term i in
    document j.

14
The vector space model
  • Vectors need to be normalized to unit length.
  • Then the dot product between vectors computes
    the cosine of the angle
  • Identical documents cosine 1
  • Totally disjunct documents cosine 0

15
Search engines ranking principles
  • A document D is ranked higher
  • If D contains more search terms.
  • If the search terms are more frequent in D.
  • If the search term occurs less frequently in the
    set of all documents.
  • If D is shorter but still contains the search
    terms.
  • If the search terms are closer to each other in
    D.
  • If the search terms appear earlier (closer to the
    top) in D.

16
Search engines ranking principles
  • A document D is ranked higher
  • If D is newer !?
  • If there are more documents that point to D !?
  • If D is accessed more frequently !?
  • If Ds owner pays more ... !? ?

17
Increasing Precision
  • Information Retrieval (Google)
  • Query ? List of documents for human inspection
  • Answer Retrieval (SUIS)
  • Query ? List of answers (specific parts of
    documents) for human inspection
  • Fact Retrieval (??)
  • Query ? List of facts for insertion in database

18
Other IR tasks
  • Document categorization,
  • Document filtering / routing,
  • Text mining (deriving new information ie.
    information that is not explicitly mentioned
    from a textual document)
  • Text summarization

19
Filtering, Routing
  • Classifying documents (e.g. Emails) according to
    predefined criteria.
  • Distinguishing real emails from junk mail (
    spam)
  • Routing emails to the respective department
    (orders, complaints, product information,
    question, ...)
  • Routing ticker new for special customers.
  • Filtering yes / no routing

20
Summary
  • Modern Information Retrieval systems use
    statistical methods for ranking documents
    according to relevance.
  • Precision and Recall are the most frequently used
    methods for measuring the quality of IR systems.
  • Routing/filtering, text mining, summarization are
    other IR tasks.

21
Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com