Modern Information Retrieval: A Brief Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Modern Information Retrieval: A Brief Overview

Description:

Starts from 3000BC with Sumerians. The major IR developments starts ... Okapi weighting. Pivoted normalization weighting. Document frequency. Document length ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 15
Provided by: mohanrathi
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Modern Information Retrieval: A Brief Overview


1
Modern Information Retrieval A Brief Overview
  • By
  • Amit Singhal
  • Ranjan Dash

2
Layout
  • History
  • Models Implementations
  • Evaluation
  • Key Techniques
  • Term Weighting
  • Query Modification
  • Other Techniques and Applications
  • Conclusion

3
History
  • Starts from 3000BC with Sumerians
  • The major IR developments starts in 1950s and
    1960s
  • 1950s Vannevar Bush, Luhn
  • 1960s
  • SMART system Gerald Salton
  • Cranfield Evaluation Cyril Cleverdon
  • 1970s 1980s
  • Various models for document retrieval on small
    text collection
  • 1992
  • TREC Text Retrieval Conference
  • Other fields like retrieval of spoken
    information, non-English language retrieval, info
    filtering,
  • Modern Textual IR WWW search 1996 - 1998

4
Models Implementations
  • IR systems
  • Boolean systems
  • Ranked Retrieval Systems
  • Models
  • Vector space model
  • Probabilistic Model
  • Inference Network Model
  • Implementation

5
Models Implementations..
  • Vector space model
  • Every word in vocabulary as independent dimension
  • Document or query as vectors in this high
    dimensional space
  • Positive quadrant of vector space
  • Numeric similarity between query vector and
    document vector cosine of the angle between
    them.

6
Models Implementations..
  • Probabilistic Model Probabilistic Ranking
    Principle(PRP)
  • Ranked by decreasing probability of their
    relevance to a query
  • Maron and Kuhn - 1960
  • Probability of relevance for doc D

P(RD)


7
Models Implementations..
Assumptions
8
Models Implementations..
  • Inference Network Model
  • Inference process in an inference network
  • A document instantiates a term with a certain
    strength and credit from multiple terms is
    accumulated
  • Strength of instantiation of a term weight
  • Document ranking for this model Vector space or
    probabilistic models

9
Models Implementations..
  • Implementation
  • Inverted list
  • Stop words
  • Stemming little effective for English,
    effective for language with many word inflections
    German
  • Multiword phrases
  • Techniques to generate list of phrases
    linguistic, statistical

10
Evaluation
  • Objective evaluation
  • Cranfield Tests
  • Characteristics for search effectiveness
  • Recall proportion of relevant documents
    retrieved by the system
  • Precision proportion of the retrieved documents
    that are relevant
  • Average Precision averaging precisions at
    different recall points

11
Key Techniques
  • Term weight
  • Term frequency
  • Raw tf non optimal
  • Dampened tf ( logarithmic tf) better one
  • Okapi weighting
  • Pivoted normalization weighting
  • Document frequency
  • Document length
  • Query modification/expansion via relevance
    feedback

12
Key Techniques
  • Query modification/expansion
  • Adding synonyms lack of query context
  • Relevance feedback Rocchio in 1965
  • User judgment to modify the query
  • Quite effective
  • Pseudo-feedback for short user query
  • Top few docs retrieved by initial user query are
    relevant and does relevance feedback to
    generate a new query

13
Other Techniques and Applications
  • Cluster Hypothesis Documents that cluster
    together have similar relevance profile for a
    query
  • Natural Language Processing ( NLP )
  • Not so effective for IR
  • Other IR fields besides doc ranking
  • Information Filtering (IF), Topic Detection and
    Tracking ( TDT), Speech Retrieval, Cross-language
    retrieval

14
Conclusion
  • 40 yrs of experience for IR
  • Statistical techniques are the BEST
Write a Comment
User Comments (0)
About PowerShow.com