Project 1: Adding Proximity Preference to VectorSpace Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Project 1: Adding Proximity Preference to VectorSpace Retrieval

Description:

Does not know the position of tokens in the document or query. ... Requires verbatim appearance of the correct phrase. Boolean search ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 9
Provided by: Raymond
Category:

less

Transcript and Presenter's Notes

Title: Project 1: Adding Proximity Preference to VectorSpace Retrieval


1
Project 1 Adding Proximity Preference to
Vector-Space Retrieval
2
Problems with VSR(see trace)
  • Does not know the position of tokens in the
    document or query.
  • Therefore cannot account for the proximity of
    tokens in the documents compared to the query.
  • Similarity along a few dimensions can dominate
    others.
  • May prefer a document in which one of the query
    words is very frequent over another in which both
    occur less frequently.

3
Solutions
  • Phrasal search
  • Requires verbatim appearance of the correct
    phrase.
  • Boolean search
  • Requires strict occurrence of all words, still
    not address proximity.
  • Proximity metric
  • Include a measure of proximity in the similarity
    metric itself.
  • Allows a flexible match with a proximity factor.
  • Google apparently includes such a factor in its
    scoring.

4
Example
  • Query computer science
  • Should not prefer just any page that talks about
    computers and talks about science.
  • Should prefer a page in which words are close and
    in same order but the exact phrase does not
    appear, e.g. Computer and Information Science,
    Computer Engineering and Science
  • Should also prefer a page in which terms are
    frequent.

5
My Solution
  • Separate proximity score for each document,
    normalized to be between 0 and 1.
  • Average closest distance in the document
    (measured in number of words, excluding stop
    words) that a query word appears from another
    query word.
  • Averaged across all pairs of words in the query.
  • Multiplicative penalty factor included when a
    pair of words appeared in the reverse order from
    that in the query.
  • Final score is the ratio of cosine-similarity and
    proximity.

6
Your Task
  • Be creative! Should be challenging, leave
    sufficient time to complete the project!
  • Requires fundamental changes to existing code to
    add positional information to the inverted index.
  • So first understand the existing code.
  • Make changes by creating new classes and methods
    rather than altering existing ones.
  • Final code must support both original and the new
    proximity-enhanced approach.

7
Positional Inverted Index
  • Instead of just a count of the token,
    TokenOccurence must include a list of integer
    positions of the token in the document.
  • Positions can be in terms of token number in the
    document, excluding stop words.

8
Project Submission
  • Follow submission directions on the web to submit
    electronically with turnin.
  • Document all code (Javadoc).
  • Include a trace of your working system on the
    sample problems.
  • Include a short project report describing your
    approach and discussing your results.
Write a Comment
User Comments (0)
About PowerShow.com