INFORMATION RETRIEVAL SEARCH PROJECT - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

INFORMATION RETRIEVAL SEARCH PROJECT

Description:

Build the inverted index (Lucene Index) ... Investigate Lemur functionality; Investigate natural language analysis of the query/data. ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 12
Provided by: ilpsSci
Category:

less

Transcript and Presenter's Notes

Title: INFORMATION RETRIEVAL SEARCH PROJECT


1
INFORMATION RETRIEVAL SEARCH PROJECT
  • midterm presentation

Paulus Wiezer Laurentiu Stancu Tjarda Koster Paul
Koppen Farrell Ligeon Rene De Haas Peter Bloem
2
Summary
  • Task Description
  • Implementation Issues
  • Search Module Interface
  • To Data Structuring Module
  • To Presentation Module
  • Future plans
  • Questions?

3
Task Description
  • Build The Indexer (indexing task is performed
    offline)
  • Retrieve structured QA pairs from Data
    Structuring Module
  • Parse the data
  • Build the inverted index (Lucene Index)
  • Implement an XML RPC Server (the query/search
    task is performed online)
  • Receive a QA search query from the user through
    the Presentation Module
  • Perform the search using the previously built
    inverted index
  • Return the search results, by submitting an
    ordered list of QA pair IDs

4
Implementation Issues The Indexer
  • This is a short list of some of the main classes
    used in the project, together with the provided
    functionality.
  • FreaqXMLParser Class
  • Gets the data from the Data Structuring Module in
    an XML format file
  • Parses the retrieved XML file for QA pairs
  • Sends a collection of QAPairs to QAIndexer
  • SemiXMLIndexer Class
  • Note This class is temporarily used instead of
    the above one, because the real data from the
    Data Structuring Module is not yet available. It
    actually gets the data from a WeBFAQ repository,
    already made available from another project.
  • QAIndexer Class
  • Receives the collection of QAPairs from
    FreaqXMLParser or SemiXMLIndexer
  • Builds an inverted index, using Lucene library
  • Provides the built index to the XML RPC Server

5
Implementation Issues (cont) The Query/Search
  • A message in xml-format containing a query comes
    in from a client through the FreAQServer
    connection.
  • This message is then handled by an implementation
    of the abstract class QueryProcessor, e.g. the
    LuceneQueryProcessor
  • The query is extracted from the xml-message by
    the FreAQMessageCreator and stored in our
    internal format, i.e. FreAQMessage

6
Implementation Issues (cont) The Query/Search
  • The resulting FreAQMessage is transformed into
    the native format of the search engine
    implementation of choice (e.g. a Lucene Query is
    created by the LuceneQueryCreator)
  • The query (in search engine's native format) is
    run against a collection
  • The id's of the returned ranked documents are
    added to an xml-message
  • The xml-message containing the results is sent
    back to the client.

7
Search Module Interface From Data Structuring
Module
  • The interface between Search Module and Data
    Structuring Module is an XML file containing a
    collection of QA pairs.
  • The XML format looks like
  • lt?xml version"1.0" encoding"UTF-8" ?gt
  • lt!DOCTYPE freaq SYSTEM "freaq.dtd"gt
  • ltfreaqgt
  • ltcollection uri"http//example.com/faq"
    date"2007-02-12T122100 title"Page title
    goes here"gt
  • ltcontextgtFull page content (without
    markup) goes here.lt/contextgt
  • ltpairsgt
  • ltpair id"qa1"gt
  • ltqgtWho invented this stuff?lt/qgt
  • ltagtWe did.lt/agt
  • lt/pairgt
  • lt/pairsgt
  • lt/collectiongt
  • lt/freaqgt

8
Search Module Interface (cont) From
Presentation Module
  • The interface between Search Module and
    Presentation Module is done using XML RPCs
    (remote procedure calls in XML format).
  • The XML Query format looks like
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • ltfreaqgt
  • ltquerygt
  • ltwithgtwith at least one of the wordslt/withgt
  • ltwith field"..."gtwith at least one of the
    words in field ...lt/withgt
  • ltwithoutgtwithout the wordslt/withoutgt
  • ltwithout field"..."gtwithout the words in
    field ...lt/withoutgt
  • ltexactgtwith the exact phraselt/exactgt
  • ltexact field"..."gtwith the exact phrase in
    field ...lt/exactgt
  • lt/querygt
  • ltparametersgt
  • ltfiletypegtfiletypelt/filetypegt
  • lt/parametersgt
  • lt/freaqgt

9
Search Module Interface (cont) To Presentation
Module
  • The interface between Search Module and
    Presentation Module is done using XML RPCs
    (remote procedure calls in XML format).
  • The XML Result format looks like
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • ltfreaqgt
  • ltresultsgt
  • ltidgtid of highest ranked resultlt/idgt
  • ltidgtid of second highest ranked resultlt/idgt
  • ...
  • ltidgtid of nth highest ranked resultlt/idgt
  • lt/resultsgt
  • lt/freaqgt

10
Future Plans
  • This lists some of the coming tasks
  • Investigate different search methods and evaluate
    them
  • Investigate Lemur functionality
  • Investigate natural language analysis of the
    query/data.

11
  • QUESTIONS ?
Write a Comment
User Comments (0)
About PowerShow.com