CSA3216 Search Engine Technology Assignment - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

CSA3216 Search Engine Technology Assignment

Description:

... presentation.xml ppt/s/1.xml ppt/s/8.xml ppt/s/9. ... xml ppt/viewProps.xml ppt/presProps.xml docProps/app.xml docProps/core.xml ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 13
Provided by: angelo67
Category:

less

Transcript and Presenter's Notes

Title: CSA3216 Search Engine Technology Assignment


1
CSA3216 Search Engine Technology Assignment
  • Angelo Dalli
  • Department of Intelligent Computing Systems

2
Assignment Overview
  • This assignment guides you through the creation
    of a simple search engine consisting of a simple
    indexing system and query processing system
  • There are 10 well-defined steps that need to be
    implemented together with a compressed data file
    containing HTML documents that will form the
    document collection for your search engine

3
Step 1
  • Download the document data collection from
  • http//staff.um.edu.mt/adal1/csa3216/csa3216_assig
    nment_data.zip
  • You need to unzip the documents to some suitable
    folder on your computer.

4
Step 2
  • Parse the documents to extract tokens (strip out
    HTML tags, punctuation, etc.)

5
Step 3
  • Build an inverted index
  • token ? doc1, doc2, , docn

6
Step 4
  • Calculate TF.IDF weights for the document
    collection

7
Step 5
  • Implement vector distance model using the cosine
    similarity weight model

8
Step 6
  • Implement a simple query processor that takes in
    the query, removes any punctuation, etc. and
    creates a query vector
  • The query vector then needs to be matched with
    the existing pre-computed vectors, using the
    cosine similarity match
  • Hint Use the inverted index to speed up the
    process

9
Step 7
  • For each query, output the following (in simple
    HTML format or as text output)
  • Ranked list of the top 10 documents for the query
    sorted in descending level of relevancy (i.e.
    most relevant document in 1 position)
  • Document titles
  • A snippet for each document
  • Indication of relevancy using the similarity
    output

10
Step 8
  • Write documentation including assumptions,
    implementation details and a few test cases and
    examples

11
Step 9
  • Evaluation of your results
  • XML output of results
  • Discussion and conclusion
  • Optional steps
  • Stemming
  • Stopword Removal

12
Step 10 XML Output
  • ltxmlgt
  • ltquerygtlt/querygt
  • ltresultsgt
  • ltresultgt
  • ltdoc id title relevancygt
  • ltsnippetgtlt/snippetgt
  • lt/docgt
  • lt/resultgt
  • lt/resultsgt
  • lt/xmlgt
Write a Comment
User Comments (0)
About PowerShow.com