Search Engine Project GROUPD - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Search Engine Project GROUPD

Description:

... based search like Lehman college, Web, News and Image. ... Image Search performs exact and partial match of image name and ... Image search operation ... – PowerPoint PPT presentation

Number of Views:264
Avg rating:3.0/5.0
Slides: 26
Provided by: cometLeh
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Project GROUPD


1
DISTRIBUTED INFORMATION RETRIEVAL SYSTEMS
  • Search Engine ProjectGROUP-D

Group Members Deepesh Sen (Group Leader) Hilda
Gonzalez Jonathan D Holland Kavitha
Marupakula Palilba Singuinam Somalina Samal
2
Introduction
GlobeSurf is a high-performance, full-featured
text and image search engine written entirely in
Java.
  • Lucene API is used for text searching and
    indexing.
  • Image searching and indexing was implemented
    using database.
  • Documents are downloaded from Lehman and other
    websites using JOBO web spider.
  • The search engine features full text search and
    supports Term, Phrase and Boolean queries.
  • It allows category based search like Lehman
    college, Web, News and Image. This can be further
    extended in future.
  • Automatic OR" Queries - Returns pages that
    include any or all of the search terms.
  • Automatic Exclusion of Common Words (Stop words)
    like a, an, and, there, they, this, to, was,
    will, with etc.
  • Automatic Capitalization - Searches are NOT case
    sensitive. Searches

3
Introduction Contd.
  • Word Variations (Stemming) - Uses PorterStemmer
    to search not only for the search terms, but also
    for words that are similar to some or all of
    those terms.
  • Image Search performs exact and partial match of
    image name and shows thumbnail of returned
    images.
  • Result links open in new window so that you can
    keep your search window open.
  • Best Result Returns the document which matches
    best to the query.

4
Use Case Diagram
5
Process Flow
6
Downloading Document - JOBO
JOBO web spider was used to download documents
from websites
  • Four separate download process were executed with
    JOBO
  • Lehman college domain
  • News Domain
  • All Web
  • Images
  • The files downloaded by JOBO were stored in the
    following directories
  • \DownloadedFiles\lehman
  • \DownloadedFiles\news
  • \DownloadedFiles\web
  • \DownloadedFiles\images

7
Downloading Document - JOBO
JOBO XML configuration
  • To download the images, we allowed only files of
    image type, like gif, jpg, jpeg and for the other
    three downloads we allowed text and html files
    only.
  • The sleep time was 5 seconds between files.
  • There was no limit in the age of the files that
    were downloaded.
  • JOBO downloaded files that were maximum 25 clicks
    away from the starting page.
  • JOBO did not have any bandwidth limitations, it
    used all our available internet bandwidth.
  • Minimum file size was zero.
  • Maximum file size was 100KB.

8
Indexing Html/Text files
Html/Text indexing was done using Lucene
9
Indexing Html/Text files
Classes used for Html/Text indexing
  • Html Documents are parsed using HtmlParser class
    to separate title, content section and to remove
    all Html tags.
  • Document Object is created for each file with
    appropriate Field objects
  • URL - UnIndexed field stored with document, not
    searchable.
  • Modified Time - Keyword field, searchable, not
    tokenized.
  • Content - Tokenized and indexed.
  • Summary - UnIndexed field stored with document,
    not searchable.
  • Title - Tokenized and indexed.
  • Remove Stop Words like a, an, and etc.
    using StopAnalyzer, StopFilter. There are 35
    words in the stop word list.
  • Perform Stemming operation using
    PorterStemFilter, PorterStemmer.
    PorterStemFilter is called within
    StandardAnalyzer and uses LowerCaseTokenizer.
  • Created IndexWriter using modified
    StandardAnalyzer and write index document walking
    through the directory hierarchy.

10
Indexing Image files
11
Indexing Image files
Classes used for Image indexing
  • Walk through the download image directory.
  • Create ImageDocument Object for each file with
    appropriate Field objects
  • Name photo.jpg
  • URL http//www.lehman.edu/..
  • Size 35 Kb
  • Modified Date 12/01/2004
  • Insert image information in image Database using
    ImageIndexer class ImageIndexer.indexImages()
    method opens a databse connection to image
    database and insert all Image related
    information in image_table

12
Searching Html/Text files
13
Searching Html/Text files
Classes used for Html/Text search operation
The input for a search operation is a 'query'
that specifies a criteria for selecting the
documents and its output is a list of documents
('hits') that matched that criteria.
  • Searcher class is used to search matching
    documents for the user query.
    Searcher.search(File indexDir, String q,)
    method accepts an index directory and query
    string as an argument and returns a
    DataStore object containg the result information.
  • QueryParser parses the user's query string.
    QuerParser uses various types of Query object
    internally
  • PhraseQuery A Query that matches documents
    containing a particular sequence of terms.
  • This may be combined with other terms
    with a BooleanQuery.
  • BooleanQuery Query that matches documents
    matching Boolean combinations of other queries,
    typically TermQuerys or PhraseQuerys.
  • TermQuery etc.
  • QueryParser.parse() method returns a Query object
    and uses StandardAnalyzer (Same analyzer used in
    indexing)
  • Query object instance is handed to the
    IndexSearcher.search() method which return a Hits
    collection.
  • Hits object is a collection of Document objects
    matched by the query and an associated relevance
    score
  • for each document, sorted by score.
  • For each Document returned by Hit a
    ResultDisplayComponent is created.

14
Searching Image files
15
Searching Image files
Classes used for Image search operation
  • Searcher.searchImages() method accepts the user
    query and search through the image database for a
    partial or full name match (uses leading and
    trailing like operator) and returns a list
    (Datastore) of images ordered by modified time.
  • Create a ResultDisplayComponent for each image
    returned by Searcher.searchImages() method.
    This component contains a HtmlImage object,
    Image Name and Size Information objects.

16
Result Display
Classes used for displaying results
The User interface and display logic is
implemented using servlet, JSP and a framework
similar to struts.
  • SiteServer servlet handles all user requests,
    forwards the request to appropriate section,
    manages user session etc..
  • DataStore object is a container that holds the
    information of the matching documents like url,
    title, summary, file size etc.
  • For each entry of the DataStore a
    ResultDisplayComponent is created which displays
    the document information in the screen
  • There are two different kinds of
    ResultDisplayComponent one for Html/Text and the
    other one is for Image.
  • HtmlDataTable and HtmlDisplayBox handles the
    search result display and page navigation.

17
Platform
The project is implemented in following platform
  • Hardware
  • CPU Intel Centrino CPU 1.6 Ghz, 1cpu
  • RAM 1.3 GB Memory
  • Drive configuration IDE 7200rpm Raid-1
  • Software
  • Java Version 1.4.2
  • Java VM IBM JDK
  • OS Windows 2000 Professional, Service pack 4
  • IDE Visual Age of Java
  • Application Server Websphere 3.5.6 or Websphere
    5.2
  • Web Server IBM HTTP Server
  • User Interface Servlets / JSP
  • Location of html/text index local (B Tree)
  • Location of image index local (Database)

18
Screenshots Lehman Search
19
Screenshots Lehman Search Result
20
Screenshots Web Search
21
Screenshots Web Search Result
22
Screenshots Image Search
23
Screenshots Image Search Result
24
Screenshots News Search
25
Screenshots News Search Result
Write a Comment
User Comments (0)
About PowerShow.com