Title: Search Engine Project GROUPD
1 DISTRIBUTED INFORMATION RETRIEVAL SYSTEMS
- Search Engine ProjectGROUP-D
Group Members Deepesh Sen (Group Leader) Hilda
Gonzalez Jonathan D Holland Kavitha
Marupakula Palilba Singuinam Somalina Samal
2 Introduction
GlobeSurf is a high-performance, full-featured
text and image search engine written entirely in
Java.
- Lucene API is used for text searching and
indexing. - Image searching and indexing was implemented
using database. - Documents are downloaded from Lehman and other
websites using JOBO web spider. - The search engine features full text search and
supports Term, Phrase and Boolean queries. - It allows category based search like Lehman
college, Web, News and Image. This can be further
extended in future. - Automatic OR" Queries - Returns pages that
include any or all of the search terms. - Automatic Exclusion of Common Words (Stop words)
like a, an, and, there, they, this, to, was,
will, with etc. - Automatic Capitalization - Searches are NOT case
sensitive. Searches
3 Introduction Contd.
- Word Variations (Stemming) - Uses PorterStemmer
to search not only for the search terms, but also
for words that are similar to some or all of
those terms. - Image Search performs exact and partial match of
image name and shows thumbnail of returned
images. - Result links open in new window so that you can
keep your search window open. - Best Result Returns the document which matches
best to the query.
4 Use Case Diagram
5 Process Flow
6 Downloading Document - JOBO
JOBO web spider was used to download documents
from websites
- Four separate download process were executed with
JOBO - Lehman college domain
- News Domain
- All Web
- Images
- The files downloaded by JOBO were stored in the
following directories - \DownloadedFiles\lehman
- \DownloadedFiles\news
- \DownloadedFiles\web
- \DownloadedFiles\images
7 Downloading Document - JOBO
JOBO XML configuration
- To download the images, we allowed only files of
image type, like gif, jpg, jpeg and for the other
three downloads we allowed text and html files
only. - The sleep time was 5 seconds between files.
- There was no limit in the age of the files that
were downloaded. - JOBO downloaded files that were maximum 25 clicks
away from the starting page. - JOBO did not have any bandwidth limitations, it
used all our available internet bandwidth. - Minimum file size was zero.
- Maximum file size was 100KB.
8 Indexing Html/Text files
Html/Text indexing was done using Lucene
9 Indexing Html/Text files
Classes used for Html/Text indexing
- Html Documents are parsed using HtmlParser class
to separate title, content section and to remove
all Html tags. - Document Object is created for each file with
appropriate Field objects - URL - UnIndexed field stored with document, not
searchable. - Modified Time - Keyword field, searchable, not
tokenized. - Content - Tokenized and indexed.
- Summary - UnIndexed field stored with document,
not searchable. - Title - Tokenized and indexed.
- Remove Stop Words like a, an, and etc.
using StopAnalyzer, StopFilter. There are 35
words in the stop word list. - Perform Stemming operation using
PorterStemFilter, PorterStemmer.
PorterStemFilter is called within
StandardAnalyzer and uses LowerCaseTokenizer. - Created IndexWriter using modified
StandardAnalyzer and write index document walking
through the directory hierarchy.
10 Indexing Image files
11 Indexing Image files
Classes used for Image indexing
- Walk through the download image directory.
- Create ImageDocument Object for each file with
appropriate Field objects - Name photo.jpg
- URL http//www.lehman.edu/..
- Size 35 Kb
- Modified Date 12/01/2004
- Insert image information in image Database using
ImageIndexer class ImageIndexer.indexImages()
method opens a databse connection to image
database and insert all Image related
information in image_table
12 Searching Html/Text files
13 Searching Html/Text files
Classes used for Html/Text search operation
The input for a search operation is a 'query'
that specifies a criteria for selecting the
documents and its output is a list of documents
('hits') that matched that criteria.
- Searcher class is used to search matching
documents for the user query.
Searcher.search(File indexDir, String q,)
method accepts an index directory and query
string as an argument and returns a
DataStore object containg the result information. - QueryParser parses the user's query string.
QuerParser uses various types of Query object
internally - PhraseQuery A Query that matches documents
containing a particular sequence of terms. - This may be combined with other terms
with a BooleanQuery. - BooleanQuery Query that matches documents
matching Boolean combinations of other queries,
typically TermQuerys or PhraseQuerys. - TermQuery etc.
- QueryParser.parse() method returns a Query object
and uses StandardAnalyzer (Same analyzer used in
indexing) - Query object instance is handed to the
IndexSearcher.search() method which return a Hits
collection. - Hits object is a collection of Document objects
matched by the query and an associated relevance
score - for each document, sorted by score.
- For each Document returned by Hit a
ResultDisplayComponent is created.
14 Searching Image files
15 Searching Image files
Classes used for Image search operation
- Searcher.searchImages() method accepts the user
query and search through the image database for a
partial or full name match (uses leading and
trailing like operator) and returns a list
(Datastore) of images ordered by modified time.
- Create a ResultDisplayComponent for each image
returned by Searcher.searchImages() method.
This component contains a HtmlImage object,
Image Name and Size Information objects.
16 Result Display
Classes used for displaying results
The User interface and display logic is
implemented using servlet, JSP and a framework
similar to struts.
- SiteServer servlet handles all user requests,
forwards the request to appropriate section,
manages user session etc.. - DataStore object is a container that holds the
information of the matching documents like url,
title, summary, file size etc. - For each entry of the DataStore a
ResultDisplayComponent is created which displays
the document information in the screen - There are two different kinds of
ResultDisplayComponent one for Html/Text and the
other one is for Image.
- HtmlDataTable and HtmlDisplayBox handles the
search result display and page navigation.
17 Platform
The project is implemented in following platform
- Hardware
- CPU Intel Centrino CPU 1.6 Ghz, 1cpu
- RAM 1.3 GB Memory
- Drive configuration IDE 7200rpm Raid-1
- Software
- Java Version 1.4.2
- Java VM IBM JDK
- OS Windows 2000 Professional, Service pack 4
- IDE Visual Age of Java
- Application Server Websphere 3.5.6 or Websphere
5.2 - Web Server IBM HTTP Server
- User Interface Servlets / JSP
- Location of html/text index local (B Tree)
- Location of image index local (Database)
-
18 Screenshots Lehman Search
19 Screenshots Lehman Search Result
20 Screenshots Web Search
21 Screenshots Web Search Result
22 Screenshots Image Search
23 Screenshots Image Search Result
24 Screenshots News Search
25 Screenshots News Search Result