Search Engine Project GROUPD

About This Presentation

Title:

Search Engine Project GROUPD

Description:

... based search like Lehman college, Web, News and Image. ... Image Search performs exact and partial match of image name and ... Image search operation ... – PowerPoint PPT presentation

Number of Views:264

Avg rating:3.0/5.0

Slides: 26

Provided by: cometLeh

Category:

more less

Transcript and Presenter's Notes

Title: Search Engine Project GROUPD

1
DISTRIBUTED INFORMATION RETRIEVAL SYSTEMS

Search Engine ProjectGROUP-D

Group Members Deepesh Sen (Group Leader) Hilda
Gonzalez Jonathan D Holland Kavitha
Marupakula Palilba Singuinam Somalina Samal
2
Introduction
GlobeSurf is a high-performance, full-featured
text and image search engine written entirely in
Java.

Lucene API is used for text searching and
indexing.
Image searching and indexing was implemented
using database.
Documents are downloaded from Lehman and other
websites using JOBO web spider.
The search engine features full text search and
supports Term, Phrase and Boolean queries.
It allows category based search like Lehman
college, Web, News and Image. This can be further
extended in future.
Automatic OR" Queries - Returns pages that
include any or all of the search terms.
Automatic Exclusion of Common Words (Stop words)
like a, an, and, there, they, this, to, was,
will, with etc.
Automatic Capitalization - Searches are NOT case
sensitive. Searches

3
Introduction Contd.

Word Variations (Stemming) - Uses PorterStemmer
to search not only for the search terms, but also
for words that are similar to some or all of
those terms.
Image Search performs exact and partial match of
image name and shows thumbnail of returned
images.
Result links open in new window so that you can
keep your search window open.
Best Result Returns the document which matches
best to the query.

4
Use Case Diagram
5
Process Flow
6
Downloading Document - JOBO
JOBO web spider was used to download documents
from websites

Four separate download process were executed with
JOBO
Lehman college domain
News Domain
All Web
Images
The files downloaded by JOBO were stored in the
following directories
\DownloadedFiles\lehman
\DownloadedFiles\news
\DownloadedFiles\web
\DownloadedFiles\images

7
Downloading Document - JOBO
JOBO XML configuration

To download the images, we allowed only files of
image type, like gif, jpg, jpeg and for the other
three downloads we allowed text and html files
only.
The sleep time was 5 seconds between files.
There was no limit in the age of the files that
were downloaded.
JOBO downloaded files that were maximum 25 clicks
away from the starting page.
JOBO did not have any bandwidth limitations, it
used all our available internet bandwidth.
Minimum file size was zero.
Maximum file size was 100KB.

8
Indexing Html/Text files
Html/Text indexing was done using Lucene
9
Indexing Html/Text files
Classes used for Html/Text indexing

Html Documents are parsed using HtmlParser class
to separate title, content section and to remove
all Html tags.
Document Object is created for each file with
appropriate Field objects
URL - UnIndexed field stored with document, not
searchable.
Modified Time - Keyword field, searchable, not
tokenized.
Content - Tokenized and indexed.
Summary - UnIndexed field stored with document,
not searchable.
Title - Tokenized and indexed.
Remove Stop Words like a, an, and etc.
using StopAnalyzer, StopFilter. There are 35
words in the stop word list.
Perform Stemming operation using
PorterStemFilter, PorterStemmer.
PorterStemFilter is called within
StandardAnalyzer and uses LowerCaseTokenizer.
Created IndexWriter using modified
StandardAnalyzer and write index document walking
through the directory hierarchy.

10
Indexing Image files
11
Indexing Image files
Classes used for Image indexing

Walk through the download image directory.
Create ImageDocument Object for each file with
appropriate Field objects
Name photo.jpg
URL http//www.lehman.edu/..
Size 35 Kb
Modified Date 12/01/2004
Insert image information in image Database using
ImageIndexer class ImageIndexer.indexImages()
method opens a databse connection to image
database and insert all Image related
information in image_table

12
Searching Html/Text files
13
Searching Html/Text files
Classes used for Html/Text search operation
The input for a search operation is a 'query'
that specifies a criteria for selecting the
documents and its output is a list of documents
('hits') that matched that criteria.

Searcher class is used to search matching
documents for the user query.
Searcher.search(File indexDir, String q,)
method accepts an index directory and query
string as an argument and returns a
DataStore object containg the result information.
QueryParser parses the user's query string.
QuerParser uses various types of Query object
internally
PhraseQuery A Query that matches documents
containing a particular sequence of terms.
This may be combined with other terms
with a BooleanQuery.
BooleanQuery Query that matches documents
matching Boolean combinations of other queries,
typically TermQuerys or PhraseQuerys.
TermQuery etc.
QueryParser.parse() method returns a Query object
and uses StandardAnalyzer (Same analyzer used in
indexing)
Query object instance is handed to the
IndexSearcher.search() method which return a Hits
collection.
Hits object is a collection of Document objects
matched by the query and an associated relevance
score
for each document, sorted by score.
For each Document returned by Hit a
ResultDisplayComponent is created.

14
Searching Image files
15
Searching Image files
Classes used for Image search operation

Searcher.searchImages() method accepts the user
query and search through the image database for a
partial or full name match (uses leading and
trailing like operator) and returns a list
(Datastore) of images ordered by modified time.
Create a ResultDisplayComponent for each image
returned by Searcher.searchImages() method.
This component contains a HtmlImage object,
Image Name and Size Information objects.

16
Result Display
Classes used for displaying results
The User interface and display logic is
implemented using servlet, JSP and a framework
similar to struts.

SiteServer servlet handles all user requests,
forwards the request to appropriate section,
manages user session etc..
DataStore object is a container that holds the
information of the matching documents like url,
title, summary, file size etc.
For each entry of the DataStore a
ResultDisplayComponent is created which displays
the document information in the screen
There are two different kinds of
ResultDisplayComponent one for Html/Text and the
other one is for Image.

HtmlDataTable and HtmlDisplayBox handles the
search result display and page navigation.

17
Platform
The project is implemented in following platform

Hardware
CPU Intel Centrino CPU 1.6 Ghz, 1cpu
RAM 1.3 GB Memory
Drive configuration IDE 7200rpm Raid-1
Software
Java Version 1.4.2
Java VM IBM JDK
OS Windows 2000 Professional, Service pack 4
IDE Visual Age of Java
Application Server Websphere 3.5.6 or Websphere
5.2
Web Server IBM HTTP Server
User Interface Servlets / JSP
Location of html/text index local (B Tree)
Location of image index local (Database)

18
Screenshots Lehman Search
19
Screenshots Lehman Search Result
20
Screenshots Web Search
21
Screenshots Web Search Result
22
Screenshots Image Search
23
Screenshots Image Search Result
24
Screenshots News Search
25
Screenshots News Search Result

Write a Comment

User Comments (0)