Tata Steel Intranet Search System - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Tata Steel Intranet Search System

Description:

A live tarantula in a case hidden under a box is placed 173 cm away from the ... lifts the box revealing the tarantula. The researcher removes the case's lid ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 20
Provided by: tqrtqm
Category:

less

Transcript and Presenter's Notes

Title: Tata Steel Intranet Search System


1
  • Presentation
  • Tata Steel Intranet Search System
  • Dolf Trieschnigg
  • 19th of December 2003

2
Overview
  • Problem identification
  • Possible solutions
  • Developed solution
  • Architecture Design
  • Demo Comparison
  • Conclusion Recommendations

3
Problem identification
  • Need for improvement of search system
  • Current intranet search results not satisfying
  • search only main intranet server
  • results are disappointing
  • Search system is decentralized

4
Possible solutions
  • Real-time searching
  • start looking at intranet pages after querying
  • Index-based searching
  • make a central index of the intranet content
  • search this index

5
Developed solution
  • Crawler based search system
  • retrieve pages/documents from the intranet
  • build and maintain a local index of the intranet
  • search the index via a web-based interface

6
Architecture
7
Architecture contd
  • TAranTulA (Crawler/Spider)
  • Java implementation
  • Using open source projects for HTTP-Client and
    HTML-parser (Sourceforge, Apache Jakarta
    project).
  • Indexer
  • Microsoft SQL Server (storing)
  • Microsoft Search (indexing stored content)
  • Web interface
  • Microsoft IIS Server
  • ASP pages with VBScript and JScript

8
TAranTulA
  • Retrieves pages from the intranet
  • Crawling a location
  • Retrieve, store and index a location
    (HTTP-client)
  • Extract new locations (links) from current
    location
  • Crawl every found new location
  • Builds index completely from user perspective
  • old system indexed script, not output of script

9
Difficulties Concessions
  • Many different file/MIME types
  • Supported html (incl. asp, jsp etc.), txt, doc,
    xls, ppt
  • Unlimited locations because of dynamic
    pageshttp//host/page1.asp?bgcolor123...
  • Supported only URLs without parameters
  • Links in embedded objects Applets, Flash,
    client-side scripts
  • Only follow HTML links (A-HREF, FRAME-SRC, etc.)

10
However
  • HTML, text and Word documents form the largest
    part of available content.
  • Content on department servers now also is
    searchable.

11
OO-Design
  • Derived from process flow

12
Class diagram
13
TAranTulA is
  • an automated HTTP-client
  • a HTML-parser summarizer
  • multi-threaded
  • scheduled
  • a Java Windows NT service

14
Web interface searching
15
Web interface submission
Submit new/updated sites for indexing by TAranTulA
16
Demo
17
A small comparison
18
External improvement
  • Quality of results dependent on text on page.
  • Improve the results, by improving meta-data on
    intranet-pages
  • provide a good title
  • provide keywords
  • provide a description
  • alternative text for images, etc

19
Conclusion
  • Improvement of intranet search system
  • more and better results
  • more user friendly user interface
  • Future improvements
  • develop method for indexing query-driven content
  • improve links retrieving from embedded objects
  • improve ranking system
Write a Comment
User Comments (0)
About PowerShow.com