Search for the Enterprise - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Search for the Enterprise

Description:

Programmatic Control over search results according to logged in user ... Supports PDF, MS-Word, Text and HTML. Using Nutch. BootStrap Web Database. Fetch ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 15
Provided by: foss
Category:

less

Transcript and Presenter's Notes

Title: Search for the Enterprise


1
Search for the Enterprise
  • Free flow of information is the only safeguard
    against tyranny
  • Tarun Dua
  • Company Induslogic Inc.
  • LUG Affiliation Linux Delhi

2
Why Search ?
  • Search is the easiest and fastest way to access
    information.
  • Alternatives
  • 1. SQL Database
  • 2. File System
  • They don't look as good as search!!

3
What is Search ?
  • Approximates to
  • Index and assign Relevance
  • Fetch
  • Analyse
  • Rank
  • Access Method
  • Command line/GUI
  • CGI
  • API

4
Various Terms
  • Index
  • Meta Tags
  • Crawler
  • Analyser/Tokenizer
  • Filters
  • Rank

5
What to Search?
  • Desktop Search for individual users
  • Try to get as much information onto web-based
    Intranet Interfaces
  • GNU Mailman for Web-Archives
  • Document Repository using Drupal

6
Desktop Search Namazu
  • Create Index
  • Customize mknmz
  • mknmz options target directory
  • install cron
  • Search
  • Customize namazurc
  • namazu query index
  • Setup CGI access to namazu index

7
Namazu
  • Pluggable Document Filters
  • Phrase Search
  • Sub-String Matching
  • Regexp Search
  • Good for Desktop Sized Searches

8
Setting up and using htdig
  • Configure htdig.conf
  • Define Scope of Crawl
  • Create Index
  • rundig
  • Setup cron
  • Using htdig
  • htsearch

9
Htdig Advantages
  • http based crawl (configurable)
  • Comprehensive Toolset
  • htdump, htload,htmerge
  • Fuzzy Indexing
  • htfuzzy
  • Scalability
  • Millions of Pages can be indexed
  • Document Types Supported
  • Ranking Supported

10
What Next ?
  • Tomcat as primary server
  • Custom Indexing/ Meta Information
  • Programmatic Control over search results
    according to logged in user
  • Web-Application Session Awareness
  • LuceneNutch

11
Lucene
  • Cross platform Search Engine Library
  • Simple working Example with PDFBox
  • See http//dharmanand.tarundua.net
  • Implementations based on lucene
  • nutch
  • regain
  • Plugin Architecture
  • e.g. PDFBox integration

12
Nutch
  • Works with Tomcat/ Pure Java
  • Continuous indexing
  • Aims to return results lt 1sec
  • Link Analysis
  • Original Page Caching
  • Relevance Quality options
  • Supports PDF, MS-Word, Text and HTML

13
Using Nutch
  • BootStrap Web Database
  • Fetch
  • Fetch initial Segments
  • Expand Segments
  • Analyse
  • Index
  • Search

14
References
  • http//www.htdig.org/
  • http//www.namazu.org
  • http//jakarta.apache.org/lucene
  • http//www.nutch.org
Write a Comment
User Comments (0)
About PowerShow.com