Intelligent Crawling and Indexing using Lucene - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Intelligent Crawling and Indexing using Lucene

Description:

Intelligent Crawling and Indexing using Lucene. By. Shiva Thatipelli. Mohammad Zubair (Advisor) ... Single, Multiple Phase queries, Results ranking, Sorting, ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 25
Provided by: phili226
Category:

less

Transcript and Presenter's Notes

Title: Intelligent Crawling and Indexing using Lucene


1
Intelligent Crawling and Indexing using Lucene
  • By
  • Shiva Thatipelli
  • Mohammad Zubair (Advisor)

2
Contents
  • Searching
  • Indexing
  • Lucene
  • Indexing with Lucene
  • Indexing Static and Dynamic Pages
  • Extracting and Indexing Dynamic Pages
  • Implementation
  • Screens

3
Searching
  • Looking up words in an index
  • Factors Affecting Search
  • Precision How well the system can filter
  • Speed
  • Single, Multiple Phase queries, Results ranking,
    Sorting, Wild card queries, Range queries support

4
Indexing
  • Sequential Search is bad (Not Scalable)
  • Index speeds up selection
  • Index is a special data structure which allows
    rapid searching.
  • Different Index Implementations
  • - B Trees
  • - Hash Map

5
Search Process
Query
Docs
Docs
Indexing API
Hits
Index
6
Lucene
  • High-performance, full-featured text search
    engine library
  • Written 100 in pure java
  • Easy to use yet powerful API
  • Jakarta Apache Product. Strong open source
    community support.

7
Why Lucene?
  • Open source (Not proprietary)
  • Easy to use, good documentation
  • Interoperable - ex Index generated by java can
    be used by VB, asp, perl application
  • Powerful And Highly Scalable
  • Index Format
  • Designed for interoperability
  • Well Documented
  • Resides on File System, RAM, custom store

8
Continued
  • Algorithms
  • Efficient, fast and optimized
  • Incremental Indexing
  • Boolean Query, Fuzzy Query, Range Query, Multi
    Phrase Query, Wild Card Query etc
  • Content Tagging Documents as Collection of
    terms
  • Heterogeneous documents - Useful when different
    set of metadata present for different mime types

9
Indexing With Lucene
  • What type of documents can be indexed?
  • Any document from which text can be fetched and
    extracted over the net with a URL
  • Uses Inverted Index
  • - The index stores statistics about terms in
    order to make term-based search more efficient.

10
Indexing With Lucene Contd
extracted
extracted
extracted
extracted
11
Indexing Static and Dynamic Pages
  • Static Pages which are HTML, XLS, WORD, PDF
    documents on web which can be easily crawled and
    indexed by search engines like Google and Yahoo.
  • Static Pages over the internet can be passed into
    Lucene and indexed and searched with direct URLs.
  • Dynamic Pages which are generated due to result
    of parameters submitted like search results
    pages, Database hidden pages cannot be indexed
    with direct URLs.
  • To index Dynamic Pages we need the parameters
    submitted by users to generate those pages.

12
Extracting and Indexing Dynamic Pages
  • Extracting dynamic web pages which also can be
    called as database hidden pages needs some kind
    of input to generate the URLs
  • To get the input parameters, we used of Apache
    Access logs which contain user request as URL.
  • A sample entry in Apache access log is as
    follows
  • 127.0.0.1 - - 31/Aug/2005184403 -0400 "GET
    /archon/servlet/search?formnamesimplefulltextma
    lygroupsubjectsorttitle HTTP/1.1" 200 9560

13
Extracting and Indexing Dynamic Pages Contd...
  • It contains all the information like IP-address
    of the computer accessing the information, date,
    time information accessed, Method called, Request
    URL, HTTP version, and HTTP code.
  • The Request URL is the one which has all the
    input parameters, in this case formnamesimple
  • fulltextmaly groupsubject sorttitle
  • Results page is dynamic and dependent upon the
    parameters passed.
  • A full URL like http//archon.cs.odu.edu8066/arch
    on/servlet/search?formnamesimplefulltextmalygr
    oupsubjectsorttitle Can be generated from
    Request URL by appending Website address.

14
Indexing Dynamic Pages
Apache Logs
Parse and generate URL
Results page
Could be any file type
15
Implementation
  • The above flow chart describes the way Apache
    logs are parsed and URLs are generated
  • It shows how the Results pages are fetched and
    extracted from the URLs
  • The Results page is sent for analysis then Lucene
    generates the index which will be used for future
    searches.

16
Demo
17
  • Results
  • Hardware Environment
  • Dedicated machine for indexing No, but nominal
    usage at time of indexing.
  • CPU Intel x86 P4 2.8Ghz
  • RAM 512 DDR
  • Drive configuration IDE 7200rpm
  • Software environment
  • Lucene Version 1.4
  • Java Version 1..2
  • OS Version Windows 2000
  • Apache Web server version 1.3 to 2.0
  • Location of index local

18
Create Index
IndexByLog.java file reads the access logs on
local computer, generates the URLs, fetches and
extracts the results page from the URLs and
indexes them and stores in LuceneIndex folder.
19
Files extraction and Index Creation
20
Searching at the prompt
21
Searching on the web
22
Results on the web
23
Conclusion
  • It is very easy to implement efficient and
    powerful search engines using Lucene
  • Lucene can be used to index dynamic pages and
    database hidden pages
  • Web Server Access logs can be used to generate
    URLs and Java, Lucene API can be used to fetch
    and index database hidden pages.
  • There are some security risks involved as we can
    reveal what users are doing what searches and
    other sensitive information .

24
Questions?
Write a Comment
User Comments (0)
About PowerShow.com