High Performance Index Build Algorithms for Intranet Search Engines - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

High Performance Index Build Algorithms for Intranet Search Engines

Description:

High Performance Index Build Algorithms for Intranet Search Engines ... Radix sort. Linear time sorting. Flexibility in defining the sort criteria ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 13
Provided by: MAF9
Category:

less

Transcript and Presenter's Notes

Title: High Performance Index Build Algorithms for Intranet Search Engines


1
High Performance Index Build Algorithms for
Intranet Search Engines
  • Marcus Fontoura, Eugene Shekita, Jason Zien,
    Sridhar Rajagopalan, Andreas Neumann
  • fontoura_at_almaden.ibm.com

2
Agenda
  • Overview and problem description
  • Global analysis
  • Major data structures for index build
  • Index build algorithm

3
Overview and problem description
  • Trevi goal is to provide high quality intranet
    search capability to corporate portals such as
    w3.ibm.com
  • Scalable text search engine that is being
    developed by a joint IBM Research and Software
    Group team
  • This talk focuses on how to efficiently
    incorporate global analysis into the index build
    process

4
Global analysis (GA)
  • Duplicate detection
  • Computes fingerprints for each page (64 bit
    shingle)
  • Master are identified by using the (previous)
    static rank
  • Anchor text (D1 lta refD2gtTrevilt/agt)
  • Appends anchor text tokens to documents
  • Static rank
  • Host in-degree, i.e., number of hosts that point
    to a page ( PageRank on the IBM intranet)

5
Index build requires GA
  • Rebuild the inverted text index and update the
    global analysis (GA)
  • Duplicate documents are deleted from the index
  • Anchor text is indexed together with the
    documents content
  • Static rank gives the index ordering, allowing
    for early termination during query evaluation
  • The time to rebuild the index will be dominated
    by the GA time, as analysis get more complex
  • Semantic search

6
Major data structures
  • Store
  • Storage for the tokenized version of each
    document
  • Index
  • Inverted text index over the Store
  • Delta store and delta index
  • Small versions of the Store and Index with new
    and modified documents
  • Allow for hourly updates of the Index content

7
Index build algorithm (1/3)
  • Index build merges the current version of the
    Store (Storei) and with the current version of
    the DeltaStore and generates the new version of
    the Store and the new Index, Storei1 and Indexi1

Index Build
Storei
Storei1
DeltaStore
Indexi1
8
Index build algorithm (2/3)
  • Index build using global analysis

DeltaStore
9
Index build algorithm (3/3)
  • Index build using lagging global analysis

Global Analysis and DeltaIndex build can proceed
in parallel
Storei1
Storei
Indexi1
DeltaStore
GA inputs
GAi
GAi1
GAi
DeltaStorej1
DeltaStorej
Newly crawled documents
DeltaIndexj1
10
Indexing algorithm
  • Radix sort
  • Linear time sorting
  • Flexibility in defining the sort criteria
  • Bigger sort buffers increase performance
  • Pipelining load and sort phases

11
Experimental results
  • Lagging global analysis does not degrade quality
  • More than 25 of performance improvement
  • Even more advantageous when analysis are more
    complex
  • Indexing algorithm scales linearly with the
    number of documents
  • Superior performance when compared to several
    state-of-the art indexing algorithms

12
Hardware and software architectures
Write a Comment
User Comments (0)
About PowerShow.com