High Performance Index Build Algorithms for Intranet Search Engines

About This Presentation

Title:

High Performance Index Build Algorithms for Intranet Search Engines

Description:

High Performance Index Build Algorithms for Intranet Search Engines ... Radix sort. Linear time sorting. Flexibility in defining the sort criteria ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 13

Provided by: MAF9

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Index Build Algorithms for Intranet Search Engines

1
High Performance Index Build Algorithms for
Intranet Search Engines

Marcus Fontoura, Eugene Shekita, Jason Zien,
Sridhar Rajagopalan, Andreas Neumann
fontoura_at_almaden.ibm.com

2
Agenda

Overview and problem description
Global analysis
Major data structures for index build
Index build algorithm

3
Overview and problem description

Trevi goal is to provide high quality intranet
search capability to corporate portals such as
w3.ibm.com
Scalable text search engine that is being
developed by a joint IBM Research and Software
Group team
This talk focuses on how to efficiently
incorporate global analysis into the index build
process

4
Global analysis (GA)

Duplicate detection
Computes fingerprints for each page (64 bit
shingle)
Master are identified by using the (previous)
static rank
Anchor text (D1 lta refD2gtTrevilt/agt)
Appends anchor text tokens to documents
Static rank
Host in-degree, i.e., number of hosts that point
to a page ( PageRank on the IBM intranet)

5
Index build requires GA

Rebuild the inverted text index and update the
global analysis (GA)
Duplicate documents are deleted from the index
Anchor text is indexed together with the
documents content
Static rank gives the index ordering, allowing
for early termination during query evaluation
The time to rebuild the index will be dominated
by the GA time, as analysis get more complex
Semantic search

6
Major data structures

Store
Storage for the tokenized version of each
document
Index
Inverted text index over the Store
Delta store and delta index
Small versions of the Store and Index with new
and modified documents
Allow for hourly updates of the Index content

7
Index build algorithm (1/3)

Index build merges the current version of the
Store (Storei) and with the current version of
the DeltaStore and generates the new version of
the Store and the new Index, Storei1 and Indexi1

Index Build
Storei
Storei1
DeltaStore
Indexi1
8
Index build algorithm (2/3)

Index build using global analysis

DeltaStore
9
Index build algorithm (3/3)

Index build using lagging global analysis

Global Analysis and DeltaIndex build can proceed
in parallel
Storei1
Storei
Indexi1
DeltaStore
GA inputs
GAi
GAi1
GAi
DeltaStorej1
DeltaStorej
Newly crawled documents
DeltaIndexj1
10
Indexing algorithm

Radix sort
Linear time sorting
Flexibility in defining the sort criteria
Bigger sort buffers increase performance
Pipelining load and sort phases

11
Experimental results

Lagging global analysis does not degrade quality
More than 25 of performance improvement
Even more advantageous when analysis are more
complex
Indexing algorithm scales linearly with the
number of documents
Superior performance when compared to several
state-of-the art indexing algorithms

12
Hardware and software architectures

Write a Comment

User Comments (0)