IR Software for LargeScale Research - PowerPoint PPT Presentation

About This Presentation

Title:

IR Software for LargeScale Research

Description:

Greg Newby has been working on experimental IR systems for over 10 years. ... Excellent IR software such as SMART, Okapi and INQUERY are missing one or more ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 23

Provided by: GBN2

Learn more at: https://www.petascale.org

Category:

more less

Transcript and Presenter's Notes

Title: IR Software for LargeScale Research

1
IR Software for Large-Scale Research

Gregory B. NewbySchool of Information and
Library Science,University of North Carolina at
Chapel HillCB 3360 Manning Hall, Chapel Hill,
NC, 27599-3360gbnewby_at_ils.unc.edu

2
Abstract
3
Who

Greg Newby has been working on experimental IR
systems for over 10 years. Hes participated in
TREC since 1986.
His interest has been in extending information
space ideas to IR systems (see recent JASIST
article, Information Space and Cognitive
Space).
IRTools is a more generalized version of software
hes developed previously
With an NSF/ITR grant, its been possible to hire
student programmers to help write code and test
system performance

4
What

IRTools is a software toolkit. Its not a
ready-made IR system, but can be easily
configured to perform consistently with major IR
models
Boolean retrieval with various term and document
weighting
Vector Space Model (VSM)
Latent Semantic Indexing (LSI) and Newbys
Information Space
Probabilistic IR
The software is designed for modularity,
scalability and high performance, but with an
emphasis on IR experimentation, not real-world
production use

5
Where

UNC Chapel Hill has a tradition of information
retrieval research, systems development and
evaluation
School facilities include new SunFire servers.
The University provides additional computational
hosts, and a robotic tape to disk library with
unlimited storage
Project facilities include two research systems
with 2 and 4GB RAM and 1000GB disk space

6
When

The NSF/ITR project runs for 3 years, ending in
August 2003
Software development is ongoing, and partners and
contributors are sought to join in a virtual
development team
The approximate timeline for IRTools is
2001 Fundamental software functional for Boolean
and VSM
2002 Functionality for LSI and Information Space
2003 More emphasis on XML and other
semi-structured data types

7
Why

To have configurable, flexible software for IR
experimentation that is freely available, high
performance and scalable.
Excellent IR software such as SMART, Okapi and
INQUERY are missing one or more of the desired
qualities above
Excellent Web retrieval software such as ht//dig
are not suitable for experimentation, as they
only implement a subset of desirable retrieval
models
The search engines dont share their source code,
algorithms or methods

8
How

Write code. We use mostly C, with some
reliance on the Standard Template Library (STL).
We use C, Perl and other languages as needed
Test and evaluate. The code includes a full
regression test (make test)
Experiment. Weve been working with the 10GB Web
dataset from TREC, with several years of
relevance judgments
Tune. Data structures, file structures and
algorithms need experimental validation. Often,
they must be tuned for particular retrieval
methods

9
Getting the Code

Source code is periodically assembled into
releases. We have not yet made a 1.0 release
Visit the project homepage for documentation and
information about current work
For the source code, visit our development site
at Sourceforge http//sf.net/projects/irtools
You can download the most current code
IRTools has been tested for
Solaris
Linux (i686 and Alpha)

10
Full Disclosure

Does this software work? Not fully, but many
parts of it function quite well. Its a work in
progress.
So, your TREC 2001 results must have been pretty
good, eh? No, there were some bugs that resulted
in poor performance this year. We were trying to
test our implementation of the VSM with pivoted
term weights
Will this be better than Google? Doubtful, but
thats not the point. This is for IR
researchers, not a commercial product
Are you trying to get people to use IRTools for
their own research? Not necessarily, but we hope
it will be helpful for other researchers, and
possibly for use in the classroom

11
Major Components
12
The Spider

Needed for live Web use. For existing datasets
(such as TREC data), we dont need the spider
Were borrowing methods from wget and other
open-source spidering tools
Challenges include spider traps and poorly formed
HTML
The spider is solely concerned with Web
interaction to get documents and handle errors.
The indexer worries about seeking more documents
(HREFs), parsing the documents, etc.

13
The Indexer

Quite complicated, with dozens of classes and
thousands of lines of code
Some components are generic, but many are
specific to a particular retrieval experiment.
Different indexing methods are applied based on
The type of data being indexed (Web, abstracts,
full text)
What retrieval methods will be used (VSM, LSI,
Boolean)
What term weighting is needed
The size of the data (e.g., to determine whether
multiple files will be used for the inverted
index, or only one)

14
The Retrieval Engine

Highly configurable for different experiments
One collection (aka set of indexed data) may be
used with different retrieval methods. This is
the core value of the software to enable
experiments with many constants
Small proxy servers enable the retrieval engine
to interact with external interfaces (e.g., Java
programs)
Other small servers can retrieve from Web search
engines, such as Google, then reformat hits
internally

15
A Typical TREC-Style Experiment Indexer
Configuration

Estimate high-water marks for memory and disk
usage. Determine whether you can index the
entire dataset with one run, or if you need
multiple runs
Bring together different indexing classes and
methods into one program. For example
File opener (to recursively retrieve files
directories)
Tokenizer (identify word boundaries)
Stemmer and stoplist handlers
Choice of HTML or XML tags or other elements to
identify, and how to identify them
Choice of what data to store to disk (e.g.,
separate inverted indexes for particular tags
sequential index)

16
A Typical TREC-Style Experiment Retrieval Engine
Configuration

For batch-oriented retrieval, queries may be
pre-stemmed and stopped (or you could use term ID
s instead of the terms)
For interactive retrieval or testing, the
tokenizer, stemmer and stopword processor should
match indexer
Add components as needed, such as
Candidate document selection (e.g., Boolean AND)
Query expansion
Weighting of terms and documents (tfidf,
pivoted, user specified)
Similarity measure (cosine, geometric distance)
Ranking
Presentation of results
Relevance feedback, query adjustment, etc.
Adding CGI functionality, command line options
and other interaction methods is easily done

17
Some Files used by IRTools

Inverted index Binary files. One file contains
term ID s, term counts, weights and offset
locations to the document list. The second file
contains the document list for each term. A
third file contains the list of term locations
(for NEAR operator)
Sequential index For each document, a list of
the term ID s, term counts and locations (3
separate binary files)
Term map a database (Berkeley DB) to look up a
term ID for a term
Term ID data For each term, its frequency in the
collection
Sparse matrix files For co-occurrence data, term
by document lists, etc. Binary files using a
modified Harwell-Boeing format
For any experiment, only some of these files (or
others) are needed

18
A Little Source Code weight.h

class IRT_Weight
public
// Constructor
IRT_Weight (IRT_Index inarg)
in inarg
// End of constructor for class IRT_Weight
// Destructor
IRT_Weight()
// End of destructor for class IRT_Weight
// Get a tf weight
irt_float weight_get_tf(vector lt irt_int gt,
irt_float)
// Get an idf weight
irt_float weight_get_idf(irt_float, irt_float)

19
A Little Source Code bool_or.cc

This class member function merges lists of
document ID s

20
A Program to Index WT10G (the TREC 10GB Web
dataset)

This program runs in about 3 hours on our dual
Alpha station
It creates separate inverted indexes for terms in
the lttitlegt and lth1gt tags
The slowest part is the tokenizer, which
identifies terms and tags of interest. The token
class is being redesigned for higher performance

21
Current Projects Include

TREC Interactive Track. Well be using IRTools
to post-process Google results and display them
via a proxy to the end user. A sitemap-style
interface will be compared to a traditional list.
3D navigation. Several interfaces to navigate
through information space. These can use a local
dataset, or visualize relative locations of
documents retrieved elsewhere.

22
Our Thanks To