IR Software for LargeScale Research - PowerPoint PPT Presentation

About This Presentation
Title:

IR Software for LargeScale Research

Description:

Greg Newby has been working on experimental IR systems for over 10 years. ... Excellent IR software such as SMART, Okapi and INQUERY are missing one or more ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 23
Provided by: GBN2
Category:

less

Transcript and Presenter's Notes

Title: IR Software for LargeScale Research


1
IR Software for Large-Scale Research
  • Gregory B. NewbySchool of Information and
    Library Science,University of North Carolina at
    Chapel HillCB 3360 Manning Hall, Chapel Hill,
    NC, 27599-3360gbnewby_at_ils.unc.edu

2
Abstract
3
Who
  • Greg Newby has been working on experimental IR
    systems for over 10 years. Hes participated in
    TREC since 1986.
  • His interest has been in extending information
    space ideas to IR systems (see recent JASIST
    article, Information Space and Cognitive
    Space).
  • IRTools is a more generalized version of software
    hes developed previously
  • With an NSF/ITR grant, its been possible to hire
    student programmers to help write code and test
    system performance

4
What
  • IRTools is a software toolkit. Its not a
    ready-made IR system, but can be easily
    configured to perform consistently with major IR
    models
  • Boolean retrieval with various term and document
    weighting
  • Vector Space Model (VSM)
  • Latent Semantic Indexing (LSI) and Newbys
    Information Space
  • Probabilistic IR
  • The software is designed for modularity,
    scalability and high performance, but with an
    emphasis on IR experimentation, not real-world
    production use

5
Where
  • UNC Chapel Hill has a tradition of information
    retrieval research, systems development and
    evaluation
  • School facilities include new SunFire servers.
    The University provides additional computational
    hosts, and a robotic tape to disk library with
    unlimited storage
  • Project facilities include two research systems
    with 2 and 4GB RAM and 1000GB disk space

6
When
  • The NSF/ITR project runs for 3 years, ending in
    August 2003
  • Software development is ongoing, and partners and
    contributors are sought to join in a virtual
    development team
  • The approximate timeline for IRTools is
  • 2001 Fundamental software functional for Boolean
    and VSM
  • 2002 Functionality for LSI and Information Space
  • 2003 More emphasis on XML and other
    semi-structured data types

7
Why
  • To have configurable, flexible software for IR
    experimentation that is freely available, high
    performance and scalable.
  • Excellent IR software such as SMART, Okapi and
    INQUERY are missing one or more of the desired
    qualities above
  • Excellent Web retrieval software such as ht//dig
    are not suitable for experimentation, as they
    only implement a subset of desirable retrieval
    models
  • The search engines dont share their source code,
    algorithms or methods

8
How
  • Write code. We use mostly C, with some
    reliance on the Standard Template Library (STL).
    We use C, Perl and other languages as needed
  • Test and evaluate. The code includes a full
    regression test (make test)
  • Experiment. Weve been working with the 10GB Web
    dataset from TREC, with several years of
    relevance judgments
  • Tune. Data structures, file structures and
    algorithms need experimental validation. Often,
    they must be tuned for particular retrieval
    methods

9
Getting the Code
  • Source code is periodically assembled into
    releases. We have not yet made a 1.0 release
  • Visit the project homepage for documentation and
    information about current work
  • For the source code, visit our development site
    at Sourceforge http//sf.net/projects/irtools
  • You can download the most current code
  • IRTools has been tested for
  • Solaris
  • Linux (i686 and Alpha)

10
Full Disclosure
  • Does this software work? Not fully, but many
    parts of it function quite well. Its a work in
    progress.
  • So, your TREC 2001 results must have been pretty
    good, eh? No, there were some bugs that resulted
    in poor performance this year. We were trying to
    test our implementation of the VSM with pivoted
    term weights
  • Will this be better than Google? Doubtful, but
    thats not the point. This is for IR
    researchers, not a commercial product
  • Are you trying to get people to use IRTools for
    their own research? Not necessarily, but we hope
    it will be helpful for other researchers, and
    possibly for use in the classroom

11
Major Components
12
The Spider
  • Needed for live Web use. For existing datasets
    (such as TREC data), we dont need the spider
  • Were borrowing methods from wget and other
    open-source spidering tools
  • Challenges include spider traps and poorly formed
    HTML
  • The spider is solely concerned with Web
    interaction to get documents and handle errors.
    The indexer worries about seeking more documents
    (HREFs), parsing the documents, etc.

13
The Indexer
  • Quite complicated, with dozens of classes and
    thousands of lines of code
  • Some components are generic, but many are
    specific to a particular retrieval experiment.
  • Different indexing methods are applied based on
  • The type of data being indexed (Web, abstracts,
    full text)
  • What retrieval methods will be used (VSM, LSI,
    Boolean)
  • What term weighting is needed
  • The size of the data (e.g., to determine whether
    multiple files will be used for the inverted
    index, or only one)

14
The Retrieval Engine
  • Highly configurable for different experiments
  • One collection (aka set of indexed data) may be
    used with different retrieval methods. This is
    the core value of the software to enable
    experiments with many constants
  • Small proxy servers enable the retrieval engine
    to interact with external interfaces (e.g., Java
    programs)
  • Other small servers can retrieve from Web search
    engines, such as Google, then reformat hits
    internally

15
A Typical TREC-Style Experiment Indexer
Configuration
  • Estimate high-water marks for memory and disk
    usage. Determine whether you can index the
    entire dataset with one run, or if you need
    multiple runs
  • Bring together different indexing classes and
    methods into one program. For example
  • File opener (to recursively retrieve files
    directories)
  • Tokenizer (identify word boundaries)
  • Stemmer and stoplist handlers
  • Choice of HTML or XML tags or other elements to
    identify, and how to identify them
  • Choice of what data to store to disk (e.g.,
    separate inverted indexes for particular tags
    sequential index)

16
A Typical TREC-Style Experiment Retrieval Engine
Configuration
  • For batch-oriented retrieval, queries may be
    pre-stemmed and stopped (or you could use term ID
    s instead of the terms)
  • For interactive retrieval or testing, the
    tokenizer, stemmer and stopword processor should
    match indexer
  • Add components as needed, such as
  • Candidate document selection (e.g., Boolean AND)
  • Query expansion
  • Weighting of terms and documents (tfidf,
    pivoted, user specified)
  • Similarity measure (cosine, geometric distance)
  • Ranking
  • Presentation of results
  • Relevance feedback, query adjustment, etc.
  • Adding CGI functionality, command line options
    and other interaction methods is easily done

17
Some Files used by IRTools
  • Inverted index Binary files. One file contains
    term ID s, term counts, weights and offset
    locations to the document list. The second file
    contains the document list for each term. A
    third file contains the list of term locations
    (for NEAR operator)
  • Sequential index For each document, a list of
    the term ID s, term counts and locations (3
    separate binary files)
  • Term map a database (Berkeley DB) to look up a
    term ID for a term
  • Term ID data For each term, its frequency in the
    collection
  • Sparse matrix files For co-occurrence data, term
    by document lists, etc. Binary files using a
    modified Harwell-Boeing format
  • For any experiment, only some of these files (or
    others) are needed

18
A Little Source Code weight.h
  • class IRT_Weight
  • public
  • // Constructor
  • IRT_Weight (IRT_Index inarg)
  • in inarg
  • // End of constructor for class IRT_Weight
  • // Destructor
  • IRT_Weight()
  • // End of destructor for class IRT_Weight
  • // Get a tf weight
  • irt_float weight_get_tf(vector lt irt_int gt,
    irt_float)
  • // Get an idf weight
  • irt_float weight_get_idf(irt_float, irt_float)

19
A Little Source Code bool_or.cc
  • This class member function merges lists of
    document ID s

20
A Program to Index WT10G (the TREC 10GB Web
dataset)
  • This program runs in about 3 hours on our dual
    Alpha station
  • It creates separate inverted indexes for terms in
    the lttitlegt and lth1gt tags
  • The slowest part is the tokenizer, which
    identifies terms and tags of interest. The token
    class is being redesigned for higher performance

21
Current Projects Include
  • TREC Interactive Track. Well be using IRTools
    to post-process Google results and display them
    via a proxy to the end user. A sitemap-style
    interface will be compared to a traditional list.
  • 3D navigation. Several interfaces to navigate
    through information space. These can use a local
    dataset, or visualize relative locations of
    documents retrieved elsewhere.

22
Our Thanks To
  • gcc and g These compilers greatly facilitate
    cross-platform development
  • The Berkeley DB High performance database
    functionality for single-key data
  • wget and ht//dig, with open source functionality
    we have learned from
Write a Comment
User Comments (0)
About PowerShow.com