LIS618 lecture 3 - PowerPoint PPT Presentation

About This Presentation
Title:

LIS618 lecture 3

Description:

... is not well documented by the provider. but searchers need to be aware of them... lexical analysis ... consider a searcher for 'to be or not to be' ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 23
Provided by: kric
Learn more at: https://openlib.org
Category:

less

Transcript and Presenter's Notes

Title: LIS618 lecture 3


1
LIS618 lecture 3
  • Thomas Krichel
  • 2003-02-13

2
Structure of talk
  • Document Preprocessing
  • Basic ingredients of query languages
  • Retrieval performance evaluation

3
document preprocessing
  • There are some operations that may be done to the
    documents before indexing
  • lexical analysis
  • stemming of words
  • elimination of stop words
  • selection of index terms
  • construction of term categorization structures
  • we will look at those in turn
  • in many cases, document preprocessing is not well
    documented by the provider.
  • but searchers need to be aware of them

4
lexical analysis
  • divides a stream of characters into a stream of
    words
  • seems easy enough but.
  • should we keep numbers?
  • hyphens. compare "state-of-the-art" with "b-52"
  • removal of punctuation, but "333B.C."
  • casing. compare "bank" and "Bank"

5
stemming
  • in general, users search for the occurrence of a
    term irrespective of grammar
  • plural, gerund forms, past tense can be subject
    to stemming
  • important algorithm by Porter
  • evidence about the effect of stemming on
    information retrieval is mixed
  • stemming is relatively rare these days.

6
elimination of stop words
  • some words carry no meaning and should be
    eliminated
  • in fact any word that appears in 80 of all
    documents is pretty much useless, but
  • consider a searcher for "to be or not to be".
  • It is better to reduce the index weight of terms
    that appear very frequently

7
index term selection
  • some engines try to capture nouns only
  • some nouns that appear heavily together can be
    considered to be one index term, such as
    "computer science"
  • Dialog deals with this through phrase indexing.
  • Most web engines, however, index all words, and
    all of the individually

8
thesauri
  • a list of words and for each word, a list of
    related words
  • synonyms
  • broader terms
  • narrower terms
  • used
  • to provide a consistent vocabulary for indexing
    and searching
  • to assist users with locating terms for query
    formulation
  • allow users to broaden or narrow query

9
use of thesauri
  • Thesauri are limited to experimental systems, or
    some high-quality systems, see http//www.sosig.ac
    .uk for an example.
  • It can be confusing to users.
  • Frequently the relationship between terms in the
    query is badly served by the relationships in the
    thesaurus. Thus thesaurus expansion of an initial
    query (if performed automatically) can lead to
    bad results.

10
simple queries
  • single-word queries
  • one word only
  • Hopefully some word combinations are understood
    as one word, e.g. on-line
  • Context queries
  • phrase queries (be aware of stop words)
  • proximity queries, generalize phrase queries
  • Boolean queries

11
simple pattern queries
  • prefix queries (e.g. "anal" for analogy)
  • suffix queries (e.g. "oral" for choral)
  • substring (e.g. "al" for talk)
  • ranges (e.g. form "held" to "hero")
  • within a distance, usually Levenshtein distance
    (i.e. the minimum number of insertions,
    deletions, and replacements) of query term

12
regular expressions
  • come from UNIX computing
  • build form strings where certain characters are
    metacharacters.
  • example "pro(blem)(tein)s?" matches problem,
    problem, protein and proteins.
  • example New .y matches "New Jersey" and "New
    York City", and "New Delhy".
  • great variety of dialects, usually very powerful.
  • Extremely important in digital libraries.

13
structured queries
  • make use of document structures
  • simplest example is when the documents are
    database records, we can search for terms is a
    certain field only.
  • if there is sufficient structure to field
    contents, the field can be interpreted as meaning
    something different than the word it contains.
    example dates

14
query protocols
  • There are some standard languages
  • Z39.50 queries
  • CCL, "common command language" is a development
    of Z39.50
  • CD-RDx "compact disk read only data exchange" is
    supported by US government agencies such as CIA
    and NASA
  • SFQL "structure full text query language" built
    on SQL

15
http//openlib.org/home/krichel
  • Thank you for your attention!

16
retrieval performance evaluation
  • "Recall" and "Precision" are two classic measures
    to measure the performance of information
    retrieval in a single query.
  • Both assume that there is an answer set of
    documents that contain the answer to the query.
  • Performance is optimal if
  • the database returns all the documents in the
    answer set
  • the database returns only documents in the answer
    set
  • Recall is the fraction of the relevant documents
    that the query result has captured.
  • Precision is the fraction of the retrieved
    documents that is relevant.

17
recall and precision curves
  • Assume that all the retrieved documents arrive at
    once and are being examined.
  • During that process, the user discover more and
    more relevant documents. Recall increases.
  • During the same process, at least eventually,
    there will be less and less useful document.
    Precision declines (usually).
  • This can be represented as a curve.

18
Example
  • Let the answer set be 0,1,2,3,4,5,6,7,8,9 and
    non-relevant documents represented by letters.
  • A query reveals the following result
  • 7,a,3,b,c,9,n,j,l,5,r,o,s,e,4.
  • For the first document, (recall, precision) is
    (10,100), for the third, (20,66), for the
    sixth (30,50), for the tenth (40,40) etc.

19
recall/precision curves
  • Such curves can be formed for each query.
  • An average curve, for each recall level, can be
    calculated for several queries.
  • Recall and precision levels can also be used to
    calculate two single-valued summaries.
  • average precision at seen document
  • R-precision

20
average precision at seen document
  • To find it, sum all the precision level for each
    new relevant document discovered by the user and
    divide by the total number of relevant documents
    for the query.
  • In our example, it is 0.57
  • This measure favors retrieval methods that get
    the relevant documents to the top.

21
R-precision
  • a more ad-hoc measure.
  • Let R be the size of the answer set.
  • Take the first R results of the query.
  • Find the number of relevant documents
  • Divide by R.
  • In our example, the R-precision is .4.
  • An average can be calculated for a number of
    queries.

22
critique of recall precision
  • Recall has to be estimated by an expert
  • Recall is very difficult to estimate in a large
    collection
  • They focus on one query only. No serious user
    works like this.
  • There are some other measures, but that is more
    for an advanced course in IR.
Write a Comment
User Comments (0)
About PowerShow.com