LIS618 lecture 3 - PowerPoint PPT Presentation

About This Presentation

Title:

LIS618 lecture 3

Description:

... is not well documented by the provider. but searchers need to be aware of them... lexical analysis ... consider a searcher for 'to be or not to be' ... – PowerPoint PPT presentation

Number of Views:16

Avg rating:3.0/5.0

Slides: 23

Provided by: kric

Learn more at: https://openlib.org

Category:

more less

Transcript and Presenter's Notes

Title: LIS618 lecture 3

1
LIS618 lecture 3

Thomas Krichel
2003-02-13

2
Structure of talk

Document Preprocessing
Basic ingredients of query languages
Retrieval performance evaluation

3
document preprocessing

There are some operations that may be done to the
documents before indexing
lexical analysis
stemming of words
elimination of stop words
selection of index terms
construction of term categorization structures
we will look at those in turn
in many cases, document preprocessing is not well
documented by the provider.
but searchers need to be aware of them

4
lexical analysis

divides a stream of characters into a stream of
words
seems easy enough but.
should we keep numbers?
hyphens. compare "state-of-the-art" with "b-52"
removal of punctuation, but "333B.C."
casing. compare "bank" and "Bank"

5
stemming

in general, users search for the occurrence of a
term irrespective of grammar
plural, gerund forms, past tense can be subject
to stemming
important algorithm by Porter
evidence about the effect of stemming on
information retrieval is mixed
stemming is relatively rare these days.

6
elimination of stop words

some words carry no meaning and should be
eliminated
in fact any word that appears in 80 of all
documents is pretty much useless, but
consider a searcher for "to be or not to be".
It is better to reduce the index weight of terms
that appear very frequently

7
index term selection

some engines try to capture nouns only
some nouns that appear heavily together can be
considered to be one index term, such as
"computer science"
Dialog deals with this through phrase indexing.
Most web engines, however, index all words, and
all of the individually

8
thesauri

a list of words and for each word, a list of
related words
synonyms
broader terms
narrower terms
used
to provide a consistent vocabulary for indexing
and searching
to assist users with locating terms for query
formulation
allow users to broaden or narrow query

9
use of thesauri

Thesauri are limited to experimental systems, or
some high-quality systems, see http//www.sosig.ac
.uk for an example.
It can be confusing to users.
Frequently the relationship between terms in the
query is badly served by the relationships in the
thesaurus. Thus thesaurus expansion of an initial
query (if performed automatically) can lead to
bad results.

10
simple queries

single-word queries
one word only
Hopefully some word combinations are understood
as one word, e.g. on-line
Context queries
phrase queries (be aware of stop words)
proximity queries, generalize phrase queries
Boolean queries

11
simple pattern queries

prefix queries (e.g. "anal" for analogy)
suffix queries (e.g. "oral" for choral)
substring (e.g. "al" for talk)
ranges (e.g. form "held" to "hero")
within a distance, usually Levenshtein distance
(i.e. the minimum number of insertions,
deletions, and replacements) of query term

12
regular expressions

come from UNIX computing
build form strings where certain characters are
metacharacters.
example "pro(blem)(tein)s?" matches problem,
problem, protein and proteins.
example New .y matches "New Jersey" and "New
York City", and "New Delhy".
great variety of dialects, usually very powerful.
Extremely important in digital libraries.

13
structured queries

make use of document structures
simplest example is when the documents are
database records, we can search for terms is a
certain field only.
if there is sufficient structure to field
contents, the field can be interpreted as meaning
something different than the word it contains.
example dates

14
query protocols

There are some standard languages
Z39.50 queries
CCL, "common command language" is a development
of Z39.50
CD-RDx "compact disk read only data exchange" is
supported by US government agencies such as CIA
and NASA
SFQL "structure full text query language" built
on SQL

15
http//openlib.org/home/krichel

Thank you for your attention!

16
retrieval performance evaluation

"Recall" and "Precision" are two classic measures
to measure the performance of information
retrieval in a single query.
Both assume that there is an answer set of
documents that contain the answer to the query.
Performance is optimal if
the database returns all the documents in the
answer set
the database returns only documents in the answer
set
Recall is the fraction of the relevant documents
that the query result has captured.
Precision is the fraction of the retrieved
documents that is relevant.

17
recall and precision curves

Assume that all the retrieved documents arrive at
once and are being examined.
During that process, the user discover more and
more relevant documents. Recall increases.
During the same process, at least eventually,
there will be less and less useful document.
Precision declines (usually).
This can be represented as a curve.

18
Example

Let the answer set be 0,1,2,3,4,5,6,7,8,9 and
non-relevant documents represented by letters.
A query reveals the following result
7,a,3,b,c,9,n,j,l,5,r,o,s,e,4.
For the first document, (recall, precision) is
(10,100), for the third, (20,66), for the
sixth (30,50), for the tenth (40,40) etc.

19
recall/precision curves

Such curves can be formed for each query.
An average curve, for each recall level, can be
calculated for several queries.
Recall and precision levels can also be used to
calculate two single-valued summaries.
average precision at seen document
R-precision

20
average precision at seen document

To find it, sum all the precision level for each
new relevant document discovered by the user and
divide by the total number of relevant documents
for the query.
In our example, it is 0.57
This measure favors retrieval methods that get
the relevant documents to the top.

21
R-precision