Modern Information Retrieval Chapter 7: Text Operations

About This Presentation

Title:

Modern Information Retrieval Chapter 7: Text Operations

Description:

A sentence is usually composed of nouns, pronouns, articles, ... hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang) ... – PowerPoint PPT presentation

Number of Views:761

Avg rating:3.0/5.0

Slides: 41

Provided by: csieN5

Category:

more less

Transcript and Presenter's Notes

Title: Modern Information Retrieval Chapter 7: Text Operations

1
Modern Information RetrievalChapter 7 Text
Operations

Ricardo Baeza-Yates
Berthier Ribeiro-Neto

2
Document Preprocessing

Lexical analysis of the text
Elimination of stopwords
Stemming
Selection of index terms
Construction of term categorization structures

3
Lexical Analysis of the Text

Word separators
space
digits
hyphens
punctuation marks
the case of the letters

4
Elimination of Stopwords

A list of stopwords
words that are too frequent among the documents
articles, prepositions, conjunctions, etc.
Can reduce the size of the indexing structure
considerably
Problem
Search for to be or not to be?

5
Stemming

Example
connect, connected, connecting, connection,
connections
effectiveness --gt effective --gt effect
picnicking --gt picnic
king -\-gt k
Removing strategies
affix removal intuitive, simple
table lookup
successor variety
n-gram

6
Index Terms Selection

Motivation
A sentence is usually composed of nouns,
pronouns, articles, verbs, adjectives, adverbs,
and connectives.
Most of the semantics is carried by the noun
words.
Identification of noun groups
A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold

7
Thesauri

Peter Roget, 1988
Example
cowardly adj.
Ignobly lacking in courage cowardly turncoats
Syns chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang),
yellow-bellied (slang).
A controlled vocabulary for the indexing and
searching

8
The Purpose of a Thesaurus

To provide a standard vocabulary for indexing and
searching
To assist users with locating terms for proper
query formulation
To provide classified hierarchies that allow the
broadening and narrowing of the current query
request

9
Thesaurus Term Relationships

BT broader
NT narrower
RT non-hierarchical, but related

10
Term Selection

Automatic Text Processing
by G. Salton, Chap 9,
Addison-Wesley, 1989.

11
Automatic Indexing

Indexing
assign identifiers (index terms) to text
documents.
Identifiers
single-term vs. term phrase
controlled vs. uncontrolled vocabulariesinstructi
on manuals, terminological schedules,
objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names,
publisher names, dates of publications,

12
Two Issues

Issue 1 indexing exhaustivity
exhaustive assign a large number of terms
nonexhaustive
Issue 2 term specificity
broad terms (generic)cannot distinguish relevant
from nonrelevant documents
narrow terms (specific)retrieve relatively fewer
documents, but most of them are relevant

13
Parameters of retrieval effectiveness

Recall
Precision
Goal high recall and high precision

14
Retrieved Part
b
a
Nonrelevant Items
Relevant Items
d
c
15
A Joint Measure

F-score
? is a parameter that encode the importance of
recall and procedure.
?1 equal weight
?lt1 precision is more important
?gt1 recall is more important

16
Choices of Recall and Precision

Both recall and precision vary from 0 to 1.
Particular choices of indexing and search
policies have produced variations in performance
ranging from 0.8 precision and 0.2 recall to 0.1
precision and 0.8 recall.
In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.

17
Term-Frequency Consideration

Function words
for example, "and", "or", "of", "but",
the frequencies of these words are high in all
texts
Content words
words that actually relate to document content
varying frequencies in the different texts of a
collect
indicate term importance for content

18
A Frequency-Based Indexing Method

Eliminate common function words from the document
texts by consulting a special dictionary, or stop
list, containing a list of high frequency
function words.
Compute the term frequency tfij for all remaining
terms Tj in each document Di, specifying the
number of occurrences of Tj in Di.
Choose a threshold frequency T, and assign to
each document Di all term Tj for which tfij gt T.

19
Inverse Document Frequency

Inverse Document Frequency (IDF) for term
Tjwhere dfj (document frequency of term Tj)
is the number of documents in which Tj occurs.
fulfil both the recall and the precision
occur frequently in individual documents but
rarely in the remainder of the collection

20
TFxIDF

Weight wij of a term Tj in a document di
Eliminating common function words
Computing the value of wij for each term Tj in
each document Di
Assigning to the documents of a collection all
terms with sufficiently high (tf x idf) factors

21
Term-discrimination Value

Useful index terms
Distinguish the documents of a collection from
each other
Document Space
Two documents are assigned very similar term
sets, when the corresponding points in document
configuration appear close together
When a high-frequency term without discrimination
is assigned, it will increase the document space
density

22
A Virtual Document Space
After Assignment of good discriminator
After Assignment of poor discriminator
Original State
23
Good Term Assignment

When a term is assigned to the documents of a
collection, the few objects to which the term is
assigned will be distinguished from the rest of
the collection.
This should increase the average distance between
the objects in the collection and hence produce a
document space less dense than before.

24
Poor Term Assignment

A high frequency term is assigned that does not
discriminate between the objects of a collection.
Its assignment will render the document more
similar.
This is reflected in an increase in document
space density.

25
Term Discrimination Value

Definition dvj Q - Qjwhere Q and Qj are
space densities before and after the
assignments of term Tj.
dvjgt0, Tj is a good term dvjlt0, Tj is a poor
term.

26
Variations of Term-Discrimination Value with
Document Frequency
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
27
TFij x dvj

wij tfij x dvj
compared with
decrease steadily with increasing
document frequency
dvj increase from zero to positive as the
document frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.

28
Document Centroid

Issue efficiency problem N(N-1) pairwise
similarities
Document centroid C (c1, c2, c3, ...,
ct)where wij is the j-th term in document i.
Space density

29
Probabilistic Term Weighting

GoalExplicit distinctions between occurrences of
terms in relevant and nonrelevant documents of a
collection
DefinitionGiven a user query q, and the ideal
answer set of the relevant documents
From decision theory, the best ranking algorithm
for a document D

30
Probabilistic Term Weighting

Pr(rel), Pr(nonrel)documents a priori
probabilities of relevance and nonrelevance
Pr(Drel), Pr(Dnonrel)occurrence probabilities
of document D in the relevant and nonrelevant
document sets

31
Assumptions

Terms occur independently in documents

32
Derivation Process
33
For a specific document D

Given a document D(d1, d2, , dt)
Assume di is either 0 (absent) or 1 (present).

Pr(xi1rel) pi Pr(xi0rel)
1-pi Pr(xi1nonrel) qi Pr(xi0nonrel) 1-qi
34
(No Transcript)
35
Term Relevance Weight
36
Issue

How to compute pj and qj ?
pj rj / R qj (dfj-rj)/(N-R)
R the total number of relevant documents
N the total number of documents

37
Estimation of Term-Relevance

The occurrence probability of a term in the
nonrelevant documents qj is approximated by the
occurrence probability of the term in the entire
document collection qj dfj / N
The occurrence probabilities of the terms in the
small number of relevant documents is equal by
using a constant value pj 0.5 for all j.

38
Comparison
When N is sufficiently large, N-dfj ? N,
39
Estimation of Term-Relevance

Estimate the number of relevant documents rj in
the collection that contain term Tj as a function
of the known document frequency tfj of the term
Tj. pj rj / R qj (dfj-rj)/(N-R)R an
estimate of the total number of relevant
documents in the collection.

40
Summary

Inverse document frequency, idfj
tfijidfj (TFxIDF)
Term discrimination value, dvj
tfijdvj
Probabilistic term weighting trj
tfijtrj
Global properties of terms in a document
collection

Write a Comment

User Comments (0)

About PowerShow.com

Modern Information Retrieval Chapter 7: Text Operations - PowerPoint PPT Presentation

Modern Information Retrieval Chapter 7: Text Operations

A sentence is usually composed of nouns, pronouns, articles, ... hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang) ... – PowerPoint PPT presentation