Title: Modern Information Retrieval Chapter 7: Text Operations
1Modern Information RetrievalChapter 7 Text
Operations
- Ricardo Baeza-Yates
- Berthier Ribeiro-Neto
2Document Preprocessing
- Lexical analysis of the text
- Elimination of stopwords
- Stemming
- Selection of index terms
- Construction of term categorization structures
3Lexical Analysis of the Text
- Word separators
- space
- digits
- hyphens
- punctuation marks
- the case of the letters
4Elimination of Stopwords
- A list of stopwords
- words that are too frequent among the documents
- articles, prepositions, conjunctions, etc.
- Can reduce the size of the indexing structure
considerably - Problem
- Search for to be or not to be?
5Stemming
- Example
- connect, connected, connecting, connection,
connections - effectiveness --gt effective --gt effect
- picnicking --gt picnic
- king -\-gt k
- Removing strategies
- affix removal intuitive, simple
- table lookup
- successor variety
- n-gram
6Index Terms Selection
- Motivation
- A sentence is usually composed of nouns,
pronouns, articles, verbs, adjectives, adverbs,
and connectives. - Most of the semantics is carried by the noun
words. - Identification of noun groups
- A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold
7Thesauri
- Peter Roget, 1988
- Example
- cowardly adj.
- Ignobly lacking in courage cowardly turncoats
- Syns chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang),
yellow-bellied (slang). - A controlled vocabulary for the indexing and
searching
8The Purpose of a Thesaurus
- To provide a standard vocabulary for indexing and
searching - To assist users with locating terms for proper
query formulation - To provide classified hierarchies that allow the
broadening and narrowing of the current query
request
9Thesaurus Term Relationships
- BT broader
- NT narrower
- RT non-hierarchical, but related
10Term Selection
- Automatic Text Processing
- by G. Salton, Chap 9,
- Addison-Wesley, 1989.
11Automatic Indexing
- Indexing
- assign identifiers (index terms) to text
documents. - Identifiers
- single-term vs. term phrase
- controlled vs. uncontrolled vocabulariesinstructi
on manuals, terminological schedules, - objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names,
publisher names, dates of publications,
12Two Issues
- Issue 1 indexing exhaustivity
- exhaustive assign a large number of terms
- nonexhaustive
- Issue 2 term specificity
- broad terms (generic)cannot distinguish relevant
from nonrelevant documents - narrow terms (specific)retrieve relatively fewer
documents, but most of them are relevant
13Parameters of retrieval effectiveness
- Recall
- Precision
- Goal high recall and high precision
14Retrieved Part
b
a
Nonrelevant Items
Relevant Items
d
c
15A Joint Measure
- F-score
- ? is a parameter that encode the importance of
recall and procedure. - ?1 equal weight
- ?lt1 precision is more important
- ?gt1 recall is more important
16Choices of Recall and Precision
- Both recall and precision vary from 0 to 1.
- Particular choices of indexing and search
policies have produced variations in performance
ranging from 0.8 precision and 0.2 recall to 0.1
precision and 0.8 recall. - In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.
17Term-Frequency Consideration
- Function words
- for example, "and", "or", "of", "but",
- the frequencies of these words are high in all
texts - Content words
- words that actually relate to document content
- varying frequencies in the different texts of a
collect - indicate term importance for content
18A Frequency-Based Indexing Method
- Eliminate common function words from the document
texts by consulting a special dictionary, or stop
list, containing a list of high frequency
function words. - Compute the term frequency tfij for all remaining
terms Tj in each document Di, specifying the
number of occurrences of Tj in Di. - Choose a threshold frequency T, and assign to
each document Di all term Tj for which tfij gt T.
19Inverse Document Frequency
- Inverse Document Frequency (IDF) for term
Tjwhere dfj (document frequency of term Tj)
is the number of documents in which Tj occurs. - fulfil both the recall and the precision
- occur frequently in individual documents but
rarely in the remainder of the collection
20TFxIDF
- Weight wij of a term Tj in a document di
- Eliminating common function words
- Computing the value of wij for each term Tj in
each document Di - Assigning to the documents of a collection all
terms with sufficiently high (tf x idf) factors
21Term-discrimination Value
- Useful index terms
- Distinguish the documents of a collection from
each other - Document Space
- Two documents are assigned very similar term
sets, when the corresponding points in document
configuration appear close together - When a high-frequency term without discrimination
is assigned, it will increase the document space
density
22A Virtual Document Space
After Assignment of good discriminator
After Assignment of poor discriminator
Original State
23Good Term Assignment
- When a term is assigned to the documents of a
collection, the few objects to which the term is
assigned will be distinguished from the rest of
the collection. - This should increase the average distance between
the objects in the collection and hence produce a
document space less dense than before.
24Poor Term Assignment
- A high frequency term is assigned that does not
discriminate between the objects of a collection.
Its assignment will render the document more
similar. - This is reflected in an increase in document
space density.
25Term Discrimination Value
- Definition dvj Q - Qjwhere Q and Qj are
space densities before and after the
assignments of term Tj. - dvjgt0, Tj is a good term dvjlt0, Tj is a poor
term.
26Variations of Term-Discrimination Value with
Document Frequency
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
27TFij x dvj
- wij tfij x dvj
- compared with
- decrease steadily with increasing
document frequency - dvj increase from zero to positive as the
document frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.
28Document Centroid
- Issue efficiency problem N(N-1) pairwise
similarities - Document centroid C (c1, c2, c3, ...,
ct)where wij is the j-th term in document i. - Space density
29Probabilistic Term Weighting
- GoalExplicit distinctions between occurrences of
terms in relevant and nonrelevant documents of a
collection - DefinitionGiven a user query q, and the ideal
answer set of the relevant documents - From decision theory, the best ranking algorithm
for a document D
30Probabilistic Term Weighting
- Pr(rel), Pr(nonrel)documents a priori
probabilities of relevance and nonrelevance - Pr(Drel), Pr(Dnonrel)occurrence probabilities
of document D in the relevant and nonrelevant
document sets
31Assumptions
- Terms occur independently in documents
32Derivation Process
33For a specific document D
- Given a document D(d1, d2, , dt)
- Assume di is either 0 (absent) or 1 (present).
Pr(xi1rel) pi Pr(xi0rel)
1-pi Pr(xi1nonrel) qi Pr(xi0nonrel) 1-qi
34(No Transcript)
35Term Relevance Weight
36Issue
- How to compute pj and qj ?
- pj rj / R qj (dfj-rj)/(N-R)
- R the total number of relevant documents
- N the total number of documents
37Estimation of Term-Relevance
- The occurrence probability of a term in the
nonrelevant documents qj is approximated by the
occurrence probability of the term in the entire
document collection qj dfj / N - The occurrence probabilities of the terms in the
small number of relevant documents is equal by
using a constant value pj 0.5 for all j.
38Comparison
When N is sufficiently large, N-dfj ? N,
39Estimation of Term-Relevance
- Estimate the number of relevant documents rj in
the collection that contain term Tj as a function
of the known document frequency tfj of the term
Tj. pj rj / R qj (dfj-rj)/(N-R)R an
estimate of the total number of relevant
documents in the collection.
40Summary
- Inverse document frequency, idfj
- tfijidfj (TFxIDF)
- Term discrimination value, dvj
- tfijdvj
- Probabilistic term weighting trj
- tfijtrj
- Global properties of terms in a document
collection