Title: Text Operations
1Text Operations
2The Retrieval Process
3Outline
- Document Preprocessing (7.1-7.2)
- Text Compression (7.4-7.5) skipped
- Automatic Indexing (Chap. 9, Salton)
- Term Selection
4Document Preprocessing
- Lexical analysis
- Letters, digits, punctuation marks,
- Stopword removal
- the, of,
- Stemming
- Prefix, suffix
- Index term selection
- Noun
- Construction of term categorization structure
- Thesaurus
5- Logical view of the documents
Accents, spacing
Noun groups
Manual indexing
stopwords
stemming
Docs
structure
6Lexical Analysis
- Converting a stream of characters into a stream
of words - Recognition of words
- Digits usually not good index terms
- Ex. The number of deaths due to car accidents
between 1910 and 1989, 510B.C., credit card
numbers, - Hyphens
- Ex state-of-the-art, gilt-edge, B-49,
- Punctuation marks normally removed entirely
- Ex 510B.C., program codes x.id vs. xid,
- The case of letters usually not important
- Ex Bank vs. bank, Unix-like operating systems,
7Elimination of Stopwords
- Stopwords words which are too frequent among the
documents in the collection are not good
discriminators - Articles, prepositions, conjunctions,
- Some verbs, adverbs, and adjectives
- To reduce the size of the indexing structure
- Stopword removal might reduce recall
- Ex to be or not to be
8Stemming
- The substitution of the words by their respective
stems - Ex plurals, gerund forms, past tense suffixes,
- A stem is the portion of a word which is left
after the removal of its affixes (i.e., prefixes
and suffixes) - Ex connect, connected, connecting, connection,
connections - Controversy about the benefits
- Useful for improving retrieval performance
- Reducing the size of the indexing structure
9Stemming
- Four types of stemming strategies
- Affix removal, table lookup, successor variety,
and n-grams (or term clustering) - Suffix removal
- Ports algorithm (available in the Appendix)
- Simplicity and elegance
10Index Term Selection
- Manually or automatically
- Identification of noun groups
- Most of the semantics is carried by the noun
words - Systematic elimination of verbs, adjectives,
adverbs, connectives, articles, and pronouns - A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold
11Thesauri
- Thesaurus a reference to a treasury of words
- A precompiled list of important words in a given
domain of knowledge - For each word in this list, a set of related
words - Ex synonyms,
- It also involves normalization of vocabulary, and
a structure
12Example Entry in Peter Rogets Thesaurus
- Cowardly adjective
- Ignobly lacking in courage cowardly turncoats.
- Syns chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang),
yellow-bellied (slang).
13Main Purposes of a Thesaurus
- To provide a standard vocabulary for indexing and
searching - To assist users with locating terms for proper
query formulation - To provide classified hierarchies that allow the
broadening and narrowing of the current query
request according to the user needs
14Motivation for Building a Thesaurus
- Using a controlled vocabulary for the indexing
and searching - Normalization of indexing concepts
- Reduction of noise
- Identification of indexing terms with a clear
semantic meaning - Retrieval based on concepts rather than on words
- Ex term classification hierarchy in Yahoo!
15Main Components of a Thesaurus
- Index terms individual words, group of words,
phrases - Concept
- Ex missiles, ballistic
- Definition or explanation
- Ex seal (marine animals), seal (documents)
- Relationships among the terms
- BT (broader), NT (narrower)
- RT (related) much difficult
- A layout design for these term relationships
- A list or bi-dimensional display
16Automatic Indexing(Term Selection)
17Automatic Indexing
- Indexing
- assign identifiers (index terms) to text
documents - Identifiers
- single-term vs. term phrase
- controlled vs. uncontrolled vocabulariesinstructi
on manuals, terminological schedules, - objective vs. nonobjective text identifiers
cataloging rules control, e.g., author names,
publisher names, dates of publications,
18Two Issues
- Issue 1 indexing exhaustivity
- exhaustive assign a large number of terms
- Nonexhaustive only main aspects of subject
content - Issue 2 term specificity
- broad terms (generic)cannot distinguish relevant
from nonrelevant documents - narrow terms (specific)retrieve relatively fewer
documents, but most of them are relevant
19Recall vs. Precision
- Recall (R) Number of relevant documents
retrieved / total number of relevant documents in
collection - The proportion of relevant items retrieved
- Precision (P) Number of relevant documents
retrieved / total number of documents retrieved - The proportion of items retrieved that are
relevant - Example for a query, e.g. Taipei
All docs
Relevantdocs
Retrieveddocs
20More on Recall/Precision
- Simultaneously optimizing both recall and
precision is not normally achievable - Narrow and specific terms precision is favored
- Broad and nonspecific terms recall is favored
- When a choice must be made between term
specificity and term breadth, the former is
generally preferable - High-recall, low-precision documents will burden
the user - Lack of precision is more easily remedied than
lack of recall
21Term-Frequency Consideration
- Function words
- for example, "and", "of", "or", "but",
- the frequencies of these words are high in all
texts - Content words
- words that actually relate to document content
- varying frequencies in the different texts of a
collection - indicate term importance for content
22A Frequency-Based Indexing Method
- Eliminate common function words from the document
texts by consulting a special dictionary, or stop
list, containing a list of high frequency
function words - Compute the term frequency tfij for all remaining
terms Tj in each document Di, specifying the
number of occurrences of Tj in Di - Choose a threshold frequency T, and assign to
each document Di all term Tj for which tfij gt T
23More on Term Frequency
- High-frequency term
- Recall
- Ex Apple
- But only if its occurrence frequency is not
equally high in other documents - Low-frequency term
- Precision
- Ex Huntingtons disease
- Able to distinguish the few documents in which
they occur from the many from which they are
absent
24How to Compute Weight wij ?
- Inverse document frequency, idfj
- tfijidfj (TFxIDF)
- Term discrimination value, dvj
- tfijdvj
- Probabilistic term weighting trj
- tfijtrj
- Global properties of terms in a document
collection
25Inverse Document Frequency
- Inverse Document Frequency (IDF) for term
Tjwhere dfj (document frequency of term Tj)
is the number of documents in which Tj occurs. - fulfil both the recall and the precision
- occur frequently in individual documents but
rarely in the remainder of the collection
26TFxIDF
- Weight wij of a term Tj in a document di
- Eliminating common function words
- Computing the value of wij for each term Tj in
each document Di - Assigning to the documents of a collection all
terms with sufficiently high (tf x idf) weights
27Term-discrimination Value
- Useful index terms
- Distinguish the documents of a collection from
each other - Document Space
- Each point represents a particular document of a
collection - The distance between two points is inversely
proportional to the similarity between the
respective term assignments - When two documents are assigned very similar term
sets, the corresponding points in document
configuration appear close together
28A Virtual Document Space
After Assignment of good discriminator
After Assignment of poor discriminator
Original State
29Good Term Assignment
- When a term is assigned to the documents of a
collection, the few documents to which the term
is assigned will be distinguished from the rest
of the collection - This should increase the average distance between
the documents in the collection and hence produce
a document space less dense than before
30Poor Term Assignment
- A high frequency term is assigned that does not
discriminate between the objects of a collection - Its assignment will render the document more
similar - This is reflected in an increase in document
space density
31Term Discrimination Value
- Definition dvj Q - Qjwhere Q and Qj are
space densities before and after the
assignment of term Tj - The average pairwise similarity between all pairs
of distinct terms - dvjgt0, Tj is a good term dvjlt0, Tj is a poor
term.
32Variations of Term-Discrimination Value with
Document Frequency
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
33TFij x dvj
- wij tfij x dvj
- compared with
- decreases steadily with increasing
document frequency - dvj increases from zero to positive as the
document frequency of the term increases,
decreases shapely as the document frequency
becomes still larger - Issue efficiency problem to compute N(N-1)
pairwise similarities
34Document Centroid
- Document centroid C (c1, c2, c3, ...,
ct)where wij is the j-th term in document I - A dummy average document located in the center
of the document space - Space density
35Probabilistic Term Weighting
- GoalExplicit distinctions between occurrences of
terms in relevant and nonrelevant documents of a
collection - DefinitionGiven a user query q, and the ideal
answer set of the relevant documentsFrom
decision theory, the best ranking algorithm for a
document D
36Probabilistic Term Weighting
- Pr(rel), Pr(nonrel)documents a priori
probabilities of relevance and nonrelevance - Pr(Drel), Pr(Dnonrel)occurrence probabilities
of document D in the relevant and nonrelevant
document sets
37Assumptions
- Terms occur independently in documents
38Derivation Process
39For a specific document D
- Given a document D(d1, d2, , dt)
- Assume di is either 0 (absent) or 1 (present)
Pr(xi1rel) pi Pr(xi0rel)
1-pi Pr(xi1nonrel) qi Pr(xi0nonrel) 1-qi
40(No Transcript)
41Term Relevance Weight
42Issue
- How to compute pj and qj ?
- pj rj / R qj (dfj-rj)/(N-R)
- rj the number of relevant documents that
contains term Tj - R the total number of relevant documents
- N the total number of documents
43Estimation of Term-Relevance
- The occurrence probability of a term in the
nonrelevant documents qj is approximated by the
occurrence probability of the term in the entire
document collection qj dfj / N - Large majority of documents will be nonrelevant
to the average query - The occurrence probabilities of the terms in the
small number of relevant documents is assumed to
be equal by using a constant value pj 0.5 for
all j
44Comparison
When N is sufficiently large, N-dfj ? N,