Title: Text Operations
1- Chapter 7
- Text Operations
2 Text Operations
- Text operations
- a pre-processing step on docs in a collection to
determine representative index terms, a
process of (i) controlling the size of index
terms and (ii) improving retrieval performance - useful text operations include
- elimination of stop-words
- Stemming
- building thesaurus
- performing compression
- drawback yielding unexpected results of phrase
queries
(reduce each word
to its grammatical root
remove affixes suffixes and prefixes)
(represent conceptual term relationships
construct term categorization structures)
(reduce query response time)
3Basic Concepts
- Logical view of the documents
- (internal) structure in a document (e.g., chapter
and section) - Document representation viewed as a continuum
logical view of documents might shift from a
full text representation to a higher level
representation
4 Lexical Analysis
- Lexical analysis is
- the process of converting an input stream of
characters into a stream of words (or token) - the first stage of automatic indexing and query
processing - query processing is the activity of (i) analyzing
a query and (ii) comparing it to indexes to find
relevant items - Design a lexical analyzer to extract tokens that
exclude - digits a number by itself doesnt make a good
index term - hyphens breaking hyphenated terms apart helps
with inconsistent usage, but loses the
specificity of a phrase - punctuation often used as parts of index terms
- case usually insignificant in index terms, but
may be important in some situations and should
be preserved
5 Lexical Analysis
- Policies to be considered
- recognizing numbers as tokens
- (-) adds many terms with poor discrimination
value to indexing - () maybe a good policy if exhaustive searching
is important - breaking up hyphenated terms
- preserving case distinctions
- Cost of lexical analysis is expensive
- it requires examination of each input character
- account for 50 computational expense of
compilation - Solutions
- specify the exceptions through regular
expressions, or - consider the full-text search/indexing strategy
increases recall but decreases precision, e.g.,
author field
enhances precision but decreases
recall
6Stoplists and Stopwords
- Candidate index terms are often checked to see
whether they are in a stoplist, or negative
dictionary, which includes articles,
propositions, and conjunctions - Stoplist words (such as the, of, and,
for) are - known to make poor or worthless index terms
- their discrimination value is low
- making up a large fraction (20 40) of text in
documents - immediately removed from consideration as index
terms - Eliminating stopwords
- speeds processing
- saves hugh amounts of space in indexing
- doesnt damage retrieval effectiveness
- Stopwords are not always frequently occuring
words, e.g., the 200 most frequently occurring
words include war, time, etc.
7Stoplists and Stopwords
- Stoplist policy is depended on
- Database/IR systems (commercial IR systems are
very conservative with a few stopwords) - features of the users
- indexing process
- Implementation of stoplists
- filtering stopwords from lexical analyzer output,
e.g., use hashing (fast but slow down by
re-computing hash value of each token and
resolve collision) - removing stopwords as part of the lexical
analysis process - at almost no extra cost
- can be automated easier/less error-prone than
filtering stopwords manually
8Stoplists and Stopwords
9Stemming
- A stem is the portion of a word after reducing
its variant of the same root word to a common
concept - Stemming algorithms are programs that relate
morphologically similar indexing and search
terms - Stemming (conflation, i.e., fusing or combining)
provides a way of finding morphological
variants of search terms - Stemming is used to improve retrieval
effectiveness and to reduce the size of indexing
files - Since a single stem corresponds to several full
terms, by storing stems instead of terms,
compression factors of over 50 can be
achieved - Terms can be stemmed at indexing (efficiency but
require extra storage) or search time
10Stemming in an IR System
11Stemming Algorithms
- Goals
- Different words with the same base meaning should
be conflated to the same form - Words with distinct meanings are kept separated
- Criteria for judging stemmers
- Correctness
- overstemming too much of a term is removed,
which - Effect retrieving unrelevant documents
- understemming too little of a term is removed,
which - Effect relevant documents cannot be
retrieved - Retrieval effectiveness measured by precision,
recall, speed and size - Compression performance
causes unrelated terms to be conflated
prevents related terms from being conflated
12Stemming
Conflation Methods
Successor Variety
Affix Removal
Table Lookup
N-gram
Cutoff
Peak Plateau
Complete Word
Entropy
Affix Removal removes suffixes and/or prefixes
from terms to yield a stem Successor Variety
uses the frequencies of letter sequences for
stemming Table Lookup storing terms and their
corresponding stems in a lookup table N-gram
conflates terms based on the number of
sub-strings with length N
13Stemming Algorithms
- Types of Stemming
- Store in a table of all index terms and their
stems - Terms from queries/indexes are stemmed via table
lookup - Advantages
- Disadvantages
- Domain-dependent DBs
- Storage overhead - trading size for time
, e.g.,
lookups are fast using a B()-tree/hash function
14Stemming Algorithms
- Types of Stemming (continued)
- Successor variety stemmers
- successor variety of a string S is the number of
different characters that follow S in words of a
text - determine the word boundaries based on the
distribution of letters, e.g., given the word
apple and the words in T above - the succesor variety of a is 4, i.e., b, x,
p, c - the successor variety of ap is 1, i.e., e
- the successor variety of app is 0
- the successor variety SV of substrings of a term
decreases as more characters are added till a
segment boundary is reached when SV sharply
increases - when a segment boundary is reached, a stem is
identified
, e.g., a text T that
contains able, axe, ape, and accept of a
15Stemming Algorithms
- Types of Stemming
- Successor variety stemmers (continued)
- Cutoff Method
- some cut-off value is selected for successor
varieties to identify boundaries - drawback if the cutoff value is too small,
incorrect
cut will be made
if too large, correct cuts will be missed
16Stemming Algorithms
- Types of Stemming
- Successor variety stemmers (continued)
- b. Peak and Plateau Method. Removes the need for
the cutoff value - a segment break is identified after a character
whose successor variety exceeds that of the
character - immediately preceding it, and
- immediately following it
Example. Let readable be the test word, and
let the corpus be able, ape, beatable,
fixable, read, readable, reading, reads, red,
rope, ripe
Prefix Successor Variety
Letters R 3
e, i, o RE
2 a, d REA
1
d READ 3
a, i, s READA
1 b READAB
1
l READABL 1
e READABLE 0
BLANK
Result readable is segmented into read and
able
17Stemming Algorithms
- Types of Stemming (continued)
- Complete Word (Segmentation) Method
- a segment break is identified if it is a complete
word in the corpus. - the choice of 1st or 2nd segment as stem, e.g.,
- if (1st segment occurs in ? 12 words in corpus),
then 1st segment is stem - else 2nd segment is stem, i.e., the 1st segment
is a prefix - Example. Let readable be the test word, and
let the corpus be able, ape, beatable,
fixable, read, readable, reading, reads, red,
rope, ripe . - Using the Complete Word (Segmentation)
Method, readable is segmented into read and
able, the same result in the peak and plateau
method.
18Stemming Algorithms
- Type of Stemming (continued)
- Entropy Method
- Considers the distribution of successor variety
letters - Approach
- Let D?i be the number of words in a text body
beginning with the i length sequence of letters
? - Let D?ij be the number of words in D?i with the
successor letter j - D?ij / D?i is the probability that a member
of D?i has the successor j - the entropy value of D?i is
- E?i ? - (D?ij / D?i) ? log2 (D?ij /
D?i) - Using the entropy values, a cutoff value is
identified and thus the boundary of a word
19Stemming Algorithms
- 3) N-gram (stemmers)
- calculate association measures between pairs of
terms based on shared unique N consecutive
letters, a term-clustering procedure - similarity measure (S) based on unique digrams is
computed as - S 2 ? C / (A B)
- where
- A number of unique digrams in the 1st word
- B ... ... ... ... ... ... ... ... ... ...
... ... ... 2nd ... ... - C ... ... ... ... ... ... ... ... ... ...
... shared by A and B
20Stemming Algorithms
- Example (N-gram stemmers).
- The terms statistics and statistical can
be broken into digrams as - statistics ? st ta at ti is st ti ic cs
- unique digrams st ta at ti is ic cs (7)
- statistical ? st ta at ti is st ti ic ca al
- unique digrams st ta at ti is ic ca al (8)
- S 2 ? C / (A B) ? S (2 ? 6) / (7 8)
- 0.8
- statistics, statistical are assigned to a
single cluster if the cutoff similarity measure
(value) of 0.6 is used
21Stemming Algorithms
- The similarity matrix of N-gram Stemmers
- similarity Matrix a (symmetric) matrix that
includes the similarity measures for each pair
of terms in the system - A symmetric matrix (i.e., Sij Sji)
22Stemming Algorithms
- 4) Affix Removal Stemmers
- Remove suffixes/prefixes from terms leaving a
stem - Resultant stems may also be transformed
- Example. remove the plurals from terms
(considered in the given order, i.e., use only
the 1st applicable rule) - If a word ends in ies but not eies or
aies, then replace ies by y - If a word ends in es but not aes, ees,
or oes, (e.g., goes), then replace es by
e - If a word ends in s but not us or ss,
- then remove s
(e.g., skies ? sky)
(e.g., eyes ? eye)
(e.g., bus and kiss),
(e.g., cars ? car)
23Stemming Algorithms
- 4a) Affix (Simple) Removal Stemmers
- Techniques
- Recoding a context sensitive transformation from
- AxC ? AyC,
- where A and C specify the context of the
transformation, - x is the input string and y is the
transformed string - e.g., ski ? sky, where i is the input
string and y is the
transformed string - Partial matching the n initial characterss of
stems are used in comparison - two stems are equivalent if they agree in all but
their last characters - e.g., transmission transmit (n 7)
24Stemming Algorithms
- 4b) Affix (Longest Match) Removal Stemmers
(Porters Algorithm) - Consists of a set of condition/action rules that
are evaluted in the ordering of the rules
specified in the condition to remove the longest
possible suffixes. - Notations
- A vowel, denoted v, is either A, E, I, O, or U,
and Y is a vowel if it is preceded by a
vowel otherwise, it is a consonant. - A consonant, denoted c, is a letter other than A,
E, I, O, or U, with the exception of Y. - A list ccc? of length greater than 0 is denoted
by C. - A list vvv? of length greater than 0 is denoted
by V. - Any word, or part of a word, has one of the
following forms - CVCV ? C, CVCV ? V, VCVC ? C, or VCVC
? V, - which can be represented by the single form
CVCVC ? V.
25Stemming Algorithms
4b) Affix (Longest Match) Removal Stemmers
(Porters Algorithm) 1) Measure, m ? 0, of a
stem is defined as
C(VC)mV where C is a sequence of
consonants (i.e. non-vowels letters) V is a
sequence of vowels 2) lt X gt the stem
ends with a given letter X 3) v the stem
contains a vowel, e.g., Row 2 of Table Rule (next
page)
(m 2)
(m 1),
e.g., BY
TREES
PRIVATE
(m 0),
e.g., (m gt 1 and (ltSgt or ltTgt)) in Row 1 of
Table Rule
(next page)
26Stemming Algorithms
4b) Porters Algorithm (continued) 4) d the
stem ends in a double consonants, e.g., Row 3 in
Table Rule 5) o the stem ends with a
consonant-vowel-consonant (cvc), not CVC,
sequence, and the final consonant is not w, x, or
y, e.g., Row 4 in Table Rule
Conditions (on the Suffix
Replacement Example potential stem) (S1 ?
S2) (S1) (S2)
m gt 1 and (ltSgt or ltTgt) ion
NULL adoption ? adopt
v
ing NULL motoring ? motor
sing ?
sing
m gt 1 and d and ltLgt NULL single
letter controll ? control
roll ?
roll
m 1 and not o e
NULL cease ? ceas
C(VC)mV
27Stemming Algorithms
4b) Porters Algorithm (http//www.tartarus.org/m
artin/)
28Stemming Algorithms
4b) Porters Algorithm (continued)
29Stemming Algorithms
4b) Porters Algorithm (continued)
30Thesauri
- Term thesauri (treasury of words) refine/broaden
the interpretation of terms - Use of similar or closely related terms with
- synonyms and antonyms (i.e., related words) for
each word - broader and narrower query terms using classified
hierarchies - A thesaurus, which broadens the vocabulary terms,
enhances the recall performance in retrieval - Can be used during
- document storage processing by replacing each
term variant w/ a standard term based on the
thesaurus - query processing to broaden a query to ensure
that relevant documents are not missed - Problems to be dealt with homographs (2 words w/
distinct meanings but identical spellings, e.g.
Mr. Post and post office)
31Thesauri
- Thesaurus Classes are groups or categories of
terms used in a given topic area
- A term match would result through the thesaurus
- transformation
32Thesauri
- Thesauri can be constructed manually,
semi-automatically, and fully automatically - Problems arise during the construction
- what terms should be included in the thesaurus
- terms specified for inclusion must be suitably
grouped
33Thesauri
- Automatic Thesaurus Construction
- use a set of document vectors and represent a
document collection by a matrix - the rows of the matrix represent the individual
doc vectors - the columns identify the term assignments/weights
to the docs
34Thesauri
- Similarity measures between pairs of terms
- Let TERMk and TERMh be two terms in a collection
of docs - Let tik be the weight of TERMk in document i in
the collection - Let n be the number of documents in the
collection - The similarity measure of TERMk and TERMh can be
defined as - SIM(TERMk , TERMh) ? (tik ? tih)
- and the normalized similarity measure of
TERMk and TERMh is - SIM(TERMk , TERMh)
- to limit SIM to values between 0 and 1
35Thesauri
- Similarity measures between pairs of terms
- A term-term association matrix can be constructed
when all pairs of columns in the document
vectors matrix are compared - Let SIM(TERMi , TERMj) be the similarity measure
of terms i and j
36Thesauri
- Use of thesauri
- A thesaurus can be used to broaden the existing
indexing vocabulary by - replacing the initial terms w/ the corresponding
thesaurus class, or - adding the thesaurus class identifiers to the
original terms - Example.
37Thesauri
- Use of thesauri
- Term associations and thesaurus classes can be
displayed to help the IR system users in - formulating the search queries
- familiarizing themselves with the vocabulary
- An attractive format of display is a graph-like
structure
38Thesauri
- Use of thesauri
- Maintenance problem use of thesaurus requires
maintenance - rebuilding result of user interaction w/ the
system, e.g., new user populations and
interests require new vocabulary terms - accommodate collection growth new documents are
added requires updated strategies - original thesauri are left unchanged for further
expansion - new terms derived from the added items are placed
into existing thesaurus categories - new terms are placed into separate new classes
- The thesauri are completely restructured by
generating a term classification from the
updated vocabulary - Option (a) produces loss of performance,
- Option (d) is expensive, and
- Options (b) and (c) yield no clear-cut (better)
choice
39Thesauri
- Automatic classification/Clustering Method
- Use term-term similarity matrix to construct
classes of similar terms (? thesaurus classes)
by collecting all terms whose similarity
coefficients are sufficiently large,
i.e. exceed a threshold value - Refinement for each thesaurus class TC, defines
the - term-centroid ltÃ1, Ã2, , Ãmgt
- which is the average vector for the term
vectors of TC, i.e. - the average value of all the values of
TERMk - Ãk 1/m ? ? tik
- for a class with m term vectors
40Thesauri
- Automatic classification/Clustering Method
- term-centroid can be used to refine thesaurus
class by computing the similarity between TERMk
and each term-centroid for all thesaurus classes - assume that there are t term vectors distributed
into p classes, generate a similarity
coefficients matrix of dimension t ? p - SIMILAR_CO(TERMk, TERM-CENTROIDh)
- where 1 ? k ? t and 1 ? h ? p
- each term vector can be assigned to a class for
which the similarity to the TERM-CENTROID is
largest - if there is a switch of a given term vector from
one class to another, then the centroids of the
involved classes must be recomposed until no
further class changes