Text Operations - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Text Operations

Description:

(represent conceptual term relationships; construct term ... Accents. Spacing. Stopwords. Noun. Groups. Stemming. Manual. Indexing. Docs. Structure. Full Text ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 41

Provided by: baomi

Category:

more less

Transcript and Presenter's Notes

Title: Text Operations

1

Chapter 7
Text Operations

2
Text Operations

Text operations
a pre-processing step on docs in a collection to
determine representative index terms, a
process of (i) controlling the size of index
terms and (ii) improving retrieval performance
useful text operations include
elimination of stop-words
Stemming
building thesaurus
performing compression
drawback yielding unexpected results of phrase
queries

(reduce each word
to its grammatical root
remove affixes suffixes and prefixes)

(represent conceptual term relationships
construct term categorization structures)
(reduce query response time)
3
Basic Concepts

Logical view of the documents
(internal) structure in a document (e.g., chapter
and section)
Document representation viewed as a continuum
logical view of documents might shift from a
full text representation to a higher level
representation

4
Lexical Analysis

Lexical analysis is
the process of converting an input stream of
characters into a stream of words (or token)
the first stage of automatic indexing and query
processing
query processing is the activity of (i) analyzing
a query and (ii) comparing it to indexes to find
relevant items
Design a lexical analyzer to extract tokens that
exclude
digits a number by itself doesnt make a good
index term
hyphens breaking hyphenated terms apart helps
with inconsistent usage, but loses the
specificity of a phrase
punctuation often used as parts of index terms
case usually insignificant in index terms, but
may be important in some situations and should
be preserved

5
Lexical Analysis

Policies to be considered
recognizing numbers as tokens
(-) adds many terms with poor discrimination
value to indexing
() maybe a good policy if exhaustive searching
is important
breaking up hyphenated terms
preserving case distinctions
Cost of lexical analysis is expensive
it requires examination of each input character
account for 50 computational expense of
compilation
Solutions
specify the exceptions through regular
expressions, or
consider the full-text search/indexing strategy

increases recall but decreases precision, e.g.,
author field

enhances precision but decreases
recall
6
Stoplists and Stopwords

Candidate index terms are often checked to see
whether they are in a stoplist, or negative
dictionary, which includes articles,
propositions, and conjunctions
Stoplist words (such as the, of, and,
for) are
known to make poor or worthless index terms
their discrimination value is low
making up a large fraction (20 40) of text in
documents
immediately removed from consideration as index
terms
Eliminating stopwords
speeds processing
saves hugh amounts of space in indexing
doesnt damage retrieval effectiveness
Stopwords are not always frequently occuring
words, e.g., the 200 most frequently occurring
words include war, time, etc.

7
Stoplists and Stopwords

Stoplist policy is depended on
Database/IR systems (commercial IR systems are
very conservative with a few stopwords)
features of the users
indexing process
Implementation of stoplists
filtering stopwords from lexical analyzer output,
e.g., use hashing (fast but slow down by
re-computing hash value of each token and
resolve collision)
removing stopwords as part of the lexical
analysis process
at almost no extra cost
can be automated easier/less error-prone than
filtering stopwords manually

8
Stoplists and Stopwords

9
Stemming

A stem is the portion of a word after reducing
its variant of the same root word to a common
concept
Stemming algorithms are programs that relate
morphologically similar indexing and search
terms
Stemming (conflation, i.e., fusing or combining)
provides a way of finding morphological
variants of search terms
Stemming is used to improve retrieval
effectiveness and to reduce the size of indexing
files
Since a single stem corresponds to several full
terms, by storing stems instead of terms,
compression factors of over 50 can be
achieved
Terms can be stemmed at indexing (efficiency but
require extra storage) or search time

10
Stemming in an IR System
11
Stemming Algorithms

Goals
Different words with the same base meaning should
be conflated to the same form
Words with distinct meanings are kept separated
Criteria for judging stemmers
Correctness
overstemming too much of a term is removed,
which
Effect retrieving unrelevant documents
understemming too little of a term is removed,
which
Effect relevant documents cannot be
retrieved
Retrieval effectiveness measured by precision,
recall, speed and size
Compression performance

causes unrelated terms to be conflated
prevents related terms from being conflated
12
Stemming
Conflation Methods
Successor Variety
Affix Removal
Table Lookup
N-gram
Cutoff
Peak Plateau
Complete Word
Entropy
Affix Removal removes suffixes and/or prefixes
from terms to yield a stem Successor Variety
uses the frequencies of letter sequences for
stemming Table Lookup storing terms and their
corresponding stems in a lookup table N-gram
conflates terms based on the number of
sub-strings with length N
13
Stemming Algorithms

Types of Stemming
Store in a table of all index terms and their
stems
Terms from queries/indexes are stemmed via table
lookup
Advantages
Disadvantages
Domain-dependent DBs
Storage overhead - trading size for time

, e.g.,
lookups are fast using a B()-tree/hash function
14
Stemming Algorithms

Types of Stemming (continued)
Successor variety stemmers
successor variety of a string S is the number of
different characters that follow S in words of a
text
determine the word boundaries based on the
distribution of letters, e.g., given the word
apple and the words in T above
the succesor variety of a is 4, i.e., b, x,
p, c
the successor variety of ap is 1, i.e., e
the successor variety of app is 0
the successor variety SV of substrings of a term
decreases as more characters are added till a
segment boundary is reached when SV sharply
increases
when a segment boundary is reached, a stem is
identified

, e.g., a text T that
contains able, axe, ape, and accept of a
15
Stemming Algorithms

Types of Stemming
Successor variety stemmers (continued)
Cutoff Method
some cut-off value is selected for successor
varieties to identify boundaries
drawback if the cutoff value is too small,

incorrect
cut will be made
if too large, correct cuts will be missed
16
Stemming Algorithms

Types of Stemming
Successor variety stemmers (continued)
b. Peak and Plateau Method. Removes the need for
the cutoff value
a segment break is identified after a character
whose successor variety exceeds that of the
character
immediately preceding it, and
immediately following it

Example. Let readable be the test word, and
let the corpus be able, ape, beatable,
fixable, read, readable, reading, reads, red,
rope, ripe
Prefix Successor Variety
Letters R 3
e, i, o RE
2 a, d REA
1
d READ 3
a, i, s READA
1 b READAB
1
l READABL 1
e READABLE 0
BLANK
Result readable is segmented into read and
able
17
Stemming Algorithms

Types of Stemming (continued)
Complete Word (Segmentation) Method
a segment break is identified if it is a complete
word in the corpus.
the choice of 1st or 2nd segment as stem, e.g.,
if (1st segment occurs in ? 12 words in corpus),
then 1st segment is stem
else 2nd segment is stem, i.e., the 1st segment
is a prefix
Example. Let readable be the test word, and
let the corpus be able, ape, beatable,
fixable, read, readable, reading, reads, red,
rope, ripe .
Using the Complete Word (Segmentation)
Method, readable is segmented into read and
able, the same result in the peak and plateau
method.

18
Stemming Algorithms

Type of Stemming (continued)
Entropy Method
Considers the distribution of successor variety
letters
Approach
Let D?i be the number of words in a text body
beginning with the i length sequence of letters
?
Let D?ij be the number of words in D?i with the
successor letter j
D?ij / D?i is the probability that a member
of D?i has the successor j
the entropy value of D?i is
E?i ? - (D?ij / D?i) ? log2 (D?ij /
D?i)
Using the entropy values, a cutoff value is
identified and thus the boundary of a word

19
Stemming Algorithms

3) N-gram (stemmers)
calculate association measures between pairs of
terms based on shared unique N consecutive
letters, a term-clustering procedure
similarity measure (S) based on unique digrams is
computed as
S 2 ? C / (A B)
where
A number of unique digrams in the 1st word
B ... ... ... ... ... ... ... ... ... ...
... ... ... 2nd ... ...
C ... ... ... ... ... ... ... ... ... ...
... shared by A and B

20
Stemming Algorithms

Example (N-gram stemmers).
The terms statistics and statistical can
be broken into digrams as
statistics ? st ta at ti is st ti ic cs
unique digrams st ta at ti is ic cs (7)
statistical ? st ta at ti is st ti ic ca al
unique digrams st ta at ti is ic ca al (8)
S 2 ? C / (A B) ? S (2 ? 6) / (7 8)
0.8
statistics, statistical are assigned to a
single cluster if the cutoff similarity measure
(value) of 0.6 is used

21
Stemming Algorithms

The similarity matrix of N-gram Stemmers
similarity Matrix a (symmetric) matrix that
includes the similarity measures for each pair
of terms in the system
A symmetric matrix (i.e., Sij Sji)

22
Stemming Algorithms

4) Affix Removal Stemmers
Remove suffixes/prefixes from terms leaving a
stem
Resultant stems may also be transformed
Example. remove the plurals from terms
(considered in the given order, i.e., use only
the 1st applicable rule)
If a word ends in ies but not eies or
aies, then replace ies by y
If a word ends in es but not aes, ees,
or oes, (e.g., goes), then replace es by
e
If a word ends in s but not us or ss,
then remove s

(e.g., skies ? sky)
(e.g., eyes ? eye)
(e.g., bus and kiss),
(e.g., cars ? car)
23
Stemming Algorithms

4a) Affix (Simple) Removal Stemmers
Techniques
Recoding a context sensitive transformation from
AxC ? AyC,
where A and C specify the context of the
transformation,
x is the input string and y is the
transformed string
e.g., ski ? sky, where i is the input
string and y is the
transformed string
Partial matching the n initial characterss of
stems are used in comparison
two stems are equivalent if they agree in all but
their last characters
e.g., transmission transmit (n 7)

24
Stemming Algorithms

4b) Affix (Longest Match) Removal Stemmers
(Porters Algorithm)
Consists of a set of condition/action rules that
are evaluted in the ordering of the rules
specified in the condition to remove the longest
possible suffixes.
Notations
A vowel, denoted v, is either A, E, I, O, or U,
and Y is a vowel if it is preceded by a
vowel otherwise, it is a consonant.
A consonant, denoted c, is a letter other than A,
E, I, O, or U, with the exception of Y.
A list ccc? of length greater than 0 is denoted
by C.
A list vvv? of length greater than 0 is denoted
by V.
Any word, or part of a word, has one of the
following forms
CVCV ? C, CVCV ? V, VCVC ? C, or VCVC
? V,
which can be represented by the single form
CVCVC ? V.

25
Stemming Algorithms

4b) Affix (Longest Match) Removal Stemmers
(Porters Algorithm) 1) Measure, m ? 0, of a
stem is defined as
C(VC)mV where C is a sequence of
consonants (i.e. non-vowels letters) V is a
sequence of vowels 2) lt X gt the stem
ends with a given letter X 3) v the stem
contains a vowel, e.g., Row 2 of Table Rule (next
page)
(m 2)
(m 1),
e.g., BY
TREES
PRIVATE
(m 0),
e.g., (m gt 1 and (ltSgt or ltTgt)) in Row 1 of
Table Rule
(next page)
26
Stemming Algorithms
4b) Porters Algorithm (continued) 4) d the
stem ends in a double consonants, e.g., Row 3 in
Table Rule 5) o the stem ends with a
consonant-vowel-consonant (cvc), not CVC,
sequence, and the final consonant is not w, x, or
y, e.g., Row 4 in Table Rule
Conditions (on the Suffix
Replacement Example potential stem) (S1 ?
S2) (S1) (S2)
m gt 1 and (ltSgt or ltTgt) ion
NULL adoption ? adopt
v
ing NULL motoring ? motor

sing ?
sing
m gt 1 and d and ltLgt NULL single
letter controll ? control

roll ?
roll
m 1 and not o e
NULL cease ? ceas
C(VC)mV
27
Stemming Algorithms
4b) Porters Algorithm (http//www.tartarus.org/m
artin/)
28
Stemming Algorithms
4b) Porters Algorithm (continued)
29
Stemming Algorithms
4b) Porters Algorithm (continued)
30
Thesauri

Term thesauri (treasury of words) refine/broaden
the interpretation of terms
Use of similar or closely related terms with
synonyms and antonyms (i.e., related words) for
each word
broader and narrower query terms using classified
hierarchies
A thesaurus, which broadens the vocabulary terms,
enhances the recall performance in retrieval
Can be used during
document storage processing by replacing each
term variant w/ a standard term based on the
thesaurus
query processing to broaden a query to ensure
that relevant documents are not missed
Problems to be dealt with homographs (2 words w/
distinct meanings but identical spellings, e.g.
Mr. Post and post office)

31
Thesauri

Thesaurus Classes are groups or categories of
terms used in a given topic area

A term match would result through the thesaurus
transformation

32
Thesauri

Thesauri can be constructed manually,
semi-automatically, and fully automatically
Problems arise during the construction
what terms should be included in the thesaurus
terms specified for inclusion must be suitably
grouped

33
Thesauri

Automatic Thesaurus Construction
use a set of document vectors and represent a
document collection by a matrix
the rows of the matrix represent the individual
doc vectors
the columns identify the term assignments/weights
to the docs

34
Thesauri

Similarity measures between pairs of terms
Let TERMk and TERMh be two terms in a collection
of docs
Let tik be the weight of TERMk in document i in
the collection
Let n be the number of documents in the
collection
The similarity measure of TERMk and TERMh can be
defined as
SIM(TERMk , TERMh) ? (tik ? tih)
and the normalized similarity measure of
TERMk and TERMh is
SIM(TERMk , TERMh)
to limit SIM to values between 0 and 1

35
Thesauri

Similarity measures between pairs of terms
A term-term association matrix can be constructed
when all pairs of columns in the document
vectors matrix are compared
Let SIM(TERMi , TERMj) be the similarity measure
of terms i and j

36
Thesauri

Use of thesauri
A thesaurus can be used to broaden the existing
indexing vocabulary by
replacing the initial terms w/ the corresponding
thesaurus class, or
adding the thesaurus class identifiers to the
original terms
Example.

37
Thesauri

Use of thesauri
Term associations and thesaurus classes can be
displayed to help the IR system users in
formulating the search queries
familiarizing themselves with the vocabulary
An attractive format of display is a graph-like
structure

38
Thesauri

Use of thesauri
Maintenance problem use of thesaurus requires
maintenance
rebuilding result of user interaction w/ the
system, e.g., new user populations and
interests require new vocabulary terms
accommodate collection growth new documents are
added requires updated strategies
original thesauri are left unchanged for further
expansion
new terms derived from the added items are placed
into existing thesaurus categories
new terms are placed into separate new classes
The thesauri are completely restructured by
generating a term classification from the
updated vocabulary
Option (a) produces loss of performance,
Option (d) is expensive, and
Options (b) and (c) yield no clear-cut (better)
choice

39
Thesauri

Automatic classification/Clustering Method
Use term-term similarity matrix to construct
classes of similar terms (? thesaurus classes)
by collecting all terms whose similarity
coefficients are sufficiently large,
i.e. exceed a threshold value
Refinement for each thesaurus class TC, defines
the
term-centroid ltÃ1, Ã2, , Ãmgt
which is the average vector for the term
vectors of TC, i.e.
the average value of all the values of
TERMk
Ãk 1/m ? ? tik
for a class with m term vectors

40
Thesauri

Automatic classification/Clustering Method
term-centroid can be used to refine thesaurus
class by computing the similarity between TERMk
and each term-centroid for all thesaurus classes
assume that there are t term vectors distributed
into p classes, generate a similarity
coefficients matrix of dimension t ? p
SIMILAR_CO(TERMk, TERM-CENTROIDh)
where 1 ? k ? t and 1 ? h ? p
each term vector can be assigned to a class for
which the similarity to the TERM-CENTROID is
largest
if there is a switch of a given term vector from
one class to another, then the centroids of the
involved classes must be recomposed until no
further class changes