Text Operations - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Text Operations

Description:

(represent conceptual term relationships; construct term ... Accents. Spacing. Stopwords. Noun. Groups. Stemming. Manual. Indexing. Docs. Structure. Full Text ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 41
Provided by: baomi
Category:

less

Transcript and Presenter's Notes

Title: Text Operations


1
  • Chapter 7
  • Text Operations

2
Text Operations
  • Text operations
  • a pre-processing step on docs in a collection to
    determine representative index terms, a
    process of (i) controlling the size of index
    terms and (ii) improving retrieval performance
  • useful text operations include
  • elimination of stop-words
  • Stemming
  • building thesaurus
  • performing compression
  • drawback yielding unexpected results of phrase
    queries

(reduce each word
to its grammatical root
remove affixes suffixes and prefixes)

(represent conceptual term relationships
construct term categorization structures)
(reduce query response time)
3
Basic Concepts
  • Logical view of the documents
  • (internal) structure in a document (e.g., chapter
    and section)
  • Document representation viewed as a continuum
    logical view of documents might shift from a
    full text representation to a higher level
    representation

4
Lexical Analysis
  • Lexical analysis is
  • the process of converting an input stream of
    characters into a stream of words (or token)
  • the first stage of automatic indexing and query
    processing
  • query processing is the activity of (i) analyzing
    a query and (ii) comparing it to indexes to find
    relevant items
  • Design a lexical analyzer to extract tokens that
    exclude
  • digits a number by itself doesnt make a good
    index term
  • hyphens breaking hyphenated terms apart helps
    with inconsistent usage, but loses the
    specificity of a phrase
  • punctuation often used as parts of index terms
  • case usually insignificant in index terms, but
    may be important in some situations and should
    be preserved

5
Lexical Analysis
  • Policies to be considered
  • recognizing numbers as tokens
  • (-) adds many terms with poor discrimination
    value to indexing
  • () maybe a good policy if exhaustive searching
    is important
  • breaking up hyphenated terms
  • preserving case distinctions
  • Cost of lexical analysis is expensive
  • it requires examination of each input character
  • account for 50 computational expense of
    compilation
  • Solutions
  • specify the exceptions through regular
    expressions, or
  • consider the full-text search/indexing strategy


increases recall but decreases precision, e.g.,
author field

enhances precision but decreases
recall
6
Stoplists and Stopwords
  • Candidate index terms are often checked to see
    whether they are in a stoplist, or negative
    dictionary, which includes articles,
    propositions, and conjunctions
  • Stoplist words (such as the, of, and,
    for) are
  • known to make poor or worthless index terms
  • their discrimination value is low
  • making up a large fraction (20 40) of text in
    documents
  • immediately removed from consideration as index
    terms
  • Eliminating stopwords
  • speeds processing
  • saves hugh amounts of space in indexing
  • doesnt damage retrieval effectiveness
  • Stopwords are not always frequently occuring
    words, e.g., the 200 most frequently occurring
    words include war, time, etc.

7
Stoplists and Stopwords
  • Stoplist policy is depended on
  • Database/IR systems (commercial IR systems are
    very conservative with a few stopwords)
  • features of the users
  • indexing process
  • Implementation of stoplists
  • filtering stopwords from lexical analyzer output,
    e.g., use hashing (fast but slow down by
    re-computing hash value of each token and
    resolve collision)
  • removing stopwords as part of the lexical
    analysis process
  • at almost no extra cost
  • can be automated easier/less error-prone than
    filtering stopwords manually

8
Stoplists and Stopwords

9
Stemming
  • A stem is the portion of a word after reducing
    its variant of the same root word to a common
    concept
  • Stemming algorithms are programs that relate
    morphologically similar indexing and search
    terms
  • Stemming (conflation, i.e., fusing or combining)
    provides a way of finding morphological
    variants of search terms
  • Stemming is used to improve retrieval
    effectiveness and to reduce the size of indexing
    files
  • Since a single stem corresponds to several full
    terms, by storing stems instead of terms,
    compression factors of over 50 can be
    achieved
  • Terms can be stemmed at indexing (efficiency but
    require extra storage) or search time

10
Stemming in an IR System
11
Stemming Algorithms
  • Goals
  • Different words with the same base meaning should
    be conflated to the same form
  • Words with distinct meanings are kept separated
  • Criteria for judging stemmers
  • Correctness
  • overstemming too much of a term is removed,
    which
  • Effect retrieving unrelevant documents
  • understemming too little of a term is removed,
    which
  • Effect relevant documents cannot be
    retrieved
  • Retrieval effectiveness measured by precision,
    recall, speed and size
  • Compression performance

causes unrelated terms to be conflated
prevents related terms from being conflated
12
Stemming
Conflation Methods
Successor Variety
Affix Removal
Table Lookup
N-gram
Cutoff
Peak Plateau
Complete Word
Entropy
Affix Removal removes suffixes and/or prefixes
from terms to yield a stem Successor Variety
uses the frequencies of letter sequences for
stemming Table Lookup storing terms and their
corresponding stems in a lookup table N-gram
conflates terms based on the number of
sub-strings with length N
13
Stemming Algorithms
  • Types of Stemming
  • Store in a table of all index terms and their
    stems
  • Terms from queries/indexes are stemmed via table
    lookup
  • Advantages
  • Disadvantages
  • Domain-dependent DBs
  • Storage overhead - trading size for time

, e.g.,
lookups are fast using a B()-tree/hash function
14
Stemming Algorithms
  • Types of Stemming (continued)
  • Successor variety stemmers
  • successor variety of a string S is the number of
    different characters that follow S in words of a
    text
  • determine the word boundaries based on the
    distribution of letters, e.g., given the word
    apple and the words in T above
  • the succesor variety of a is 4, i.e., b, x,
    p, c
  • the successor variety of ap is 1, i.e., e
  • the successor variety of app is 0
  • the successor variety SV of substrings of a term
    decreases as more characters are added till a
    segment boundary is reached when SV sharply
    increases
  • when a segment boundary is reached, a stem is
    identified


, e.g., a text T that
contains able, axe, ape, and accept of a
15
Stemming Algorithms
  • Types of Stemming
  • Successor variety stemmers (continued)
  • Cutoff Method
  • some cut-off value is selected for successor
    varieties to identify boundaries
  • drawback if the cutoff value is too small,


incorrect
cut will be made
if too large, correct cuts will be missed
16
Stemming Algorithms
  • Types of Stemming
  • Successor variety stemmers (continued)
  • b. Peak and Plateau Method. Removes the need for
    the cutoff value
  • a segment break is identified after a character
    whose successor variety exceeds that of the
    character
  • immediately preceding it, and
  • immediately following it

Example. Let readable be the test word, and
let the corpus be able, ape, beatable,
fixable, read, readable, reading, reads, red,
rope, ripe
Prefix Successor Variety
Letters R 3
e, i, o RE
2 a, d REA
1
d READ 3
a, i, s READA
1 b READAB
1
l READABL 1
e READABLE 0
BLANK
Result readable is segmented into read and
able
17
Stemming Algorithms
  • Types of Stemming (continued)
  • Complete Word (Segmentation) Method
  • a segment break is identified if it is a complete
    word in the corpus.
  • the choice of 1st or 2nd segment as stem, e.g.,
  • if (1st segment occurs in ? 12 words in corpus),
    then 1st segment is stem
  • else 2nd segment is stem, i.e., the 1st segment
    is a prefix
  • Example. Let readable be the test word, and
    let the corpus be able, ape, beatable,
    fixable, read, readable, reading, reads, red,
    rope, ripe .
  • Using the Complete Word (Segmentation)
    Method, readable is segmented into read and
    able, the same result in the peak and plateau
    method.

18
Stemming Algorithms
  • Type of Stemming (continued)
  • Entropy Method
  • Considers the distribution of successor variety
    letters
  • Approach
  • Let D?i be the number of words in a text body
    beginning with the i length sequence of letters
    ?
  • Let D?ij be the number of words in D?i with the
    successor letter j
  • D?ij / D?i is the probability that a member
    of D?i has the successor j
  • the entropy value of D?i is
  • E?i ? - (D?ij / D?i) ? log2 (D?ij /
    D?i)
  • Using the entropy values, a cutoff value is
    identified and thus the boundary of a word

19
Stemming Algorithms
  • 3) N-gram (stemmers)
  • calculate association measures between pairs of
    terms based on shared unique N consecutive
    letters, a term-clustering procedure
  • similarity measure (S) based on unique digrams is
    computed as
  • S 2 ? C / (A B)
  • where
  • A number of unique digrams in the 1st word
  • B ... ... ... ... ... ... ... ... ... ...
    ... ... ... 2nd ... ...
  • C ... ... ... ... ... ... ... ... ... ...
    ... shared by A and B

20
Stemming Algorithms
  • Example (N-gram stemmers).
  • The terms statistics and statistical can
    be broken into digrams as
  • statistics ? st ta at ti is st ti ic cs
  • unique digrams st ta at ti is ic cs (7)
  • statistical ? st ta at ti is st ti ic ca al
  • unique digrams st ta at ti is ic ca al (8)
  • S 2 ? C / (A B) ? S (2 ? 6) / (7 8)
  • 0.8
  • statistics, statistical are assigned to a
    single cluster if the cutoff similarity measure
    (value) of 0.6 is used

21
Stemming Algorithms
  • The similarity matrix of N-gram Stemmers
  • similarity Matrix a (symmetric) matrix that
    includes the similarity measures for each pair
    of terms in the system
  • A symmetric matrix (i.e., Sij Sji)

22
Stemming Algorithms
  • 4) Affix Removal Stemmers
  • Remove suffixes/prefixes from terms leaving a
    stem
  • Resultant stems may also be transformed
  • Example. remove the plurals from terms
    (considered in the given order, i.e., use only
    the 1st applicable rule)
  • If a word ends in ies but not eies or
    aies, then replace ies by y
  • If a word ends in es but not aes, ees,
    or oes, (e.g., goes), then replace es by
    e
  • If a word ends in s but not us or ss,
  • then remove s

(e.g., skies ? sky)
(e.g., eyes ? eye)
(e.g., bus and kiss),
(e.g., cars ? car)
23
Stemming Algorithms
  • 4a) Affix (Simple) Removal Stemmers
  • Techniques
  • Recoding a context sensitive transformation from
  • AxC ? AyC,
  • where A and C specify the context of the
    transformation,
  • x is the input string and y is the
    transformed string
  • e.g., ski ? sky, where i is the input
    string and y is the
    transformed string
  • Partial matching the n initial characterss of
    stems are used in comparison
  • two stems are equivalent if they agree in all but
    their last characters
  • e.g., transmission transmit (n 7)

24
Stemming Algorithms
  • 4b) Affix (Longest Match) Removal Stemmers
    (Porters Algorithm)
  • Consists of a set of condition/action rules that
    are evaluted in the ordering of the rules
    specified in the condition to remove the longest
    possible suffixes.
  • Notations
  • A vowel, denoted v, is either A, E, I, O, or U,
    and Y is a vowel if it is preceded by a
    vowel otherwise, it is a consonant.
  • A consonant, denoted c, is a letter other than A,
    E, I, O, or U, with the exception of Y.
  • A list ccc? of length greater than 0 is denoted
    by C.
  • A list vvv? of length greater than 0 is denoted
    by V.
  • Any word, or part of a word, has one of the
    following forms
  • CVCV ? C, CVCV ? V, VCVC ? C, or VCVC
    ? V,
  • which can be represented by the single form
    CVCVC ? V.


25
Stemming Algorithms

4b) Affix (Longest Match) Removal Stemmers
(Porters Algorithm) 1) Measure, m ? 0, of a
stem is defined as
C(VC)mV where C is a sequence of
consonants (i.e. non-vowels letters) V is a
sequence of vowels 2) lt X gt the stem
ends with a given letter X 3) v the stem
contains a vowel, e.g., Row 2 of Table Rule (next
page)
(m 2)
(m 1),
e.g., BY
TREES
PRIVATE
(m 0),
e.g., (m gt 1 and (ltSgt or ltTgt)) in Row 1 of
Table Rule
(next page)
26
Stemming Algorithms
4b) Porters Algorithm (continued) 4) d the
stem ends in a double consonants, e.g., Row 3 in
Table Rule 5) o the stem ends with a
consonant-vowel-consonant (cvc), not CVC,
sequence, and the final consonant is not w, x, or
y, e.g., Row 4 in Table Rule
Conditions (on the Suffix
Replacement Example potential stem) (S1 ?
S2) (S1) (S2)
m gt 1 and (ltSgt or ltTgt) ion
NULL adoption ? adopt
v
ing NULL motoring ? motor


sing ?
sing
m gt 1 and d and ltLgt NULL single
letter controll ? control


roll ?
roll
m 1 and not o e
NULL cease ? ceas
C(VC)mV
27
Stemming Algorithms
4b) Porters Algorithm (http//www.tartarus.org/m
artin/)
28
Stemming Algorithms
4b) Porters Algorithm (continued)
29
Stemming Algorithms
4b) Porters Algorithm (continued)
30
Thesauri
  • Term thesauri (treasury of words) refine/broaden
    the interpretation of terms
  • Use of similar or closely related terms with
  • synonyms and antonyms (i.e., related words) for
    each word
  • broader and narrower query terms using classified
    hierarchies
  • A thesaurus, which broadens the vocabulary terms,
    enhances the recall performance in retrieval
  • Can be used during
  • document storage processing by replacing each
    term variant w/ a standard term based on the
    thesaurus
  • query processing to broaden a query to ensure
    that relevant documents are not missed
  • Problems to be dealt with homographs (2 words w/
    distinct meanings but identical spellings, e.g.
    Mr. Post and post office)

31
Thesauri
  • Thesaurus Classes are groups or categories of
    terms used in a given topic area
  • A term match would result through the thesaurus
  • transformation

32
Thesauri
  • Thesauri can be constructed manually,
    semi-automatically, and fully automatically
  • Problems arise during the construction
  • what terms should be included in the thesaurus
  • terms specified for inclusion must be suitably
    grouped

33
Thesauri
  • Automatic Thesaurus Construction
  • use a set of document vectors and represent a
    document collection by a matrix
  • the rows of the matrix represent the individual
    doc vectors
  • the columns identify the term assignments/weights
    to the docs

34
Thesauri
  • Similarity measures between pairs of terms
  • Let TERMk and TERMh be two terms in a collection
    of docs
  • Let tik be the weight of TERMk in document i in
    the collection
  • Let n be the number of documents in the
    collection
  • The similarity measure of TERMk and TERMh can be
    defined as
  • SIM(TERMk , TERMh) ? (tik ? tih)
  • and the normalized similarity measure of
    TERMk and TERMh is
  • SIM(TERMk , TERMh)
  • to limit SIM to values between 0 and 1

35
Thesauri
  • Similarity measures between pairs of terms
  • A term-term association matrix can be constructed
    when all pairs of columns in the document
    vectors matrix are compared
  • Let SIM(TERMi , TERMj) be the similarity measure
    of terms i and j

36
Thesauri
  • Use of thesauri
  • A thesaurus can be used to broaden the existing
    indexing vocabulary by
  • replacing the initial terms w/ the corresponding
    thesaurus class, or
  • adding the thesaurus class identifiers to the
    original terms
  • Example.

37
Thesauri
  • Use of thesauri
  • Term associations and thesaurus classes can be
    displayed to help the IR system users in
  • formulating the search queries
  • familiarizing themselves with the vocabulary
  • An attractive format of display is a graph-like
    structure

38
Thesauri
  • Use of thesauri
  • Maintenance problem use of thesaurus requires
    maintenance
  • rebuilding result of user interaction w/ the
    system, e.g., new user populations and
    interests require new vocabulary terms
  • accommodate collection growth new documents are
    added requires updated strategies
  • original thesauri are left unchanged for further
    expansion
  • new terms derived from the added items are placed
    into existing thesaurus categories
  • new terms are placed into separate new classes
  • The thesauri are completely restructured by
    generating a term classification from the
    updated vocabulary
  • Option (a) produces loss of performance,
  • Option (d) is expensive, and
  • Options (b) and (c) yield no clear-cut (better)
    choice

39
Thesauri
  • Automatic classification/Clustering Method
  • Use term-term similarity matrix to construct
    classes of similar terms (? thesaurus classes)
    by collecting all terms whose similarity
    coefficients are sufficiently large,
    i.e. exceed a threshold value
  • Refinement for each thesaurus class TC, defines
    the
  • term-centroid ltÃ1, Ã2, , Ãmgt
  • which is the average vector for the term
    vectors of TC, i.e.
  • the average value of all the values of
    TERMk
  • Ãk 1/m ? ? tik
  • for a class with m term vectors

40
Thesauri
  • Automatic classification/Clustering Method
  • term-centroid can be used to refine thesaurus
    class by computing the similarity between TERMk
    and each term-centroid for all thesaurus classes
  • assume that there are t term vectors distributed
    into p classes, generate a similarity
    coefficients matrix of dimension t ? p
  • SIMILAR_CO(TERMk, TERM-CENTROIDh)
  • where 1 ? k ? t and 1 ? h ? p
  • each term vector can be assigned to a class for
    which the similarity to the TERM-CENTROID is
    largest
  • if there is a switch of a given term vector from
    one class to another, then the centroids of the
    involved classes must be recomposed until no
    further class changes
Write a Comment
User Comments (0)
About PowerShow.com