Terms and Query Operations

About This Presentation

Title:

Terms and Query Operations

Description:

Terms and Query Operations Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 39

Provided by: Hsin82

Category:

more less

Transcript and Presenter's Notes

Title: Terms and Query Operations

1
Terms and Query Operations

Information Retrieval Data Structures and
Algorithms
by W.B. Frakes and R. Baeza-Yates (Eds.)
Englewood Cliffs, NJ Prentice Hall, 1992.
Chapter 7 - 9

2
Lexical Analysis and Stoplists

Chapter 7

3
Lexical Analysis for Automatic Indexing

Lexical AnalysisConvert an input stream of
characters into stream words or token.
What is a word or a token? Tokens consist of
letters.
Digits Most numbers are not good index
terms.counterexamples case numbers in a legal
database, B6 and B12 in vitamin database.
Hyphens
break hyphenated words state-of-the-art, state
of the art
keep hyphenated words as a token Jean-Claude,
F-16
Other punctuation often used as parts of terms,
e.g., OS/2
Case usually not significant in index terms

4
Lexical Analysis for Automatic Indexing(Continued
)

Issues recall and precision
breaking up hyphenated termsincrease recall but
decrease precision
preserving case distinctionsenhance precision
but decrease recall
commercial information systemsusually take a
conservative (recall enhancing) approach

5
Lexical Analysis for Query Processing

Tasks
depend on the design strategies of the lexical
analyzer for automatic indexing (search terms
must match index terms)
distinguish operators like Boolean operators
distinguish grouping indicators like parentheses
and brackets
flag illegal characters as unrecognized tokens

6
STOPLISTS (negative dictionary)

Avoid retrieving almost every item in a database
regardless of its relevance.
Example (derived from Brown corpus) 425 wordsa,
about, above, across, after, again, against, all,
almost, alone, along, already, also, although,
always, among, an, and, another, any, anybody,
anyone, anything, anywhere, are, area, areas,
around, as, ask, asked, asking, asks, at, away,
b, back, backed, backing, backs, be, because,
became,
Commercial Information systems tend to take a
conservative approach, with few stopwords

7
Implementing Stoplists

Approaches
examine lexical analyzer output and remove any
stopwords
remove stopwords as part of lexical analysis

8
Stemming Algorithms

Chapter 8

9
Stemmers

Programs that relate morphologically similar
indexing and search terms
Stem at indexing time
advantage efficiency and index file compression
disadvantage information about the full terms is
lost
Example (CATALOG system), stem at search
time Look for system users Search Term
users Term Occurrences 1. user 15 2.
users 1 3. used 3 4. using 2

10
Conflation Methods

Manual
Automatic (stemmers)
table lookup
successor variety
n-gram
affix removallongest match vs. simple removal
Evaluation
correctness
retrieval effectiveness
compression performance

11
Successor Variety

Definition (successor variety of a string)the
number of different characters that follow it in
words in some body of text
Examplea body of text able, axle, accident,
ape, aboutsuccessor variety of apple1st 4 (b,
x, c, p)2nd (e)

12
Successor Variety (Continued)

IdeaThe successor variety of substrings of a
term will decrease as more characters are added
until a segment boundary is reached, i.e., the
successor variety will sharply increase.
ExampleTest word READABLECorpus ABLE,
BEATABLE, FIXABLE, READ, READABLE, READING,
RED, ROPE, RIPEPrefix Successor
Variety LettersR 3 E, O, IRE 2 A,
DREA 1 DREAD 3 A, I,
SREADA 1 BREADAB 1 LREADABL 1 EREA
DABLE 1 blank

13
The successor variety stemming process

Determine the successor variety for a word.
Use this information to segment the word.
cutoff methoda boundary is identified whenever
the cutoff value is reached
peak and plateau methoda character whose
successor variety exceeds that of the character
immediately preceding it and the character
immediately following it
complete word methoda segment is a complete word
entropy method
Select one of the segments as the stem.

14
n-gram stemmers

Diagrama pair of consecutive letters
Shared diagram method (Adamson and Boreham, 1974)
association measures are calculated between pairs
of terms
where A the number of unique diagrams in the
first word, B the number of unique diagrams
in the second, C the number of unique
diagrams shared by A and B

15
n-gram stemmers (Continued)

Example statistics gt st ta at ti is st ti ic
cs unique diagrams gt at cs ic is st ta
ti statistical gt st ta at ti is st ti ic ca
al unique diagrams gt al at ca ic is st ta ti

16
n-gram stemmers (Continued)

similarity matrixdetermine the semantic measures
for all pairs of terms in the database word1 wor
d2 word3 ... wordn-1 word1 word2 S21 word3 S31
S32 . . Wordn Sn1 Sn2 Sn3 Sn(n-1)
terms are clustered using a single link
clustering method
most pairwise similarity measures were 0
using a cutoff similarity value of .6

17
Affix Removal Stemmers

ProcedureRemove suffixes and/or prefixes from
terms leaving a stem, and transform the resultant
stem.
Example plural formsIf a word ends in ies but
not eies or aies then ies --gt yIf a
word ends in es but not aes, ees, or
oes then es --gt eIf a word ends in s,
but not us or ss then s --gt NULL
Ambiguity

18
Affix Removal Stemmers (Continued)

Iterative longest match stemmerremove the
longest possible string of characters from a word
according to a set of rules
recoding AxC--gt AyC, e.g., ki --gt ky
partial matching only n initial characters of
stems are used in comparing
Different versionsLovins, Slaton, Dawson,
Porter, Students can refer to the rules listed
in the text book.

19
Thesaurus Constructions

Chapter 9

20
Thesaurus Construction

IR thesaurusa list of terms (words or phrases)
along with relationships among them
physics, EE, electronics, computer and
control
INSPEC thesaurus (1979) cesium (?,Cs)
USE caesium (USE the preferred form)
computer-aided instruction see also
education (cross-referenced terms) UF
teaching machines (UF a set of
alternatives) BT educational computing (BT
broader terms, cf. NT) TT computer
applications (TT root node/top term) RT
education (RT related terms)
teaching CC C7810C (CC subject area) FC
C7810Cf (subject area)

21
Usage

IndexingSelect the most appropriate thesaurus
entries for representing the document.
SearchingDesign the most appropriate search
strategy.
If the search does not retrieve enough documents,
the thesaurus can be used to expand the query.
If the search retrieves too many items, the
thesaurus can suggest more specific search
vocabulary.

22
Features of Thesauri (1/5)

Coordination Level
the construction of phrases from individual terms
pre-coordination contains phrases
phrases are available for indexing and retrieval
advantage reducing ambiguity in indexing and
searching
disadvantage searcher has to be know the phrase
formulation rules
lower recall
post-coordination does not allow phrases
phrases are constructed while searching
advantage users do not worry about the exact
word ordering
disadvantage the search precision may fall,
e.g.,library school vs. school library
lower precision

23
Features of Thesauri (2/5)

intermediate level allows both phrases and
single words
the higher the level of coordination, the greater
the precision of the vocabulary but the larger
the vocabulary size
it also implies an increase in the number of
relationships to be encoded
Precoordination is more common in manually
constructed thesauri.
Automatic phrase construction is still quite
difficult and therefore automatic thesaurus
construction usually implies post-coordination

24
Features of Thesauri (3/5)

Term Relationships
Aitchison and Gilchrist (1972)
equivalence relationships synonymy or
quasi-synonymy
hierarchical relationships, e.g., genus
(?)-species(?)
nonhierarchical relationships,
e.g., thing-part, bus and seat
e.g., thing-attribute, rose and fragrance
Wang, Vandendorpe, and Evens (1985)
parts-wholes, e.g., set-element, count-mass
collocation relations words that frequently
co-occur in the same phrase or sentence
paradigmatic relations (????) e.g., moon and
lunar
taxonomy and synonymy
antonymy relations

25
Features of Thesauri (4/5)

Number of entries for each term
homographs words with multiple meanings
each homograph entry is associated with its own
set of relations
problem how to select between alternative
meanings
typically the user has to select between
alternative meanings
Specificity of vocabulary
is a function of the precision associated with
the component terms
disadvantage the size of the vocabulary grows
since a large number of terms are required to
cover the concepts in the domain
high specificity implies a high coordination
level
a highly specific vocabulary promotes precision
in retrieval

26
Features of Thesauri (5/5)

Control on term frequency of class members
for statistical thesaurus construction methods
terms included in the same thesaurus class have
roughly equal frequencies
the total frequency in each class should also be
roughly similar
Normalization of vocabulary
Normalization of vocabulary terms is given
considerable emphasis in manual thesauri
terms should be in noun form
noun phrases should avoid prepositions unless
they are commonly known
a limited number of adjectives should be used
...

27
Thesaurus Construction

Manual thesaurus construction
define the boundaries of the subject area
collect the terms for each subareasources
indexes, encyclopedias, handbooks, textbooks,
journal titles and abstracts, catalogues, ...
organize the terms and their relationship into
structures
review (and refine) the entire thesaurus for
consistency
Automatic thesaurus construction
from a collection document items
by merging existing thesaurus

28
Thesaurus Construction from Texts
1. Construction of vocabulary normalization
and selection of terms phrase construction
depending on the coordination level desired 2.
Similarity computations between terms
identify the significant statistical associations
between terms 3. Organization of vocabulary
organize the selected vocabulary into a hierarchy
on the basis of the associations computed in
step 2.
29
Construction of Vocabulary

Objectiveidentify the most informative terms
(words and phrases)
Procedure(1) Identify an appropriate document
collection. The document collection should be
sizable and representative of the subject
area.(2) Determine the required specificity for
the thesaurus.(3) Normalize the vocabulary
terms. (a) Eliminate very trivial words
such as prepositions and
conjunctions. (b) Stem the vocabulary. (4)
Select the most interesting stems, and create
interesting phrases for a higher coordination
level.

30
Stem evaluation and selection

Selection by frequency of occurrence
each term may belong to category of high, medium
or low frequency
terms in the mid-frequency range are the best for
indexing and searching

31
Stem evaluation and selection (Continued)

selection by discrimination value (DV)
the more discriminating a term, the higher its
value as an index term
procedure
compute the average inter-document similarity in
the collection
Remove the term K from the indexing vocabulary,
and recompute the average similarity
DV(K)(average similarity without K)-(average
similarity with k)
The DV for good discriminators is positive.

32
Phrase Construction

Salton and McGill procedure1. Compute pairwise
co-occurrence for high-frequency words.2. If
this co-occurrence is lower than a threshold,
then do not consider the pair any further.3.
For pairs that qualify, compute the cohesion
value. COHESION(ti, tj)
co-occurrence-frequency/(sqrt(frequency(ti)freque
ncy(tj)))
COHESION(ti, tj)size-factor
co-occurrence-frequency/(frequency(ti)frequency(t
j)) where size-factor is the size of
thesaurus vocabulary 4. If cohesion is above a
second threshold, retain the phrase

33
Phrase Construction (Continued)

Choueka Procedure1. Select the range of length
allowed for each collocational expression.
E.g., 2-6 wsords2. Build a list of all potential
expressions from the collection with the
prescribed length that have a minimum
frequency.3. Delete sequences that begin or end
with a trivial word (e.g., prepositions,
pronouns, articles, conjunctions, etc.)4. Delete
expressions that contain high-frequency
nontrivial words.5. Given an expression,
evaluate any potential sub-expressions for
relevance. Discard any that are not
sufficiently relevant.6. Try to merge smaller
expressions into larger and more meaningful
ones.

34
Term-Phrase Formation

Term Phrasea sequence of related text words
carry a more specific meaning than the single
termse.g., computer science vs. computer

Phrase transformation
Thesaurus transformation
Document Frequency
N
35
Similarity Computation

Cosinecompute the number of documents associated
with both terms divided by the square root of the
product of the number of documents associated
with the first term and the number of documents
associated with the second term.
Dicecompute the number of documents associated
with both terms divided by the sum of the number
of documents associated with one term and the
number associated with the other.

36
Vocabulary Organization

Clustering
Forsyth and Rada (1986)
Assumptions
(1) high-frequency words have broad meaning,
while low-frequency words have narrow meaning.
(2) if the density functions of two terms have
the same shape, then the two words have similar
meaning.

1. Identify a set of frequency ranges. 2. Group
the vocabulary terms into different classes based
on their frequencies and the ranges selected
in step 1. 3. The highest frequency class is
assigned level 0, the next, level 1, and so
on.
37
Forsyth and Rada (cont.)

4. Parent-child links are determined between
adjacent levels as follows. For each term t
in level i, compute similarity between t and
every term in level i-1. Term t becomes the
child of the most similar term in level i-1. If
more than one term in level i-1qualifies for
this, then each becomes a parent of t. In other
words, a term is allowed to have multiple
parents.
5. After all terms in level i have been linked to
level i-1 terms,
check level i-1terms and identify those that
have no children.
Propagate such terms to level i by creating
an identical
dummy term as its child.
6. Perform steps 4 and 5 for each level starting
with level.