Terms and Query Operations - PowerPoint PPT Presentation

About This Presentation
Title:

Terms and Query Operations

Description:

Terms and Query Operations Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 39
Provided by: Hsin82
Category:

less

Transcript and Presenter's Notes

Title: Terms and Query Operations


1
Terms and Query Operations
  • Information Retrieval Data Structures and
    Algorithms
  • by W.B. Frakes and R. Baeza-Yates (Eds.)
    Englewood Cliffs, NJ Prentice Hall, 1992.
  • Chapter 7 - 9

2
Lexical Analysis and Stoplists
  • Chapter 7

3
Lexical Analysis for Automatic Indexing
  • Lexical AnalysisConvert an input stream of
    characters into stream words or token.
  • What is a word or a token? Tokens consist of
    letters.
  • Digits Most numbers are not good index
    terms.counterexamples case numbers in a legal
    database, B6 and B12 in vitamin database.
  • Hyphens
  • break hyphenated words state-of-the-art, state
    of the art
  • keep hyphenated words as a token Jean-Claude,
    F-16
  • Other punctuation often used as parts of terms,
    e.g., OS/2
  • Case usually not significant in index terms

4
Lexical Analysis for Automatic Indexing(Continued
)
  • Issues recall and precision
  • breaking up hyphenated termsincrease recall but
    decrease precision
  • preserving case distinctionsenhance precision
    but decrease recall
  • commercial information systemsusually take a
    conservative (recall enhancing) approach

5
Lexical Analysis for Query Processing
  • Tasks
  • depend on the design strategies of the lexical
    analyzer for automatic indexing (search terms
    must match index terms)
  • distinguish operators like Boolean operators
  • distinguish grouping indicators like parentheses
    and brackets
  • flag illegal characters as unrecognized tokens

6
STOPLISTS (negative dictionary)
  • Avoid retrieving almost every item in a database
    regardless of its relevance.
  • Example (derived from Brown corpus) 425 wordsa,
    about, above, across, after, again, against, all,
    almost, alone, along, already, also, although,
    always, among, an, and, another, any, anybody,
    anyone, anything, anywhere, are, area, areas,
    around, as, ask, asked, asking, asks, at, away,
    b, back, backed, backing, backs, be, because,
    became,
  • Commercial Information systems tend to take a
    conservative approach, with few stopwords

7
Implementing Stoplists
  • Approaches
  • examine lexical analyzer output and remove any
    stopwords
  • remove stopwords as part of lexical analysis

8
Stemming Algorithms
  • Chapter 8

9
Stemmers
  • Programs that relate morphologically similar
    indexing and search terms
  • Stem at indexing time
  • advantage efficiency and index file compression
  • disadvantage information about the full terms is
    lost
  • Example (CATALOG system), stem at search
    time Look for system users Search Term
    users Term Occurrences 1. user 15 2.
    users 1 3. used 3 4. using 2

10
Conflation Methods
  • Manual
  • Automatic (stemmers)
  • table lookup
  • successor variety
  • n-gram
  • affix removallongest match vs. simple removal
  • Evaluation
  • correctness
  • retrieval effectiveness
  • compression performance

11
Successor Variety
  • Definition (successor variety of a string)the
    number of different characters that follow it in
    words in some body of text
  • Examplea body of text able, axle, accident,
    ape, aboutsuccessor variety of apple1st 4 (b,
    x, c, p)2nd (e)

12
Successor Variety (Continued)
  • IdeaThe successor variety of substrings of a
    term will decrease as more characters are added
    until a segment boundary is reached, i.e., the
    successor variety will sharply increase.
  • ExampleTest word READABLECorpus ABLE,
    BEATABLE, FIXABLE, READ, READABLE, READING,
    RED, ROPE, RIPEPrefix Successor
    Variety LettersR 3 E, O, IRE 2 A,
    DREA 1 DREAD 3 A, I,
    SREADA 1 BREADAB 1 LREADABL 1 EREA
    DABLE 1 blank

13
The successor variety stemming process
  • Determine the successor variety for a word.
  • Use this information to segment the word.
  • cutoff methoda boundary is identified whenever
    the cutoff value is reached
  • peak and plateau methoda character whose
    successor variety exceeds that of the character
    immediately preceding it and the character
    immediately following it
  • complete word methoda segment is a complete word
  • entropy method
  • Select one of the segments as the stem.

14
n-gram stemmers
  • Diagrama pair of consecutive letters
  • Shared diagram method (Adamson and Boreham, 1974)
  • association measures are calculated between pairs
    of terms
  • where A the number of unique diagrams in the
    first word, B the number of unique diagrams
    in the second, C the number of unique
    diagrams shared by A and B

15
n-gram stemmers (Continued)
  • Example statistics gt st ta at ti is st ti ic
    cs unique diagrams gt at cs ic is st ta
    ti statistical gt st ta at ti is st ti ic ca
    al unique diagrams gt al at ca ic is st ta ti

16
n-gram stemmers (Continued)
  • similarity matrixdetermine the semantic measures
    for all pairs of terms in the database word1 wor
    d2 word3 ... wordn-1 word1 word2 S21 word3 S31
    S32 . . Wordn Sn1 Sn2 Sn3 Sn(n-1)
  • terms are clustered using a single link
    clustering method
  • most pairwise similarity measures were 0
  • using a cutoff similarity value of .6

17
Affix Removal Stemmers
  • ProcedureRemove suffixes and/or prefixes from
    terms leaving a stem, and transform the resultant
    stem.
  • Example plural formsIf a word ends in ies but
    not eies or aies then ies --gt yIf a
    word ends in es but not aes, ees, or
    oes then es --gt eIf a word ends in s,
    but not us or ss then s --gt NULL
  • Ambiguity

18
Affix Removal Stemmers (Continued)
  • Iterative longest match stemmerremove the
    longest possible string of characters from a word
    according to a set of rules
  • recoding AxC--gt AyC, e.g., ki --gt ky
  • partial matching only n initial characters of
    stems are used in comparing
  • Different versionsLovins, Slaton, Dawson,
    Porter, Students can refer to the rules listed
    in the text book.

19
Thesaurus Constructions
  • Chapter 9

20
Thesaurus Construction
  • IR thesaurusa list of terms (words or phrases)
    along with relationships among them
    physics, EE, electronics, computer and
    control
  • INSPEC thesaurus (1979) cesium (?,Cs)
    USE caesium (USE the preferred form)
    computer-aided instruction see also
    education (cross-referenced terms) UF
    teaching machines (UF a set of
    alternatives) BT educational computing (BT
    broader terms, cf. NT) TT computer
    applications (TT root node/top term) RT
    education (RT related terms)
    teaching CC C7810C (CC subject area) FC
    C7810Cf (subject area)

21
Usage
  • IndexingSelect the most appropriate thesaurus
    entries for representing the document.
  • SearchingDesign the most appropriate search
    strategy.
  • If the search does not retrieve enough documents,
    the thesaurus can be used to expand the query.
  • If the search retrieves too many items, the
    thesaurus can suggest more specific search
    vocabulary.

22
Features of Thesauri (1/5)
  • Coordination Level
  • the construction of phrases from individual terms
  • pre-coordination contains phrases
  • phrases are available for indexing and retrieval
  • advantage reducing ambiguity in indexing and
    searching
  • disadvantage searcher has to be know the phrase
    formulation rules
  • lower recall
  • post-coordination does not allow phrases
  • phrases are constructed while searching
  • advantage users do not worry about the exact
    word ordering
  • disadvantage the search precision may fall,
    e.g.,library school vs. school library
  • lower precision

23
Features of Thesauri (2/5)
  • intermediate level allows both phrases and
    single words
  • the higher the level of coordination, the greater
    the precision of the vocabulary but the larger
    the vocabulary size
  • it also implies an increase in the number of
    relationships to be encoded
  • Precoordination is more common in manually
    constructed thesauri.
  • Automatic phrase construction is still quite
    difficult and therefore automatic thesaurus
    construction usually implies post-coordination

24
Features of Thesauri (3/5)
  • Term Relationships
  • Aitchison and Gilchrist (1972)
  • equivalence relationships synonymy or
    quasi-synonymy
  • hierarchical relationships, e.g., genus
    (?)-species(?)
  • nonhierarchical relationships,
  • e.g., thing-part, bus and seat
  • e.g., thing-attribute, rose and fragrance
  • Wang, Vandendorpe, and Evens (1985)
  • parts-wholes, e.g., set-element, count-mass
  • collocation relations words that frequently
    co-occur in the same phrase or sentence
  • paradigmatic relations (????) e.g., moon and
    lunar
  • taxonomy and synonymy
  • antonymy relations

25
Features of Thesauri (4/5)
  • Number of entries for each term
  • homographs words with multiple meanings
  • each homograph entry is associated with its own
    set of relations
  • problem how to select between alternative
    meanings
  • typically the user has to select between
    alternative meanings
  • Specificity of vocabulary
  • is a function of the precision associated with
    the component terms
  • disadvantage the size of the vocabulary grows
    since a large number of terms are required to
    cover the concepts in the domain
  • high specificity implies a high coordination
    level
  • a highly specific vocabulary promotes precision
    in retrieval

26
Features of Thesauri (5/5)
  • Control on term frequency of class members
  • for statistical thesaurus construction methods
  • terms included in the same thesaurus class have
    roughly equal frequencies
  • the total frequency in each class should also be
    roughly similar
  • Normalization of vocabulary
  • Normalization of vocabulary terms is given
    considerable emphasis in manual thesauri
  • terms should be in noun form
  • noun phrases should avoid prepositions unless
    they are commonly known
  • a limited number of adjectives should be used
  • ...

27
Thesaurus Construction
  • Manual thesaurus construction
  • define the boundaries of the subject area
  • collect the terms for each subareasources
    indexes, encyclopedias, handbooks, textbooks,
    journal titles and abstracts, catalogues, ...
  • organize the terms and their relationship into
    structures
  • review (and refine) the entire thesaurus for
    consistency
  • Automatic thesaurus construction
  • from a collection document items
  • by merging existing thesaurus

28
Thesaurus Construction from Texts
1. Construction of vocabulary normalization
and selection of terms phrase construction
depending on the coordination level desired 2.
Similarity computations between terms
identify the significant statistical associations
between terms 3. Organization of vocabulary
organize the selected vocabulary into a hierarchy
on the basis of the associations computed in
step 2.
29
Construction of Vocabulary
  • Objectiveidentify the most informative terms
    (words and phrases)
  • Procedure(1) Identify an appropriate document
    collection. The document collection should be
    sizable and representative of the subject
    area.(2) Determine the required specificity for
    the thesaurus.(3) Normalize the vocabulary
    terms. (a) Eliminate very trivial words
    such as prepositions and
    conjunctions. (b) Stem the vocabulary. (4)
    Select the most interesting stems, and create
    interesting phrases for a higher coordination
    level.

30
Stem evaluation and selection
  • Selection by frequency of occurrence
  • each term may belong to category of high, medium
    or low frequency
  • terms in the mid-frequency range are the best for
    indexing and searching

31
Stem evaluation and selection (Continued)
  • selection by discrimination value (DV)
  • the more discriminating a term, the higher its
    value as an index term
  • procedure
  • compute the average inter-document similarity in
    the collection
  • Remove the term K from the indexing vocabulary,
    and recompute the average similarity
  • DV(K)(average similarity without K)-(average
    similarity with k)
  • The DV for good discriminators is positive.

32
Phrase Construction
  • Salton and McGill procedure1. Compute pairwise
    co-occurrence for high-frequency words.2. If
    this co-occurrence is lower than a threshold,
    then do not consider the pair any further.3.
    For pairs that qualify, compute the cohesion
    value. COHESION(ti, tj)
    co-occurrence-frequency/(sqrt(frequency(ti)freque
    ncy(tj)))
  • COHESION(ti, tj)size-factor
    co-occurrence-frequency/(frequency(ti)frequency(t
    j)) where size-factor is the size of
    thesaurus vocabulary 4. If cohesion is above a
    second threshold, retain the phrase

33
Phrase Construction (Continued)
  • Choueka Procedure1. Select the range of length
    allowed for each collocational expression.
    E.g., 2-6 wsords2. Build a list of all potential
    expressions from the collection with the
    prescribed length that have a minimum
    frequency.3. Delete sequences that begin or end
    with a trivial word (e.g., prepositions,
    pronouns, articles, conjunctions, etc.)4. Delete
    expressions that contain high-frequency
    nontrivial words.5. Given an expression,
    evaluate any potential sub-expressions for
    relevance. Discard any that are not
    sufficiently relevant.6. Try to merge smaller
    expressions into larger and more meaningful
    ones.

34
Term-Phrase Formation
  • Term Phrasea sequence of related text words
    carry a more specific meaning than the single
    termse.g., computer science vs. computer

Phrase transformation
Thesaurus transformation
Document Frequency
N
35
Similarity Computation
  • Cosinecompute the number of documents associated
    with both terms divided by the square root of the
    product of the number of documents associated
    with the first term and the number of documents
    associated with the second term.
  • Dicecompute the number of documents associated
    with both terms divided by the sum of the number
    of documents associated with one term and the
    number associated with the other.

36
Vocabulary Organization
  • Clustering
  • Forsyth and Rada (1986)
  • Assumptions
  • (1) high-frequency words have broad meaning,
    while low-frequency words have narrow meaning.
  • (2) if the density functions of two terms have
    the same shape, then the two words have similar
    meaning.

1. Identify a set of frequency ranges. 2. Group
the vocabulary terms into different classes based
on their frequencies and the ranges selected
in step 1. 3. The highest frequency class is
assigned level 0, the next, level 1, and so
on.
37
Forsyth and Rada (cont.)
  • 4. Parent-child links are determined between
    adjacent levels as follows. For each term t
    in level i, compute similarity between t and
    every term in level i-1. Term t becomes the
    child of the most similar term in level i-1. If
    more than one term in level i-1qualifies for
    this, then each becomes a parent of t. In other
    words, a term is allowed to have multiple
    parents.
  • 5. After all terms in level i have been linked to
    level i-1 terms,
  • check level i-1terms and identify those that
    have no children.
  • Propagate such terms to level i by creating
    an identical
  • dummy term as its child.
  • 6. Perform steps 4 and 5 for each level starting
    with level.

38
Merging Existing Thesauri
  • simple mergelink hierarchies wherever they have
    terms in common
  • complex merge
  • link terms from different hierarchies if they are
    similar enough.
  • similarity is a function of the number of parent
    and child terms in common
Write a Comment
User Comments (0)
About PowerShow.com