Text Operations - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Text Operations

Description:

Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij T ... the documents of a collection all terms with sufficiently high ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 45
Provided by: 140122
Category:

less

Transcript and Presenter's Notes

Title: Text Operations


1
Text Operations
  • J. H. Wang
  • Feb. 21, 2008

2
The Retrieval Process
3
Outline
  • Document Preprocessing (7.1-7.2)
  • Text Compression (7.4-7.5) skipped
  • Automatic Indexing (Chap. 9, Salton)
  • Term Selection

4
Document Preprocessing
  • Lexical analysis
  • Letters, digits, punctuation marks,
  • Stopword removal
  • the, of,
  • Stemming
  • Prefix, suffix
  • Index term selection
  • Noun
  • Construction of term categorization structure
  • Thesaurus

5
  • Logical view of the documents

Accents, spacing
Noun groups
Manual indexing
stopwords
stemming
Docs
structure
6
Lexical Analysis
  • Converting a stream of characters into a stream
    of words
  • Recognition of words
  • Digits usually not good index terms
  • Ex. The number of deaths due to car accidents
    between 1910 and 1989, 510B.C., credit card
    numbers,
  • Hyphens
  • Ex state-of-the-art, gilt-edge, B-49,
  • Punctuation marks normally removed entirely
  • Ex 510B.C., program codes x.id vs. xid,
  • The case of letters usually not important
  • Ex Bank vs. bank, Unix-like operating systems,

7
Elimination of Stopwords
  • Stopwords words which are too frequent among the
    documents in the collection are not good
    discriminators
  • Articles, prepositions, conjunctions,
  • Some verbs, adverbs, and adjectives
  • To reduce the size of the indexing structure
  • Stopword removal might reduce recall
  • Ex to be or not to be

8
Stemming
  • The substitution of the words by their respective
    stems
  • Ex plurals, gerund forms, past tense suffixes,
  • A stem is the portion of a word which is left
    after the removal of its affixes (i.e., prefixes
    and suffixes)
  • Ex connect, connected, connecting, connection,
    connections
  • Controversy about the benefits
  • Useful for improving retrieval performance
  • Reducing the size of the indexing structure

9
Stemming
  • Four types of stemming strategies
  • Affix removal, table lookup, successor variety,
    and n-grams (or term clustering)
  • Suffix removal
  • Ports algorithm (available in the Appendix)
  • Simplicity and elegance

10
Index Term Selection
  • Manually or automatically
  • Identification of noun groups
  • Most of the semantics is carried by the noun
    words
  • Systematic elimination of verbs, adjectives,
    adverbs, connectives, articles, and pronouns
  • A noun group is a set of nouns whose syntactic
    distance in the text does not exceed a predefined
    threshold

11
Thesauri
  • Thesaurus a reference to a treasury of words
  • A precompiled list of important words in a given
    domain of knowledge
  • For each word in this list, a set of related
    words
  • Ex synonyms,
  • It also involves normalization of vocabulary, and
    a structure

12
Example Entry in Peter Rogets Thesaurus
  • Cowardly adjective
  • Ignobly lacking in courage cowardly turncoats.
  • Syns chicken (slang), chicken-hearted, craven,
    dastardly, faint-hearted, gutless, lily-livered,
    pusillanimous, unmanly, yellow (slang),
    yellow-bellied (slang).

13
Main Purposes of a Thesaurus
  • To provide a standard vocabulary for indexing and
    searching
  • To assist users with locating terms for proper
    query formulation
  • To provide classified hierarchies that allow the
    broadening and narrowing of the current query
    request according to the user needs

14
Motivation for Building a Thesaurus
  • Using a controlled vocabulary for the indexing
    and searching
  • Normalization of indexing concepts
  • Reduction of noise
  • Identification of indexing terms with a clear
    semantic meaning
  • Retrieval based on concepts rather than on words
  • Ex term classification hierarchy in Yahoo!

15
Main Components of a Thesaurus
  • Index terms individual words, group of words,
    phrases
  • Concept
  • Ex missiles, ballistic
  • Definition or explanation
  • Ex seal (marine animals), seal (documents)
  • Relationships among the terms
  • BT (broader), NT (narrower)
  • RT (related) much difficult
  • A layout design for these term relationships
  • A list or bi-dimensional display

16
Automatic Indexing(Term Selection)
17
Automatic Indexing
  • Indexing
  • assign identifiers (index terms) to text
    documents
  • Identifiers
  • single-term vs. term phrase
  • controlled vs. uncontrolled vocabulariesinstructi
    on manuals, terminological schedules,
  • objective vs. nonobjective text identifiers
    cataloging rules control, e.g., author names,
    publisher names, dates of publications,

18
Two Issues
  • Issue 1 indexing exhaustivity
  • exhaustive assign a large number of terms
  • Nonexhaustive only main aspects of subject
    content
  • Issue 2 term specificity
  • broad terms (generic)cannot distinguish relevant
    from nonrelevant documents
  • narrow terms (specific)retrieve relatively fewer
    documents, but most of them are relevant

19
Recall vs. Precision
  • Recall (R) Number of relevant documents
    retrieved / total number of relevant documents in
    collection
  • The proportion of relevant items retrieved
  • Precision (P) Number of relevant documents
    retrieved / total number of documents retrieved
  • The proportion of items retrieved that are
    relevant
  • Example for a query, e.g. Taipei

All docs
Relevantdocs
Retrieveddocs
20
More on Recall/Precision
  • Simultaneously optimizing both recall and
    precision is not normally achievable
  • Narrow and specific terms precision is favored
  • Broad and nonspecific terms recall is favored
  • When a choice must be made between term
    specificity and term breadth, the former is
    generally preferable
  • High-recall, low-precision documents will burden
    the user
  • Lack of precision is more easily remedied than
    lack of recall

21
Term-Frequency Consideration
  • Function words
  • for example, "and", "of", "or", "but",
  • the frequencies of these words are high in all
    texts
  • Content words
  • words that actually relate to document content
  • varying frequencies in the different texts of a
    collection
  • indicate term importance for content

22
A Frequency-Based Indexing Method
  • Eliminate common function words from the document
    texts by consulting a special dictionary, or stop
    list, containing a list of high frequency
    function words
  • Compute the term frequency tfij for all remaining
    terms Tj in each document Di, specifying the
    number of occurrences of Tj in Di
  • Choose a threshold frequency T, and assign to
    each document Di all term Tj for which tfij gt T

23
More on Term Frequency
  • High-frequency term
  • Recall
  • Ex Apple
  • But only if its occurrence frequency is not
    equally high in other documents
  • Low-frequency term
  • Precision
  • Ex Huntingtons disease
  • Able to distinguish the few documents in which
    they occur from the many from which they are
    absent

24
How to Compute Weight wij ?
  • Inverse document frequency, idfj
  • tfijidfj (TFxIDF)
  • Term discrimination value, dvj
  • tfijdvj
  • Probabilistic term weighting trj
  • tfijtrj
  • Global properties of terms in a document
    collection

25
Inverse Document Frequency
  • Inverse Document Frequency (IDF) for term
    Tjwhere dfj (document frequency of term Tj)
    is the number of documents in which Tj occurs.
  • fulfil both the recall and the precision
  • occur frequently in individual documents but
    rarely in the remainder of the collection

26
TFxIDF
  • Weight wij of a term Tj in a document di
  • Eliminating common function words
  • Computing the value of wij for each term Tj in
    each document Di
  • Assigning to the documents of a collection all
    terms with sufficiently high (tf x idf) weights

27
Term-discrimination Value
  • Useful index terms
  • Distinguish the documents of a collection from
    each other
  • Document Space
  • Each point represents a particular document of a
    collection
  • The distance between two points is inversely
    proportional to the similarity between the
    respective term assignments
  • When two documents are assigned very similar term
    sets, the corresponding points in document
    configuration appear close together

28
A Virtual Document Space
After Assignment of good discriminator
After Assignment of poor discriminator
Original State
29
Good Term Assignment
  • When a term is assigned to the documents of a
    collection, the few documents to which the term
    is assigned will be distinguished from the rest
    of the collection
  • This should increase the average distance between
    the documents in the collection and hence produce
    a document space less dense than before

30
Poor Term Assignment
  • A high frequency term is assigned that does not
    discriminate between the objects of a collection
  • Its assignment will render the document more
    similar
  • This is reflected in an increase in document
    space density

31
Term Discrimination Value
  • Definition dvj Q - Qjwhere Q and Qj are
    space densities before and after the
    assignment of term Tj
  • The average pairwise similarity between all pairs
    of distinct terms
  • dvjgt0, Tj is a good term dvjlt0, Tj is a poor
    term.

32
Variations of Term-Discrimination Value with
Document Frequency
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
33
TFij x dvj
  • wij tfij x dvj
  • compared with
  • decreases steadily with increasing
    document frequency
  • dvj increases from zero to positive as the
    document frequency of the term increases,
    decreases shapely as the document frequency
    becomes still larger
  • Issue efficiency problem to compute N(N-1)
    pairwise similarities

34
Document Centroid
  • Document centroid C (c1, c2, c3, ...,
    ct)where wij is the j-th term in document I
  • A dummy average document located in the center
    of the document space
  • Space density

35
Probabilistic Term Weighting
  • GoalExplicit distinctions between occurrences of
    terms in relevant and nonrelevant documents of a
    collection
  • DefinitionGiven a user query q, and the ideal
    answer set of the relevant documentsFrom
    decision theory, the best ranking algorithm for a
    document D

36
Probabilistic Term Weighting
  • Pr(rel), Pr(nonrel)documents a priori
    probabilities of relevance and nonrelevance
  • Pr(Drel), Pr(Dnonrel)occurrence probabilities
    of document D in the relevant and nonrelevant
    document sets

37
Assumptions
  • Terms occur independently in documents

38
Derivation Process
39
For a specific document D
  • Given a document D(d1, d2, , dt)
  • Assume di is either 0 (absent) or 1 (present)

Pr(xi1rel) pi Pr(xi0rel)
1-pi Pr(xi1nonrel) qi Pr(xi0nonrel) 1-qi
40
(No Transcript)
41
Term Relevance Weight
42
Issue
  • How to compute pj and qj ?
  • pj rj / R qj (dfj-rj)/(N-R)
  • rj the number of relevant documents that
    contains term Tj
  • R the total number of relevant documents
  • N the total number of documents

43
Estimation of Term-Relevance
  • The occurrence probability of a term in the
    nonrelevant documents qj is approximated by the
    occurrence probability of the term in the entire
    document collection qj dfj / N
  • Large majority of documents will be nonrelevant
    to the average query
  • The occurrence probabilities of the terms in the
    small number of relevant documents is assumed to
    be equal by using a constant value pj 0.5 for
    all j

44
Comparison
When N is sufficiently large, N-dfj ? N,
Write a Comment
User Comments (0)
About PowerShow.com