Modern Information Retrieval Chapter 7: Text Operations - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Modern Information Retrieval Chapter 7: Text Operations

Description:

A sentence is usually composed of nouns, pronouns, articles, ... hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang) ... – PowerPoint PPT presentation

Number of Views:761
Avg rating:3.0/5.0
Slides: 41
Provided by: csieN5
Category:

less

Transcript and Presenter's Notes

Title: Modern Information Retrieval Chapter 7: Text Operations


1
Modern Information RetrievalChapter 7 Text
Operations
  • Ricardo Baeza-Yates
  • Berthier Ribeiro-Neto

2
Document Preprocessing
  • Lexical analysis of the text
  • Elimination of stopwords
  • Stemming
  • Selection of index terms
  • Construction of term categorization structures

3
Lexical Analysis of the Text
  • Word separators
  • space
  • digits
  • hyphens
  • punctuation marks
  • the case of the letters

4
Elimination of Stopwords
  • A list of stopwords
  • words that are too frequent among the documents
  • articles, prepositions, conjunctions, etc.
  • Can reduce the size of the indexing structure
    considerably
  • Problem
  • Search for to be or not to be?

5
Stemming
  • Example
  • connect, connected, connecting, connection,
    connections
  • effectiveness --gt effective --gt effect
  • picnicking --gt picnic
  • king -\-gt k
  • Removing strategies
  • affix removal intuitive, simple
  • table lookup
  • successor variety
  • n-gram

6
Index Terms Selection
  • Motivation
  • A sentence is usually composed of nouns,
    pronouns, articles, verbs, adjectives, adverbs,
    and connectives.
  • Most of the semantics is carried by the noun
    words.
  • Identification of noun groups
  • A noun group is a set of nouns whose syntactic
    distance in the text does not exceed a predefined
    threshold

7
Thesauri
  • Peter Roget, 1988
  • Example
  • cowardly adj.
  • Ignobly lacking in courage cowardly turncoats
  • Syns chicken (slang), chicken-hearted, craven,
    dastardly, faint-hearted, gutless, lily-livered,
    pusillanimous, unmanly, yellow (slang),
    yellow-bellied (slang).
  • A controlled vocabulary for the indexing and
    searching

8
The Purpose of a Thesaurus
  • To provide a standard vocabulary for indexing and
    searching
  • To assist users with locating terms for proper
    query formulation
  • To provide classified hierarchies that allow the
    broadening and narrowing of the current query
    request

9
Thesaurus Term Relationships
  • BT broader
  • NT narrower
  • RT non-hierarchical, but related

10
Term Selection
  • Automatic Text Processing
  • by G. Salton, Chap 9,
  • Addison-Wesley, 1989.

11
Automatic Indexing
  • Indexing
  • assign identifiers (index terms) to text
    documents.
  • Identifiers
  • single-term vs. term phrase
  • controlled vs. uncontrolled vocabulariesinstructi
    on manuals, terminological schedules,
  • objective vs. nonobjective text identifiers
    cataloging rules define, e.g., author names,
    publisher names, dates of publications,

12
Two Issues
  • Issue 1 indexing exhaustivity
  • exhaustive assign a large number of terms
  • nonexhaustive
  • Issue 2 term specificity
  • broad terms (generic)cannot distinguish relevant
    from nonrelevant documents
  • narrow terms (specific)retrieve relatively fewer
    documents, but most of them are relevant

13
Parameters of retrieval effectiveness
  • Recall
  • Precision
  • Goal high recall and high precision

14
Retrieved Part
b
a
Nonrelevant Items
Relevant Items
d
c
15
A Joint Measure
  • F-score
  • ? is a parameter that encode the importance of
    recall and procedure.
  • ?1 equal weight
  • ?lt1 precision is more important
  • ?gt1 recall is more important

16
Choices of Recall and Precision
  • Both recall and precision vary from 0 to 1.
  • Particular choices of indexing and search
    policies have produced variations in performance
    ranging from 0.8 precision and 0.2 recall to 0.1
    precision and 0.8 recall.
  • In many circumstance, both the recall and the
    precision varying between 0.5 and 0.6 are more
    satisfactory for the average users.

17
Term-Frequency Consideration
  • Function words
  • for example, "and", "or", "of", "but",
  • the frequencies of these words are high in all
    texts
  • Content words
  • words that actually relate to document content
  • varying frequencies in the different texts of a
    collect
  • indicate term importance for content

18
A Frequency-Based Indexing Method
  • Eliminate common function words from the document
    texts by consulting a special dictionary, or stop
    list, containing a list of high frequency
    function words.
  • Compute the term frequency tfij for all remaining
    terms Tj in each document Di, specifying the
    number of occurrences of Tj in Di.
  • Choose a threshold frequency T, and assign to
    each document Di all term Tj for which tfij gt T.

19
Inverse Document Frequency
  • Inverse Document Frequency (IDF) for term
    Tjwhere dfj (document frequency of term Tj)
    is the number of documents in which Tj occurs.
  • fulfil both the recall and the precision
  • occur frequently in individual documents but
    rarely in the remainder of the collection

20
TFxIDF
  • Weight wij of a term Tj in a document di
  • Eliminating common function words
  • Computing the value of wij for each term Tj in
    each document Di
  • Assigning to the documents of a collection all
    terms with sufficiently high (tf x idf) factors

21
Term-discrimination Value
  • Useful index terms
  • Distinguish the documents of a collection from
    each other
  • Document Space
  • Two documents are assigned very similar term
    sets, when the corresponding points in document
    configuration appear close together
  • When a high-frequency term without discrimination
    is assigned, it will increase the document space
    density

22
A Virtual Document Space
After Assignment of good discriminator
After Assignment of poor discriminator
Original State
23
Good Term Assignment
  • When a term is assigned to the documents of a
    collection, the few objects to which the term is
    assigned will be distinguished from the rest of
    the collection.
  • This should increase the average distance between
    the objects in the collection and hence produce a
    document space less dense than before.

24
Poor Term Assignment
  • A high frequency term is assigned that does not
    discriminate between the objects of a collection.
    Its assignment will render the document more
    similar.
  • This is reflected in an increase in document
    space density.

25
Term Discrimination Value
  • Definition dvj Q - Qjwhere Q and Qj are
    space densities before and after the
    assignments of term Tj.
  • dvjgt0, Tj is a good term dvjlt0, Tj is a poor
    term.

26
Variations of Term-Discrimination Value with
Document Frequency
Document Frequency
N
Low frequency dvj0
Medium frequency dvjgt0
High frequency dvjlt0
27
TFij x dvj
  • wij tfij x dvj
  • compared with
  • decrease steadily with increasing
    document frequency
  • dvj increase from zero to positive as the
    document frequency of the term increase,
    decrease shapely as the document frequency
    becomes still larger.

28
Document Centroid
  • Issue efficiency problem N(N-1) pairwise
    similarities
  • Document centroid C (c1, c2, c3, ...,
    ct)where wij is the j-th term in document i.
  • Space density

29
Probabilistic Term Weighting
  • GoalExplicit distinctions between occurrences of
    terms in relevant and nonrelevant documents of a
    collection
  • DefinitionGiven a user query q, and the ideal
    answer set of the relevant documents
  • From decision theory, the best ranking algorithm
    for a document D

30
Probabilistic Term Weighting
  • Pr(rel), Pr(nonrel)documents a priori
    probabilities of relevance and nonrelevance
  • Pr(Drel), Pr(Dnonrel)occurrence probabilities
    of document D in the relevant and nonrelevant
    document sets

31
Assumptions
  • Terms occur independently in documents

32
Derivation Process
33
For a specific document D
  • Given a document D(d1, d2, , dt)
  • Assume di is either 0 (absent) or 1 (present).

Pr(xi1rel) pi Pr(xi0rel)
1-pi Pr(xi1nonrel) qi Pr(xi0nonrel) 1-qi
34
(No Transcript)
35
Term Relevance Weight
36
Issue
  • How to compute pj and qj ?
  • pj rj / R qj (dfj-rj)/(N-R)
  • R the total number of relevant documents
  • N the total number of documents

37
Estimation of Term-Relevance
  • The occurrence probability of a term in the
    nonrelevant documents qj is approximated by the
    occurrence probability of the term in the entire
    document collection qj dfj / N
  • The occurrence probabilities of the terms in the
    small number of relevant documents is equal by
    using a constant value pj 0.5 for all j.

38
Comparison
When N is sufficiently large, N-dfj ? N,
39
Estimation of Term-Relevance
  • Estimate the number of relevant documents rj in
    the collection that contain term Tj as a function
    of the known document frequency tfj of the term
    Tj. pj rj / R qj (dfj-rj)/(N-R)R an
    estimate of the total number of relevant
    documents in the collection.

40
Summary
  • Inverse document frequency, idfj
  • tfijidfj (TFxIDF)
  • Term discrimination value, dvj
  • tfijdvj
  • Probabilistic term weighting trj
  • tfijtrj
  • Global properties of terms in a document
    collection
Write a Comment
User Comments (0)
About PowerShow.com