Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Information Retrieval

Description:

http://www.sims.berkeley.edu/~hearst/irbook ... 2. Operations TOC. Introduction. Relevance Feedback. Query Expansion. Term Reweighting ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 63
Provided by: bert193
Learn more at: https://s2.smu.edu
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • CSE 8337
  • Spring 2007
  • Query Operations
  • Material for these slides obtained from
  • Modern Information Retrieval by Ricardo
    Baeza-Yates and Berthier Ribeiro-Neto
    http//www.sims.berkeley.edu/hearst/irbook/
  • Prof. Raymond J. Mooney in CS378 at University of
    Texas
  • Introduction to Modern Information Retrieval by
    Gerald Salton and Michael J. McGill,
  • 1983, McGraw-Hill.
  • Automatic Text Processing, by Gerard Salton,
    Addison-Wesley,1989.

2
Operations TOC
  • Introduction
  • Relevance Feedback
  • Query Expansion
  • Term Reweighting
  • Automatic Local Analysis
  • Query Expansion using Clustering
  • Automatic Global Analysis
  • Query Expansion using Thesaurus
  • Similarity Thesaurus
  • Statistical Thesaurua
  • Complete Link Algorithm

3
Query Operations Introduction
  • IR queries as stated by the user may not be
    precise or effective.
  • There are many techniques to improve a stated
    query and then process that query instead.

4
Relevance Feedback
  • Use assessments by users as to the relevance of
    previously returned documents to create new
    (modify old) queries.
  • Technique
  • Increase weights of terms from relevant
    documents.
  • Decrease weight of terms from nonrelevant
    documents.
  • Figure 10.4 in Automatic Text Processing
  • Figure 6-10 in Introduction to Modern Information
    Retrieval

5
Relevance Feedback
  • After initial retrieval results are presented,
    allow the user to provide feedback on the
    relevance of one or more of the retrieved
    documents.
  • Use this feedback information to reformulate the
    query.
  • Produce new results based on reformulated query.
  • Allows more interactive, multi-pass process.

6
Relevance Feedback Architecture
Document corpus
Rankings
IR System
7
Query Reformulation
  • Revise query to account for feedback
  • Query Expansion Add new terms to query from
    relevant documents.
  • Term Reweighting Increase weight of terms in
    relevant documents and decrease weight of terms
    in irrelevant documents.
  • Several algorithms for query reformulation.

8
Query Reformulation for VSR
  • Change query vector using vector algebra.
  • Add the vectors for the relevant documents to the
    query vector.
  • Subtract the vectors for the irrelevant docs from
    the query vector.
  • This both adds both positive and negatively
    weighted terms to the query as well as
    reweighting the initial terms.

9
Optimal Query
  • Assume that the relevant set of documents Cr are
    known.
  • Then the best query that ranks all and only the
    relevant queries at the top is

Where N is the total number of documents.
10
Standard Rochio Method
  • Since all relevant documents unknown, just use
    the known relevant (Dr) and irrelevant (Dn) sets
    of documents and include the initial query q.

? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant documents.
11
Ide Regular Method
  • Since more feedback should perhaps increase the
    degree of reformulation, do not normalize for
    amount of feedback

? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant documents.
12
Ide Dec Hi Method
  • Bias towards rejecting just the highest ranked of
    the irrelevant documents

? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant document.
13
Comparison of Methods
  • Overall, experimental results indicate no clear
    preference for any one of the specific methods.
  • All methods generally improve retrieval
    performance (recall precision) with feedback.
  • Generally just let tunable constants equal 1.

14
Fair Evaluation of Relevance Feedback
  • Remove from the corpus any documents for which
    feedback was provided.
  • Measure recall/precision performance on the
    remaining residual collection.
  • Compared to complete corpus, specific
    recall/precision numbers may decrease since
    relevant documents were removed.
  • However, relative performance on the residual
    collection provides fair data on the
    effectiveness of relevance feedback.
  • Fig 10.5 in Automatic Text Processing

15
Evaluating Relevance Feedback
  • Test-and-control Collection
  • Divide document collection in two parts
  • Use test portion to perform relevance feedback
    and to modify query
  • Perform test on control portion using both
    original and modified query
  • Compare results

16
Why is Feedback Not Widely Used?
  • Users sometimes reluctant to provide explicit
    feedback.
  • Results in long queries that require more
    computation to retrieve, and search engines
    process lots of queries and allow little time for
    each one.
  • Makes it harder to understand why a particular
    document was retrieved.

17
Pseudo Feedback
  • Use relevance feedback methods without explicit
    user input.
  • Just assume the top m retrieved documents are
    relevant, and use them to reformulate the query.
  • Allows for query expansion that includes terms
    that are correlated with the query terms.

18
PseudoFeedback Results
  • Found to improve performance on TREC competition
    ad-hoc retrieval task.
  • Works even better if top documents must also
    satisfy additional boolean constraints in order
    to be used in feedback.

19
Term Reweighting for PM
  • Use statistics found in retrieved documents
  • Dr Set of relevant and retrieved
  • Dr,i Set of relevant and retrieved that contain
    ki.

20
Term Reweighting
  • No query expansion
  • Document term weights not used
  • Query term weights not used
  • Therefore, not usually as effective as previous
    vector approach.

21
Local vs. Global Automatic Analysis
  • Local Documents retrieved are examined to
    automatically determine query expansion. No
    relevance feedback needed.
  • Global Thesaurus used to help select terms for
    expansion.

22
Automatic Local Analysis
  • At query time, dynamically determine similar
    terms based on analysis of top-ranked retrieved
    documents.
  • Base correlation analysis on only the local set
    of retrieved documents for a specific query.
  • Avoids ambiguity by determining similar
    (correlated) terms only within relevant
    documents.
  • Apple computer ?
    Apple computer
    Powerbook laptop

23
Automatic Local Analysis
  • Expand query with terms found in local clusters.
  • Dl set of documents retireved for query q.
  • Vl Set of words used in Dl.
  • Sl Set of distinct stems in Vl.
  • fsi,j Frequency of stem si in document dj found
    in Dl.
  • Construct stem-stem association matrix.

24
Association Matrix
cij Correlation factor between stems si and stem
sj
fik Frequency of term i in document k
25
Normalized Association Matrix
  • Frequency based correlation factor favors more
    frequent terms.
  • Normalize association scores
  • Normalized score is 1 if two stems have the same
    frequency in all documents.

26
Metric Correlation Matrix
  • Association correlation does not account for the
    proximity of terms in documents, just
    co-occurrence frequencies within documents.
  • Metric correlations account for term proximity.

Vi Set of all occurrences of term i in any
document. r(ku,kv) Distance in words between
word occurrences ku and kv (?
if ku and kv are occurrences in different
documents).
27
Normalized Metric Correlation Matrix
  • Normalize scores to account for term frequencies

28
Query Expansion with Correlation Matrix
  • For each term i in query, expand query with the n
    terms, j, with the highest value of cij (sij).
  • This adds semantically related terms in the
    neighborhood of the query terms.

29
Problems with Local Analysis
  • Term ambiguity may introduce irrelevant
    statistically correlated terms.
  • Apple computer ? Apple red fruit computer
  • Since terms are highly correlated anyway,
    expansion may not retrieve many additional
    documents.

30
Automatic Global Analysis
  • Determine term similarity through a pre-computed
    statistical analysis of the complete corpus.
  • Compute association matrices which quantify term
    correlations in terms of how frequently they
    co-occur.
  • Expand queries with statistically most similar
    terms.

31
Automatic Global Analysis
  • There are two modern variants based on a
    thesaurus-like structure built using all
    documents in collection
  • Query Expansion based on a Similarity Thesaurus
  • Query Expansion based on a Statistical Thesaurus

32
Thesaurus
  • A thesaurus provides information on synonyms and
    semantically related words and phrases.
  • Example
  • physician
  • syn croaker, doc, doctor, MD, medical,
    mediciner, medico, sawbones
  • rel medic, general practitioner, surgeon,

33
Thesaurus-based Query Expansion
  • For each term, t, in a query, expand the query
    with synonyms and related words of t from the
    thesaurus.
  • May weight added terms less than original query
    terms.
  • Generally increases recall.
  • May significantly decrease precision,
    particularly with ambiguous terms.
  • interest rate ? interest rate fascinate
    evaluate

34
Similarity Thesaurus
  • The similarity thesaurus is based on term to term
    relationships rather than on a matrix of
    co-occurrence.
  • This relationship are not derived directly from
    co-occurrence of terms inside documents.
  • They are obtained by considering that the terms
    are concepts in a concept space.
  • In this concept space, each term is indexed by
    the documents in which it appears.
  • Terms assume the original role of documents while
    documents are interpreted as indexing elements

35
Similarity Thesaurus
  • The following definitions establish the proper
    framework
  • t number of terms in the collection
  • N number of documents in the collection
  • fi,j frequency of occurrence of the term ki in
    the document dj
  • tj vocabulary of document dj
  • itfj inverse term frequency for document dj

36
Similarity Thesaurus
  • Inverse term frequency for document dj
  • To ki is associated a vector

37
Similarity Thesaurus
  • where wi,j is a weight associated to
    index-document pairki,dj. These weights are
    computed as follows

38
Similarity Thesaurus
  • The relationship between two terms ku and kv is
    computed as a correlation factor c u,v given by
  • The global similarity thesaurus is built through
    the computation of correlation factor cu,v for
    each pair of indexing terms ku,kv in the
    collection

39
Similarity Thesaurus
  • This computation is expensive
  • Global similarity thesaurus has to be computed
    only once and can be updated incrementally

40
Query Expansion based on a Similarity Thesaurus
  • Query expansion is done in three steps as
    follows
  • Represent the query in the concept space used for
    representation of the index terms
  • Based on the global similarity thesaurus, compute
    a similarity sim(q,kv) between each term kv
    correlated to the query terms and the whole query
    q.
  • Expand the query with the top r ranked terms
    according to sim(q,kv)

41
Query Expansion - step one
  • To the query q is associated a vector q in the
    term-concept space given by
  • where wi,q is a weight associated to the
    index-query pairki,q

42
Query Expansion - step two
  • Compute a similarity sim(q,kv) between each term
    kv and the user query q
  • where cu,v is the correlation factor

43
Query Expansion - step three
  • Add the top r ranked terms according to sim(q,kv)
    to the original query q to form the expanded
    query q
  • To each expansion term kv in the query q is
    assigned a weight wv,q given by
  • The expanded query q is then used to retrieve
    new documents to the user

44
Query Expansion Sample
  • Doc1 D, D, A, B, C, A, B, C
  • Doc2 E, C, E, A, A, D
  • Doc3 D, C, B, B, D, A, B, C, A
  • Doc4 A
  • c(A,A) 10.991
  • c(A,C) 10.781
  • c(A,D) 10.781
  • ...
  • c(D,E) 10.398
  • c(B,E) 10.396
  • c(E,E) 10.224

45
Query Expansion Sample
  • Query q A E E
  • sim(q,A) 24.298
  • sim(q,C) 23.833
  • sim(q,D) 23.833
  • sim(q,B) 23.830
  • sim(q,E) 23.435
  • New query q A C D E E
  • w(A,q') 6.88
  • w(C,q') 6.75
  • w(D,q') 6.75
  • w(E,q') 6.64

46
WordNet
  • A more detailed database of semantic
    relationships between English words.
  • Developed by famous cognitive psychologist George
    Miller and a team at Princeton University.
  • About 144,000 English words.
  • Nouns, adjectives, verbs, and adverbs grouped
    into about 109,000 synonym sets called synsets.

47
WordNet Synset Relationships
  • Antonym front ? back
  • Attribute benevolence ? good (noun to adjective)
  • Pertainym alphabetical ? alphabet (adjective to
    noun)
  • Similar unquestioning ? absolute
  • Cause kill ? die
  • Entailment breathe ? inhale
  • Holonym chapter ? text (part-of)
  • Meronym computer ? cpu (whole-of)
  • Hyponym tree ? plant (specialization)
  • Hypernym fruit ? apple (generalization)

48
WordNet Query Expansion
  • Add synonyms in the same synset.
  • Add hyponyms to add specialized terms.
  • Add hypernyms to generalize a query.
  • Add other related terms to expand query.

49
Statistical Thesaurus
  • Existing human-developed thesauri are not easily
    available in all languages.
  • Human thesuari are limited in the type and range
    of synonymy and semantic relations they
    represent.
  • Semantically related terms can be discovered from
    statistical analysis of corpora.

50
Query Expansion Based on a Statistical Thesaurus
  • Global thesaurus is composed of classes which
    group correlated terms in the context of the
    whole collection
  • Such correlated terms can then be used to expand
    the original user query
  • This terms must be low frequency terms
  • However, it is difficult to cluster low frequency
    terms
  • To circumvent this problem, we cluster documents
    into classes instead and use the low frequency
    terms in these documents to define our thesaurus
    classes.
  • This algorithm must produce small and tight
    clusters.

51
Query Expansion based on a Statistical Thesaurus
  • Use the thesaurus class to query expansion.
  • Compute an average term weight wtc for each
    thesaurus class C

52
Query Expansion based on a Statistical Thesaurus
  • wtc can be used to compute a thesaurus class
    weight wc as

53
Query Expansion Sample
Doc1 D, D, A, B, C, A, B, C Doc2 E, C, E, A,
A, D Doc3 D, C, B, B, D, A, B, C, A Doc4 A
q A E E
sim(1,3) 0.99 sim(1,2) 0.40 sim(1,2)
0.40 sim(2,3) 0.29 sim(4,1) 0.00 sim(4,2)
0.00 sim(4,3) 0.00
  • TC 0.90 NDC 2.00 MIDF 0.2

idf A 0.0 idf B 0.3 idf C 0.12 idf D
0.12 idf E 0.60
q'A B E E
54
Query Expansion based on a Statistical Thesaurus
  • Problems with this approach
  • initialization of parameters TC,NDC and MIDF
  • TC depends on the collection
  • Inspection of the cluster hierarchy is almost
    always necessary for assisting with the setting
    of TC.
  • A high value of TC might yield classes with too
    few terms

55
Complete link algorithm
  • This is document clustering algorithm with
    produces small and tight clusters
  • Place each document in a distinct cluster.
  • Compute the similarity between all pairs of
    clusters.
  • Determine the pair of clusters Cu,Cv with the
    highest inter-cluster similarity.
  • Merge the clusters Cu and Cv
  • Verify a stop criterion. If this criterion is not
    met then go back to step 2.
  • Return a hierarchy of clusters.
  • Similarity between two clusters is defined as the
    minimum of similarities between all pair of
    inter-cluster documents

56
Selecting the terms that compose each class
  • Given the document cluster hierarchy for the
    whole collection, the terms that compose each
    class of the global thesaurus are selected as
    follows
  • Obtain from the user three parameters
  • TC Threshold class
  • NDC Number of documents in class
  • MIDF Minimum inverse document frequency

57
Selecting the terms that compose each class
  • Use the parameter TC as threshold value for
    determining the document clusters that will be
    used to generate thesaurus classes
  • This threshold has to be surpassed by sim(Cu,Cv)
    if the documents in the clusters Cu and Cv are to
    be selected as sources of terms for a thesaurus
    class

58
Selecting the terms that compose each class
  • Use the parameter NDC as a limit on the size of
    clusters (number of documents) to be considered.
  • A low value of NDC might restrict the selection
    to the smaller cluster Cuv

59
Selecting the terms that compose each class
  • Consider the set of document in each document
    cluster pre-selected above.
  • Only the lower frequency documents are used as
    sources of terms for the thesaurus classes
  • The parameter MIDF defines the minimum value of
    inverse document frequency for any term which is
    selected to participate in a thesaurus class

60
Global vs. Local Analysis
  • Global analysis requires intensive term
    correlation computation only once at system
    development time.
  • Local analysis requires intensive term
    correlation computation for every query at run
    time (although number of terms and documents is
    less than in global analysis).
  • But local analysis gives better results.

61
Query Expansion Conclusions
  • Expansion of queries with related terms can
    improve performance, particularly recall.
  • However, must select similar terms very carefully
    to avoid problems, such as loss of precision.

62
Conclusion
  • Thesaurus is a efficient method to expand queries
  • The computation is expensive but it is executed
    only once
  • Query expansion based on similarity thesaurus may
    use high term frequency to expand the query
  • Query expansion based on statistical thesaurus
    need well defined parameters
Write a Comment
User Comments (0)
About PowerShow.com