Cue Validity Variance (CVV) Database Selection Algorithm Enhancement - PowerPoint PPT Presentation

About This Presentation
Title:

Cue Validity Variance (CVV) Database Selection Algorithm Enhancement

Description:

In SYM, larger collections (# documents) tend to contain more of the relevant documents ... SBR, random, etc. meritquery,coll = estimated merit ('goodness') Ranks ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 27
Provided by: stude1616
Category:

less

Transcript and Presenter's Notes

Title: Cue Validity Variance (CVV) Database Selection Algorithm Enhancement


1
Cue Validity Variance (CVV)Database Selection
Algorithm Enhancement
  • Travis Emmitt
  • 9 August 1999

2
Collection Selection Overview
UserY
BROKER ACTIONS 1) Receives query from
user 2) Decides which collections are likely
to contain resources relevant to the
query, ranks them 3) Forwards query to top-n
ranked collection servers 4) Receives results
from collection servers 5) Merges and
presents results to user
UserZ
UserX
QueryB
QueryC
QueryA
QueryA
Broker
Broker
Clones
Coll1
Coll3
Coll2
document relevant to QueryA
Step 2 is the Collection Selection (a/k/a
Database Selection) Problem. This is our focus.
document relevant to QueryB
document relevant to QueryC
3
TREC-Based Test Environment
  • 6 Sources AP, FR, PATN, SJM, WSJ, ZIFF
  • Sources partitioned into a total of 236
    collections (a/k/a sites, databases)
  • Different decomposition (partitioning) methods
  • 250 Queries
  • Often contain many possibly repeating query
    terms
  • 1000s of Relevance Judgements
  • Humans listed documents relevant to each query
  • Complete only for queries 51-150

4
SYM Decomposition
  • SYM Source Year Month
  • Example collections
  • AP.88.02 - contains all Associated Press
    documents from February 1988
  • ZIFF.90.12 - contains all ZIFF documents from
    December 1990
  • 236 collections total

5
SYMs Goto AP Problem
  • In SYM, larger collections ( documents) tend to
    contain more of the relevant documents
  • Some collections (AP) so large w.r.t. others that
    any algorithm favoring large collections over
    small collections will perform well,
    independently of the query contents

6
UDC Decomposition
  • UDC Uniform Document Count
  • Example collections
  • AP.01 - contains the first 1/236 (approx) of the
    complete set of documents restricted to AP
    documents only so no overlap w/ FR
  • ZIFF.45 - contains the last 1/236 (approx) of the
    complete set documents
  • 236 collections total

7
Baselines
  • RBR Relevance Based Ranking
  • meritquery,coll number of documents in
    collection which were deemed relevant to query
  • Others...

8
Estimates
  • Estimate names/categories
  • gGLOSS - ideal(k)
  • CORI - U.Mass, best performance
  • SMART - ntn_ntn, etc. (tweaked components)
  • CVV - Yuwono/Lees traditional version
  • CVVp,q,r,s - hybrid, elements from CVV, SMART
  • SBR, random, etc.
  • meritquery,coll estimated merit (goodness)

9
Ranks
  • For each query, collections are ranked according
    to estimated merit (highest 1st)
  • A ranking represents the order in which
    collections should be searched for docs
  • Often want to consider only top-n collections
  • n is the collection cut-off or simply cut-off

10
Rank Comparisons
  • An estimates ranks are compared against a
    baselines ranks (e.g., RBR vs CORI)
  • Different comparison metrics
  • Mean Squared Error (MSE), Spearmans rho
  • P(n) Precision at cut-off n
  • prop. of estimated top-n collections with real
    merit gt 0
  • R(n) Recall at cut-off n
  • prop. of top-n real merit in estimated top-n
    collections
  • R(n) R hat(n) H(n)
  • prop. of total real merit in estimated top-n
    collections

11
Overall Performance Metric
  • In this study, we use R(avg) averaged over all
    queries
  • This is a normalized area under the curve

12
What is CVV?
  • CVV Cue Validity Variance
  • CVV algorithm
  • a/k/a CVV ranking method
  • an estimator, generates merits (which can be
    ranked and compared against a baseline)
  • consists of CVV component, DF component
  • CVV component
  • derived from DF and N
  • What are DF and N?

13
Terminology, cont.
  • C set of collections in the system
  • C number of collections in the system (236)
  • Ncoll number of documents in collection
  • DFterm, coll Document Frequency
  • number of documents in collection in which term
    occurs at least once
  • So, DFterm,coll lt Ncoll

14
Cue Validity (CV)
  • Densityterm,coll DFterm,coll / Ncoll
  • Externalterm,coll Ã¥ (DFterm,c) / Ã¥ (Nc)
    c ! coll
    c ! coll
  • CVterm,coll
    Densityterm,coll
    Densityterm,coll Externalterm,coll
  • CVterm Ã¥ (CVterm,coll) / C coll
    in C CVterm,coll
    values are always between 0 and 1

15
Cue Validity Variance (CVV)
  • CVVterm Ã¥ (CVterm,coll - CVterm)2 / C
    coll in C
  • meritquery,coll estimated Goodness
    å (CVVterm DFterm,coll)
    term in query

    2 components
  • The basic CVV algorithm ignores the number of
    times a term occurs in the query (Query Term
    Weight)

16
Basic CVV Problem
  • What if query cat cat dog?
  • Basic CVV ignores the number of times cat
    appears in the query, so merits (and consequently
    ranks) will be the same as for query cat dog
  • Unlike in CVVs development environment, many of
    our test queries have multiply-occurring terms
  • Performance of basic CVV tends to be noticeably
    poorer than that of algorithms which incorporate
    the query term frequency in their merit
    calculations
  • We can modify CVV to incorporate query term
    frequency

17
Enchancing CVV with QTW
  • QTWquery,term query term weight (or freq)
    of times term occurs in query
  • meritquery,coll Ã¥ (CVVterm DFterm,coll
    QTWquery,term)
    term in query

    3 components

18
Effects of QTW Enhancement
  • For R(avg) averaged over all queries
  • SYM performance increased from .8416 to .8486
  • UDC performance increased from .6735 to .6846
  • Meanwhile, CORIs SYM .8972, UDC .7884
  • So, performance increased, but not dramatically
  • For most individual queries, performance improved
    or stayed the same, but for 20-30 of queries,
    performance decreased
  • Could QTW be overcorrecting?

19
Further Questions...
  • What if we varied the degree to which QTW
    influenced the merit calculations? We could
  • Add an exponent to the QTW componentmeritterm,co
    ll å (CVVterm DFterm,coll QTWterme)
  • Evaluate using a range of exponents, searching
    for version that maximizes performance (e 0,
    0.5, 1, 2, )
  • What if we added exponents to the other two
    components as well CVV and DF?
  • What about adding an ICF component?

20
Inverse Collection Frequency (ICF)
  • CFterm collection frequency of term
    of collections in which term occurs
  • ICFterm log ((C 1) / CFterm)
  • ICF is used to decrease contribution of terms
    which appear in many collections and are
    therefore not good discriminators terms which
    occur in few collections are best discriminators.
  • ICF has shown useful in other algorithms.

21
Final Enhancement Equation
  • meritquery,coll Ã¥ (CVVtermp DFterm,collq
    QTWquery,termr ICFterms) term in query

    4 components
  • Notes
  • DF is the only component dependent upon
    collection
  • CVV(1,1,0,0) Basic CVV CVV(0) or cvv
  • CVV(1,1,1,0) QTW-Enhanced CVV CVV(1)
  • CVV(0,1,1,2) ntn_ntn
  • CVV(,0,,) alphabetical (same merits for all
    collections)

22
  • P E R F O R M A N C E
  • Estimate SYM UDC Joint
  • CORI .8972 .7884 .8428
  • 1.0 0.5 2.0 2.0 .8938 .7373 .8155
  • 1.0 0.5 2.0 0.7 .8839 .7776 .8307
  • 0.5 0.2 3.0 1.0 .8937 .7712 .8325
  • Basic CVV .8416 .6735 .7576
  • Ideal(0) .8570 .7146 .7858
  • ntn_ntn .8729 .7356 .8042
  • Note ntn_ntn 0.0 1.0 1.0 2.0

23
Essentiality
  • Whats the best performance you can get if you
    hold a components exponent at 0?
  • CVV component appears to be the least
    essentialDF appears to be the most essential

Omitted Best Performer and
Performance Comp
SYM UDC
Joint . none
1.0 0.5 2.0 2.0 (.8938) 1.0 0.5 2.0 0.7
(.7776) 0.5 0.2 3.0 1.0 (.8325) CVV 0.0 0.5
3.0 3.0 (.8932) 0.0 0.5 2.0 1.0 (.7737) 0.0 0.5
3.0 1.0 (.8298) DF tied
(.6081) tied (.6017) tied
(.6049) QTW 1.0 0.5 0.0 2.0 (.8851) 3.0 0.5
0.0 0.0 (.7525) 0.0 0.5 0.0 1.0 (.8053) ICF
2.0 0.5 3.0 0.0 (.8755) 3.0 0.5 2.0 0.0
(.7647) 3.0 0.5 3.0 0.0 (.8188)
24
Other Reductions
Active Best Performer and
Performance Comps
SYM UDC
Joint . all
1.0 0.5 2.0 2.0 (.8938) 1.0 0.5 2.0 0.7
(.7776) 0.5 0.2 3.0 1.0 (.8325)DF,CVV 0.0
0.5 0.0 0.0 (.8533) 3.0 0.5 0.0 0.0 (.7525)
2.0 0.5 0.0 0.0 (.7945)DF,QTW 0.0 0.5 3.0 0.0
(.8725) 0.0 0.5 3.0 0.0 (.7306) 0.0 0.5 3.0
0.0 (.8015)DF,ICF 0.0 0.5 0.0 2.0 (.8831)
0.0 0.5 0.0 1.0 (.7372) 0.0 0.5 0.0 1.0
(.8053)DF only 0.0 0.5 0.0 0.0 (.8533) 0.0
0.5 0.0 0.0 (.6889) 0.0 0.5 0.0 0.0 (.7711)
  • Comments
  • Fewer sample points, since we didnt focus
    searches here

25
Open Questions
  • How much would other algorithms performances
    improve if we tweaked them?
  • Would CORI get so much better that a CVV hybrid
    couldnt even get close?
  • Or, is CORI already optimized?
  • Is there value in automating these searches,
    using adaptive programming?

26
Basic CVV Example
  • Given
  • query1 cat dog
  • CVVcat 0.8 (cat is unevenly distributed)
  • CVVdog 0.2 (dog is more evenly
    distributed)
  • DFcat,collA 1 (cat appears in only one of
    collAs docs)
  • DFdog,collA 20
  • DFcat,collB 5
  • DFdog,collB 3
  • meritquery1,collA (0.81) (0.220) 0.8 4
    4.8
  • meritquery1,collB (0.85) (0.23) 4 0.6
    4.6 For query1, collA would be ranked
    higher (better) than collB
Write a Comment
User Comments (0)
About PowerShow.com