Metasearch Mathematics of Knowledge and Search Engines: Tutorials IPAM 9132007 PowerPoint PPT Presentation

presentation player overlay
1 / 53
About This Presentation
Transcript and Presenter's Notes

Title: Metasearch Mathematics of Knowledge and Search Engines: Tutorials IPAM 9132007


1
MetasearchMathematics of Knowledge and Search
Engines Tutorials _at_ IPAM9/13/2007
  • Zhenyu (Victor) Liu
  • Software Engineer
  • Google Inc.
  • vicliu_at_google.com

2
Roadmap
  • The problem
  • Database content modeling
  • Database selection
  • Summary

3
Metasearch the problem
??? appliedmathematics
MetasearchEngine
Search results
4
Subproblems
  • Database content modeling
  • How does a Metasearch engine perceive the
    content of each database?
  • Database selection
  • Selectively issue the query to the best
    databases
  • Query translation
  • Different database has different query formats
  • ab / a AND b / titlea AND bodyb /
    etc.
  • Result merging
  • Query applied mathematics
  • top-10 results from both science.com and
    nature.com, how to present?

5
Database content modeling and selection a
simplified example
  • A content summary of each database
  • Selection based on of mathing docs
  • Assuming independence between words

Total 60,000
Total 10,000
gt
10,000 ? 0.4 ? 0.25 1000 documents
matchesapplied mathematics
60,000 ? 0.00333 ? 0.005 1documents matches
applied mathematics
6
Roadmap
  • The problem
  • Database content modeling
  • Database selection
  • Summary

7
Database content modeling
able to replicate theentire text database - most
storage demanding- fully cooperative database
able to obtain a fullcontent summary - less
storage demanding- fully cooperative database
download part of atext database - more storage
demanding- non-cooperative database
approximate the contentsummary via sampling -
least storage demanding- non-cooperative database
8
Replicate the entire database
  • E.g.
  • www.google.com/patents, replica of the entire
    USPTO patent document database

9
Download a non-cooperative database
  • Objective download as much as possible
  • Basic idea probing (querying with short
    queries) and downloading all results
  • Practically, can only issue a fixed of probes
    (e.g., 1000 queries per day)

SearchInterface
applied
MetasearchEngine
mathematics
A textdatabase
10
Harder than the set-coverage problem
  • All docs in a database db as the universe
  • assuming all docs are equal
  • Each probe corresponds to a subset
  • Find the least of subsets(probes) that covers
    db
  • or, the max coverage with afixed of subsets
    (probes)
  • NP-complete
  • Greedy algo. proved to be thebest-possible
    P-timeapproximation algo.
  • Cardinality of each subset( of matching docs
    for eachprobe) unknown!

mathematics

applied
11
Pseudo-greedy algorithms NPC05
  • Greedy-set-coverage choose subsets with the max
    cardinality gain
  • When cardinality of subsets is unknown
  • Assume cardinality of subsets is the same across
    databases - proportionally
  • e.g. build a database with Web pages crawled from
    the Internet, rank single words according to
    their frequency
  • Start with certain seed queries, adaptively
    choose query words within the docs returned
  • Choice of probing words varies from database to
    database

12
An adaptive method
  • D(wi) subsets returned by probe with word wi
  • w1, w2, , wn already issued
  • Rewritten as db?Pr(wi1) - db?Pr(wi1 ?
    (w1VV wn))
  • Pr(w) prob. of w appearing in a doc of db

13
An adaptive method (contd)
  • How to estimate Pr(wi1)
  • Zipfs law
  • Pr(w) a?(R(w)ß)-?, R(w) rank of w in a
    descending order of Pr(w)
  • Assuming the ranking of w1, w2, , wn and other
    words remains the same in the downloaded subset
    and in db
  • Interpolate

14
Obtain an exact content summary
  • C(db) for a database db
  • Statistics about words in db,e.g., df document
    frequency,
  • Standards and proposals for co-operative
    databases to follow to export C(db)
  • STARTS GCM97
  • Initiated by Stanford, attracted main search
    engine players by 1997 Fulcrum, Infoseek, PLS,
    Verity, WAIS, Excite, etc.
  • SDARTS GIG01
  • Initiated by Columbia U.

15
Approximate the content-summary
  • Objective C(db) of a database db, with high
    vocabulary coverage high accuracy
  • Basic idea probing and download sample docs
    CC01
  • Example df as the content summary statistics
  • Pick a single word as the query, probe the
    database
  • Download a fraction of results, e.g., top-k
  • If terminating condition unsatisfied, go to 1.
  • Output ltw, dfgt based on the sample docs
    downloaded

16
Vocabulary coverage
  • Can a small sample of docs cover the vocabulary
    of a big database?
  • Yes, based on Heaps law Hea78
  • W K?nß
  • n - of words scanned
  • W - set of distinct words encountered
  • K - constant, typically in 10, 100
  • ß - constant, typically in 0.4, 0.6
  • Empirically verified CC01

17
Estimate document frequency
  • How to identify the df of w in the entire
    database?
  • w used as a query during sampling df typically
    revealed in search results
  • w appearing in the sampled docs need to
    estimate df based on the docs sample
  • Apply Zipfs law interpolate IG02
  • Rank w and w based on their frequency in the
    sample
  • Curve-fit based on the true df of those w
  • Interpolate the estimated df of w onto the
    fitted curve

18
What if db changes over time?
  • So does its content summary C(db), and C(db)
    INC05
  • Empirical study
  • 152 Web databases, a snapshot downloaded weekly,
    for 1 year
  • df as the statistics measure
  • Kullback-Leibler divergenceas the change
    measure
  • between the latestsnapshot and thesnapshot
    time t ago
  • db does change!
  • How do we modelthe change?
  • When to resample, andget a new C(db) ?

Kullback-Leiblerdivergence
t
19
Model the change
  • KLdb(t) the KL divergence between the current
    C(db) and C(db, t) time t ago
  • T time when KLdb(t) exceeds a pre-specified t
  • Applying principles of Survival Analysis
  • Survival function Sdb(t) 1 Pr(T t)
  • Hazard funciton hdb(t) - (dSdb(t) /dt) / Sdb(t)
  • How to compute hdb(t) and then Sdb(t)?

20
Learn the hdb(t) of database change
  • Cox proportional hazards regression model
  • ln( hdb(t) ) ln( hbase(t) ) ß1?x1 , where
    xi is some predictor variable
  • Predictors
  • Pre-specified threshold t
  • Web domain of db, .com .edu .gov .org
    others
  • 5 binary domain variables
  • ln( db )
  • avg KLdb(1 week) measured in the training period

21
Train the Cox model
  • Stratified Cox model being applied
  • Domain variables didnt satisfy the Cox
    proportional assumption
  • Stratifying on each domain, or, a hbase(t) /
    Sbase(t) for each domain
  • Training Sbase(t) for each domain
  • Assuming Weibull distribution, Sbase(t) e-?t?

22
Training result
  • ? ranges in (0.57, 1.08) ? Sbase(t) not
    exponential distribution

Sbase(t)
t
23
Training result (contd)
  • A larger db takes less time to have KLdb(t)
    exceed t
  • Databases changes faster during a short period
    are more likely to change later on

24
How to use the trained model?
  • Model gives Sdb(t) ? likelihood that db has not
    changed much
  • An update policy to periodically resample each db
  • Intuitively, maximize ?db Sdb(t)
  • More precisely S limt?8 (1/t)??0t ?db
    Sdb(t) dt
  • A policy fdb, where fdb is the update
    frequency of db, e.g., 2/week
  • Subject to practical constraints, e.g., total
    update cap per week


25
Derive an optimal update policy
  • Find fdb that maximizes S under the constraint
    ?db fdb F, where F is a global frequency limit
  • Solvable by the Lagrange-multiplier method
  • Sample results

26
Roadmap
  • The problem
  • Database content modeling
  • Database selection
  • Summary

27
Database selection
  • Select the databases to issue a given query
  • Necessary when the Metasearch engine do not have
    entire replica of each database most likely
    with content summary only
  • Reduces query load in the entire system
  • Formalization
  • Query q ltw1, , wmgt, databases db1, , dbn
  • Rank databases according to their relevancy
    score r(dbi, q) to query q

28
Relevancy score
  • of matching docs in db
  • Similarity between q and top docs returned by db
  • Typically vector-space similarity (dot-product)
    between q and a doc
  • Sum / Avg of similarities of top-k docs of each
    db, e.g., top-10
  • Sum / Avg of similarities of top docs of each db
    exceeding a similarity threshold
  • Relevancy of db as judged by users
  • Explicit relevance feedback
  • User click behavior data

29
Estimating r(db,q)
  • Typically, r(db, q) unavailable
  • Estimate r(db, q) based on C(db), or C(db)

30
Estimating r(db,q), example 1 GGT99
  • r(db, q) of matching docs in db
  • Independence assumption
  • Query words w1, , wm appear independently in db
  • r(db, q)
  • df(db, wj) document frequency of wj in db
    could be df(db, wj) from C(db)

31
Estimating r(db,q), example 2 GGT99
  • r(db, q) ?d?db sim(d, q)gtl sim(d, q)
  • d a doc in db
  • sim(d, q) vector dot-product between d q
  • each word in d q weighted with common tf?idf
    weighting
  • l a pre-specified threshold

32
Estimating r(db,q), example 2 (contd)
  • Content summary, C(db), required
  • df(db, w) doc frequency
  • v(db, w) ?d?db weight of w in ds vector
  • ltv(db, w1), v(db, w2), gt - centroid of the
    entire db as a cluster of doc vectors




33
Estimating r(db,q), example 2 (contd)
  • l 0, sum of all q-doc similarity values of db
  • r(db, q) ?d?db sim(d, q)
  • r(db, q) r(db, q) ltv(q,w1), gt ? ltv(db,
    w1), v(db, w2), gt
  • v(q, w) weight of w in the query vector
  • l gt 0?



34
Estimating r(db,q), example 2 (contd)
  • Assuming uniform weight of w among all docs using
    w
  • i.e. weight of w in any doc v(db, w) / df(db,
    w)
  • Highly-correlated query words scenario
  • If df(db, wi) lt df(db, wj), every doc using wi
    also uses wj
  • Words in q sorted s.t. df(db, w1) df(db, w2)
    df(db, wm)
  • r(db, q) ?i1pv(q, wi)?v(db, wi) df(db,
    wp)? ?jp1mv(q, wj)?v(db, wj)/df(db,
    wj)where p is determined by some criteria
    GGT99
  • Disjoint query words scenario
  • No doc using wi uses wj
  • r(db, q) ?i1m df(db, wi) gt 0 ? v(q,
    wi)?v(db, wi) / df(db, wi) gt l v(q, wi)?v(db, wi)






35
Estimating r(db,q), example 2 (contd)
  • Ranking of databases based on r(db, q)
    empirically evaluated GGT99

36
A probabilistic model for errors in estimation
LLC04
  • Any estimation makes errors
  • An error (observed) distribution for each db
  • distribution of db1 ? distribution of db2
  • Definition of error relative

37
Modeling the errors a motivating experiment
  • dbPMC PubMedCentral www.pubmedcentral.nih.gov
  • Two query sets, Q1 and Q2 (healthcare related)
  • Q1 Q2 1000, Q1 ? Q2 ?
  • Compute err(dbPMC, q) for each sample queryq ?
    Q1 or Q2
  • Further verified through statistical tests
    (Pearson-?2)

error probability distribution
error probability distribution
err(dbPMC, q), ?q? Q1
err(dbPMC, q), ?q? Q2
Q2
Q1
38
Implications of the experiment
  • On a text database
  • Similar error behavior among sample queries
  • Can sample a database and summarize the error
    behavior into an Error Distribution (ED)
  • Use ED to predict the error for a future unseen
    query
  • Sampling size study LLC04
  • A few hundred sample queries good enough

39
From an Error Distribution (ED)to a Relevancy
Distribution (RD)
  • Database db1. Query qnew

?
by definition
0.5
0.4
0.1
err(db1,qnew)
from sampling
0.5
0.4
-50
0
50
0.1
r(db1,qnew)
The ED for db1
?
500
1000
1500
A Relevancy Distribution (RD)for r(db1, qnew)
?
? r(db1,qnew) 1000
existing estimation method
40
RD-based selection
r(db1,qnew)
r(db2,qnew)
  • Estimation-based db1 gt db2

1000
650
0.5
db1
0.4
0.1
err(db1, qnew)
r(db1, qnew)
-50
0
50
RD-based db1 lt db2( Pr(db1 lt db2) 0.85 )
r(db1,qnew) 1000
db2
0.9
0.1
err(db2, qnew)
db1
r(db2, qnew)
0
100
db2
r(db1,qnew) 650
41
Correctness metric
  • Terminology
  • DBk k databases returned by some method
  • DBtopk the actual answer
  • How correct DBk is compared to DBtopk?
  • Absolute correctness Cora(DBk) 1, if
    DBkDBtopk 0, otherwise
  • Partial correctness Corp(DBk)
  • Cora(DBk) Corp(DBk) for k 1

42
Effectiveness of RD-based selection
  • 20 healthcare-related text databases on the Web
  • Q1 (training, 1000 queries) to learn the ED of
    each database
  • Q2 (testing, 1000 queries) to test the
    correctness of database selection

43
Probing to improve correctness
db1
  • RD-based selection
  • 0.85 Pr(db2 gt db1)
  • Pr(db2 DBtop1)
  • 1?Pr(db2 DBtop1) 0?Pr(db2 ? DBtop1)
  • ECora(db2)
  • Probe dbi contact a dbi to obtain its exact
    relevancy
  • After probing db1
  • ECora(db2) Pr(db2 gt db1) 1

db2
44
Computing the expected correctness
  • Expected absolute correctness
  • ECora(DBk)1?Pr(Cora(DBk) 1)
    0?Pr(Cora(DBk) 0) Pr(Cora(DBk) 1) Pr(DBk
    DBtopk)
  • Expected partial correctness
  • ECorp(DBk)

45
Adaptive probing algorithm APro
  • User-specified correctness threshold t

return this DBk
RDs of the probed and unprobed databases
dbi1
dbn
YES
dbi
Any DBkwith ECor(DBk) ? t?
unprobed
probed
NO
dbi1
dbi
db1
dbi-1
dbn
46
Which database to probe?
  • A greedy strategy
  • The stopping condition ECor(DBk) ? t
  • Once probed, which database leads to the highest
    ECor(DBk)?
  • Suppose we will probe db3
  • if r(db3,q) ra, max ECor(DBk) 0.85
  • if r(db3,q) rb, max ECor(DBk) 0.8
  • if r(db3,q) rc, max ECor(DBk) 0.9
  • Probe the database that leads tothe largest
    expectedmax ECor(DBk)

db1
db2
db3
db4
ra
rb
rc
47
Effectiveness of adaptive probing
  • 20 healthcare-related text databases on the Web
  • Q1 (training, 1000 queries) to learn the RD of
    each database
  • Q2 (testing, 1000 queries) to test the
    correctness of database selection

avgCora
avgCora
avgCorp
k 1
k 3
k 3
48
The lazy TA problem
  • Same problem, generalized humanized
  • After the final exam, the TA wants to find out
    the top scoring students
  • TA is lazy, dont want to score all exam sheets
  • Input every students score a known
    distribution
  • Observed from pervious quiz, mid-term exams
  • Output a scoring strategy
  • Maximizes the correctness of the guessed top-k
    students

49
Further study of this problem LSC05
  • Proves greedy probing is optimal under special
    cases
  • More interesting factors to-be-explored
  • Optimal probing strategy in general cases
  • Non-uniform probing cost
  • Time-variant distributions

50
Roadmap
  • The problem
  • Database content modeling
  • Database selection
  • Summary

51
Summary
  • Metasearch a challenging problem
  • Database content modeling
  • Sampling enhanced by proper application of the
    Zipfs law, the Heaps law
  • Content change modeled using Survival Analysis
  • Database selection
  • Estimation of database relevancy based on
    assumptions
  • A probabilistic framework that models the error
    as a distribution
  • Optimal probing strategy for a collection of
    distributions as input

52
References
  • CC01 J.P. Callan and M. Connell, Query-Based
    Sampling of Text Databases, ACM Tran. on
    Information System, 19(2), 2001
  • GCM97 L. Gravano, C-C. K. Chang, H.
    Garcia-Molina, A. Paepcke, STARTS Stanford
    Proposal for Internet Meta-searching, in Proc.
    of the ACM SIGMOD Intl Conf. on Management of
    Data, 1997
  • GGT99 L. Gravano, H. Garcia-Molina, A. Tomasic,
    GlOSS Text Source Discovery over the Internet,
    ACM Tran. on Database Systems, 24(2), 1999
  • GIG01 N. Green, P. Ipeirotis, L. Gravano,
    SDLIPSTARTSSDARTS A Protocol and Toolkit for
    Metasearching, in Proc. of the Joint Conf. on
    Digital Libraries (JCDL), 2001
  • Hea78 H.S. Heaps, Information Retrieval
    Computational and Teoretical Aspects, Academic
    Press, 1978
  • IG02 P. Ipeirotis, L. Gravano, Distributed
    Search over the Hidden Web Hierarchical Database
    Sampling and Selection, in Proc. of the 28th
    VLDB Conf., 2002

53
References (contd)
  • INC05 P. Ipeirotis, A. Ntoulas, J. Cho, L.
    Gravano, Modeling and Managing Content Changes
    in Text Databases, in Proc. of the 21st IEEE
    Intl Conf. on Data Eng. (ICDE), 2005
  • LLC04 Z. Liu, C. Luo, J. Cho, W.W. Chu, A
    Probabilistic Approach to Metasearching with
    Adaptive Probing, in Proc. of the 20th IEEE
    Intl Conf. on Data Eng. (ICDE), 2004
  • LSC05 Z. Liu, K.C. Sia, J. Cho, Cost-Efficient
    Processing of Min/Max Queries over Distributed
    Sensors with Uncertainty, in Proc. of ACM Annual
    Symposium on Applied Computing, 2005
  • NPC05 A. Ntoulas, P. Zerfos, J. Cho,
    Downloading Hidden Web Content, in Proc. of the
    Joint Conf. on Digital Libraries (JCDL), June
    2005
Write a Comment
User Comments (0)
About PowerShow.com