Title: Metasearch Mathematics of Knowledge and Search Engines: Tutorials IPAM 9132007
1MetasearchMathematics of Knowledge and Search
Engines Tutorials _at_ IPAM9/13/2007
- Zhenyu (Victor) Liu
- Software Engineer
- Google Inc.
- vicliu_at_google.com
2Roadmap
- The problem
- Database content modeling
- Database selection
- Summary
3Metasearch the problem
??? appliedmathematics
MetasearchEngine
Search results
4Subproblems
- Database content modeling
- How does a Metasearch engine perceive the
content of each database? - Database selection
- Selectively issue the query to the best
databases - Query translation
- Different database has different query formats
- ab / a AND b / titlea AND bodyb /
etc. - Result merging
- Query applied mathematics
- top-10 results from both science.com and
nature.com, how to present?
5Database content modeling and selection a
simplified example
- A content summary of each database
- Selection based on of mathing docs
- Assuming independence between words
Total 60,000
Total 10,000
gt
10,000 ? 0.4 ? 0.25 1000 documents
matchesapplied mathematics
60,000 ? 0.00333 ? 0.005 1documents matches
applied mathematics
6Roadmap
- The problem
- Database content modeling
- Database selection
- Summary
7Database content modeling
able to replicate theentire text database - most
storage demanding- fully cooperative database
able to obtain a fullcontent summary - less
storage demanding- fully cooperative database
download part of atext database - more storage
demanding- non-cooperative database
approximate the contentsummary via sampling -
least storage demanding- non-cooperative database
8Replicate the entire database
- E.g.
- www.google.com/patents, replica of the entire
USPTO patent document database
9Download a non-cooperative database
- Objective download as much as possible
- Basic idea probing (querying with short
queries) and downloading all results - Practically, can only issue a fixed of probes
(e.g., 1000 queries per day)
SearchInterface
applied
MetasearchEngine
mathematics
A textdatabase
10Harder than the set-coverage problem
- All docs in a database db as the universe
- assuming all docs are equal
- Each probe corresponds to a subset
- Find the least of subsets(probes) that covers
db - or, the max coverage with afixed of subsets
(probes) - NP-complete
- Greedy algo. proved to be thebest-possible
P-timeapproximation algo. - Cardinality of each subset( of matching docs
for eachprobe) unknown!
mathematics
applied
11Pseudo-greedy algorithms NPC05
- Greedy-set-coverage choose subsets with the max
cardinality gain - When cardinality of subsets is unknown
- Assume cardinality of subsets is the same across
databases - proportionally - e.g. build a database with Web pages crawled from
the Internet, rank single words according to
their frequency - Start with certain seed queries, adaptively
choose query words within the docs returned - Choice of probing words varies from database to
database
12An adaptive method
- D(wi) subsets returned by probe with word wi
- w1, w2, , wn already issued
-
- Rewritten as db?Pr(wi1) - db?Pr(wi1 ?
(w1VV wn)) - Pr(w) prob. of w appearing in a doc of db
13An adaptive method (contd)
- How to estimate Pr(wi1)
- Zipfs law
- Pr(w) a?(R(w)ß)-?, R(w) rank of w in a
descending order of Pr(w) - Assuming the ranking of w1, w2, , wn and other
words remains the same in the downloaded subset
and in db - Interpolate
14Obtain an exact content summary
- C(db) for a database db
- Statistics about words in db,e.g., df document
frequency, - Standards and proposals for co-operative
databases to follow to export C(db) - STARTS GCM97
- Initiated by Stanford, attracted main search
engine players by 1997 Fulcrum, Infoseek, PLS,
Verity, WAIS, Excite, etc. - SDARTS GIG01
- Initiated by Columbia U.
15Approximate the content-summary
- Objective C(db) of a database db, with high
vocabulary coverage high accuracy - Basic idea probing and download sample docs
CC01 - Example df as the content summary statistics
- Pick a single word as the query, probe the
database - Download a fraction of results, e.g., top-k
- If terminating condition unsatisfied, go to 1.
- Output ltw, dfgt based on the sample docs
downloaded
16Vocabulary coverage
- Can a small sample of docs cover the vocabulary
of a big database? - Yes, based on Heaps law Hea78
- W K?nß
- n - of words scanned
- W - set of distinct words encountered
- K - constant, typically in 10, 100
- ß - constant, typically in 0.4, 0.6
- Empirically verified CC01
17Estimate document frequency
- How to identify the df of w in the entire
database? - w used as a query during sampling df typically
revealed in search results - w appearing in the sampled docs need to
estimate df based on the docs sample - Apply Zipfs law interpolate IG02
- Rank w and w based on their frequency in the
sample - Curve-fit based on the true df of those w
- Interpolate the estimated df of w onto the
fitted curve
18What if db changes over time?
- So does its content summary C(db), and C(db)
INC05 - Empirical study
- 152 Web databases, a snapshot downloaded weekly,
for 1 year - df as the statistics measure
- Kullback-Leibler divergenceas the change
measure - between the latestsnapshot and thesnapshot
time t ago - db does change!
- How do we modelthe change?
- When to resample, andget a new C(db) ?
Kullback-Leiblerdivergence
t
19Model the change
- KLdb(t) the KL divergence between the current
C(db) and C(db, t) time t ago - T time when KLdb(t) exceeds a pre-specified t
- Applying principles of Survival Analysis
- Survival function Sdb(t) 1 Pr(T t)
- Hazard funciton hdb(t) - (dSdb(t) /dt) / Sdb(t)
- How to compute hdb(t) and then Sdb(t)?
20Learn the hdb(t) of database change
- Cox proportional hazards regression model
- ln( hdb(t) ) ln( hbase(t) ) ß1?x1 , where
xi is some predictor variable - Predictors
- Pre-specified threshold t
- Web domain of db, .com .edu .gov .org
others - 5 binary domain variables
- ln( db )
- avg KLdb(1 week) measured in the training period
21Train the Cox model
- Stratified Cox model being applied
- Domain variables didnt satisfy the Cox
proportional assumption - Stratifying on each domain, or, a hbase(t) /
Sbase(t) for each domain - Training Sbase(t) for each domain
- Assuming Weibull distribution, Sbase(t) e-?t?
22Training result
- ? ranges in (0.57, 1.08) ? Sbase(t) not
exponential distribution
Sbase(t)
t
23Training result (contd)
- A larger db takes less time to have KLdb(t)
exceed t - Databases changes faster during a short period
are more likely to change later on
24How to use the trained model?
- Model gives Sdb(t) ? likelihood that db has not
changed much - An update policy to periodically resample each db
- Intuitively, maximize ?db Sdb(t)
- More precisely S limt?8 (1/t)??0t ?db
Sdb(t) dt - A policy fdb, where fdb is the update
frequency of db, e.g., 2/week - Subject to practical constraints, e.g., total
update cap per week
25Derive an optimal update policy
- Find fdb that maximizes S under the constraint
?db fdb F, where F is a global frequency limit - Solvable by the Lagrange-multiplier method
- Sample results
26Roadmap
- The problem
- Database content modeling
- Database selection
- Summary
27Database selection
- Select the databases to issue a given query
- Necessary when the Metasearch engine do not have
entire replica of each database most likely
with content summary only - Reduces query load in the entire system
- Formalization
- Query q ltw1, , wmgt, databases db1, , dbn
- Rank databases according to their relevancy
score r(dbi, q) to query q
28Relevancy score
- of matching docs in db
- Similarity between q and top docs returned by db
- Typically vector-space similarity (dot-product)
between q and a doc - Sum / Avg of similarities of top-k docs of each
db, e.g., top-10 - Sum / Avg of similarities of top docs of each db
exceeding a similarity threshold - Relevancy of db as judged by users
- Explicit relevance feedback
- User click behavior data
29Estimating r(db,q)
- Typically, r(db, q) unavailable
- Estimate r(db, q) based on C(db), or C(db)
30Estimating r(db,q), example 1 GGT99
- r(db, q) of matching docs in db
- Independence assumption
- Query words w1, , wm appear independently in db
- r(db, q)
- df(db, wj) document frequency of wj in db
could be df(db, wj) from C(db)
31Estimating r(db,q), example 2 GGT99
- r(db, q) ?d?db sim(d, q)gtl sim(d, q)
- d a doc in db
- sim(d, q) vector dot-product between d q
- each word in d q weighted with common tf?idf
weighting - l a pre-specified threshold
32Estimating r(db,q), example 2 (contd)
- Content summary, C(db), required
- df(db, w) doc frequency
- v(db, w) ?d?db weight of w in ds vector
- ltv(db, w1), v(db, w2), gt - centroid of the
entire db as a cluster of doc vectors
33Estimating r(db,q), example 2 (contd)
- l 0, sum of all q-doc similarity values of db
- r(db, q) ?d?db sim(d, q)
- r(db, q) r(db, q) ltv(q,w1), gt ? ltv(db,
w1), v(db, w2), gt - v(q, w) weight of w in the query vector
- l gt 0?
34Estimating r(db,q), example 2 (contd)
- Assuming uniform weight of w among all docs using
w - i.e. weight of w in any doc v(db, w) / df(db,
w) - Highly-correlated query words scenario
- If df(db, wi) lt df(db, wj), every doc using wi
also uses wj - Words in q sorted s.t. df(db, w1) df(db, w2)
df(db, wm) - r(db, q) ?i1pv(q, wi)?v(db, wi) df(db,
wp)? ?jp1mv(q, wj)?v(db, wj)/df(db,
wj)where p is determined by some criteria
GGT99 - Disjoint query words scenario
- No doc using wi uses wj
- r(db, q) ?i1m df(db, wi) gt 0 ? v(q,
wi)?v(db, wi) / df(db, wi) gt l v(q, wi)?v(db, wi)
35Estimating r(db,q), example 2 (contd)
- Ranking of databases based on r(db, q)
empirically evaluated GGT99
36A probabilistic model for errors in estimation
LLC04
- Any estimation makes errors
- An error (observed) distribution for each db
- distribution of db1 ? distribution of db2
- Definition of error relative
37Modeling the errors a motivating experiment
- dbPMC PubMedCentral www.pubmedcentral.nih.gov
- Two query sets, Q1 and Q2 (healthcare related)
- Q1 Q2 1000, Q1 ? Q2 ?
- Compute err(dbPMC, q) for each sample queryq ?
Q1 or Q2 - Further verified through statistical tests
(Pearson-?2)
error probability distribution
error probability distribution
err(dbPMC, q), ?q? Q1
err(dbPMC, q), ?q? Q2
Q2
Q1
38Implications of the experiment
- On a text database
- Similar error behavior among sample queries
- Can sample a database and summarize the error
behavior into an Error Distribution (ED) - Use ED to predict the error for a future unseen
query - Sampling size study LLC04
- A few hundred sample queries good enough
39From an Error Distribution (ED)to a Relevancy
Distribution (RD)
?
by definition
0.5
0.4
0.1
err(db1,qnew)
from sampling
0.5
0.4
-50
0
50
0.1
r(db1,qnew)
The ED for db1
?
500
1000
1500
A Relevancy Distribution (RD)for r(db1, qnew)
?
? r(db1,qnew) 1000
existing estimation method
40RD-based selection
r(db1,qnew)
r(db2,qnew)
- Estimation-based db1 gt db2
1000
650
0.5
db1
0.4
0.1
err(db1, qnew)
r(db1, qnew)
-50
0
50
RD-based db1 lt db2( Pr(db1 lt db2) 0.85 )
r(db1,qnew) 1000
db2
0.9
0.1
err(db2, qnew)
db1
r(db2, qnew)
0
100
db2
r(db1,qnew) 650
41Correctness metric
- Terminology
- DBk k databases returned by some method
- DBtopk the actual answer
- How correct DBk is compared to DBtopk?
- Absolute correctness Cora(DBk) 1, if
DBkDBtopk 0, otherwise - Partial correctness Corp(DBk)
- Cora(DBk) Corp(DBk) for k 1
42Effectiveness of RD-based selection
- 20 healthcare-related text databases on the Web
- Q1 (training, 1000 queries) to learn the ED of
each database - Q2 (testing, 1000 queries) to test the
correctness of database selection
43Probing to improve correctness
db1
- RD-based selection
- 0.85 Pr(db2 gt db1)
- Pr(db2 DBtop1)
- 1?Pr(db2 DBtop1) 0?Pr(db2 ? DBtop1)
- ECora(db2)
- Probe dbi contact a dbi to obtain its exact
relevancy - After probing db1
- ECora(db2) Pr(db2 gt db1) 1
db2
44Computing the expected correctness
- Expected absolute correctness
- ECora(DBk)1?Pr(Cora(DBk) 1)
0?Pr(Cora(DBk) 0) Pr(Cora(DBk) 1) Pr(DBk
DBtopk) - Expected partial correctness
- ECorp(DBk)
45Adaptive probing algorithm APro
- User-specified correctness threshold t
return this DBk
RDs of the probed and unprobed databases
dbi1
dbn
YES
dbi
Any DBkwith ECor(DBk) ? t?
unprobed
probed
NO
dbi1
dbi
db1
dbi-1
dbn
46Which database to probe?
- A greedy strategy
- The stopping condition ECor(DBk) ? t
- Once probed, which database leads to the highest
ECor(DBk)? - Suppose we will probe db3
- if r(db3,q) ra, max ECor(DBk) 0.85
- if r(db3,q) rb, max ECor(DBk) 0.8
- if r(db3,q) rc, max ECor(DBk) 0.9
- Probe the database that leads tothe largest
expectedmax ECor(DBk)
db1
db2
db3
db4
ra
rb
rc
47Effectiveness of adaptive probing
- 20 healthcare-related text databases on the Web
- Q1 (training, 1000 queries) to learn the RD of
each database - Q2 (testing, 1000 queries) to test the
correctness of database selection
avgCora
avgCora
avgCorp
k 1
k 3
k 3
48The lazy TA problem
- Same problem, generalized humanized
- After the final exam, the TA wants to find out
the top scoring students - TA is lazy, dont want to score all exam sheets
- Input every students score a known
distribution - Observed from pervious quiz, mid-term exams
- Output a scoring strategy
- Maximizes the correctness of the guessed top-k
students
49Further study of this problem LSC05
- Proves greedy probing is optimal under special
cases - More interesting factors to-be-explored
- Optimal probing strategy in general cases
- Non-uniform probing cost
- Time-variant distributions
50Roadmap
- The problem
- Database content modeling
- Database selection
- Summary
51Summary
- Metasearch a challenging problem
- Database content modeling
- Sampling enhanced by proper application of the
Zipfs law, the Heaps law - Content change modeled using Survival Analysis
- Database selection
- Estimation of database relevancy based on
assumptions - A probabilistic framework that models the error
as a distribution - Optimal probing strategy for a collection of
distributions as input
52References
- CC01 J.P. Callan and M. Connell, Query-Based
Sampling of Text Databases, ACM Tran. on
Information System, 19(2), 2001 - GCM97 L. Gravano, C-C. K. Chang, H.
Garcia-Molina, A. Paepcke, STARTS Stanford
Proposal for Internet Meta-searching, in Proc.
of the ACM SIGMOD Intl Conf. on Management of
Data, 1997 - GGT99 L. Gravano, H. Garcia-Molina, A. Tomasic,
GlOSS Text Source Discovery over the Internet,
ACM Tran. on Database Systems, 24(2), 1999 - GIG01 N. Green, P. Ipeirotis, L. Gravano,
SDLIPSTARTSSDARTS A Protocol and Toolkit for
Metasearching, in Proc. of the Joint Conf. on
Digital Libraries (JCDL), 2001 - Hea78 H.S. Heaps, Information Retrieval
Computational and Teoretical Aspects, Academic
Press, 1978 - IG02 P. Ipeirotis, L. Gravano, Distributed
Search over the Hidden Web Hierarchical Database
Sampling and Selection, in Proc. of the 28th
VLDB Conf., 2002
53References (contd)
- INC05 P. Ipeirotis, A. Ntoulas, J. Cho, L.
Gravano, Modeling and Managing Content Changes
in Text Databases, in Proc. of the 21st IEEE
Intl Conf. on Data Eng. (ICDE), 2005 - LLC04 Z. Liu, C. Luo, J. Cho, W.W. Chu, A
Probabilistic Approach to Metasearching with
Adaptive Probing, in Proc. of the 20th IEEE
Intl Conf. on Data Eng. (ICDE), 2004 - LSC05 Z. Liu, K.C. Sia, J. Cho, Cost-Efficient
Processing of Min/Max Queries over Distributed
Sensors with Uncertainty, in Proc. of ACM Annual
Symposium on Applied Computing, 2005 - NPC05 A. Ntoulas, P. Zerfos, J. Cho,
Downloading Hidden Web Content, in Proc. of the
Joint Conf. on Digital Libraries (JCDL), June
2005