Metasearch Mathematics of Knowledge and Search Engines: Tutorials IPAM 9132007 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Metasearch Mathematics of Knowledge and Search Engines: Tutorials IPAM 9132007

1
MetasearchMathematics of Knowledge and Search
Engines Tutorials _at_ IPAM9/13/2007

Zhenyu (Victor) Liu
Software Engineer
Google Inc.
vicliu_at_google.com

2
Roadmap

The problem
Database content modeling
Database selection
Summary

3
Metasearch the problem
??? appliedmathematics
MetasearchEngine
Search results
4
Subproblems

Database content modeling
How does a Metasearch engine perceive the
content of each database?
Database selection
Selectively issue the query to the best
databases
Query translation
Different database has different query formats
ab / a AND b / titlea AND bodyb /
etc.
Result merging
Query applied mathematics
top-10 results from both science.com and
nature.com, how to present?

5
Database content modeling and selection a
simplified example

A content summary of each database
Selection based on of mathing docs
Assuming independence between words

Total 60,000
Total 10,000
gt
10,000 ? 0.4 ? 0.25 1000 documents
matchesapplied mathematics
60,000 ? 0.00333 ? 0.005 1documents matches
applied mathematics
6
Roadmap

The problem
Database content modeling
Database selection
Summary

7
Database content modeling
able to replicate theentire text database - most
storage demanding- fully cooperative database
able to obtain a fullcontent summary - less
storage demanding- fully cooperative database
download part of atext database - more storage
demanding- non-cooperative database
approximate the contentsummary via sampling -
least storage demanding- non-cooperative database
8
Replicate the entire database

E.g.
www.google.com/patents, replica of the entire
USPTO patent document database

9
Download a non-cooperative database

Objective download as much as possible
Basic idea probing (querying with short
queries) and downloading all results
Practically, can only issue a fixed of probes
(e.g., 1000 queries per day)

SearchInterface
applied
MetasearchEngine
mathematics
A textdatabase
10
Harder than the set-coverage problem

All docs in a database db as the universe
assuming all docs are equal
Each probe corresponds to a subset
Find the least of subsets(probes) that covers
db
or, the max coverage with afixed of subsets
(probes)
NP-complete
Greedy algo. proved to be thebest-possible
P-timeapproximation algo.
Cardinality of each subset( of matching docs
for eachprobe) unknown!

mathematics

applied
11
Pseudo-greedy algorithms NPC05

Greedy-set-coverage choose subsets with the max
cardinality gain
When cardinality of subsets is unknown
Assume cardinality of subsets is the same across
databases - proportionally
e.g. build a database with Web pages crawled from
the Internet, rank single words according to
their frequency
Start with certain seed queries, adaptively
choose query words within the docs returned
Choice of probing words varies from database to
database

12
An adaptive method

D(wi) subsets returned by probe with word wi
w1, w2, , wn already issued
Rewritten as db?Pr(wi1) - db?Pr(wi1 ?
(w1VV wn))
Pr(w) prob. of w appearing in a doc of db

13
An adaptive method (contd)

How to estimate Pr(wi1)
Zipfs law
Pr(w) a?(R(w)ß)-?, R(w) rank of w in a
descending order of Pr(w)
Assuming the ranking of w1, w2, , wn and other
words remains the same in the downloaded subset
and in db
Interpolate

14
Obtain an exact content summary

C(db) for a database db
Statistics about words in db,e.g., df document
frequency,
Standards and proposals for co-operative
databases to follow to export C(db)
STARTS GCM97
Initiated by Stanford, attracted main search
engine players by 1997 Fulcrum, Infoseek, PLS,
Verity, WAIS, Excite, etc.
SDARTS GIG01
Initiated by Columbia U.

15
Approximate the content-summary

Objective C(db) of a database db, with high
vocabulary coverage high accuracy
Basic idea probing and download sample docs
CC01
Example df as the content summary statistics
Pick a single word as the query, probe the
database
Download a fraction of results, e.g., top-k
If terminating condition unsatisfied, go to 1.
Output ltw, dfgt based on the sample docs
downloaded

16
Vocabulary coverage

Can a small sample of docs cover the vocabulary
of a big database?
Yes, based on Heaps law Hea78
W K?nß
n - of words scanned
W - set of distinct words encountered
K - constant, typically in 10, 100
ß - constant, typically in 0.4, 0.6
Empirically verified CC01

17
Estimate document frequency

How to identify the df of w in the entire
database?
w used as a query during sampling df typically
revealed in search results
w appearing in the sampled docs need to
estimate df based on the docs sample
Apply Zipfs law interpolate IG02
Rank w and w based on their frequency in the
sample
Curve-fit based on the true df of those w
Interpolate the estimated df of w onto the
fitted curve

18
What if db changes over time?

So does its content summary C(db), and C(db)
INC05
Empirical study
152 Web databases, a snapshot downloaded weekly,
for 1 year
df as the statistics measure
Kullback-Leibler divergenceas the change
measure
between the latestsnapshot and thesnapshot
time t ago
db does change!
How do we modelthe change?
When to resample, andget a new C(db) ?

Kullback-Leiblerdivergence
t
19
Model the change

KLdb(t) the KL divergence between the current
C(db) and C(db, t) time t ago
T time when KLdb(t) exceeds a pre-specified t
Applying principles of Survival Analysis
Survival function Sdb(t) 1 Pr(T t)
Hazard funciton hdb(t) - (dSdb(t) /dt) / Sdb(t)
How to compute hdb(t) and then Sdb(t)?

20
Learn the hdb(t) of database change

Cox proportional hazards regression model
ln( hdb(t) ) ln( hbase(t) ) ß1?x1 , where
xi is some predictor variable
Predictors
Pre-specified threshold t
Web domain of db, .com .edu .gov .org
others
5 binary domain variables
ln( db )
avg KLdb(1 week) measured in the training period

21
Train the Cox model

Stratified Cox model being applied
Domain variables didnt satisfy the Cox
proportional assumption
Stratifying on each domain, or, a hbase(t) /
Sbase(t) for each domain
Training Sbase(t) for each domain
Assuming Weibull distribution, Sbase(t) e-?t?

22
Training result

? ranges in (0.57, 1.08) ? Sbase(t) not
exponential distribution

Sbase(t)
t
23
Training result (contd)

A larger db takes less time to have KLdb(t)
exceed t
Databases changes faster during a short period
are more likely to change later on

24
How to use the trained model?

Model gives Sdb(t) ? likelihood that db has not
changed much
An update policy to periodically resample each db
Intuitively, maximize ?db Sdb(t)
More precisely S limt?8 (1/t)??0t ?db
Sdb(t) dt
A policy fdb, where fdb is the update
frequency of db, e.g., 2/week
Subject to practical constraints, e.g., total
update cap per week

25
Derive an optimal update policy

Find fdb that maximizes S under the constraint
?db fdb F, where F is a global frequency limit
Solvable by the Lagrange-multiplier method
Sample results

26
Roadmap

The problem
Database content modeling
Database selection
Summary

27
Database selection

Select the databases to issue a given query
Necessary when the Metasearch engine do not have
entire replica of each database most likely
with content summary only
Reduces query load in the entire system
Formalization
Query q ltw1, , wmgt, databases db1, , dbn
Rank databases according to their relevancy
score r(dbi, q) to query q

28
Relevancy score

of matching docs in db
Similarity between q and top docs returned by db
Typically vector-space similarity (dot-product)
between q and a doc
Sum / Avg of similarities of top-k docs of each
db, e.g., top-10
Sum / Avg of similarities of top docs of each db
exceeding a similarity threshold
Relevancy of db as judged by users
Explicit relevance feedback
User click behavior data

29
Estimating r(db,q)

Typically, r(db, q) unavailable
Estimate r(db, q) based on C(db), or C(db)

30
Estimating r(db,q), example 1 GGT99

r(db, q) of matching docs in db
Independence assumption
Query words w1, , wm appear independently in db
r(db, q)
df(db, wj) document frequency of wj in db
could be df(db, wj) from C(db)

31
Estimating r(db,q), example 2 GGT99

r(db, q) ?d?db sim(d, q)gtl sim(d, q)
d a doc in db
sim(d, q) vector dot-product between d q
each word in d q weighted with common tf?idf
weighting
l a pre-specified threshold

32
Estimating r(db,q), example 2 (contd)

Content summary, C(db), required
df(db, w) doc frequency
v(db, w) ?d?db weight of w in ds vector
ltv(db, w1), v(db, w2), gt - centroid of the
entire db as a cluster of doc vectors

33
Estimating r(db,q), example 2 (contd)

l 0, sum of all q-doc similarity values of db
r(db, q) ?d?db sim(d, q)
r(db, q) r(db, q) ltv(q,w1), gt ? ltv(db,
w1), v(db, w2), gt
v(q, w) weight of w in the query vector
l gt 0?

34
Estimating r(db,q), example 2 (contd)

Assuming uniform weight of w among all docs using
w
i.e. weight of w in any doc v(db, w) / df(db,
w)
Highly-correlated query words scenario
If df(db, wi) lt df(db, wj), every doc using wi
also uses wj
Words in q sorted s.t. df(db, w1) df(db, w2)
df(db, wm)
r(db, q) ?i1pv(q, wi)?v(db, wi) df(db,
wp)? ?jp1mv(q, wj)?v(db, wj)/df(db,
wj)where p is determined by some criteria
GGT99
Disjoint query words scenario
No doc using wi uses wj
r(db, q) ?i1m df(db, wi) gt 0 ? v(q,
wi)?v(db, wi) / df(db, wi) gt l v(q, wi)?v(db, wi)

35
Estimating r(db,q), example 2 (contd)

Ranking of databases based on r(db, q)
empirically evaluated GGT99

36
A probabilistic model for errors in estimation
LLC04

Any estimation makes errors
An error (observed) distribution for each db
distribution of db1 ? distribution of db2
Definition of error relative

37
Modeling the errors a motivating experiment

dbPMC PubMedCentral www.pubmedcentral.nih.gov
Two query sets, Q1 and Q2 (healthcare related)
Q1 Q2 1000, Q1 ? Q2 ?
Compute err(dbPMC, q) for each sample queryq ?
Q1 or Q2
Further verified through statistical tests
(Pearson-?2)

error probability distribution
error probability distribution
err(dbPMC, q), ?q? Q1
err(dbPMC, q), ?q? Q2
Q2
Q1
38
Implications of the experiment

On a text database
Similar error behavior among sample queries
Can sample a database and summarize the error
behavior into an Error Distribution (ED)
Use ED to predict the error for a future unseen
query
Sampling size study LLC04
A few hundred sample queries good enough

39
From an Error Distribution (ED)to a Relevancy
Distribution (RD)

Database db1. Query qnew

?
by definition
0.5
0.4
0.1
err(db1,qnew)
from sampling
0.5
0.4
-50
0
50
0.1
r(db1,qnew)
The ED for db1
?
500
1000
1500
A Relevancy Distribution (RD)for r(db1, qnew)
?
? r(db1,qnew) 1000
existing estimation method
40
RD-based selection
r(db1,qnew)
r(db2,qnew)

Estimation-based db1 gt db2

1000
650
0.5
db1
0.4
0.1
err(db1, qnew)
r(db1, qnew)
-50
0
50
RD-based db1 lt db2( Pr(db1 lt db2) 0.85 )
r(db1,qnew) 1000
db2
0.9
0.1
err(db2, qnew)
db1
r(db2, qnew)
0
100
db2
r(db1,qnew) 650
41
Correctness metric

Terminology
DBk k databases returned by some method
DBtopk the actual answer
How correct DBk is compared to DBtopk?
Absolute correctness Cora(DBk) 1, if
DBkDBtopk 0, otherwise
Partial correctness Corp(DBk)
Cora(DBk) Corp(DBk) for k 1

42
Effectiveness of RD-based selection

20 healthcare-related text databases on the Web
Q1 (training, 1000 queries) to learn the ED of
each database
Q2 (testing, 1000 queries) to test the
correctness of database selection

43
Probing to improve correctness
db1

RD-based selection
0.85 Pr(db2 gt db1)
Pr(db2 DBtop1)
1?Pr(db2 DBtop1) 0?Pr(db2 ? DBtop1)
ECora(db2)
Probe dbi contact a dbi to obtain its exact
relevancy
After probing db1
ECora(db2) Pr(db2 gt db1) 1

db2
44
Computing the expected correctness

Expected absolute correctness
ECora(DBk)1?Pr(Cora(DBk) 1)
0?Pr(Cora(DBk) 0) Pr(Cora(DBk) 1) Pr(DBk
DBtopk)
Expected partial correctness
ECorp(DBk)

45
Adaptive probing algorithm APro

User-specified correctness threshold t

return this DBk
RDs of the probed and unprobed databases
dbi1
dbn
YES
dbi
Any DBkwith ECor(DBk) ? t?
unprobed
probed
NO
dbi1
dbi
db1
dbi-1
dbn
46
Which database to probe?

A greedy strategy
The stopping condition ECor(DBk) ? t
Once probed, which database leads to the highest
ECor(DBk)?
Suppose we will probe db3
if r(db3,q) ra, max ECor(DBk) 0.85
if r(db3,q) rb, max ECor(DBk) 0.8
if r(db3,q) rc, max ECor(DBk) 0.9
Probe the database that leads tothe largest
expectedmax ECor(DBk)

db1
db2
db3
db4
ra
rb
rc
47
Effectiveness of adaptive probing

20 healthcare-related text databases on the Web
Q1 (training, 1000 queries) to learn the RD of
each database
Q2 (testing, 1000 queries) to test the
correctness of database selection

avgCora
avgCora
avgCorp
k 1
k 3
k 3
48
The lazy TA problem

Same problem, generalized humanized
After the final exam, the TA wants to find out
the top scoring students
TA is lazy, dont want to score all exam sheets
Input every students score a known
distribution
Observed from pervious quiz, mid-term exams
Output a scoring strategy
Maximizes the correctness of the guessed top-k
students

49
Further study of this problem LSC05

Proves greedy probing is optimal under special
cases
More interesting factors to-be-explored
Optimal probing strategy in general cases
Non-uniform probing cost
Time-variant distributions

50
Roadmap

The problem
Database content modeling
Database selection
Summary

51
Summary

Metasearch a challenging problem
Database content modeling
Sampling enhanced by proper application of the
Zipfs law, the Heaps law
Content change modeled using Survival Analysis
Database selection
Estimation of database relevancy based on
assumptions
A probabilistic framework that models the error
as a distribution
Optimal probing strategy for a collection of
distributions as input

52
References

CC01 J.P. Callan and M. Connell, Query-Based
Sampling of Text Databases, ACM Tran. on
Information System, 19(2), 2001
GCM97 L. Gravano, C-C. K. Chang, H.
Garcia-Molina, A. Paepcke, STARTS Stanford
Proposal for Internet Meta-searching, in Proc.
of the ACM SIGMOD Intl Conf. on Management of
Data, 1997
GGT99 L. Gravano, H. Garcia-Molina, A. Tomasic,
GlOSS Text Source Discovery over the Internet,
ACM Tran. on Database Systems, 24(2), 1999
GIG01 N. Green, P. Ipeirotis, L. Gravano,
SDLIPSTARTSSDARTS A Protocol and Toolkit for
Metasearching, in Proc. of the Joint Conf. on
Digital Libraries (JCDL), 2001
Hea78 H.S. Heaps, Information Retrieval
Computational and Teoretical Aspects, Academic
Press, 1978
IG02 P. Ipeirotis, L. Gravano, Distributed
Search over the Hidden Web Hierarchical Database
Sampling and Selection, in Proc. of the 28th
VLDB Conf., 2002

53
References (contd)

INC05 P. Ipeirotis, A. Ntoulas, J. Cho, L.
Gravano, Modeling and Managing Content Changes
in Text Databases, in Proc. of the 21st IEEE
Intl Conf. on Data Eng. (ICDE), 2005
LLC04 Z. Liu, C. Luo, J. Cho, W.W. Chu, A
Probabilistic Approach to Metasearching with
Adaptive Probing, in Proc. of the 20th IEEE
Intl Conf. on Data Eng. (ICDE), 2004
LSC05 Z. Liu, K.C. Sia, J. Cho, Cost-Efficient
Processing of Min/Max Queries over Distributed
Sensors with Uncertainty, in Proc. of ACM Annual
Symposium on Applied Computing, 2005
NPC05 A. Ntoulas, P. Zerfos, J. Cho,
Downloading Hidden Web Content, in Proc. of the
Joint Conf. on Digital Libraries (JCDL), June
2005

Write a Comment

User Comments (0)

About PowerShow.com

Metasearch Mathematics of Knowledge and Search Engines: Tutorials IPAM 9132007 PowerPoint PPT Presentation