VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges

About This Presentation

Title:

VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges

Description:

Search engine as a document retrieval system. no control on web ... additional information for each page (time last modified, organization publishing it, etc. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 94

Provided by: meng84

Learn more at: http://www.cs.binghamton.edu

more less

Transcript and Presenter's Notes

Title: VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges

1
VLDB'99 TUTORIAL Metasearch Engines
Solutions and Challenges

Clement Yu
Weiyi Meng
Dept. of EECS Dept. of
Computer Science
U. of Illinois at Chicago SUNY at
Binghamton
Chicago, IL 60607 Binghamton, NY
13902
yu_at_eecs.uic.edu meng_at_cs.binghamton.e
du

2
The Problem
How am I going to find the 5 best pages on
Internet Security?

search search
search
engine 1 engine 2
engine n
. . . . . .
text text
text
source 1 source 2
source n

3
Metasearch Engine Solution

user
user interface
query dispatcher result
merger
search search
search
engine 1 engine 2
engine n
. . . . . .
text text
text
source 1 source 2
source n

query
result
4
Some Observations

most sources are not useful for a given query
sending a query to a useless source would
incur unnecessary network traffic
waste local resources for evaluating the query
increase the cost of merging the results
retrieving too many documents from a source is
inefficient

5
A More Efficient Metasearch Engine

user
user interface
database selector document
selector
query dispatcher result
merger
search search
search
engine 1 engine 2
engine n
. . . . . .
text text
text
source 1 source 2
source n

query
result
6
Tutorial Outline

1. Introduction to Text Retrieval
consider only Vector Space Model
2. Search Engines on the Web
3. Introduction to Metasearch Engine
4. Database Selection
5. Document Selection
6. Result Merging
7. New Challenges

7
Introduction to Text Retrieval (1)

Document representation
remove stopwords of, the, ...
stemming stemming stem
d (d1 , ..., di , ..., dn)
di weight of ith term in d
tf idf formula for computing di
Example consider term t of document d in a
database of N documents.
tf weight of t in d if tf gt 0 0.5
0.5tf/max_tf
idf weight of t log(N/df)
weight of t in d (0.5 0.5tf/max_tf)log(
N/df)

8
Introduction to Text Retrieval (2)

Query representation
q (q1 , ..., qi , ..., qn)
qi weight of ith term in q
compute qi tf weight only
alternative use idf weight for query terms
not document terms
query expansion (e.g., add related terms)

9
Introduction to Text Retrieval (3)

Similarity Functions
simple dot product
favor long documents
Cosine function
other similarity functions exist
normalized similarities 0, 1.0

q
?
d
10
Introduction to Text Retrieval (4)

Retrieval Effectiveness
relevant documents documents useful to the user
of query
recall percentage of relevant documents
retrieved
precision percentage of retrieved documents that
are relevant

precision
recall
11
Search Engines on the Web (1)

Search engine as a document retrieval system
no control on web pages that can be searched
web pages have rich structures and semantics
web pages are extensively linked
additional information for each page (time last
modified, organization publishing it, etc.)
databases are dynamic and can be very large
few general-purpose search engines and numerous
special-purpose search engines

12
Search Engines on the Web (2)

New indexing techniques
partial-text indexing to improve scalability
ignore and/or discount spamming terms
use anchor terms to index linked pages
e.g. WWWW McBr94, Google BrPa98,
Webor CSM97

Page 2 http//travelocity.com/
Page 1
. . . . . . airplane ticket and
hotel . . . . . .
13
Search Engines on the Web (3)

New term weighting schemes
higher weights to terms enclosed by special tags
title (SIBRIS WaWJ89, Altavista, HotBot, Yahoo)
special fonts (Google BrPa98)
special fonts tags (LASER BoFJ96)
Webor CSM97 approach
partition tags into disjoint classes (title,
header, strong, anchor, list, plain text)
assign different importance factors to terms in
different classes
determine optimal importance factors

14
Search Engines on the Web (4)

New document ranking methods
Vector Spreading Activation YuLe96
add a fraction of parents' similarities
Example Suppose for query q
sim(q, d1) 0.4 sim(q, d2) 0.2 sim(q,
d3) 0.2
final score of d3 0.2 0.10.4 0.10.2
0.26

d1
d3
d2
15
Search Engines on the Web (5)

New document ranking methods
combine similarity with rank
PageRank PaBr98 an important page is linked to
by many pages and/or by important pages
combine similarity with authority score
authority Klei98 an important content page is
highly linked to among initially retrieved pages
and their neighbors

16
Introduction to Metasearch Engine (1)

An Example
Query Internet
Security
Databases NYT ...
WP ...
DB
... DB ...
Retrieved results t1, t2, ...
p1, p2,
Merged results p1, t1, ...

17
Introduction to Metasearch Engine (2)

Database Selection Problem
Select potentially useful databases for a given
query
essential if the number of local databases is
large
reduce network traffic
avoid wasting local resources

query
18
Introduction to Metasearch Engine (3)

Potentially useful database contain potentially
useful documents
Potentially useful documents
global similarity above a threshold
global similarity among m highest
Need some knowledge about each database in
advance in order to perform database selection
Database Representative

19
Introduction to Metasearch Engine (4)

Document Selection Problem
Select potentially useful documents from each
selected local database efficiently
Step 1 Retrieve all potentially useful documents
while minimizing the retrieval of useless
documents
from global similarity threshold to tightest
local similarity threshold
want all d Gsim(q, d) gt GT
retrieve d from DBk Lsim(q, d) gt LTk
LTk is largest Gsim(q, d) gt GT
Lsim(q, d) gt LTk

20
Introduction to Metasearch Engine (5)

Efficient Document Selection
Step 2 Transmit all potentially useful documents
to result merger while minimizing the
transmission of useless documents
further filtering to reduce transmission cost and
merge cost
Example
local
DBk

retrieve
transmit
filter
d1 , , ds
d2, d7, d10
21
Introduction to Metasearch Engine (6)

Result Merging Problem
Objective Merge returned documents from multiple
sources into a single ranked list.
Difficulty Local document similarities may be
incomparable or not available.
Solutions Generate "global similarities for
ranking.

d11, d12, ...
DB1
. . . . . .
Merger
d12, d54, ...
dN1, dN2, ...
DBN
22
Introduction to Metasearch Engine (7)

An Ideal Metasearch Engine
Retrieval effectiveness same as that as if all
documents were in the same collection.
Efficiency optimize the retrieval process
Implications should aimed at
selecting only useful search engines
retrieving and transmitting only useful documents
ranking documents according to their degrees of
relevance

23
Introduction to Metasearch Engine (8)

Main Sources of Difficulties MYL99
autonomy of local search engines
design autonomy
maintenance autonomy
heterogeneities among local search engines
indexing method
document/query term weighting schemes
similarity/ranking function
document database
document version
result presentation

24
Introduction to Metasearch Engine (9)

Impact of Autonomy and Heterogeneities MLY99
unwilling to provide database representatives or
provide different types of representatives
difficult to find potentially useful documents
difficult to merge documents from multiple sources

25
Database Selection Basic Idea

Goal Identify potentially useful databases for
each user query.
General approach
use representative to indicate approximately the
content of each database
use these representatives to select databases for
each query
Diversity of solutions
different types of representatives
different algorithms using the representatives

26
Solution Classification

Naive Approach
select all databases (e.g. MetaCrawler,
NCSTRL)
Qualitative Approaches estimate the quality of
each local database
based on rough representatives
based on detailed representatives
Quantitative Approaches estimate quantities that
measure the quality of each local database more
directly and explicitly
Learning-based Approaches database
representatives are obtained through training or
learning

27
Qualitative Approaches Using Rough Representatives

typical representative
a few words or a few paragraphs in certain format
manual construction often needed
can work well for special-purpose local search
engines
very scalable storage requirement
selection can be inaccurate as the description is
too rough

28
Qualitative Approaches Using Rough Representatives

Example 1 ALIWEB Kost94
Representative has a fixed format site
containing files for the Perl Language
Template-Type DOCUMENT
Title Perl
Description Information on the Perl
Programming
Language. Includes a
local Hypertext
Perl Manual, and the
latest FAQ in
Hypertext.
Keywords perl, perl-faq, language
user query can match against one or more fields

29
Qualitative Approaches Using Rough Representatives

Example 2 NetSerf ChHa95
Representative has a WordNet based structure
site for world facts listed by country
topic country
synset nation, nationality, land, country,
a_people
synset state, nation, country, land,
commonwealth, res_publica,
body_politic
synset country, state, land, nation
info-type facts
user query is transformed to similar structure
before match

30
Qualitative Approaches Using Detailed
Representatives

Use detailed statistical information for each
term
employ special measures to estimate the
usefulness/quality of each search engine for each
query
the measures reflect the usefulness in a less
direct/explicit way compared to those used in
quantitative approaches.
scalability starts to become an issue

31
Qualitative Approaches Using Detailed
Representatives

Example 1 gGlOSS GrGa95
representative for term ti
-- document frequency of ti
-- the sum of weights of ti in all
documents
database usefulness sum of high similarities
usefulness(q, D, T)

32
gGlOSS (continued)

Suppose for query q , we have
D1 d11 0.6, d12 0.5
D2 d21 0.3, d22 0.3, d23 0.2
D3 d31 0.7, d32 0.1, d33 0.1
usefulness(q, D1, 0.3) 1.1
usefulness(q, D2, 0.3) 0.6
usefulness(q, D3, 0.3) 0.7

33
gGlOSS (continued)

gGlOSS usefulness is estimated for two cases
high-correlation case if dfi ? dfj , then every
document having ti also has tj .
Example Consider q (1, 1, 1) with df1 2,
df2 3, df3 4, W1 0.6, W2 0.6 and W3
1.2.
t1 t2 t3
t1 t2 t3
d1 0.2 0.1 0.3
0.3 0.2 0.3
d2 0.4 0.3 0.2
0.3 0.2 0.3
d3 0 0.2 0.4
0 0.2 0.3
d4 0 0 0.3
0 0 0.3
usefulness(q, D, 0.5) W1 W2 df2W3/df3
2.1

34
gGlOSS (continued)

disjoint case for any two query terms ti and tj
, no document contains both ti and tj .
Example Consider q (1, 1, 1) with df1 2,
df2 1, df3 1, W1 0.5, W2 0.2 and W3 0.4
.
t1 t2 t3
t1 t2 t3
d1 0.2 0 0
0.25 0 0
d2 0 0.2 0
0 0.2 0
d3 0.3 0 0
0.25 0 0
d4 0 0 0.4
0 0 0.4
usefulness(q, D, 0.3)
W3 0.4

35
gGlOSS (continued)

Some observations
usefulness dependent on threshold
representative has two quantities per term
strong assumptions are used
high-correlation tends to overestimate
disjoint tends to underestimate
the two estimates tend to form bounds to the sum
of the similarities ? T

36
Qualitative Approaches Using Detailed
Representatives

Example 2 CORI Net CaLC95
representative (dfi , cfi ) for term ti
dfi -- document frequency of ti
cfi -- collection frequency of ti
cfi can be shared by all databases
database usefulness
usefulness(q, D) sim(q, representative of
D)
usefulness similarity
dfi
tfi
cfi
dfi

37
CORI Net (continued)

Some observations
estimates independent of threshold
representative has less than two quantities per
term
similarity is computed based on inference network
same method for ranking documents and ranking
databases

38
Qualitative Approaches Using Detailed
Representatives

Example 3 D-WISE YuLe97
representative dfi,j for term tj in database
Di
database usefulness a measure of query term
concentration in different databases
usefulness(q, Di)
k number of query terms
CVVj cue validity variance of term tj
across all
databases larger CVVj
tj is more
useful in distinguishing
different databases

39
D-WISE (continued)
N number of databases

ACVj average cue validity of tj over all
databases
Observations
estimates independent of threshold
representative has one quantity per term
measure is difficult to understand

ni number of documents in database Di
40
Quantitative Approaches

Two types of quantities may be estimated wrt
query q
the number of documents in a database D with
similarities higher than a threshold T
NoDoc(q, D, T) d d ? D and sim(q, d)
gt T
the global similarity of the most similar
document in D
msim(q, D) max sim(q, d)
d?D
can be used to rank databases in descending order
of similarity (or any desirability measure)

41
Estimating NoDoc(q, D, T)

Basic Approach MLYW98
representative (pi , wi ) for term ti
pi probability that ti appears in a
document
wi average weight of ti among documents
having ti
Example normalized weights of ti in 10
documents are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4,
0.6, 0.6).
pi 0.6, wi 0.4

42
Estimating NoDoc(q, D, T)

Basic Approach (continued)
Example Consider query q (1, 1).
Suppose p1 0.2, w1 2, p2 0.4, w2 1.
A generating function
(0.2 X 2 0.8) (0.4 X 0.6)
0.08 X 3 0.12 X 2 0.32 X 0.48
a X b a is the probability that a document
in D has
similarity b with q
NoDoc(q, D, 1) 10(0.08 0.12) 2

43
Estimating NoDoc(q, D, T)

Basic Approach (continued)
Consider query q (q1, ..., qr).
Proposition. If the terms are independent and
the weight of term ti whenever present in a
document is wi (the average weight), 1 ? i ? r,
then the coefficient of X s in the following
generating function is the probability that a
document in D has similarity s with q.

44
Estimating NoDoc(q, D, T)

Subrange-based Approach MLYW99
overcome the uniform term weight assumption
additional information for term ti
?i standard deviation of weights of ti
in all
documents
mnwi maximum normalized weight of ti

45
Estimating NoDoc(q, D, T)

Example weights of term ti 4, 4, 1, 1, 1, 1,
0, 0, 0, 0
generating function (factor) using average
weight
0.6X 2 0.4
a more accurate function using subranges of
weights
0.2X 4 0.4X 0.4
In general, weights are partitioned to k
subranges
pi1X mi1 ... pikX mik (1 - pi)
Probability pij and median mij can be
estimated
using di and the average of weights of ti
.
A special implementation Use the maximum
normalized weight as the first subrange by itself.

46
Estimating NoDoc(q, D, T)

Combined-term Approach LYMW99
relieve the term independence assumption
Example Consider query Chinese medicine .
Suppose generating function for
Chinese 0.1X3 0.3X 0.6
medicine 0.2X2 0.4 X 0.4
Chinese medicine 0.02 X5 0.04 X4
0.1X3
Chinese medicine 0.05 Xw ...

47
Estimating NoDoc(q, D, T)

Criteria for combining Chinese and medicine
The maximum normalized weight of the combined
term is higher than the maximum normalized weight
of each of the two individual terms (w gt 3)
The sum of estimated probabilities of terms with
exponents ? w under the term independence
assumption is very different from 1/N, N is the
number of documents in database
They are adjacent terms in previous queries.

48
Database Selection Using msim(q,D)

Optimal Ranking of Databases YLWM99b
User for query q, find the m most similar
documents or with the m largest degrees of
relevance
Definition Databases D1, D2, , Dp are
optimally ranked with respect to q if there
exists a k such that each of the databases D1, ,
Dk contains one of the m most similar documents,
and all of these m documents are contained in
these k databases.

49
Database Selection Using msim(q,D)

Optimal Ranking of Databases
Example For a given query q
D1 d1 0.8, d2 0.5,
d3 0.2, ...
D2 d9 0.7, d2 0.6,
d10 0.4, ...
D3 d8 0.9, d12 0.3,
other databases have documents with
small
similarities
When m 5 pick D1, D2, D3

50
Database Selection Using msim(q,D)

Proposition Databases D1, D2, , Dp are
optimally ranked with respect to a query q if and
only if
msim(q, Di) ? msim(q, Dj), i lt j
Example D1 d1 0.8,
D2 d9 0.7,
D3 d8 0.9,
Optimal rank D3, D1, D2,

51
Estimating msim(q, D)

Use subrange-based or combined-term method.
Example Suppose 100 documents in a database.
For query q, the generating function is
0.002 X4 0.009 X3
Since 100(0.002 0.009) ? 1, the global
similarity of the most similar document is
estimated to be 3.
Weakness of this approach
require large storage for database representative
exponential computation complexity

52
Estimating msim(q, D)

A more efficient method
global database representative global dfi of
term ti
local database representative
anwi average normalized weight of ti
mnwi maximum normalized weight of ti
Example term ti d1 0.3, d2 0.4, d3 0,
d4 0.74
anwi (0.3 0.4 0 0.7)/4
0.35
mnwi 0.74

53
Estimating msim(q, D)

A more efficient method (continued)
term weighting scheme
query term tfgidf
document term tf
query q (q1, q2)
msim(q, D)
max q1gidf1mnw1 q2gidf2anw2 ,
q2gidf2mnw2 q1gidf1anw1
linear computation complexity

54
Estimating msim(q, D)

Combine terms to improve estimation accuracy
Restrictions for combining terms ti and tj
into tij
ti and tj are adjacent query terms
mnwij gt max mnwi anwj , mnwj anwi
Given a query having ti , tj and tk in this
order, decide which terms to combine if they
should combine.
Combine ti and tj if
mnwij gt max mnwi anwj , mnwj
anwi
and mnwij - max mnwi anwj , mnwj anwi
gt mnwkj - max mnwk anwj , mnwj
anwk

55
Learning-based Approaches

Use past retrieval experiences to determine
usefulness
Assume no or little global database or local
database statistics
Static learning learning based on static
training queries
Dynamic learning learning based on evaluated
user queries
Combined learning learned knowledge based on
training queries will be adjusted based on user
queries

56
Static Learning

Example MRDD (Modeling Relevant Document
Distribution) VoGJ95
record the result of each training query for each
local database
ltr1, ..., rsgt ri indicates the minimum
number of
top-ranked documents
to retrieve in
order to obtain i
relevant documents
lt2, 5, gt need to retrieve 2 documents in
order
to obtain 1 relevant
document

57
MRDD (continued)

For a new query
identify the k most similar training queries
obtain the average distribution vector from the k
training queries for each database
use these vectors to determine databases to
search and documents to retrieve to maximize
precision
Example Suppose for query q, three average
distribution are obtained
D1 lt1, 4, 6, 7, 10, 12, 17gt
D2 lt1, 5, 7, 9, 15, 20gt
D3 lt2, 3, 6, 9, 11, 16gt
To retrieve two relevant documents select D1 and
D2.

58
Dynamic Learning

Example SavvySearch DrHo97
database representative weight wi and cfi for
term ti and two penalty values ph and pr for
each D.
wi indicate how well D responds to query
term ti
cfi number of databases containing ti
ph penalty if the average number of hits
returned
for most recent five queries lt Th
ph (Th - h) 2 / Th 2
pr penalty if the average response time
for most
recent five queries gt Tr
pr (r - Tr ) 2 / (45 - Tr ) 2

59
SavvySearch (continued)

Update of wi
initially zero
reduce by 1/k if no document is retrieved for a
k-term query containing ti
increase by 1/k if some returned document is read
Compute the ranking score of database D for query
q (t1, ..., tk)
r

60
Combined Learning

Example ProFusion FaGa99
Phase 1 Static Learning
13 categories/concepts are utilized
training queries in each category are selected
relevance assessment for each query is used to
compute the average score of each local database
with respect to each category
category D1 D2 . . .
Dn
C1 0.3 0.1 . .
. 0.2
. . . . .
. . . .
C13 0 0.4 . .
. 0.1

61
ProFusion (continued)

Phase 2 Database Selection and Dynamic Learning
Each user query is mapped to one or more
categories
Databases are selected based on accumulated
scores over involved categories
Example Suppose query q is mapped to C1, C4,
C5
category D1 D2 D3
D4
C1 0.2 0
0.1 0.3
C4 0.1 0.2
0 0
C5 0 0.4
0.3 0.2
total score 0.3 0.6 0.4
0.5

62
ProFusion (continued)

Each retrieved document from all selected
databases is re-ranked based on the product of
local similarity of the document and the score of
the database.
if the first clicked document by the user is not
the top ranked
increase the score of the database that produced
the document in related categories
decrease the score of other searched databases in
related categories

63
Other Database Selection Techniques

incorporating ranks YMLW99a
query expansion XuCa98
use of lightweight queries HaTh99
shorter
not evaluated like regular queries
use of representative hierarchies YMLW99b

64
Document Selection

Goal Select all globally most similar documents
from a selected local search engine while
minimizing the retrieval of useless documents.
General approaches
determine the number k of documents to retrieve
from a local search engine and then retrieve the
k documents with the largest local similarities
from the search engine
determine a local threshold for the local
database and retrieve documents whose local
similarities exceed the threshold
The two approaches are equivalent.

65
Solution Classification

Local Determination
all locally retrieved documents will be returned
Examples NCSTRL, Search Broker MaBi97
User Determination
global user determines how many documents should
be retrieved from each local database
neither effective nor practical when the number
of databases is large.
Examples MetaCrawler SeEt97
SavvySearch DrHo97

66
Solution Classification (continued)

Weighted Allocation
retrieve proportionally more documents from local
databases that are ranked higher
Learning-based Approaches
use past retrieval experience for selection
Guaranteed Retrieval
aimed at guaranteeing the retrieval of globally
most similar documents

67
Weighted Allocation

Suppose m documents are to be retrieved from N
local databases.
Example 1 CORI net CaLC95
Retrieve m 2(1 N - i) / N ( N1)
documents
from the ith ranked local database.
Example 2 D-WISE YuLe97
Let ri be the ranking score of local
database Di .
Retrieve m ri / documents
from Di .
When retrieving k documents from local database D
, the k documents with largest local similarities
are retrieved from Di .

68
Learning-based Approaches

determine the number of documents to retrieve
from a local database based on past retrieval
experiences with the local database.
Example MRDD VoGJ95
For query q, three average distribution are
obtained
D1 lt1, 4, 6, 7, 10, 12, 17gt
D2 lt1, 5, 7, 9, 15, 20gt
D3 lt2, 3, 6, 9, 11, 16gt
To retrieve four relevant documents retrieve
1 document from D1, 1 from D2 and 3 from D3.

69
Guaranteed Retrieval

Aim at
guaranteeing that all potentially useful
documents with respect to a query be retrieved
minimizing the retrieval of useless documents
Two cases
case 1 a global similarity threshold is known
case 2 the number of globally desired
documents is known
The two cases are mutually translatable.

70
Case 1 Global Similarity Threshold
GT Is Known

find all documents whose global similarities ? GT
Technique 1 Query modification MLYW98
Modify q to q' such that Gsim(q, d)
Lsim(q', d)
find all documents whose local
similarities
with q ? GT
Example q (q1, q2) d (d1, d2)
Gsim(q, d) gidf1q1d1 gidf2q2d2,
Lsim(q, d) lidf1q1d1 lidf2q2d2,
q' (gidf1/lidf1 q1, gidf2/lidf2 q2)
Lsim(q', d) lidf1(gidf1/lidf1)q1d1
lidf2(gidf2/lidf2)q2
d2 Gsim(q, d)

71
Case 1 Global Similarity Threshold
GT Is Known

Technique 2 find largest local threshold LT
such that
Gsim(q, d) ? GT Lsim(q, d) ? LT
MLYW98
retrieve d such that Lsim(q, d) ? LT to form
set S
transmit d from S if Gsim(q, d) ? GT
Example Gsim(q, d) Lsim(q,
d)
d1 0.8
0.7
d2 0.75
0.35
....
d3 0.4
0.6
If d2 is desired, then LT can be no higher than
0.35.
If GT 0.6, d3 will not be transmitted.
Transmit ? m documents from each local database.

72
Case 1 Global Similarity Threshold
GT Is Known

Define tightest local threshold
LT min Lsim(q, d) Gsim(q, d) ? GT
d
Determining LT
if both Gsim and Lsim are linear functions, apply
linear programming
otherwise, try Lagrange Multiplier.

73
Case 1 Global Similarity Threshold
GT Is Known

Example Gsim(q, d) Cosine(qG , d)
Lsim(q, d) Cosine(qL , d)
LT min Cosine(qL , d) Cosine(qG, d) ? GT
d
Cosine(? ?1) when qG, qL, d in the
same plane
GT Cosine ?1 - sin ? sin ?1

qL
?1
?
qG
d
74
Case 2 Number of Globally Desired
Documents Is Known

Solution
rank databases optimally for a given query q
retrieve documents from databases in the optimal
order

75
Case 2 Number of Globally Desired
Documents Is Known

Algorithm OptDocRetrv YLWM99
while less than m documents have been obtained do
1. select the next database in the order
2. compute actual similarity of most similar
document
3. find the minimum min_sim of the actual
similarities
of most similar documents of selected
databases
4. select documents from each selected database
whose actual global similarities ? min_sim
end loop
Sort the documents in descending similarities and
present the top m to the user.

76
Case 2 Number of Globally Desired
Documents Is Known

Example Number of documents desired 4.
Databases are ranked in the order D1, D2, D3,
D4
D1 d1 0.53, d2 0.48, d3
0.39,
D2 d10 0.47, d21 0.43, d52
0.42,
D3 d23 0.54, d42 0.49, ...
D4 d33 0.40,
select D1, min_sim 0.53 result d1
select D2, min_sim 0.47, result d1, d2,
d10
select D3, min_sim 0.47, result d1, d2,
d10,
d23, d42
result to user d1, d2, d23, d42

77
Case 2 Number of Globally Desired
Documents Is Known

Proposition If databases are optimally ranked,
then all the m globally most similar documents
will be retrieved by algorithm OptDocRetrv.
Proposition For any single-term query, all the m
globally most similar documents will be retrieved
by algorithm OptDocRetrv.

78
Result Merging

Goal Merge returned documents from multiple
sources into a single ranked list.
Difficulties
local similarities are usually not comparable due
to
different similarity functions
different term weighting schemes
different statistical values
e.g., global idf vs. local idf
local similarities may be unavailable to
metasearch engine (only ranks are provided)
Ideal rank in non-increasing order of global
similarities

79
Solution Classification

similarity normalization
normalize all local similarities into a
common fixed range to improve comparability
similarity adjustment
adjust local similarities/ranks based on the
quality of local databases
global similarity computation
aim at obtaining the actual global
similarities
Merge based on normalized/adjusted/computed
similarities.

80
Similarity Normalization

Example 1 MetaCrawler SeEt97
map all local similarities into 0, 1000
map largest local similarity from each source to
1000
map other local similarities proportionally
add normalized local similarities for documents
retrieved from multiple sources
D1
D2
d1 d2
d3 d1 d4 d5
local similarity 100 200 400
0.3 0.2 0.5
normalized 250 500 1000 600
400 1000
final similarity 850 500 1000
400 1000

81
Similarity Normalization

Example 2 SavvySearch DrHo97
same as MetaCrawler except using range 0, 1
documents with no local similarities are assigned
0.5
Retrieval based on Multiple Evidence
normalized similarity between 0 and 1 can be
considered as a confidence that a document is
useful
let si be the confidence of source i that
document d is useful to query q
estimate overall confidence that d is useful
S(d, q) 1 - (1 - si)...(1- sk)
Example s1 0.7, s2 0.8 S(d, q)
0.94

82
Similarity Adjustment

Use local similarity of d and the ranking score
of its database to estimate the global similarity
of d.
database ranking score the higher the better
Example CORI net CaLC95
assign the following weight to database D
w(D) 1 N (r - r') / r'
r rank score of D wrt q
r avg of scores of searched databases
N number of local databases searched
adjust local similarity s of document d in D to
sw(D)
Similar approach employed in ProFusion GaWG96.

83
Similarity Adjustment

Use local rank of d and the ranking score of its
database to estimate the global similarity of d.
Example D-WISE YuLe97
Gsim(q, d) 1 - (r - 1) Rmin / (m
Ri)
Ri ranking score of database Di
Rmin lowest database ranking score
r local rank of document d from Di
m total number of documents desired
Observation top ranked document from any
database has the same global similarity

84
D-WISE (continued)

Example R1 0.3, R2 0.7, Rmin 0.2, m 4
Gsim(q, d) 1 - (r - 1) 0.2 / (4 Ri)
D1 D2
r Gsim r
Gsim
d1 1 1.0 d1' 1
1.0
d2 2 0.83 d2' 2
0.93
d3 3 0.67 d3' 3
0.86
more documents from databases with higher ranking
scores have higher global similarities

85
Global Similarity Computation

Technique 1 Document Fetching (e.g. E2RD2,
ParaCrawler)
fetch documents to the metasearch engine
collect desired statistics (tf, idf, ...)
compute global similarities
Problem may not scale well.

86
Global Similarity Computation

Technique 2 Knowledge Discovery
discover similarity functions and term weighting
schemes used in different search engines
use the discovered knowledge to determine
what local similarities are reasonably
comparable?
how to adjust local similarities to make them
more comparable?
how to compute/estimate global similarities?

87
Knowledge Discovery (continued)

Example
All local search engines selected for a query
employ same methods for indexing local documents
and computing local similarities
do not use idf information
local similarities comparable
idf information is used and q has a single term t
Lsim(q, d) tft(q) lidft
tft(d)/(qd)
lidft tft(d)/d
Gsim(q, d) (gidft tft(d))
/d
Gsim(q, d) Lsim(q, d)
gidft / lidft

88
Knowledge Discovery (continued)

Example (continued)
idf information is used and q has terms t1, ...,
tk
Gsim(q, d)
can be determined using ti as a
single-term
query.

89
Knowledge Discovery (continued)

submit ti as a single-term query and let
si Lsim(d, q(ti))

90
New Challenges

Incorporate new search techniques into
metasearch.
Document ranks in Google
Kleinberg's hub and authority scores
Tag information in HTML documents
Implicit user feedback on previous retrieval
Pseudo relevance feedback on previous retrieval
Use of user profiles
Integrate local systems supporting different
query types
fewer researches on boolean queries, proximity
queries and hierarchical queries

91
New Challenges (continued)

Develop techniques to discover knowledge
(representatives, ranking algorithms) about local
search engines more accurately and more
efficiently.
some search engines may be unwilling to provided
desired representatives or may provide inaccurate
representatives
indexing techniques, term weighting schemes and
similarity functions are typically proprietary.
Develop standard guideline on what information
each search engine should provide to metasearch
engine (some efforts STARTS, Dublin Core).

92
New Challenges (continued)

Distributed implementation of metasearch engine
alternative ways to store local database
representatives?
how to perform database selection and document
selection at multiple sites in parallel?
Scale to a million databases
storage of database representatives
fast algorithms for database selection, document
selection and result merging
efficient network utilization

93
New Challenges (continued)

Standard testbed for evaluation
need a large number of local databases
documents should have links for computing ranks,
hub and authority scores
a large number of typical Internet queries
relevance assessment of documents to each query
Go beyond text databases
how to extend to databases containing text,
images, video, audio, structured data?

Write a Comment

User Comments (0)