Title: VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges
1VLDB'99 TUTORIAL Metasearch Engines
Solutions and Challenges
- Clement Yu
Weiyi Meng - Dept. of EECS Dept. of
Computer Science - U. of Illinois at Chicago SUNY at
Binghamton - Chicago, IL 60607 Binghamton, NY
13902 - yu_at_eecs.uic.edu meng_at_cs.binghamton.e
du
2The Problem
How am I going to find the 5 best pages on
Internet Security?
-
- search search
search - engine 1 engine 2
engine n -
. . . . . . -
- text text
text - source 1 source 2
source n
3Metasearch Engine Solution
- user
-
- user interface
- query dispatcher result
merger -
- search search
search - engine 1 engine 2
engine n -
. . . . . . - text text
text - source 1 source 2
source n
query
result
4Some Observations
- most sources are not useful for a given query
- sending a query to a useless source would
- incur unnecessary network traffic
- waste local resources for evaluating the query
- increase the cost of merging the results
- retrieving too many documents from a source is
inefficient
5A More Efficient Metasearch Engine
- user
- user interface
- database selector document
selector - query dispatcher result
merger -
- search search
search - engine 1 engine 2
engine n -
. . . . . . - text text
text - source 1 source 2
source n
query
result
6Tutorial Outline
- 1. Introduction to Text Retrieval
- consider only Vector Space Model
- 2. Search Engines on the Web
- 3. Introduction to Metasearch Engine
- 4. Database Selection
- 5. Document Selection
- 6. Result Merging
- 7. New Challenges
7Introduction to Text Retrieval (1)
- Document representation
- remove stopwords of, the, ...
- stemming stemming stem
- d (d1 , ..., di , ..., dn)
- di weight of ith term in d
- tf idf formula for computing di
- Example consider term t of document d in a
database of N documents. - tf weight of t in d if tf gt 0 0.5
0.5tf/max_tf - idf weight of t log(N/df)
- weight of t in d (0.5 0.5tf/max_tf)log(
N/df)
8Introduction to Text Retrieval (2)
- Query representation
- q (q1 , ..., qi , ..., qn)
- qi weight of ith term in q
- compute qi tf weight only
- alternative use idf weight for query terms
- not document terms
- query expansion (e.g., add related terms)
9Introduction to Text Retrieval (3)
- Similarity Functions
- simple dot product
- favor long documents
- Cosine function
-
- other similarity functions exist
- normalized similarities 0, 1.0
q
?
d
10Introduction to Text Retrieval (4)
- Retrieval Effectiveness
- relevant documents documents useful to the user
of query - recall percentage of relevant documents
retrieved - precision percentage of retrieved documents that
are relevant -
precision
recall
11Search Engines on the Web (1)
- Search engine as a document retrieval system
- no control on web pages that can be searched
- web pages have rich structures and semantics
- web pages are extensively linked
- additional information for each page (time last
modified, organization publishing it, etc.) - databases are dynamic and can be very large
- few general-purpose search engines and numerous
special-purpose search engines
12Search Engines on the Web (2)
- New indexing techniques
- partial-text indexing to improve scalability
- ignore and/or discount spamming terms
- use anchor terms to index linked pages
- e.g. WWWW McBr94, Google BrPa98,
- Webor CSM97
-
Page 2 http//travelocity.com/
Page 1
. . . . . . airplane ticket and
hotel . . . . . .
13Search Engines on the Web (3)
- New term weighting schemes
- higher weights to terms enclosed by special tags
- title (SIBRIS WaWJ89, Altavista, HotBot, Yahoo)
- special fonts (Google BrPa98)
- special fonts tags (LASER BoFJ96)
- Webor CSM97 approach
- partition tags into disjoint classes (title,
header, strong, anchor, list, plain text) - assign different importance factors to terms in
different classes - determine optimal importance factors
14Search Engines on the Web (4)
- New document ranking methods
- Vector Spreading Activation YuLe96
- add a fraction of parents' similarities
- Example Suppose for query q
- sim(q, d1) 0.4 sim(q, d2) 0.2 sim(q,
d3) 0.2 - final score of d3 0.2 0.10.4 0.10.2
0.26
d1
d3
d2
15Search Engines on the Web (5)
- New document ranking methods
- combine similarity with rank
- PageRank PaBr98 an important page is linked to
by many pages and/or by important pages - combine similarity with authority score
- authority Klei98 an important content page is
highly linked to among initially retrieved pages
and their neighbors
16Introduction to Metasearch Engine (1)
- An Example
- Query Internet
Security - Databases NYT ...
WP ... - DB
... DB ... - Retrieved results t1, t2, ...
p1, p2, - Merged results p1, t1, ...
17Introduction to Metasearch Engine (2)
- Database Selection Problem
- Select potentially useful databases for a given
query - essential if the number of local databases is
large - reduce network traffic
- avoid wasting local resources
query
18Introduction to Metasearch Engine (3)
- Potentially useful database contain potentially
useful documents - Potentially useful documents
- global similarity above a threshold
- global similarity among m highest
- Need some knowledge about each database in
advance in order to perform database selection - Database Representative
19Introduction to Metasearch Engine (4)
- Document Selection Problem
- Select potentially useful documents from each
selected local database efficiently - Step 1 Retrieve all potentially useful documents
while minimizing the retrieval of useless
documents - from global similarity threshold to tightest
local similarity threshold - want all d Gsim(q, d) gt GT
- retrieve d from DBk Lsim(q, d) gt LTk
- LTk is largest Gsim(q, d) gt GT
Lsim(q, d) gt LTk
20Introduction to Metasearch Engine (5)
- Efficient Document Selection
- Step 2 Transmit all potentially useful documents
to result merger while minimizing the
transmission of useless documents - further filtering to reduce transmission cost and
merge cost - Example
- local
- DBk
retrieve
transmit
filter
d1 , , ds
d2, d7, d10
21Introduction to Metasearch Engine (6)
- Result Merging Problem
- Objective Merge returned documents from multiple
sources into a single ranked list. - Difficulty Local document similarities may be
incomparable or not available. - Solutions Generate "global similarities for
ranking.
d11, d12, ...
DB1
. . . . . .
Merger
d12, d54, ...
dN1, dN2, ...
DBN
22Introduction to Metasearch Engine (7)
- An Ideal Metasearch Engine
- Retrieval effectiveness same as that as if all
documents were in the same collection. - Efficiency optimize the retrieval process
- Implications should aimed at
- selecting only useful search engines
- retrieving and transmitting only useful documents
- ranking documents according to their degrees of
relevance
23Introduction to Metasearch Engine (8)
- Main Sources of Difficulties MYL99
- autonomy of local search engines
- design autonomy
- maintenance autonomy
- heterogeneities among local search engines
- indexing method
- document/query term weighting schemes
- similarity/ranking function
- document database
- document version
- result presentation
24Introduction to Metasearch Engine (9)
- Impact of Autonomy and Heterogeneities MLY99
- unwilling to provide database representatives or
provide different types of representatives - difficult to find potentially useful documents
- difficult to merge documents from multiple sources
25Database Selection Basic Idea
- Goal Identify potentially useful databases for
each user query. - General approach
- use representative to indicate approximately the
content of each database - use these representatives to select databases for
each query - Diversity of solutions
- different types of representatives
- different algorithms using the representatives
26Solution Classification
- Naive Approach
- select all databases (e.g. MetaCrawler,
NCSTRL) - Qualitative Approaches estimate the quality of
each local database - based on rough representatives
- based on detailed representatives
- Quantitative Approaches estimate quantities that
measure the quality of each local database more
directly and explicitly - Learning-based Approaches database
representatives are obtained through training or
learning
27Qualitative Approaches Using Rough Representatives
- typical representative
- a few words or a few paragraphs in certain format
- manual construction often needed
- can work well for special-purpose local search
engines - very scalable storage requirement
- selection can be inaccurate as the description is
too rough
28Qualitative Approaches Using Rough Representatives
- Example 1 ALIWEB Kost94
- Representative has a fixed format site
containing files for the Perl Language - Template-Type DOCUMENT
- Title Perl
- Description Information on the Perl
Programming - Language. Includes a
local Hypertext - Perl Manual, and the
latest FAQ in - Hypertext.
- Keywords perl, perl-faq, language
- user query can match against one or more fields
29Qualitative Approaches Using Rough Representatives
- Example 2 NetSerf ChHa95
- Representative has a WordNet based structure
site for world facts listed by country - topic country
- synset nation, nationality, land, country,
a_people - synset state, nation, country, land,
- commonwealth, res_publica,
body_politic - synset country, state, land, nation
- info-type facts
- user query is transformed to similar structure
before match
30Qualitative Approaches Using Detailed
Representatives
- Use detailed statistical information for each
term - employ special measures to estimate the
usefulness/quality of each search engine for each
query - the measures reflect the usefulness in a less
direct/explicit way compared to those used in
quantitative approaches. - scalability starts to become an issue
31Qualitative Approaches Using Detailed
Representatives
- Example 1 gGlOSS GrGa95
- representative for term ti
- -- document frequency of ti
- -- the sum of weights of ti in all
documents - database usefulness sum of high similarities
-
- usefulness(q, D, T)
32gGlOSS (continued)
- Suppose for query q , we have
- D1 d11 0.6, d12 0.5
- D2 d21 0.3, d22 0.3, d23 0.2
- D3 d31 0.7, d32 0.1, d33 0.1
- usefulness(q, D1, 0.3) 1.1
- usefulness(q, D2, 0.3) 0.6
- usefulness(q, D3, 0.3) 0.7
33gGlOSS (continued)
- gGlOSS usefulness is estimated for two cases
- high-correlation case if dfi ? dfj , then every
document having ti also has tj . - Example Consider q (1, 1, 1) with df1 2,
df2 3, df3 4, W1 0.6, W2 0.6 and W3
1.2. - t1 t2 t3
t1 t2 t3 - d1 0.2 0.1 0.3
0.3 0.2 0.3 - d2 0.4 0.3 0.2
0.3 0.2 0.3 - d3 0 0.2 0.4
0 0.2 0.3 - d4 0 0 0.3
0 0 0.3 - usefulness(q, D, 0.5) W1 W2 df2W3/df3
2.1
34gGlOSS (continued)
- disjoint case for any two query terms ti and tj
, no document contains both ti and tj . - Example Consider q (1, 1, 1) with df1 2,
df2 1, df3 1, W1 0.5, W2 0.2 and W3 0.4
. - t1 t2 t3
t1 t2 t3 - d1 0.2 0 0
0.25 0 0 - d2 0 0.2 0
0 0.2 0 - d3 0.3 0 0
0.25 0 0 - d4 0 0 0.4
0 0 0.4 - usefulness(q, D, 0.3)
W3 0.4
35gGlOSS (continued)
- Some observations
- usefulness dependent on threshold
- representative has two quantities per term
- strong assumptions are used
- high-correlation tends to overestimate
- disjoint tends to underestimate
- the two estimates tend to form bounds to the sum
of the similarities ? T
36Qualitative Approaches Using Detailed
Representatives
- Example 2 CORI Net CaLC95
- representative (dfi , cfi ) for term ti
- dfi -- document frequency of ti
- cfi -- collection frequency of ti
- cfi can be shared by all databases
- database usefulness
- usefulness(q, D) sim(q, representative of
D) - usefulness similarity
- dfi
tfi - cfi
dfi
37CORI Net (continued)
- Some observations
- estimates independent of threshold
- representative has less than two quantities per
term - similarity is computed based on inference network
- same method for ranking documents and ranking
databases
38Qualitative Approaches Using Detailed
Representatives
- Example 3 D-WISE YuLe97
- representative dfi,j for term tj in database
Di - database usefulness a measure of query term
concentration in different databases -
- usefulness(q, Di)
- k number of query terms
- CVVj cue validity variance of term tj
across all - databases larger CVVj
tj is more - useful in distinguishing
different databases
39D-WISE (continued)
N number of databases
- ACVj average cue validity of tj over all
databases - Observations
- estimates independent of threshold
- representative has one quantity per term
- measure is difficult to understand
ni number of documents in database Di
40Quantitative Approaches
- Two types of quantities may be estimated wrt
query q - the number of documents in a database D with
similarities higher than a threshold T - NoDoc(q, D, T) d d ? D and sim(q, d)
gt T - the global similarity of the most similar
document in D - msim(q, D) max sim(q, d)
- d?D
- can be used to rank databases in descending order
of similarity (or any desirability measure)
41Estimating NoDoc(q, D, T)
- Basic Approach MLYW98
- representative (pi , wi ) for term ti
- pi probability that ti appears in a
document - wi average weight of ti among documents
- having ti
- Example normalized weights of ti in 10
documents are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4,
0.6, 0.6). - pi 0.6, wi 0.4
42Estimating NoDoc(q, D, T)
- Basic Approach (continued)
- Example Consider query q (1, 1).
- Suppose p1 0.2, w1 2, p2 0.4, w2 1.
- A generating function
- (0.2 X 2 0.8) (0.4 X 0.6)
- 0.08 X 3 0.12 X 2 0.32 X 0.48
- a X b a is the probability that a document
in D has - similarity b with q
- NoDoc(q, D, 1) 10(0.08 0.12) 2
43Estimating NoDoc(q, D, T)
- Basic Approach (continued)
- Consider query q (q1, ..., qr).
- Proposition. If the terms are independent and
the weight of term ti whenever present in a
document is wi (the average weight), 1 ? i ? r,
then the coefficient of X s in the following
generating function is the probability that a
document in D has similarity s with q. -
44Estimating NoDoc(q, D, T)
- Subrange-based Approach MLYW99
- overcome the uniform term weight assumption
- additional information for term ti
- ?i standard deviation of weights of ti
in all - documents
- mnwi maximum normalized weight of ti
45Estimating NoDoc(q, D, T)
- Example weights of term ti 4, 4, 1, 1, 1, 1,
0, 0, 0, 0 - generating function (factor) using average
weight - 0.6X 2 0.4
- a more accurate function using subranges of
weights - 0.2X 4 0.4X 0.4
- In general, weights are partitioned to k
subranges - pi1X mi1 ... pikX mik (1 - pi)
- Probability pij and median mij can be
estimated - using di and the average of weights of ti
. - A special implementation Use the maximum
normalized weight as the first subrange by itself.
46Estimating NoDoc(q, D, T)
- Combined-term Approach LYMW99
- relieve the term independence assumption
- Example Consider query Chinese medicine .
- Suppose generating function for
- Chinese 0.1X3 0.3X 0.6
- medicine 0.2X2 0.4 X 0.4
- Chinese medicine 0.02 X5 0.04 X4
0.1X3 - Chinese medicine 0.05 Xw ...
47Estimating NoDoc(q, D, T)
- Criteria for combining Chinese and medicine
- The maximum normalized weight of the combined
term is higher than the maximum normalized weight
of each of the two individual terms (w gt 3) - The sum of estimated probabilities of terms with
exponents ? w under the term independence
assumption is very different from 1/N, N is the
number of documents in database - They are adjacent terms in previous queries.
48Database Selection Using msim(q,D)
- Optimal Ranking of Databases YLWM99b
- User for query q, find the m most similar
documents or with the m largest degrees of
relevance - Definition Databases D1, D2, , Dp are
optimally ranked with respect to q if there
exists a k such that each of the databases D1, ,
Dk contains one of the m most similar documents,
and all of these m documents are contained in
these k databases.
49Database Selection Using msim(q,D)
- Optimal Ranking of Databases
- Example For a given query q
- D1 d1 0.8, d2 0.5,
d3 0.2, ... - D2 d9 0.7, d2 0.6,
d10 0.4, ... - D3 d8 0.9, d12 0.3,
- other databases have documents with
small - similarities
- When m 5 pick D1, D2, D3
50Database Selection Using msim(q,D)
- Proposition Databases D1, D2, , Dp are
optimally ranked with respect to a query q if and
only if - msim(q, Di) ? msim(q, Dj), i lt j
- Example D1 d1 0.8,
- D2 d9 0.7,
- D3 d8 0.9,
- Optimal rank D3, D1, D2,
51Estimating msim(q, D)
- Use subrange-based or combined-term method.
- Example Suppose 100 documents in a database.
- For query q, the generating function is
- 0.002 X4 0.009 X3
- Since 100(0.002 0.009) ? 1, the global
similarity of the most similar document is
estimated to be 3. - Weakness of this approach
- require large storage for database representative
- exponential computation complexity
52Estimating msim(q, D)
- A more efficient method
- global database representative global dfi of
term ti - local database representative
- anwi average normalized weight of ti
- mnwi maximum normalized weight of ti
-
- Example term ti d1 0.3, d2 0.4, d3 0,
d4 0.74 - anwi (0.3 0.4 0 0.7)/4
0.35 - mnwi 0.74
53Estimating msim(q, D)
- A more efficient method (continued)
- term weighting scheme
- query term tfgidf
- document term tf
- query q (q1, q2)
- msim(q, D)
- max q1gidf1mnw1 q2gidf2anw2 ,
- q2gidf2mnw2 q1gidf1anw1
- linear computation complexity
54Estimating msim(q, D)
- Combine terms to improve estimation accuracy
- Restrictions for combining terms ti and tj
into tij - ti and tj are adjacent query terms
- mnwij gt max mnwi anwj , mnwj anwi
- Given a query having ti , tj and tk in this
order, decide which terms to combine if they
should combine. - Combine ti and tj if
- mnwij gt max mnwi anwj , mnwj
anwi - and mnwij - max mnwi anwj , mnwj anwi
- gt mnwkj - max mnwk anwj , mnwj
anwk
55Learning-based Approaches
- Use past retrieval experiences to determine
usefulness - Assume no or little global database or local
database statistics - Static learning learning based on static
training queries - Dynamic learning learning based on evaluated
user queries - Combined learning learned knowledge based on
training queries will be adjusted based on user
queries
56Static Learning
- Example MRDD (Modeling Relevant Document
Distribution) VoGJ95 - record the result of each training query for each
local database - ltr1, ..., rsgt ri indicates the minimum
number of - top-ranked documents
to retrieve in - order to obtain i
relevant documents - lt2, 5, gt need to retrieve 2 documents in
order - to obtain 1 relevant
document
57MRDD (continued)
- For a new query
- identify the k most similar training queries
- obtain the average distribution vector from the k
training queries for each database - use these vectors to determine databases to
search and documents to retrieve to maximize
precision - Example Suppose for query q, three average
distribution are obtained - D1 lt1, 4, 6, 7, 10, 12, 17gt
- D2 lt1, 5, 7, 9, 15, 20gt
- D3 lt2, 3, 6, 9, 11, 16gt
- To retrieve two relevant documents select D1 and
D2.
58Dynamic Learning
- Example SavvySearch DrHo97
- database representative weight wi and cfi for
term ti and two penalty values ph and pr for
each D. - wi indicate how well D responds to query
term ti - cfi number of databases containing ti
- ph penalty if the average number of hits
returned - for most recent five queries lt Th
- ph (Th - h) 2 / Th 2
- pr penalty if the average response time
for most - recent five queries gt Tr
- pr (r - Tr ) 2 / (45 - Tr ) 2
59SavvySearch (continued)
- Update of wi
- initially zero
- reduce by 1/k if no document is retrieved for a
k-term query containing ti - increase by 1/k if some returned document is read
- Compute the ranking score of database D for query
- q (t1, ..., tk)
-
- r
60Combined Learning
- Example ProFusion FaGa99
- Phase 1 Static Learning
- 13 categories/concepts are utilized
- training queries in each category are selected
- relevance assessment for each query is used to
compute the average score of each local database
with respect to each category - category D1 D2 . . .
Dn - C1 0.3 0.1 . .
. 0.2 - . . . . .
. . . . - C13 0 0.4 . .
. 0.1
61ProFusion (continued)
- Phase 2 Database Selection and Dynamic Learning
- Each user query is mapped to one or more
categories - Databases are selected based on accumulated
scores over involved categories - Example Suppose query q is mapped to C1, C4,
C5 - category D1 D2 D3
D4 - C1 0.2 0
0.1 0.3 - C4 0.1 0.2
0 0 - C5 0 0.4
0.3 0.2 - total score 0.3 0.6 0.4
0.5
62ProFusion (continued)
- Each retrieved document from all selected
databases is re-ranked based on the product of
local similarity of the document and the score of
the database. - if the first clicked document by the user is not
the top ranked - increase the score of the database that produced
the document in related categories - decrease the score of other searched databases in
related categories
63Other Database Selection Techniques
- incorporating ranks YMLW99a
- query expansion XuCa98
- use of lightweight queries HaTh99
- shorter
- not evaluated like regular queries
- use of representative hierarchies YMLW99b
64Document Selection
- Goal Select all globally most similar documents
from a selected local search engine while
minimizing the retrieval of useless documents. - General approaches
- determine the number k of documents to retrieve
from a local search engine and then retrieve the
k documents with the largest local similarities
from the search engine - determine a local threshold for the local
database and retrieve documents whose local
similarities exceed the threshold - The two approaches are equivalent.
65Solution Classification
- Local Determination
- all locally retrieved documents will be returned
- Examples NCSTRL, Search Broker MaBi97
- User Determination
- global user determines how many documents should
be retrieved from each local database - neither effective nor practical when the number
of databases is large. - Examples MetaCrawler SeEt97
- SavvySearch DrHo97
66Solution Classification (continued)
- Weighted Allocation
- retrieve proportionally more documents from local
databases that are ranked higher - Learning-based Approaches
- use past retrieval experience for selection
- Guaranteed Retrieval
- aimed at guaranteeing the retrieval of globally
most similar documents
67Weighted Allocation
- Suppose m documents are to be retrieved from N
local databases. - Example 1 CORI net CaLC95
- Retrieve m 2(1 N - i) / N ( N1)
documents - from the ith ranked local database.
- Example 2 D-WISE YuLe97
- Let ri be the ranking score of local
database Di . - Retrieve m ri / documents
from Di . - When retrieving k documents from local database D
, the k documents with largest local similarities
are retrieved from Di .
68Learning-based Approaches
- determine the number of documents to retrieve
from a local database based on past retrieval
experiences with the local database. - Example MRDD VoGJ95
- For query q, three average distribution are
obtained - D1 lt1, 4, 6, 7, 10, 12, 17gt
- D2 lt1, 5, 7, 9, 15, 20gt
- D3 lt2, 3, 6, 9, 11, 16gt
- To retrieve four relevant documents retrieve
1 document from D1, 1 from D2 and 3 from D3.
69Guaranteed Retrieval
- Aim at
- guaranteeing that all potentially useful
documents with respect to a query be retrieved - minimizing the retrieval of useless documents
- Two cases
- case 1 a global similarity threshold is known
- case 2 the number of globally desired
documents is known - The two cases are mutually translatable.
70Case 1 Global Similarity Threshold
GT Is Known
- find all documents whose global similarities ? GT
- Technique 1 Query modification MLYW98
- Modify q to q' such that Gsim(q, d)
Lsim(q', d) - find all documents whose local
similarities - with q ? GT
- Example q (q1, q2) d (d1, d2)
- Gsim(q, d) gidf1q1d1 gidf2q2d2,
- Lsim(q, d) lidf1q1d1 lidf2q2d2,
- q' (gidf1/lidf1 q1, gidf2/lidf2 q2)
- Lsim(q', d) lidf1(gidf1/lidf1)q1d1
- lidf2(gidf2/lidf2)q2
d2 Gsim(q, d)
71Case 1 Global Similarity Threshold
GT Is Known
- Technique 2 find largest local threshold LT
such that - Gsim(q, d) ? GT Lsim(q, d) ? LT
MLYW98 - retrieve d such that Lsim(q, d) ? LT to form
set S - transmit d from S if Gsim(q, d) ? GT
- Example Gsim(q, d) Lsim(q,
d) - d1 0.8
0.7 - d2 0.75
0.35 - ....
- d3 0.4
0.6 - If d2 is desired, then LT can be no higher than
0.35. - If GT 0.6, d3 will not be transmitted.
- Transmit ? m documents from each local database.
72Case 1 Global Similarity Threshold
GT Is Known
- Define tightest local threshold
- LT min Lsim(q, d) Gsim(q, d) ? GT
- d
- Determining LT
- if both Gsim and Lsim are linear functions, apply
linear programming - otherwise, try Lagrange Multiplier.
73Case 1 Global Similarity Threshold
GT Is Known
- Example Gsim(q, d) Cosine(qG , d)
- Lsim(q, d) Cosine(qL , d)
- LT min Cosine(qL , d) Cosine(qG, d) ? GT
- d
- Cosine(? ?1) when qG, qL, d in the
same plane - GT Cosine ?1 - sin ? sin ?1
qL
?1
?
qG
d
74Case 2 Number of Globally Desired
Documents Is Known
- Solution
- rank databases optimally for a given query q
- retrieve documents from databases in the optimal
order
75Case 2 Number of Globally Desired
Documents Is Known
- Algorithm OptDocRetrv YLWM99
- while less than m documents have been obtained do
- 1. select the next database in the order
- 2. compute actual similarity of most similar
document - 3. find the minimum min_sim of the actual
similarities - of most similar documents of selected
databases - 4. select documents from each selected database
- whose actual global similarities ? min_sim
- end loop
- Sort the documents in descending similarities and
present the top m to the user.
76Case 2 Number of Globally Desired
Documents Is Known
- Example Number of documents desired 4.
- Databases are ranked in the order D1, D2, D3,
D4 - D1 d1 0.53, d2 0.48, d3
0.39, - D2 d10 0.47, d21 0.43, d52
0.42, - D3 d23 0.54, d42 0.49, ...
- D4 d33 0.40,
- select D1, min_sim 0.53 result d1
- select D2, min_sim 0.47, result d1, d2,
d10 - select D3, min_sim 0.47, result d1, d2,
d10, -
d23, d42 - result to user d1, d2, d23, d42
77Case 2 Number of Globally Desired
Documents Is Known
- Proposition If databases are optimally ranked,
then all the m globally most similar documents
will be retrieved by algorithm OptDocRetrv. - Proposition For any single-term query, all the m
globally most similar documents will be retrieved
by algorithm OptDocRetrv.
78Result Merging
- Goal Merge returned documents from multiple
sources into a single ranked list. - Difficulties
- local similarities are usually not comparable due
to - different similarity functions
- different term weighting schemes
- different statistical values
- e.g., global idf vs. local idf
- local similarities may be unavailable to
metasearch engine (only ranks are provided) - Ideal rank in non-increasing order of global
similarities
79Solution Classification
- similarity normalization
- normalize all local similarities into a
common fixed range to improve comparability - similarity adjustment
- adjust local similarities/ranks based on the
quality of local databases - global similarity computation
- aim at obtaining the actual global
similarities - Merge based on normalized/adjusted/computed
similarities.
80Similarity Normalization
- Example 1 MetaCrawler SeEt97
- map all local similarities into 0, 1000
- map largest local similarity from each source to
1000 - map other local similarities proportionally
- add normalized local similarities for documents
retrieved from multiple sources - D1
D2 - d1 d2
d3 d1 d4 d5 - local similarity 100 200 400
0.3 0.2 0.5 - normalized 250 500 1000 600
400 1000 - final similarity 850 500 1000
400 1000
81Similarity Normalization
- Example 2 SavvySearch DrHo97
- same as MetaCrawler except using range 0, 1
- documents with no local similarities are assigned
0.5 - Retrieval based on Multiple Evidence
- normalized similarity between 0 and 1 can be
considered as a confidence that a document is
useful - let si be the confidence of source i that
document d is useful to query q - estimate overall confidence that d is useful
- S(d, q) 1 - (1 - si)...(1- sk)
- Example s1 0.7, s2 0.8 S(d, q)
0.94
82Similarity Adjustment
- Use local similarity of d and the ranking score
of its database to estimate the global similarity
of d. - database ranking score the higher the better
- Example CORI net CaLC95
- assign the following weight to database D
- w(D) 1 N (r - r') / r'
- r rank score of D wrt q
- r avg of scores of searched databases
- N number of local databases searched
- adjust local similarity s of document d in D to
sw(D) - Similar approach employed in ProFusion GaWG96.
83Similarity Adjustment
- Use local rank of d and the ranking score of its
database to estimate the global similarity of d. - Example D-WISE YuLe97
- Gsim(q, d) 1 - (r - 1) Rmin / (m
Ri) - Ri ranking score of database Di
- Rmin lowest database ranking score
- r local rank of document d from Di
- m total number of documents desired
- Observation top ranked document from any
database has the same global similarity
84D-WISE (continued)
- Example R1 0.3, R2 0.7, Rmin 0.2, m 4
- Gsim(q, d) 1 - (r - 1) 0.2 / (4 Ri)
- D1 D2
- r Gsim r
Gsim - d1 1 1.0 d1' 1
1.0 - d2 2 0.83 d2' 2
0.93 - d3 3 0.67 d3' 3
0.86 - more documents from databases with higher ranking
scores have higher global similarities
85Global Similarity Computation
- Technique 1 Document Fetching (e.g. E2RD2,
ParaCrawler) - fetch documents to the metasearch engine
- collect desired statistics (tf, idf, ...)
- compute global similarities
- Problem may not scale well.
86Global Similarity Computation
- Technique 2 Knowledge Discovery
- discover similarity functions and term weighting
schemes used in different search engines - use the discovered knowledge to determine
- what local similarities are reasonably
comparable? - how to adjust local similarities to make them
more comparable? - how to compute/estimate global similarities?
87Knowledge Discovery (continued)
- Example
- All local search engines selected for a query
- employ same methods for indexing local documents
and computing local similarities - do not use idf information
- local similarities comparable
- idf information is used and q has a single term t
- Lsim(q, d) tft(q) lidft
- tft(d)/(qd)
lidft tft(d)/d - Gsim(q, d) (gidft tft(d))
/d - Gsim(q, d) Lsim(q, d)
gidft / lidft
88Knowledge Discovery (continued)
- Example (continued)
- idf information is used and q has terms t1, ...,
tk - Gsim(q, d)
-
-
- can be determined using ti as a
single-term - query.
89Knowledge Discovery (continued)
- submit ti as a single-term query and let
- si Lsim(d, q(ti))
90New Challenges
- Incorporate new search techniques into
metasearch. - Document ranks in Google
- Kleinberg's hub and authority scores
- Tag information in HTML documents
- Implicit user feedback on previous retrieval
- Pseudo relevance feedback on previous retrieval
- Use of user profiles
- Integrate local systems supporting different
query types - fewer researches on boolean queries, proximity
queries and hierarchical queries
91New Challenges (continued)
- Develop techniques to discover knowledge
(representatives, ranking algorithms) about local
search engines more accurately and more
efficiently. - some search engines may be unwilling to provided
desired representatives or may provide inaccurate
representatives - indexing techniques, term weighting schemes and
similarity functions are typically proprietary. - Develop standard guideline on what information
each search engine should provide to metasearch
engine (some efforts STARTS, Dublin Core).
92New Challenges (continued)
- Distributed implementation of metasearch engine
- alternative ways to store local database
representatives? - how to perform database selection and document
selection at multiple sites in parallel? - Scale to a million databases
- storage of database representatives
- fast algorithms for database selection, document
selection and result merging - efficient network utilization
93New Challenges (continued)
- Standard testbed for evaluation
- need a large number of local databases
- documents should have links for computing ranks,
hub and authority scores - a large number of typical Internet queries
- relevance assessment of documents to each query
- Go beyond text databases
- how to extend to databases containing text,
images, video, audio, structured data?