MetaSearch Engine - PowerPoint PPT Presentation

About This Presentation

Title:

MetaSearch Engine

Description:

Title: String Matching Allowing Errors Author: Dept of Computer Science Last modified by: cswangl Created Date: 3/17/2004 2:48:04 AM Document presentation format – PowerPoint PPT presentation

Number of Views:180

Avg rating:3.0/5.0

Slides: 49

Provided by: DeptofCom6

Category:

more less

Transcript and Presenter's Notes

Title: MetaSearch Engine

1
Lecture 9 Rank Aggregation in MetaSearch

MetaSearch Engine
Social Choice Rules
Rank Aggregation

2
Choices of Search Engines

Many search engines exist to compete for users
The results are not necessarily the same
Different users prefer different search engines
Search results may, in the future, be biased
towards paid advertisements.

3
MetaSearch Engine

Metasearch Engines are designed to increase the
coverage of web by forwarding users queries to
multiple search engines
Users requests are sent to multiple search
engines such as AlltheWeb, Google, MSN.
Then the results from the individual search
engine are combined into a single result set to
present to users.

4
Different Forms of MetaSearch

Submit different representations of the same
query to the same search engine, then combine the
results.
Submit the same query to several search engine
adopting different information retrieval models,
then combine the results.

5
Issues

How to combine the results retrieved by different
source search engines is crucial for the success
of a metasearch engine.
And this is the problem that social choice theory
has been trying to answer.

6
Search Engine Watch

Interesting meta search engines are listed at
http//www.searchenginewatch.com/links/article.php
/2156241

7
Social Choice Theory

Studies on protocols that help a group of people
make collective decisions, such as vote.

8
A Fundamental problem

Given a collection of agents (voters)
with preferences over different alternatives
(allocations, outcomes),
how should society evaluate these alternatives
and make a decision for all
that may be for the will of some voters but
against that of others.

9
Applications

Voters elect president from several candidates.
National polls for economic or political policy
of the government
The procedure or rule of election
The rank of metasearch engine obtained from those
of search engines

10
Group Descisions

How do we make decisions
Flip a coin?
Dictatorship?
Democracy (Majority rule)?

11
Group Decision Rules

Majority rule ,
Condorcet paradox (voting cycle)
Borda rule

12
Mathematical model

A set of voters Vv1,v2,v3,,Vn
A set of alternatives or outcomes
Ss1,s2,s3,Sm, with Sm and
A set of preference relation PR1,R2,R3Rn,
called a preference profile,
the preference relation Ri for each voter i is a
permutation (order) of elements in S.

13
Example 1 Majority Rule

3 rational people have rational preferences over
2 alternatives x,y
Person
1 2 3
1st X Y X
1 XgtY
Pref. i.e.Person 2
YgtX
2nd Y X Y
3 XgtY
How to Aggregate their preferences? How to choose?

Using majority rule.
Since more than ½ people (two out of three)
prefer x to y.
Then the group prefers x to y

15
Example 2 Condorcet Paradox

3 rational people have rational preferences over
3 alternatives x,y,z
Person
1 2 3
1st X Y Z
1 XgtYgtZ
Pref. 2nd Y Z X i.e. Person 2
YgtZgtX
3rd Z X Y
3 ZgtXgtY

16
Binary/paired Comparison With Majority rule

Person
1 2 3
1st X Y Z 1
XgtY
Pref. 2nd Y Z X for (x,y) 2
YgtX? XgtY
3rd Z X Y 3
XgtY
Similarly, for (Y,Z) we can get YgtZ for (Z,X) we
can get ZgtX.
Then XgtYgtZgtX (cycling) , Intransitive ? Not
rational

It was noted by Condorcet in the 18 century that
no alternative can win a majority against all
other alternatives.
Pairwise majority is not satisfactory in all
cases.

18
Example 3 Borda Rule

For each voter,
associate the number 1 with the most preferred
alternative,
2 with the second and so on,
Assign to each alternative the number equal to
the sum of the numbers the individual voters
assigned to the alternative.

Person
1 2 3
1st X(1) Y(1) X(1) X(4)
X
Pref. 2nd Y(2) X(2) W(2) ? Y(7) ?
Y
3rd Z(3) W(3) Z(3)
Z(10) W
4th W(4) Z(4) Y(4)
W(9) Z
Then We get choice XgtYgtWgtZ

For above example, if we use binary/paired
comparison With majority rule . We can get
XgtY in 2 out of 3, YgtW in 2 out of 3,
WgtZ in 2 out of 3, XgtW in 3 out of 3,
XgtZ in 3 out of 3, YgtZ in 2 out of 3
Then we can achieve same choice
XgtYgtWgtZ

For the previous example we had trouble with
majority rule via binary/paired comparison, we
get a tie between all three alternatives with the
Bordas rule
All three alternatives get a sum of 6.

Some variations
1 with relevant scores available
allotting each input system a point p to be
distributed according to relevance scores of the
documents.
2 Weighted Borda-rule
Each voter may not have equal effectiveness to
the final result. We may set more weight to good
quality input systems.

Condorcet winner algorithm
It also comes from social choice theory. The
Condorcet algorithm says that any candidate that
can beat all other candidates in a head-to-head
contest (pair-wise comparison) should win the
election.

Step 1, Construct Condorcet Graph.
For each candidate pair (x,y), there exists an
edge from x to y if x would receive at least as
many votes as y in a head-to-head contest.
In Condorcet graph, there is at least one
directed edge between every pair of candidates. (
we call the graph is semi-complete)
It may contains cycles in the graph. This is
due to voting paradox of the condorcet voting.

Step 2, We form a new acyclic graph from an old
cyclic one by contracting all of the nodes in a
cycle into one. It is a strongly connected
component graph (SCCG).
A directed graph is strongly connected if for
any two nodes ua nd v, there are paths from u to
v and from v to u.
Definition of Strongly connected component(SCC)
A strongly connected subgraph, S, of a
directed graph, D, such that no vertex or subset
of vertices of D can be added to S such that the
new subgraph is still strongly connected.

The graph is totally orderable at the level of
the SCCs and each SCC is a pocket of cycles,
within which each candidate is tied. (Why?)
Step 3, The condorcet-consistent Hamiltonian path
is any Hamiltonian path through Condorcet graph.
Definition Hamiltonian path A path between two
vertices of a graph that visits each vertex
exactly once.

Theorem 1. Suppose x and y are nodes in a graph
g, and that X and Y are nodes of the associated
SCCG G such that x X and y Y. If there
exists a path from X to Y in G, then every
Condorcet path of g has x before y.
Refer to Javed A. Aslam, Mark Montague 2001
for proof.

28
Rank Aggregation in MetaSearch

Here we discussed two cases which using
algorithm rooted at social choice theory for
MetaSearch rank aggregation.
Data fusion track in TREC
Javed A. Aslam, Mark Montague 2001 Models for
Metasearch
in SIGIR2001
Rank aggregation for web search engine
Cynthia Dwork, Ravi Kumar, Moni Naor,
D.Sivakumar 2001
Rank Aggregation Methods for the Web in WWW10

29
Data fusion track in TREC

TREC (Text Retrieval Conference ,see
http//trec.nist.gov/) maintains about 6Gb of
SGML tagged text, queries and respective answers
for evaluation purposes.
The TREC organizers distribute data sets in
advance and 50 new queries each year.
The competing teams then submit ranked lists of
documents that their system gave in response to
each query. And these retrieval systems will be
evaluated.

These ranked lists are available for metasearch
researchers to download and use.
For each query, every retrieval system will
return top 1000 documents and relevant score is
available.
Then given these results retrieved by many
different retrieval systems, how to aggregate
them for better performance?

31
Previous algorithms

Min, Max and Average Models
Fox and Shaw,1995
Linear Combination Model
Bartell 1995
Logistic Regression Model

32
Example

Min, Max and Average model
The final score of each document d is based on
the scores given to d by each input systems
(voters).
Algorithm Final
score
CombMin minimum of individual
relevance scores
CombMed median of individual relevance
scores
CombMax maximum of individual relevance
scores
CombSum sum of individual relevance
scores
CombANZ CombSum / num non-zero relevance
scores
CombMNZ CombSum num non-zero relevance
scores

Linear Combination Model (LC model)
The final score of document d is a simply
linearly (each weighted differently) combining
the normalized relevance scores given to each
document.
aiweight
si(d)relevance score

34
Experiment result on TREC Model

The performance of rank aggregation is evaluated
by average precision over the queries
Score-based borda-fuse (LC model) is usually the
best method among several borda variant
algorithms.
It is better than best input system over most of
data collection. Such as TREC3, TREC5

35
Experiment result II

The performance of rank aggregation is evaluated
by average precision over the queries.
Condorcet-fusion is the only algorithm that ,
without training data, ever matches the
performance of the best input system over TREC 9.
Condorcet-fusion seems particularly sensitive to
the dependence of input systems. If the input
systems (voters) are too similar, the performance
will decrease.

36
Rank aggregation methods for web

New Challenges Different from the case in TREC
data fusion,
The coverage of various search engine is
different
Thus some highly relevant web pages may not be
ranked by some search engines.
Therefore, each voter ranks a partial candidate
list

37
Preliminaries

Given a universe U, an ordered list with
respect to U is an ordering of a subset S U,
i.e., ,with each
and is some ordering
relation on S.
If contains
all the elements in U, then it is said to be a
full list,
otherwise it is called partial list.

Distance measures between two full lists with
respect to a set S
The Kendall tau distance
It counts the number of pairwise disagreements
between two lists.
The distance is given by
Normalize it by dividing the maximum possible
value

Spearman footrule distance
Given two full lists and , the distance
is given by
Normalize it by dividing the maximum value

Distance measures for more than 2 list
Given several full lists ,
for instance, the normalized Footrule distance of
to is given by
If are partial lists, let U
denote the union of elements in
and let be a full list with respect to U.
Considering the distance between and the
projection of with respect to , we have
the induced footrule distance

41
Optimal rank aggregation

The question is
Given (full or partial) lists ,
find a such that is a full list with
respect to the union of the elements of
minimizes
The aggregation obtained by optimizing Kendall
distance is called Kemeny optimal aggregation.

When kgt4,computing the Kemeny optimal
aggregation is NP-hard.
(please refer to Cynthia Dwork, Ravi Kumar,
Moni Naor, D.Sivakumar 2001 for detailed proof )
We can use Spearman footrule distance to
approximate the Kendall distance.

43
LCS approach (My own method)

Given n lists
l1,1, l1, 2, , l 1, n1
l2,1 , l 2, 2, , l 2, n2
l3,1,l3,2, , l3, n3
..
l m,1, l m,2, , l m,nm,
Find a longest common subsequence for these
lists.

44
LCS approach (My own method)

LCS is NP-hard for m sequences if some elements
appear twice in a sequence.
For the lists obtained by search engines, each
document appears at most once.
There exists efficient algorithm to solve the
problem for the special case.
Assume ninj for i, j1, 2, .

45
Efficient algorithm for LCS of m sequences

Fixed the order of the first sequence as
1, 2, , n1.
Define d(i) to be the length of LCS for
the elements 1, 2, , i that contains i in
the LCS.

46
Computation of d(i,1) and d(i,2)

d(i)max k d(k)1 such that k is always
before i in all the m lists. (if k does not
exist, d(i)1.)
The length of the LCS is max d(i) for i1, 2, ,
n1.
A backtracking process can give the LCS.

47
An Example