Title: Voting, Meta-search, and Bioconsensus
1Voting, Meta-search, and Bioconsensus
Fred S. Roberts Department of Mathematics
and DIMACS (Center for Discrete Mathematics and
Theoretical Computer Science) Rutgers
University Piscataway, New Jersey
2From Social Science Methods to Information
Technology Applications
Over the years, social scientists have developed
a variety of methods for dealing with problems of
voting, decisionmaking, conflict and cooperation,
measurement, etc. These methods, often heavily
mathematical, are beginning to find novel uses in
a variety of information technology applications.
3- Such methods will need to be substantially
improved to deal with such issues as - Computational Intractability
- Limitations on Computational Power/Information
- The Sheer Size of Some of the New Applications
- Learning Through Repetition
- Security, Privacy, and Cryptography
- This talk will concentrate on social science
methods for dealing with voting and
decisionmaking. We will look briefly at various
applications of these methods to a variety of
information technology problems and then
concentrate on a particular application to
biological databases.
4Voting/ Group Decisionmaking
In a standard model for voting, each member of a
group gives an opinion. We seek a consensus
among these opinions.
Sometimes the opinion is just a vote for a first
choice among a set of alternative choices or
candidates. In other contexts, the opinion might
be a ranking of all the alternatives.
5Obtaining opinions as rankings among alternatives
or candidates can sometimes give a lot more
information about a voters true preferences than
simply obtaining their first choice. But then
we have the challenge of defining what we mean by
a consensus. In many applications, we seek
a ranking that is in some sense a consensus of
the rankings provided by all of the voters.
6Medians and Means
Among the most important directions of research
in the theory of group consensus is the idea that
we can obtain a group consensus by first finding
a way to measure the distance between any two
alternatives or any two rankings. Let M be the
set of alternatives (candidates) or the set of
rankings of alternatives and d(a,b) distance
between a and b in M. A profile (of opinions) is
a vector (a1,a2, , an) of points from M.
7The median of a profile is the set of all points
x of M that minimize d(ai,x)
and the mean is the set of all points x of M
that minimize d(ai,x)2
One very commonly used method for measuring the
distance between two rankings of candidates is
called the Kemeny-Snell distance twice the
number of pairs of candidates i and j for
which i is ranked above j in one ranking and
below j in the other the number of pairs that
are ranked in one ranking and tied in another.
8Consider the following profile Voter 1 (a1)
Bush, Gore, Nader Voter 2 (a2) Bush, Gore,
Nader Voter 3 (a3) Gore, Bush, Nader In the
case of this profile, the Kemeny-Snell median is
the ranking x Bush, Gore, Nader. We have
d(a1,x) d(a2,x) d(a3,x) 0 0 2 2.
However, the Kemeny-Snell mean is the ranking y
Bush-Gore, Nader, in which Bush and Gore are
tied for first place. For d(a1,y)2 d(a2,y)2
d(a3,y)2 1 1 1 3, while d(a1,x)2
d(a2,x)2 d(a3,x)2 4.
9Note that medians or means need not be
unique. Voter 1 Bush, Gore, Nader Voter 2
Gore, Nader, Bush Voter 3 Nader, Bush,
Gore This is the voters paradox situation. In
this case, there are three Kemeny-Snell medians,
these three rankings. However, there is a unique
Kemeny-Snell mean, the ranking in which all three
candidates are tied. Because of non-uniqueness,
we think of consensus as defining a function F
from the set of profiles to the set of sets of
rankings. Kenneth Arrow called this a group
consensus function or social welfare function.
10In case the elements of M are rankings,
calculation of medians and means can be quite
difficult. Theorem (Bartholdi, Tovey and Trick,
1989 Wakabayashi, 1986) Under the Kemeny-Snell
distance, the calculation of the median of a
profile of rankings is an NP-complete problem.
11Meta-Search and Other Information Technology
Applications of Consensus Methods
Meta-Search Meta-Search is the process of
combining the results of several search engines.
We seek the consensus of the search engines,
whether it is a consensus first choice website or
a consensus ranking of websites.
12Meta-search has been studied using consensus
methods by Cohen, Schapire, and Singer (1999) and
Dwork, Kumar, Naor, and Sivakumar (2000). One
important point In voting, there are usually
few candidates and many voters. In meta-search,
there are usually few voters and many candidates.
In this setting, Dwork, et al. developed an
approximation to the Kemeny-Snell median that
preserves most of its desirable properties and is
computationally tractable.
13Information Retrieval We rank documents
according to their probability of relevance to a
query. Given a number of queries, we seek a
consensus ranking. Collaborative Filtering We
use knowledge of the behavior of multiple users
to make recommendations to an active user, for
example combining movie ratings by others to
prepare an ordered list of movies for a given
user.
14Consensus methods have been applied to
collaborative filtering by Freund, Iyer,
Schapire, and Singer (1998) -- they designed an
efficient boosting system for combining
preferences. Pennock, Horvitz, and Giles (2000)
applied consensus methods to develop Recommender
Systems. Software Measurement Combining
several different measures or ratings through
appropriate consensus methods is an important
topic in the measurement of the
understandability, quality, functionality,
reliability, efficiency, usability,
maintainability, or portability of software.
(Fenton and Pfleeger, 1997)
15Ordinal Filtering in Digital Image Processing In
one method of noise removal, to check if a pixel
is noise, one compares it with neighboring
pixels. If the values are beyond a certain
threshold, one replaces the value of the given
pixel with a mean or median of the values of its
neighbors. (See Janowitz (1986).) Related
methods are used in models of distributed
consensus. A number of processors each holds an
initial (binary) value, some of them may be
faulty and ignore any protocol, yet it is
required that the non-faulty processors
eventually agree (reach consensus) on a value.
16Berman and Garay (1993) developed a protocol for
distributed consensus based on the parliamentary
procedure known as cloture and showed it was
very good in terms of a number of important
criteria including polynomial computation and
communication.
17Bioconsensus
In recent years, methods of consensus developed
for applications in the social sciences have
become widely used in biology. In molecular
biology alone, Bill Day has compiled a
bibliography of hundreds of papers that use such
consensus methods.
18- The following are some of the ways that
bioconsensus problems arise - Alternative phylogenies (evolutionary trees) are
produced using different methods and we need to
choose a consensus tree. - Alternative taxonomies (classifications) are
produced using different models and we need to
choose a consensus taxonomy. - Alternative molecular sequences are produced
using different criteria or different algorithms
and we need to choose a consensus sequence. - Alternative sequence alignments are produced and
we need to choose a consensus alignment.
19Finding A Pattern or Feature Appearing in a Set
of Molecular Sequences In many problems of the
social and biological sciences, data is presented
as a sequence or word from some alphabet ?.
Given a set of sequences, we seek a pattern or
feature that appears widely, and we think of this
as a consensus sequence or set of sequences. A
pattern is often thought of as a consecutive
subsequence of short, fixed length. In biology,
such sequences arise from DNA, RNA, proteins,
etc.
20Why Look for Such Patterns? Similarities between
sequences or parts of sequences lead to the
discovery of shared phenomena. For example, it
was discovered that the sequence for platelet
derived factor, which causes growth in the body,
is 87 identical to the sequence for v-sis, a
cancer-causing gene. This led to the discovery
that v-sis works by stimulating growth.
21In recent years, we have developed huge databases
of molecular sequences. For example, GenBank has
over 7 million sequences comprising 8.6 billion
bases. The search for similarity or patterns has
extended from pairs of sequences to finding
patterns that appear in common in a large number
of sequences or throughout the database. To find
patterns in a database of sequences, it is useful
to measure the distance between sequences. If a
and b are sequences of the same length, a
common way to define the distance d(a,b) is to
take it to be the number of mismatches between
the sequences.
22 To measure how closely a pattern fits into a
sequence, we have to measure the distance between
sequences of different lengths. If b is longer
than a, then d(a,b) could be the smallest
number of mismatches in all possible alignments
of a as a consecutive subsequence of b. We
call this the best-mismatch distance.
23 Example a 0011, b 111010 Possible
Alignments 111010 111010 111010 0011 0011
0011 The best-mismatch distance is 2, which is
achieved in the third alignment.
24- An alternative way to measure d(a,b) is to
count the smallest number of mismatches between
sequences obtained from a and b by inserting
gaps in appropriate places -- a mismatch between
a letter of ? and a gap is counted as an ordinary
mismatch. We wont use this alternative measure
of distance. - Waterman (1989), Waterman, Galas, and Arratia
(1984), Galas, Eggert, and Waterman (1985) and
others study the following situation - ? is a finite alphabet
- k is a fixed finite number (the pattern length)
- A profile ? (a1,a2, , an) consists of a set
of words (sequences) of length L from ?, with
L ? k
25We seek a set F(?) F(a1,a2, , an) of
consensus words of length k from ?. Here is a
small piece of data from Waterman (1989), in
which he looks at 59 bacterial promoter
sequences RRNABP1 ACTCCCTATAATGCGCCA TNAA
GAGTGTAATAATGTAGCC UVRBP2
TTATCCAGTATAATTTGT SFC
AAGCGGTGTTATAATGCC Notice that if we are looking
for patterns of length 4, each sequence has the
pattern TAAT.
26However, suppose that we add another
sequence M1 RNA AACCCTCTATACTGCGCG The
pattern TAAT does not appear here. However, it
almost appears, since the word TACT appears, and
this has only one mismatch from the pattern
TAAT. So, in some sense, the pattern TAAT
is a pattern that is a good consensus pattern. We
now make this idea precise.
27In practice, the problem is a bit more
complicated than we have described it. We have
long sequences and we consider windows of
length L beginning at a fixed position, say the
jth. Thus, we consider words of length L in a
long sequence, beginning at the jth position.
For each possible pattern of length k, we ask
how closely it can be matched in each of the
sequences in a window of length L starting at
the jth position.
28Formalization Let ? be a finite alphabet of size
at least 2 and ? be a finite collection of words
of length L on ?. Let F(?) be the set of
words of length k ? 2 that are our consensus
patterns. (We drop the distinction between
profile as vector and profile as set.) Let ?
a1, a2, , an. One way to define F(?) is as
follows. Let d(a,b) be the best-mismatch
distance. Consider nonnegative parameters ?d
that are monotone decreasing with d and let
F(a1,a2, , an) be all those words w of length
k that maximize s?(w) ?d(w,ai)
29We call such an F a Waterman consensus. In
particular, Waterman and others use the
parameters ?d (k-d)/k. Example An alphabet
used frequently is the purine/pyrimidine alphabet
R,Y, where R A (adenine) or G (guanine) and
Y C (cytosine) or T (thymine). For simplicity,
it is easier to use the digits 0,1 rather than
the letters R,Y. Thus, let ? 0,1, let k
2. Then the possible pattern words are 00, 01,
10, 11.
30Suppose a1 111010, a2 111111. How do we find
F(a1,a2)? We have d(00,a1) 1, d(00,a2)
2 d(01,a1) 0, d(01,a2) 1 d(10,a1) 0,
d(10,a2) 1 d(11,a1) 0, d(11,a2) 0 S?(00)
? ?d(00,ai) ?1 ?2, S?(01) ? ?d(01,ai) ?0
?1 S?(10) ? ?d(10,ai) ?0 ?1 S?(11) ?
?d(11,ai) ?0 ?0 As long as ?0 gt ?1 gt ?2, it
follows that 11 is the consensus pattern,
according to Watermans consensus.
31Example Let ? 0,1, k 3, and consider
F(a1,a2,a3) where a1 000000, a2 100000, a3
111110. The possible pattern words are 000,
001, 010, 011, 100, 101, 110, 111. d(000,a1)
0, d(000,a2) 0, d(000,a3) 2, d(001,a1) 1,
d(001,a2) 1, d(001,a3) 2, d(100,a1) 1,
d(100,a2) 0, d(100,a3) 1, etc. S?(000) ?2
2?0, S?(001) ?2 2?1, S?(100) 2?1 ?0,
etc. Now, ?0 gt ?1 gt ?2 implies that S?(000) gt
S?(001). Similarly, one shows that the score is
maximized by S?(000) or S?(100). Monotonicity
doesnt say which of these is highest.
32Other Consensus Functions The median is the
collection of words w of length k which
minimize ??(w) d(w,ai). The mean is
the collection of words w of length k which
minimize ??(w) d(w,ai) 2.
33Another measure which it might be of interest to
minimize is a convex combination of these
two ??(w) ???(w) (1- ?) ??(w) , ? ?
0,1. Words which minimize ? will be called the
mixed median-mean. This might be of interest if
we are not ready to choose either medians or
means or want some combination of the two. We
might also choose to minimize ?d(w,a)m or
?logd(w,a)m for fixed m.
34Example Let ? 0,1, k 2, ?
a1,a2,a3,a4, a1 1111, a2 0000, a3 1000,
a4 0001. Possible pattern words 00, 01, 10,
11. ?d(00,ai) 2, ?d(01,ai) 3, ?d(10,ai) 3,
?d(11,ai) 4. Thus, 00 is the
median. ?d(00,ai)2 4, ?d(01,ai)2 3,
?d(10,ai)2 3, ?d(11,ai)2 6, so the mean
consists of the two words 01 and 10, neither of
which is a median.
35Summary of Notation s?(w)
?d(w,ai) ??(w) d(w,ai). ??(w)
d(w,ai)2. ??(w) ???(w) (1- ?) ??(w) , ??
0,1.
36The Special Case ?d (k-d)/k Suppose that ?d
(k-d)/k. We have ??(w) d(w,ai), s?(w)
?d(w,ai) n - (1/k) d(w,ai). Thus,
for fixed k ? 2, ? of size at least 2, and any
size set ?, for all words w, w? of length
L ??(w) ? ??(w?) ? s?(w) ? s?(w?).
37It follows that for fixed k ? 2, ? of size at
least 2, and any size set ?, there is a choice
of the parameter ?d so that the Waterman
consensus is the same as the median. (This also
holds for k 1 or ? of size 1, but these are
uninteresting cases.) Similarly, one can show
that for any fixed k ? 2, ? of size at least 2,
and any size set ?, there is a choice of
parameter ?d so that for all words w, w? of
length L ?? (w) ? ??(w?) ? s?(w) ?
s?(w?). For this choice of ?d , a word is a
Waterman consensus iff it is a mean.
38More generally, for all rational numbers ? ?
0,1, for fixed k ? 2, ? of size at least 2,
and any size set ?, there is a choice of
parameter ?d so that for all words w, w? of
length L ??(w) ? ??(w?) ? s?(w) ?
s?(w?). For this choice of ?d , a word is a
Waterman consensus iff it is a mixed median-mean
with convex combination depending upon?.
39What Parameters ?d Give Rise to Median, Mean,
or Mixed Median-Mean? Let us first decide if ?
can have repeated words. From the point of view
of the application, this is a reasonable
assumption. (Repeats are allowed in the
database or some words have more significance
than others.) We shall investigate both the
repetitive case -- where ? is allowed to have
repeated words -- and the nonrepetitive case.
The following results are joint with Boris
Mirkin.
40When do we Get the Median in Waterman
Consensus? Theorem Suppose k is fixed, k ?
2, ? is an alphabet of at least two letters, and
?d is a sequence with ?1 lt ?0. Then the
following are equivalent (a). The equivalence
??(w) ? ??(w?) ? s?(w) ? s?(w?) holds for
all words w, w? of length k from ? and all
finite nonrepetitive sets ? (of any size) of
words of length L ? k from ?. (b).
There are constants B, C, B lt 0, s.t. for all
0 ? j ? k, ?j Bj C.
41In other words, under the hypotheses of the
theorem, the median procedure corresponds exactly
to the choice of parameters ?j Bj C, B lt 0.
Note that ?j (k-j)/k is a special case of
this. Remark This theorem (and all subsequent
theorems) also hold if we replace words of
length L ? k by words of fixed length L, L ?
k.
42When do we Get the Mean in Waterman
Consensus? Theorem Suppose k is fixed, k ?
2, ? is an alphabet of at least two letters, and
?d is a sequence with ?1 lt ?0. Then the
following are equivalent (a). The equivalence
??(w) ? ??(w?) ? s?(w) ? s?(w?) holds for
all words w, w? of length k from ? and all
finite sets ? (of any size) of words of length
L ? k from ?. (b). There are constants A, C,
A lt 0, so that for all 0 ? j ? k, ?j Aj2 C.
43In other words, under the hypotheses of the
theorem, the mean procedure corresponds exactly
to the choice of parameters ?j Aj2 C, A lt
0. Note that we require (a) to hold for all
finite sets ?, even those with repetitions. It
is a technicality to try to remove this
hypothesis. To do so, we have found it necessary
to allow a larger alphabet or to take k
sufficiently large and also to consider only L
larger than k. The first result uses an
alphabet of size at least four.
44Theorem Suppose k is fixed, k ? 2, ? is an
alphabet of at least four letters, and ?d is a
sequence with ?1 lt ?0. Then the following are
equivalent (a). The equivalence ??(w) ? ??(w?)
? s?(w) ? s?(w?) holds for all words w, w? of
length k from ? and all finite nonrepetitive
sets ? (of any size) of words of length L gt k
from ?. (b). There are constants A, C, A lt 0,
s.t. for all 0 ? j ? k, ?j Aj2 C.
45The next result removes the hypothesis of the
alphabet being of size at least 4, but adds a
hypothesis that k ? 3 Theorem Suppose k is
fixed, k ? 3, ? is an alphabet of at least two
letters, and ?d is a sequence with ?1 lt ?0.
Then the following are equivalent (a). The
equivalence ??(w) ? ??(w?) ? s?(w) ?
s?(w?) holds for all words w, w? of length k
from ? and all finite nonrepetitive sets ? (of
any size) of words of length L gt k from
?. (b). There are constants A, C, A lt 0, s.t.
for all 0 ? j ? k, ?j Aj2 C.
46When do we Get the Mixed Median-Mean in Waterman
Consensus? Recall that ??(w) ???(w) (1- ?)
??(w) , ?? 0,1. In the following, we shall
assume that ? is a rational number. It might
just be of purely technical interest to figure
out what happens when ? is irrational, but we
have not been able to obtain analogous results
for this case.
47Theorem Suppose k is fixed, k ? 2, ? is an
alphabet of at least two letters, and ?d is a
sequence with ?1 lt ?0. Then the following are
equivalent (a). The equivalence ??(w) ? ??(w?)
? s?(w) ? s?(w?) holds for all words w, w? of
length k from ? and all finite sets ? (of any
size) of words of length L ? k from ?. (b).
There are constants D, E, D lt 0, s.t. for all 0
? j ? k, ?j D(1- ?)j2 D?j E.
48 In other words, under the hypotheses of the
theorem, the mixed median-mean procedure
corresponds exactly to the choice of parameters
?j D(1-?)j 2 D?j E, D lt 0. If we only
want to require that the equivalence in part (a)
holds for finite nonrepetitive sets ?, one way
to do so, as with the results about the mean or
??, is to assume that ? is sufficiently large.
We have been able to prove this result under the
added hypothesis that L gt k and the added
hypothesis that ? has at least r(?) 1
elements, where 2(2-?) s/t for s, t positive
integers and r(?) maxs-t,t1.
49Other Consensus Functions as Special Cases of
Watermans Consensus It would be interesting to
study conditions under which other consensus
methods are special cases of Watermans. Of
particular interest might be such consensus
methods as minimizing ? d(w,ai)m or minimizing
? logd(w,ai)m.
50Algorithms In practical applications in molecular
biology, good algorithms for obtaining a
consensus pattern are essential. Waterman and his
co-authors provide a method for computing their
consensus patterns in the case ?d (k-d)/k.
The Brute Force Algorithm Suppose that D(w,w?)
is the number of mismatches between two words w,
w? of length k. The most naïve algorithm for
finding the Waterman consensus would proceed by
brute force and calculate all best-mismatch
distances d(w,a) for w a potential pattern
word of length k and a ? ? by calculating
D(w,w?) for all words w? of length k in a.
51Suppose that c is the cost of computing any
D(w,w?). Let a be any word of length L and
w be a word of length k. If the best-mismatch
distance d(w,a) is calculated by computing
D(w,w?) for all w? of length k in a, then
since there are L-k1 such words w? in a, we
calculate d(w,a) with cost (L-k1)c. If p
???, there are pk potential pattern words w.
If there are n words in ?, then the brute
force algorithm can compute the scores s?(w) for
all potential pattern words w and hence obtain
the optimal pattern word by making a number of
calculations of total cost pkn(L-k1)c.
52Note that p is relatively small, k is small,
and n is typically large. An Improved
Algorithm As Waterman, et al. observe, we can
improve the performance of the algorithm
considerably by looking at neighborhoods of a
word w? of length k. Let the neighborhood
N(w?) consist of all words w of length k so
that D(w,w?) ? T for some threshold T.
Usually, T is taken to be small, for example ?
3. The idea is to eliminate all potential
pattern words w which dont fall within N(w?)
for some k-length word in each sequence word a
in the database ?, i.e., to eliminate all words
w so that d(w,a) gt T for some a in ?.
53Initial Calculations As an initial step,
calculate D(w,w?) for all words w, w? of length
k from ?. This requires p2k calculations of
cost c each. Going Through the Database Go
through the database ? one word at a time.
Given a word a, consider each k-length word
w? in a, find N(w?), and look up D(w,w?) for
all k-length pattern words w in N(w?). If N
is an upper bound on the size of the
neighborhoods N(w?), then we have to look up at
most (L-k1)N values D(w,w?). Assume that C
is the look-up cost. If a word w is not in
N(w?) for any such w?, dismiss it as a potential
pattern word.
54Calculate s?(w) for all pattern words w that
have not been dismissed. Since we can update
s?(w) as we go through each word of ?, and
since there are n words in ?, we can
calculate s?(w) for all non-dismissed w and
therefore obtain the highest scoring
non-dismissed w with cost at most n(L-k1)NC
p2kc, where the second term comes from the
initial calculation. This cost is considerably
smaller than the cost of the brute force
algorithm, n(L-k1)pkc.
55This is because, typically, n is large in
comparison to p2kc. Since C is much less than
c and N is presumably much less than pk, NC
is much less than pkc. Since n is large,
this change is significant compared to the added
cost p2kc. Of course, the improved algorithm
assumes that no dismissed word can be a consensus
word, which could be false.
56Algorithms for the Median A considerable amount
of work has been devoted in the literature to
finding algorithms for obtaining the median,
although this is often a difficult computational
problem and is NP-complete in some contexts.
In a typical application, we have a large
database ? and so a very efficient algorithm
will be needed. One of the reasons for our
interest in the median procedure was that some of
the algorithms for computing medians might be
usable to improve upon the computational methods
given for the Waterman consensus. We have,
however, not worked on this idea.
57Axiomatic Approach In the group consensus
literature in the social sciences and elsewhere,
there has been considerable emphasis on finding
axioms characterizing different consensus
procedures. However, consensus methods used in
molecular biology tend to be chosen because they
seem interesting or useful, rather than on the
basis of some theory. Such a theory could be
based on an axiomatic approach. Axioms have been
given for the median procedure in some contexts
58- Young and Levenglick (1978) axiomatized the
Kemeny-Snell median where the ai are rankings
rather than words. - The median procedure has been axiomatically
characterized when the ai are vertices of
various kinds of graphs - n-trees Barthelemy and McMorris, 1986
- Covering graphs of semilattices LeClerc, 1994
- Median graphs McMorris, Mulder, Roberts, 1998
59- Not as much is known about means
- The Kemeny-Snell mean procedure for rankings has
not been characterized. - The mean procedure was characterized for trees by
Holzman (1990). - However, Hansen and Roberts (1996) showed that a
natural generalization of Holzmans axioms for
arbitrary connected graphs with cycles leads to
an impossibility result.
60It would be interesting and potentially useful to
characterize axiomatically the median and the
mean in the bioconsensus context we have
described, i.e., to give axioms for F(?) to be
obtained by minimizing ?? or ??. Our results do
not give such a characterization since these
results depend upon the Waterman consensus
method. No results are known in the sequence
context which characterize axiomatically those
consensus functions that are the median or the
mean, either when L k or when L ? k.
It would also be interesting and potentially of
practical significance to try to axiomatize the
Waterman consensus. No results are known about
this problem.