Title: Randomized Approximation Algorithms for
1- Randomized Approximation Algorithms for
- Offline and Online Set Multicover Problems
- Bhaskar DasGupta
- Department of Computer Science
- Univ of IL at Chicago
- dasgupta_at_cs.uic.edu
- Joint works with Piotr Berman (Penn State) and
Eduardo Sontag (Rutgers) - collection of results that appeared in
APPROX-2004, WADS-2005 and to appear in Discrete
Applied Math (special issue on computational
biology) - Supported by NSF grants CCR-0206795,
CCR-0208749 and a CAREER award IIS-0346973 -
-
-
-
2- More interesting title for the theoretical
computer science community - Randomized Approximation Algorithms for
- Set Multicover Problems
- with Applications to
- Reverse Engineering of Protein and Gene Networks
3- More interesting title for the biological
community - Randomized Approximation Algorithms for
- Set Multicover Problems
- with Applications to
- Reverse Engineering of Protein and Gene Networks
4- Set k-multicover (SCk)
- Input Universe U1,2,?,n, sets S1,S2,?,Sm ? U,
- integer (coverage factor) k?1
- Valid Solution cover every element of universe
?k times - subset of indices I ? 1,2,?,m such that
- ?x?U j?I x?Sj ? k
- Objective minimize number of picked sets I
- k1 ? simply called (unweighted) set-cover
- a well-studied problem
- Special case of interest in our applications
- k is large, e.g., kn-1
5(maximum size of any set)
- Known positive results
- Set-cover (k1)
- can approximate with approx. ratio of 1ln a
- (determinstic or randomized)
- Johnson 1974, Chvátal 1979, Lovász 1975
- Set-multicover (kgt1)
- same holds for k?1
- e.g., primal-dual fitting Rajagopalan and
Vazirani 1999 -
6- Known negative results for setcover (i.e., k1)
- - (modulo NP ? DTIME(nloglog n))
- approx ratio better than (1-?)ln n is not
- possible for any constant 0???1 (Feige
1998) - - (modulo NP?P)
- better than (1-?)ln n not possible for
- some constant 0???1) (Raz and Safra 1997)
- - lower bound can be generalized in terms of
- set size a
- better than ln a-O(ln ln a) is not
possible -
(Trevisan, 2001)
7- r(a,k) approx. ratio of an algorithm as function
of a,k - We know that for greedy algorithm r(a,k) ? 1ln a
- at every step select set that contains maximum
number of elements not covered k times yet - Can we design algorithm such that r(a,k)
decreases with increasing k ? - possible approaches
- improved analysis of greedy?
- randomized approach (LP rounding) ?
- ?
8- Our results (very roughly)
- n number of elements of universe U
- k number of times each element must be covered
- a maximum size of any set
- Greedy would not do any better
- r(a,k)?(log n) even if k is large, e.g, kn
- But can design randomized algorithm based on
LProunding approach such that the expected
approx. ratio is better - Er(a,k) ? max2o(1), ln(a/k) (as appears in
conference proceedings) - ? (further
improvement (via comments from Feige)) - ? max1o(1), ln(a/k)
9- More precise bounds on Er(a,k)
- 1ln a if
k1 - (1e-(k-1)/5) ln(a/(k-1)) if
a/(k-1) ? e2 ?7.4 and kgt1 - min22e-(k-1)/5,20.46 a/k if ¼ ? a/(k-1) ?
e2 and kgt1 - 12(a/k)½ if
a/(k-1) ? ¼ and kgt1
Er(a,k)
10- Can Er(a,k) coverge to 1 at a much faster rate?
- Probably not...for example, problem can be shown
to be APX-hard for a/k ? 1 - Can we prove matching lower bounds of the form
- max 1o(1) , 1ln(a/k) ?
- Do not know...
11- How about the weighted case?
- each set has arbitrary positive weight
- minimize sum of weights of selected sets
- It seems that the multi-cover version may not be
much easier than the single-cover version - take single-cover instance
- add few new elements and new must-select sets
with almost-zero weights that covers original
elements - k-1 times and all new elements k times
12- Our randomized algorithm
- Standard LP-relaxation for set multicover (SCk)
- selection variable xi for each set Si (1 ? i ?
m) - minimize
- subject to
-
0 ? xi ? 1 for all i
13- Our randomized algorithm
- Solve the LP-relaxation
- Select a scaling factor ? carefully
- ln a if k1
- ln (a/(k-1)) if a/(k-1)?e2 and k?1
- 2 if ¼?a/(k-1)?e2 and
k?1 - 1(a/k)½ otherwise
- Deterministic rounding select Si if ?xi?1
- C0 Si ?xi?1
- Randomized rounding select Si?S1,?,Sm\C0 with
prob. ?xi - C1 collection of such selected sets
- Greedy choice if an element u?U is covered less
than k - times, pick sets from S1,?,Sm\(C0 ?C1)
arbitrarily
14- Most non-trivial part of the analysis involved
proving the following bound for Er(a,k) - Er(a,k) ? (1e-(k-1)/5) ln(a/(k-1)) if
a/(k-1) ? e2 and kgt1 - Needed to do an amortized analysis of the
interaction between the deterministic and
randomized rounding steps with the greedy step. - For tight analysis, the standard Chernoff bounds
were not always sufficient and hence needed to
devise more appropriate bounds for certain
parameter ranges.
15- Proof of the simplest of the bounds
- Er(a,k) ? 12(a/k)½ if a/k ? ¼
- Notational simplification
- a (a/k)½ 2
- thus, ß 1(1/a)
- need to show that Er(a,k) ? 1(2/a)
- (x1, x2, ...,xn) is the solution vector for the
LP - thus, OPT
- Also, obviously, OPT (n k)/a n a2
16- Focus on a single element j?U
- Remember the algorithm
- Deterministic rounding select Si if ?xi?1
- C0 Si ?xi?1
- Let C0,j those sets in C0 that contained j
- Randomized rounding select Si?S1,?,Sm\C0 with
prob. ?xi - C1 collection of such selected sets
- Let C1,j those sets in C1 that contained
j - p sum of prob. of those sets that
contained j -
- Greedy choice if an element j?U is covered less
than k times, pick sets from S1,?,Sm\(C0 ?C1)
that contains j arbitrarily let C2 be all such
sets selected - Let C2,j be those sets in C2 that
contained j
17- What is E C0 C1 ?
- Obvious.
- E C0C1 ß( ) ? (1a-1).OPT
- ( no set is both in C0 and C1 )
18- What is E C2,j ?
- Suppose that C0,jk-f for some f
- S1, S2, ...,Sk-f,Sk-f1,....Sk-f
?
C0,j
?f?, say
and xj ? 1 for any j imply
19- (Focus on a single element j?U)
- Goal is to
- first determine E C0 C1
- then determine
- E C2,j
- sum it up over all j to get E C2
- finallly determine E C0 C1 C2
20- What is E C2,j ? (contd.)
C1,j f-C2,j and thus after some algebra
21- What is E C2,j ? (contd.)
22(No Transcript)
23- One application
- We used the randomized algorithm for robust
string barcoding - Check the publications in the software webpage
- http//dna.engr.uconn.edu/software/barcode/
- (joint project with Kishori Konwar, Ion Mandoiu
and Alex Shvartsman at Univ. of Connecticut)
24- Another (the original) motivation for looking at
- set-multicover
- Reverse engineering of biological networks
25Biological problem via Differential Equations
Linear Algebraic formulation
Set-multicover formulation
Randomized Algorithm
Selection of appropriate biological experiments
Biological Motivation
26Biological problem via Differential Equations
Linear Algebraic formulation
Set multicover formulation
Randomized Algorithm
Selection of appropriate biological experiments
Biological Motivation
27n
1
m
m
1
1
1
1
1
Ai
Bj
n
n
n
A
B
C
unknown
- initially unknown,
- but can be queried
- columns are linearly
- independent
0 ?
0 ?
Get Zero structure of jth column Cj
Query jth column Bj
0 ?
0 ?
281
m
m
n
1
1
B1
B0
B2
B4
B3
1
1
0 2 0 1 3 4 1 2 0 0 0 0
5 0 1
1
- 3 37 1 10
- 4 5 52 2 16
- 0 0 -5 0 -1
x
n
n
n
B
C
A
(columns are in general position)
B2
0 ?0 0 ?0 0 ?0 ?0 ?0 0 0 0 0 ?0
0 ?0
? ? ? ? ? ? ? ? ?
37 52 -5
what is B2 ?
C0 zero structure of C known
unknown
initially unknown but can query columns
29- Rough objective obtain as much information about
A performing as few queries as possible - Obviously, the best we can hope is to identify A
upto scaling
30n
1
B1
B0
B2
B4
B3
1
1
1
- 3 37 1 10
- 4 5 52 2 16
- 0 0 -5 0 -1
0 ?0 0 ?0 0 ?0 ?0 ?0 0 0 0 0 ?0
0 ?0
? ? ? ? ? ? ? ? ?
x
n
n
n
B
A
C0
J1? 2 n-1
37 52 -5
10 16 -1
0 0 ?0 0
?0 ?0
can be recovered (upto scaling)
A
31- Suppose we query columns Bj for j?J j1,?, jl
- Let Jij j?J and cij0
- Suppose Ji ? n-1.Then,each Ai is uniquely
determined upto a scalar multiple (theoretically
the best possible) - Thus, the combinatorial question is
- find J of minimum cardinality such that
- Ji ? n-1 for all i
32- Combinatorial Question
- Input sets Ji ? 1,2,,n for 1 ? i ? m
- Valid Solution a subset ? ? 1,2,...,m such
that - ? 1 ? i ? n J? ??? and i?J? ? n-1
- Goal minimize ?
- This is the set-multicover problem with coverage
factor n-1 - More generally, one can ask for lower coverage
factor, n-k for some k?1, to allow fewer queries
but resulting in ambiguous determination of A
33Biological problem via Differential Equations
Linear Algebraic formulation
Combinatorial Algorithms (randomized)
Combinatorial formulation
Selection of appropriate biological experiments
34- Time evolution of state variables
(x1(t),x2(t),?,xn(t)) given by a set of
differential equations - ?x1/?t f1(x1,x2,?,xn,p1,p2,
?,pm) - ?x/?t f(x,p) ? ?
- ?xn/?t fn(x1,x2,?,xn,p1,p2
,?,pm) - p(p1,p2,?,pm) represents concentration of
certain enzymes -
- f(x?,p?)0
- p? is wild type (i.e. normal) condition of p
- x? is corresponding steday-state
condition
35- Goal
- We are interested in obtaining information about
the sign of ?fi/?xj(x?,p?) - e.g., if ?fi/?xj ? 0, then xj has a positive
(catalytic) effect on the formation of xi -
36- Assumption
- We do not know f, but do know that certain
parameters pj do not effect certain variables xi - This gives zero structure of matrix C
- matrix C0(c0ij) with c0ij0 ? ?fi/?xj0
-
37- m experiments
- change one parameter, say pk (1 ? k ? m)
- for perturbed p ? p?, measure steady state vector
x ?(p) - estimate n sensitivities
-
where ej is the jth canonical basis vector
38- In practice, perturbation experiment involves
- letting the system relax to steady state
- measure expression profiles of variables xi
(e.g., using microarrys)
39- Biology to linear algebra (continued)
- Let A be the Jacobian matrix ?f/?x
- Let C be the negative of the Jacobian matrix
?f/?p - From f(?(p),p)0, taking derivative with respect
to p and using chain rules, we get CAB. - This gives the linear algebraic formulation of
the problem.
40 41- Performance measure
- Via competitive ratio
- ratio of the total cost of the online algorithm
to that of an optimal offline algorithm that
knows the entire input in advance - For randomized algorithm, we measure the expected
competitive ratio
42- Parameters of interest
- (for performance measure)
- frequency m
- (maximum number of sets in which any presented
element belongs) - unknown
- maximum set size d
- (maximum number of presented elements a set
contains) - unknown
- total number of elements in the universe n
- ( d) unknown
- coverage factor k
- given
43- Previous result
- Alon, Awerbuch, Azar, Buchbinder, and Naor
- (STOC 2003 and SODA 2004)
- considered k1
- both deterministic and randomized algorithms
- competitive ratio O(log m log n),
worst-case/expected - almost matching lower bound of
for deterministic algorithms and almost all
parameter values
44- Our improved algorithm
- Expected competitive ratio of
- O(log m log n)
O(log m log d)
d ? n
log2m ln d lower order term
small precise constants
ratio improves with larger k
c largest weight / smallest weight
45- Even more precise smaller constants for
- unweighted k1 case
- via improved analysis
46- Our lower bounds on competitive ratio
- (for deterministic algorithms)
unweighted case
weighted case
for many values of parameters
47- Work concurrent to our conference publication
- Alon, Azar and Gutner (SPAA 2005)
- different version of the online problem (weighted
case) - same element can be presented multiple times
- if the same element is presented k times, our
goal is to cover it by at least k different sets - expected competitive ratio O(log m log n)
- easy to see that it applies to our version with
same bounds - Conversely,
- our algorithm and analysis can be easily adapted
to provide expected competitive ratio of - log2m ln (d/....)
- for the above version
48- Yet another version of online set-cover
- Awerbuch, Azar, Fiat, Leighton (STOC 96)
- elements presented one at a time
- allowed to pick k sets at a given time for a
specified k - goal maximize number of presented elements for
which - at least one set containing the element was
selected before the element was presented - provides efficient radomized approximation
algorithms and matching lower bounds
49- Our algorithmic approach
- Randomized version of the so-called winnowing
approach - (deterministic) winnowing approach was first
used long ago - N. Littlestone, Learning Quickly When Irrelevant
Attributes Abound A New Linear-Threshold
Algorithm, Machine Learning, 2, pp. 285-318,
1988. - this approach was also used by Alon, Awerbuch,
Azar, Buchbinder and Naor in their STOC-2003
paper
50- Very very rough description of our approach
- every set starts with zero probability of
selection - start with an empty solution
- when the next element i is presented
- if already k sets contain i, terminate
- appropriately increase probabilities of all
sets containing i (promotion step of winnowing) - select sets containing i with the above
probabilities - if still k sets not selected, then just select
more sets greedily - select the least-cost set not selected already,
then the next least-cost sets etc.
51- Many desirable (and, sometimes conflicting goals)
- increase in probability of each set should not be
too much - else, e.g., randomized step may select too many
sets - increase in probability of each set should not be
too little - else, e.g., optimal sets may be missed too many
times, - greedy step may dominate too
much - light sets should be preferable over heavy
sets unless heavy sets are in an optimal solution - increase in probability should be somehow
inversely linked to the frequency of i to
eliminate selection of too many sets in the
randomized step
52(No Transcript)
53- Slightly improved algorithm for unweighted case
- (expected competitive ratio has better
constants/asymptotic) - Modify the promotion step slightly
change
to
54- New expected competitive ratio
55- Motivation for the online version
- Similar to before except that
- we use fluorescent proteins instead of
microarrays - Fluorescent proteins can be used to know the rate
at which a certain gene transcribes in a cell
under a set of conditions. - a priori matrix C is not known completely but to
be learnt by doing experiments
56- Thank you for your attention!