Title: Large Scale Combinatorial Optimization Problems in Bioinformatics
1Large Scale Combinatorial Optimization Problems
in Bioinformatics
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
- TexPoint fonts used in EMF.
- Read the TexPoint manual before you delete this
box. AAAAAAA
2Sample Preparation for Peptide Identification
3Single Stage MS
MS
m/z
4Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
5Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
6Novel Splice Isoform
- Human Jurkat leukemia cell-line
- Lipid-raft extraction protocol, targeting T cells
- von Haller, et al. MCP 2003.
- LIME1 gene
- LCK interacting transmembrane adaptor 1
- LCK gene
- Leukocyte-specific protein tyrosine kinase
- Proto-oncogene
- Chromosomal aberration involving LCK in
leukemias. - Multiple significant peptide identifications
7Novel Splice Isoform
8Novel Splice Isoform
9Polymerase Chain Reaction
10PCR
11Applications of k-mer sets
- Peptide Identification
- Set of all human amino-acid 30-mer peptide
sequences... - ...that occur at least twice in dbEST
- PCR Primer Design
- Unique 20-mers
- What does it mean to be unique?
12k-mer superstrings
- Completeness
- All of the required k-mers are represented
- Correctness
- No additional k-mers are represented
- Minimize the total representation length
- Correlates with running time
13Shortest superstring problem
- General strings (arbitrary length)
- Completeness only!
- Classical NP-hard problem
- Garey and Johnson
- Approximate within 2.5OPT
- Max-SNP hard
- One of the first algorithmic approaches to genome
assembly
14de Bruijn Sequences
- de Bruijn sequences represent all words of length
k from some alphabet A. - A 0,1, k 3 s 0001110100
- A 0,1, k 4 s 0000111101011001000
15de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
16de Bruijn Sequences Graphs
- de Bruijn graphs (k,A)
- Edges represent length k words from A
- Each node has
- in degree A
- out degree A
- Eulerian tour constructs de Bruijn sequence.
17SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
20Sequence Databases CSBH-graphs
- All k-mers represented by an edge have the same
count
1
2
2
1
2
21Correct, Complete Enumeration
- Set of paths that use each edge at least once
ACDEFGEFGI, DEFACG
22Related work
- Chinese Postman Problem
- Edmonds and Johnson, 73
- Undirected graph, weighted edges
- Shortest path that uses all the edges
- Solvable in polynomial time
- Construct minimum weighted matching between nodes
of odd-degree - Add matching to graph and find Eulerian path
- Minimize weight of extra edges used
23Correct, Complete, Enumeration
- Chinese postman problem, except
- Directed graph
- Add edges from nodes with surplus in-degree to
nodes with surplus out-degree - Fixed cost teleportation option
- Can always start a new sequence
- Find optimal set of additional edges
- Transportation problem / min cost flow instance
24Patching the CSBH-graph
- Use artificial edges to fix unbalanced nodes
25C3 Enumeration
in-out
in-out
Cost k
26C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
27C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
28C2 Superstring
in-out
in-out
Cost 0
Cost 0
Cap 1
Cost 0
29Large scale instances
- Millions of nodes and edges
- cSBH-graph instance
- Min cost flow instance
- cSBH-graph instance takes days to construct
- 2 days on 250 CPUs
- Algorithms must be linear in problem size
- Out-of-core Eulerian path algorithm?
30Grid computing
- Heterogeneous machines
- Varying disk/memory/MHz/cores capabilities
- Centralized scheduler
- Jobs started asynchronously
- Other jobs may preempt current job
- Input files may need to be staged
- 250 simultaneous requests for a 3Gb file?
- How to guarantee integrity of input files?
- Problem decomposition may be non-trivial
- Jobs sizes need to fit the least capable machine
- Sometimes need to game the scheduler
- Need to ensure the integrity of job output
31Uniqueness Oracles
- Oracle for uniqueness of 20-mers in the Human
genome (n3Gb) - Count occurrences in the genome 0,1,2
- Construct 20-mer superstring for 20-mers with
count 1 - Construct 20-mer superstring for 20-mers with
count gt 1 - Easy for exact sequence match O(n)
- Fast automata, indexing, hash tables.
32Inexact sequence match
- Inexact sequence matching O(nmk)
- Errors/Mismatches (k) 1,2,3
- distinct 20-mers (m) O(n)
- Achieve expected linear time using a hybrid
approach - Exact search for short chunks of queries
- Expensive alignment only where chunks match
- Large chunks ) Fast, but miss occurrences
- Small chunks ) Slow, find all matches
33Inexact sequence match
- Baeza-Yates Perleberg
- Correct and O(n) for small k
- At least 1 chunk is observed with no error.
- Form of locality sensitive hashing
34Locality Sensitive Hashing
- For each query
- store a (set of) hash(es) in hash-table
- At each position in the genome
- look-up a (set of) hash(es) in hash-table
- if any are in the table, do more expensive check
- Need to weigh
- sensitivity (false negatives) against
- specificity (false positives)
- Our application requires no false negatives
35Random Projection
- Choose T templates of l random care positions
g
t1
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0
36Random Projection
- Choose T templates of l random positions
g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
37Random Projection
- Choose T templates of l random positions
g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
38Random Projection
- Choose T templates of l random positions
g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
39Gapped seed-set design problem
- Given
- mer-size m ( 20 )
- errors k ( 1,2,3)
- cares l ( 10,12,14 )
- Find the smallest set of templates with no false
negatives. - Minimize running time.
40Alternative Formulation 1(for k 2)
- Cover the edges of Km with copies of Km-l
- How many triangles to cover K6?
- Some instances of (m,2,m-3) cover each edge
exactly once - Steiner triple systems
41Alternative Formulation 2
- Set cover instance
- Ground set all possible placements of the k
errors (alignments) - Covering sets all possible placements of the l
care positions - For (m20,k2,l10),
- 190 elements, 184,756 sets!
- Greedy approximation algorithm works
42Alternative Formulation 3
Templates
Positions (m)
Remove any kposition nodes, 1 templatehas
degree l.
l
43Alternative formulation 3
- Template t has care at position i xti
- Alignment a is not covered by template t yta
- Alignment a is not covered by any template za
44Alternative Formulation 3
- Polynomial size in terms of number of templates
- Select T in advance and test whether T is
sufficient. - Greedily add T templates.
- Apply iteratively to achieve feasible solution
- Extremely weak LP relaxation
- Lots of symmetry!
- Hard to solve useful instances
45Solution for (20,2,10)
- .................... Positions
- t1
- t2
- t3
- t4
- t5
- t6
- Need at least 4 templates, 6 is optimal
46Remember the application!
- We are checking some templates twice!
- We compute the hash(es) at each position in the
genome - Any template that is a shift of another will be
computed at some nearby genomic position!
47Solution for (20,2,10)
- .................... Positions
- t1
- t2
- t3
- t4
- t5
- t6
- Need at most 3 templates...can we do better?
48Alternative formulation 3 with template shift
49Solution for (20,2,10) w/ shift
- .................... Positions
- t1
- t2
- Optimal is 2 templates...
50Alternative Formulation 3 with shift
- Polynomial size in terms of number of templates
- Select T in advance and test whether T is
sufficient. - Greedily add T templates.
- Apply iteratively to achieve feasible solution
- Greedy not known to be good!
- Extremely weak LP relaxation
- Much less symmetry
- Hard to solve useful instances
51Conclusions
- Lots of interesting optimization problems in
bioinformatics - Usually very large scale
- Need good empirical algorithms/solvers
- Modeling tradeoffs abound
- Speed/Memory/Optimality/Correctness
- Many variants of LSH in different domains
- Resource constrained allocation problems
- ...due to limitations in biotechnologies
- As the technologies scale up, so do the issues