Large Scale Combinatorial Optimization Problems in Bioinformatics - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Large Scale Combinatorial Optimization Problems in Bioinformatics

Description:

Construct minimum weighted matching between nodes of odd-degree ... days to construct. 2 days on ... Construct 20-mer superstring for 20-mers with count 1 ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 52

Provided by: umiac7

Category:

more less

Transcript and Presenter's Notes

Title: Large Scale Combinatorial Optimization Problems in Bioinformatics

1
Large Scale Combinatorial Optimization Problems
in Bioinformatics

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this
box. AAAAAAA

2
Sample Preparation for Peptide Identification
3
Single Stage MS
MS
m/z
4
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
5
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
6
Novel Splice Isoform

Human Jurkat leukemia cell-line
Lipid-raft extraction protocol, targeting T cells
von Haller, et al. MCP 2003.
LIME1 gene
LCK interacting transmembrane adaptor 1
LCK gene
Leukocyte-specific protein tyrosine kinase
Proto-oncogene
Chromosomal aberration involving LCK in
leukemias.
Multiple significant peptide identifications

7
Novel Splice Isoform
8
Novel Splice Isoform
9
Polymerase Chain Reaction
10
PCR
11
Applications of k-mer sets

Peptide Identification
Set of all human amino-acid 30-mer peptide
sequences...
...that occur at least twice in dbEST
PCR Primer Design
Unique 20-mers
What does it mean to be unique?

12
k-mer superstrings

Completeness
All of the required k-mers are represented
Correctness
No additional k-mers are represented
Minimize the total representation length
Correlates with running time

13
Shortest superstring problem

General strings (arbitrary length)
Completeness only!
Classical NP-hard problem
Garey and Johnson
Approximate within 2.5OPT
Max-SNP hard
One of the first algorithmic approaches to genome
assembly

14
de Bruijn Sequences

de Bruijn sequences represent all words of length
k from some alphabet A.
A 0,1, k 3 s 0001110100
A 0,1, k 4 s 0000111101011001000

15
de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
16
de Bruijn Sequences Graphs

de Bruijn graphs (k,A)
Edges represent length k words from A
Each node has
in degree A
out degree A
Eulerian tour constructs de Bruijn sequence.

17
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19
Sequence Databases CSBH-graphs

Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
20
Sequence Databases CSBH-graphs

All k-mers represented by an edge have the same
count

1
2
2
1
2
21
Correct, Complete Enumeration

Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG
22
Related work

Chinese Postman Problem
Edmonds and Johnson, 73
Undirected graph, weighted edges
Shortest path that uses all the edges
Solvable in polynomial time
Construct minimum weighted matching between nodes
of odd-degree
Add matching to graph and find Eulerian path
Minimize weight of extra edges used

23
Correct, Complete, Enumeration

Chinese postman problem, except
Directed graph
Add edges from nodes with surplus in-degree to
nodes with surplus out-degree
Fixed cost teleportation option
Can always start a new sequence
Find optimal set of additional edges
Transportation problem / min cost flow instance

24
Patching the CSBH-graph

Use artificial edges to fix unbalanced nodes

25
C3 Enumeration
in-out
in-out
Cost k
26
C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
27
C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
28
C2 Superstring
in-out
in-out
Cost 0
Cost 0
Cap 1
Cost 0
29
Large scale instances

Millions of nodes and edges
cSBH-graph instance
Min cost flow instance
cSBH-graph instance takes days to construct
2 days on 250 CPUs
Algorithms must be linear in problem size
Out-of-core Eulerian path algorithm?

30
Grid computing

Heterogeneous machines
Varying disk/memory/MHz/cores capabilities
Centralized scheduler
Jobs started asynchronously
Other jobs may preempt current job
Input files may need to be staged
250 simultaneous requests for a 3Gb file?
How to guarantee integrity of input files?
Problem decomposition may be non-trivial
Jobs sizes need to fit the least capable machine
Sometimes need to game the scheduler
Need to ensure the integrity of job output

31
Uniqueness Oracles

Oracle for uniqueness of 20-mers in the Human
genome (n3Gb)
Count occurrences in the genome 0,1,2
Construct 20-mer superstring for 20-mers with
count 1
Construct 20-mer superstring for 20-mers with
count gt 1
Easy for exact sequence match O(n)
Fast automata, indexing, hash tables.

32
Inexact sequence match

Inexact sequence matching O(nmk)
Errors/Mismatches (k) 1,2,3
distinct 20-mers (m) O(n)
Achieve expected linear time using a hybrid
approach
Exact search for short chunks of queries
Expensive alignment only where chunks match
Large chunks ) Fast, but miss occurrences
Small chunks ) Slow, find all matches

33
Inexact sequence match

Baeza-Yates Perleberg
Correct and O(n) for small k
At least 1 chunk is observed with no error.
Form of locality sensitive hashing

34
Locality Sensitive Hashing

For each query
store a (set of) hash(es) in hash-table
At each position in the genome
look-up a (set of) hash(es) in hash-table
if any are in the table, do more expensive check
Need to weigh
sensitivity (false negatives) against
specificity (false positives)
Our application requires no false negatives

35
Random Projection

Choose T templates of l random care positions

g
t1
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0
36
Random Projection

Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
37
Random Projection

Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
38
Random Projection

Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
39
Gapped seed-set design problem

Given
mer-size m ( 20 )
errors k ( 1,2,3)
cares l ( 10,12,14 )
Find the smallest set of templates with no false
negatives.
Minimize running time.

40
Alternative Formulation 1(for k 2)

Cover the edges of Km with copies of Km-l
How many triangles to cover K6?
Some instances of (m,2,m-3) cover each edge
exactly once
Steiner triple systems

41
Alternative Formulation 2

Set cover instance
Ground set all possible placements of the k
errors (alignments)
Covering sets all possible placements of the l
care positions
For (m20,k2,l10),
190 elements, 184,756 sets!
Greedy approximation algorithm works

42
Alternative Formulation 3
Templates
Positions (m)
Remove any kposition nodes, 1 templatehas
degree l.
l
43
Alternative formulation 3

Template t has care at position i xti
Alignment a is not covered by template t yta
Alignment a is not covered by any template za

44
Alternative Formulation 3

Polynomial size in terms of number of templates
Select T in advance and test whether T is
sufficient.
Greedily add T templates.
Apply iteratively to achieve feasible solution
Extremely weak LP relaxation
Lots of symmetry!
Hard to solve useful instances

45
Solution for (20,2,10)

.................... Positions
t1
t2
t3
t4
t5
t6

Need at least 4 templates, 6 is optimal

46
Remember the application!

We are checking some templates twice!
We compute the hash(es) at each position in the
genome
Any template that is a shift of another will be
computed at some nearby genomic position!

47
Solution for (20,2,10)