Large Scale Combinatorial Optimization Problems in Bioinformatics - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Large Scale Combinatorial Optimization Problems in Bioinformatics

Description:

Construct minimum weighted matching between nodes of odd-degree ... days to construct. 2 days on ... Construct 20-mer superstring for 20-mers with count 1 ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 52
Provided by: umiac7
Category:

less

Transcript and Presenter's Notes

Title: Large Scale Combinatorial Optimization Problems in Bioinformatics


1
Large Scale Combinatorial Optimization Problems
in Bioinformatics
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park
  • TexPoint fonts used in EMF.
  • Read the TexPoint manual before you delete this
    box. AAAAAAA

2
Sample Preparation for Peptide Identification
3
Single Stage MS
MS
m/z
4
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
5
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
6
Novel Splice Isoform
  • Human Jurkat leukemia cell-line
  • Lipid-raft extraction protocol, targeting T cells
  • von Haller, et al. MCP 2003.
  • LIME1 gene
  • LCK interacting transmembrane adaptor 1
  • LCK gene
  • Leukocyte-specific protein tyrosine kinase
  • Proto-oncogene
  • Chromosomal aberration involving LCK in
    leukemias.
  • Multiple significant peptide identifications

7
Novel Splice Isoform
8
Novel Splice Isoform
9
Polymerase Chain Reaction
10
PCR
11
Applications of k-mer sets
  • Peptide Identification
  • Set of all human amino-acid 30-mer peptide
    sequences...
  • ...that occur at least twice in dbEST
  • PCR Primer Design
  • Unique 20-mers
  • What does it mean to be unique?

12
k-mer superstrings
  • Completeness
  • All of the required k-mers are represented
  • Correctness
  • No additional k-mers are represented
  • Minimize the total representation length
  • Correlates with running time

13
Shortest superstring problem
  • General strings (arbitrary length)
  • Completeness only!
  • Classical NP-hard problem
  • Garey and Johnson
  • Approximate within 2.5OPT
  • Max-SNP hard
  • One of the first algorithmic approaches to genome
    assembly

14
de Bruijn Sequences
  • de Bruijn sequences represent all words of length
    k from some alphabet A.
  • A 0,1, k 3 s 0001110100
  • A 0,1, k 4 s 0000111101011001000

15
de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
16
de Bruijn Sequences Graphs
  • de Bruijn graphs (k,A)
  • Edges represent length k words from A
  • Each node has
  • in degree A
  • out degree A
  • Eulerian tour constructs de Bruijn sequence.

17
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19
Sequence Databases CSBH-graphs
  • Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
20
Sequence Databases CSBH-graphs
  • All k-mers represented by an edge have the same
    count

1
2
2
1
2
21
Correct, Complete Enumeration
  • Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG
22
Related work
  • Chinese Postman Problem
  • Edmonds and Johnson, 73
  • Undirected graph, weighted edges
  • Shortest path that uses all the edges
  • Solvable in polynomial time
  • Construct minimum weighted matching between nodes
    of odd-degree
  • Add matching to graph and find Eulerian path
  • Minimize weight of extra edges used

23
Correct, Complete, Enumeration
  • Chinese postman problem, except
  • Directed graph
  • Add edges from nodes with surplus in-degree to
    nodes with surplus out-degree
  • Fixed cost teleportation option
  • Can always start a new sequence
  • Find optimal set of additional edges
  • Transportation problem / min cost flow instance

24
Patching the CSBH-graph
  • Use artificial edges to fix unbalanced nodes

25
C3 Enumeration
in-out
in-out
Cost k
26
C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
27
C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
28
C2 Superstring
in-out
in-out
Cost 0
Cost 0
Cap 1
Cost 0
29
Large scale instances
  • Millions of nodes and edges
  • cSBH-graph instance
  • Min cost flow instance
  • cSBH-graph instance takes days to construct
  • 2 days on 250 CPUs
  • Algorithms must be linear in problem size
  • Out-of-core Eulerian path algorithm?

30
Grid computing
  • Heterogeneous machines
  • Varying disk/memory/MHz/cores capabilities
  • Centralized scheduler
  • Jobs started asynchronously
  • Other jobs may preempt current job
  • Input files may need to be staged
  • 250 simultaneous requests for a 3Gb file?
  • How to guarantee integrity of input files?
  • Problem decomposition may be non-trivial
  • Jobs sizes need to fit the least capable machine
  • Sometimes need to game the scheduler
  • Need to ensure the integrity of job output

31
Uniqueness Oracles
  • Oracle for uniqueness of 20-mers in the Human
    genome (n3Gb)
  • Count occurrences in the genome 0,1,2
  • Construct 20-mer superstring for 20-mers with
    count 1
  • Construct 20-mer superstring for 20-mers with
    count gt 1
  • Easy for exact sequence match O(n)
  • Fast automata, indexing, hash tables.

32
Inexact sequence match
  • Inexact sequence matching O(nmk)
  • Errors/Mismatches (k) 1,2,3
  • distinct 20-mers (m) O(n)
  • Achieve expected linear time using a hybrid
    approach
  • Exact search for short chunks of queries
  • Expensive alignment only where chunks match
  • Large chunks ) Fast, but miss occurrences
  • Small chunks ) Slow, find all matches

33
Inexact sequence match
  • Baeza-Yates Perleberg
  • Correct and O(n) for small k
  • At least 1 chunk is observed with no error.
  • Form of locality sensitive hashing

34
Locality Sensitive Hashing
  • For each query
  • store a (set of) hash(es) in hash-table
  • At each position in the genome
  • look-up a (set of) hash(es) in hash-table
  • if any are in the table, do more expensive check
  • Need to weigh
  • sensitivity (false negatives) against
  • specificity (false positives)
  • Our application requires no false negatives

35
Random Projection
  • Choose T templates of l random care positions

g
t1
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0
36
Random Projection
  • Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
37
Random Projection
  • Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
38
Random Projection
  • Choose T templates of l random positions

g
t1
t2
t1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0
0 t2 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0
0 1
39
Gapped seed-set design problem
  • Given
  • mer-size m ( 20 )
  • errors k ( 1,2,3)
  • cares l ( 10,12,14 )
  • Find the smallest set of templates with no false
    negatives.
  • Minimize running time.

40
Alternative Formulation 1(for k 2)
  • Cover the edges of Km with copies of Km-l
  • How many triangles to cover K6?
  • Some instances of (m,2,m-3) cover each edge
    exactly once
  • Steiner triple systems

41
Alternative Formulation 2
  • Set cover instance
  • Ground set all possible placements of the k
    errors (alignments)
  • Covering sets all possible placements of the l
    care positions
  • For (m20,k2,l10),
  • 190 elements, 184,756 sets!
  • Greedy approximation algorithm works

42
Alternative Formulation 3
Templates
Positions (m)
Remove any kposition nodes, 1 templatehas
degree l.
l
43
Alternative formulation 3
  • Template t has care at position i xti
  • Alignment a is not covered by template t yta
  • Alignment a is not covered by any template za

44
Alternative Formulation 3
  • Polynomial size in terms of number of templates
  • Select T in advance and test whether T is
    sufficient.
  • Greedily add T templates.
  • Apply iteratively to achieve feasible solution
  • Extremely weak LP relaxation
  • Lots of symmetry!
  • Hard to solve useful instances

45
Solution for (20,2,10)
  • .................... Positions
  • t1
  • t2
  • t3
  • t4
  • t5
  • t6
  • Need at least 4 templates, 6 is optimal

46
Remember the application!
  • We are checking some templates twice!
  • We compute the hash(es) at each position in the
    genome
  • Any template that is a shift of another will be
    computed at some nearby genomic position!

47
Solution for (20,2,10)
  • .................... Positions
  • t1
  • t2
  • t3
  • t4
  • t5
  • t6
  • Need at most 3 templates...can we do better?

48
Alternative formulation 3 with template shift
49
Solution for (20,2,10) w/ shift
  • .................... Positions
  • t1
  • t2
  • Optimal is 2 templates...

50
Alternative Formulation 3 with shift
  • Polynomial size in terms of number of templates
  • Select T in advance and test whether T is
    sufficient.
  • Greedily add T templates.
  • Apply iteratively to achieve feasible solution
  • Greedy not known to be good!
  • Extremely weak LP relaxation
  • Much less symmetry
  • Hard to solve useful instances

51
Conclusions
  • Lots of interesting optimization problems in
    bioinformatics
  • Usually very large scale
  • Need good empirical algorithms/solvers
  • Modeling tradeoffs abound
  • Speed/Memory/Optimality/Correctness
  • Many variants of LSH in different domains
  • Resource constrained allocation problems
  • ...due to limitations in biotechnologies
  • As the technologies scale up, so do the issues
Write a Comment
User Comments (0)
About PowerShow.com