Title: Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads
1Ab Initio Whole Genome Shotgun Assembly With
Mated Short Reads
Paul Medvedevs brain
Michael Brudnos brain
Brainchild!
- Presented by Lucas Lochovsky
2Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
3Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
4Who Wants to Assemble an Entire Genome?
- Two approaches
- I) Clone-by-clone assembly (a.k.a. hierarchical
shotgun approach) - Break genome into 150 kb fragments
- Insert into BAC and map onto chromosomes
- Shotgun sequence each BAC and assemble into a
contig
5Who Wants to Assemble an Entire Genome? (contd)
- II) Whole Genome Shotgun (WGS) Assembly
- Break genome into shotgun-sized fragments and
sequence - Match the overlapping regions of contiguous
sequences - Demonstrated by Celera Genomics to be feasible
for whole genome assembly - Sequenced human genome at 1/10th the cost of the
public Human Genome Project
6Who Wants to Assemble an Entire Genome? (contd)
- Next Generation Sequencing (NGS) potential
- third player?
- Improved speed and cost-effectiveness relative to
the other methods - but much shorter read length (25-200 bp)
- Only proven on resequencing projects, i.e. a
reference genome is already available - Not yet proven on ab initio genome sequencing
- No advance knowledge available for ab initio
(Defn from first principles) genome sequencing
7Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
8Review of Related Work
- Genome assembly intuition identify shortest
common superstring of all reads - Equivalent to finding the shortest genome that is
possible with the given reads - Problem each repeat only used once
over-collapsing the repeats - Accommodate multiple uses of repeat
regions makes sequence assembly amenable to
graph-theoretic methods
9Review of Related Work (contd)
- EULER assembler (Pevzner, Tang and
- Waterman)
- Represent reads as edges and overlaps as vertices
in a de Bruijn graph - Assembly can be efficiently solved as an Eulerian
Path Problem each edge must be visited exactly
once - Repeats dealt with by using multiple edges for a
single repeat read
10Review of Related Work (contd)
- String graph (Myers)
- Represent reads as vertices, and read overlaps as
edges - Remove redundant edges
- Establish edge constraints
- Unique? (flow is exactly one)
- Required? (min. flow is 1)
- Optional? (min. flow is 0)
- Find shortest walk
11Review of Related Work (contd)
- Flow entering and exiting each vertex in these
graphs must be balanced - Gave rise to the idea of network flow methods for
assembly - EULER assembler minimum cost circulation problem
- String graph place constraints on copy-counts
before solving the flow - Network flow fails for the assembly problem when
there are repeats longer than the span of a
single read
12Review of Related Work (contd)
- Sequence assembly using NGS
- Several methods available now (e.g. SSAKE, VCAKE,
SHARCGS, etc.) - All of these assume that the length of the
assembled genome must be minimized - Results in over-collapsing of repeats
- Given ubiquity of repeats in eukaryotic genomes,
authors considered this a poor assumption
13Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
14The Medvedev-Brudno Method
- Change goal of sequence assembly
- Maximize the likelihood that the resultant genome
was the source of the given reads - Take advantage of the high coverage of NGS to
statistically estimate the copy-count of each
read identify and quantify repeats - Maximizing the likelihood of observed read
frequencies can be cast as mininum cost
bidirected flow (biflow) problem - Allows solution to be obtained with an
off-the-shelf network flow solver - Authors claim 99.99 accuracy
15The Medvedev-Brudno Method (contd)
- Second important aspect is the use of matepair
information for joining contigs - Other systems look for all paths between mated
reads - The Medvedev-Brudno Method looks only for short
paths between some pairs of reads - Question How do you decide the upper bound for
these short paths? And how do you decide which
pairs of reads to examine?
16Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
17Methods Bidirected Overlap Graph
- Bidirected graphs are kind of like directed
graphs, except each edge has an orientation on
each of its ends - Gives rise to three types of edges
- Edges where one arrow points out of a vertex, and
one arrow points into a vertex - Edges with both arrows pointing out, and
- Edges with both arrows pointing in (easiest one
to do in PowerPoint!) - For a walk in a bidirected graph, for each vertex
on that walk, the orientation of the edge
entering the vertex must be opposite that of the
edge leaving the vertex
18Methods Bidirected Overlap Graph (contd)
- In a bidirected overlap graph, each vertex is a
double-stranded read - Edges represent read overlaps
- Three possible ways that two double-stranded
reads can overlap (corresponds to the three types
of edges) - Suppose we have two ds reads r1 and r2
- Each read can be oriented to the left or to the
right - The three possible overlaps are
- i) Both strands point in the same direction (both
- reads can point left, or both can point
right, its - the same overlap either way)
- ii) r1 points left and r2 points right
- iii) r1 points right and r2 points left
19Methods Bidirected Overlap Graph (contd)
- A walk along this graph that visits every vertex
at least once produces the original
double-stranded genome (under the assumptions
that the whole genome was covered by the reads,
and that the reads are error-free) - In this paper, the overlap graph is constructed
by placing an edge between two reads if they
overlap by a minimum number of characters omin - Question How is omin determined?
- Then perform transitive edge reduction remove
overlaps covered by two shorter overlaps
20Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
21Methods Adjustments to the Standard Min-cost
Biflow Problem
- Standard Min-cost Biflow Problem
- Set upper and lower flow bounds on each edge
- Flow function f E ? N must obey the constraint
for each edge e - For each vertex, the incoming flow is balanced
with the outgoing flow - Objective Find the flow that minimizes
22Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
- Medvedev-Brudno Min-cost Biflow Problem
- Upper and lower flow bounds on vertices as well
- Accomplished by splitting every vertex v into
two v and v- - v- serves as the incoming vertex, and inherits
vs incoming edges - v serves as the outgoing vertex, and inherits
vs outgoing edges - Finally add one edge between v-
- and v and assign it the upper and
- lower flow bounds for v
23Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
- Second variation represent the cost ce as a
convex function - A function is convex if every point on or above
it forms a convex set - A convex set refers to an area where, for every
pair of points within that area, every point on
the straight line segment connecting those points
also lies within that area
24Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
- An area that is not convex would have some sort
of concave portion that would contradict the
above property of convex sets - In the overlap graph, convex functions are
modelled with piecewise-linear approximations,
allowing the flow to be solved as a linear
min-cost flow problem
25Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
- Supersource and supersink added to convert flow
problem into circulation problem - Each vertex has a lower bound of 1, since each
read must appear in the finished genome at least
once - Edge bounds are set to 0 (lower bound) and
infinity (upper bound) - Put prohibitively large cost on the edge leading
from the supersource and the edge leading to the
supersink to ensure that the assembly uses the
smallest number of contigs possible - Flow through each vertex represents number of
times it appears in the assembled genome
26Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
27Methods Maximizing the Global Read-Count
Likelihood
- Start with the probability of a k-mer i being
sampled a certain number of times from a genome G - Let N(G) be the length of the genome assembly of
G, and let gi be the frequency of i in G - Under the assumption of uniform sampling, the
probability of sampling i is gi/N(G) - Let Xi be the random variable that represents the
number of trials whose outcome is i - Each random variable for every possible k-mer has
a binomial distribution. Their joint distribution
is the following multinomial distribution
28Methods Maximizing the Global Read-Count
Likelihood (contd)
- From this, derive the global read-count
likelihood, the likelihood of k-mer distributions
(gi) given the sampling outcomes (xi) - Goal is to maximize L, or, equivalently, minimize
the negative log of L - To translate this problem into a convex min-cost
biflow problem, we need convex functions for each
k-mer ci s.t. - Problem the Xi random variables are not
independent
29Methods Maximizing the Global Read-Count
Likelihood (contd)
- unless the number of trials approaches infinity
- The number of trials is usually large enough to
warrant the approximation of the multinomial
distribution as the product of the binomial
distributions for each Xi - In this binomial approximation, genome length
N(G) is constant, and independent of the sampling
frequencies - Therefore, use N instead, which is the actual
length of the genome G
30Methods Maximizing the Global Read-Count
Likelihood (contd)
- New approximation of L
- Now
- And
- ci is used as the convex functions for the
vertices of the min-cost biflow graph described
earlier
31Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
32Methods Efficiently Solving a Min-Cost Biflow
(Linear)
- Problem No existing efficient implementation of
a min-cost biflow algorithm - Solution Roll your own! Introducing the
Medvedev-Brudno Min-cost Biflow Solver! - Its fast!
- Its 2-approximate!
- Get one today! Now available at half-price for
one day only!
offer contingent on willingness of authors to go
into business with this
33Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
- Can solve directed network flow by reducing the
problem to a linear program (LP) - Use an edge incidence matrix derived from
the overlap graph - If cell has a value of 1, then edge n is an
in-edge for vertex m - If the value is -1, n is an out-edge
- 0 means n and m are not on speaking terms
- Use incidence matrix as constraint matrix for LP
optimal LP solution corresponds to a minimum flow
34Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
- The incidence matrix is Totally Unimodular (TU)
- Translation from wowspeak Every linear
combination of a TU matrix M and the identity
matrix I has integer coefficients. Therefore,
every solution to the l.c. consists of integers - Makes it possible to produce an integral solution
with LP, rather than resort to Integer
Programming -gt NP-hard
35Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
- Possible for 2 or -2 to appear in the incidence
matrix, since two in-edges/out-edges can enter a
single vertex - Incidence matrix is actually a
- binet matrix
- Optimal LP solution for binet matrices
is guaranteed to be half-integral (i.e. the
coefficients are multiples of 0.5)
36Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
- The binet matrix can be reduced to a TU matrix
through monotonization, i.e. doubling the number
of columns and rows - TU matrix corresponds to a directed graph
- Min-cost directed flow can be solved must faster
than LP
37Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
- Monotonization Procedure
- For every vertex v in the bidirected graph,
replace with two vertices v1 and v2 in the new
graph - Each of vs in-edges are replaced with two edges,
one of which points into v1, while the other
points out of v2 - Likewise, each of vs out-edges are replaced with
two edges, one of which points out of v1, while
the other points into v2 - Bounds and costs from original graph are
transferred to the new graph, and the solution of
the new graph will be transferred to the original
graph - Problem can now be solved with off-the-shelf
software (no more work for us!)
38Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
39Methods Show Me the Contigs
- Assuming the flows been solved, now its time to
decompose it into a collection of walks, which
translates into assembled contigs - Graph is first simplified by removing all edges
with a flow of zero - Additional simplifications possible by removing
vertices v where - There is exactly one edge going into v and one
edge leading out of v, and the flow on both edges
is the same - Vertices where there is also a loop with the same
flow as the other two edges, and - Split and join vertices, where the flow on the
in- - edges is the same as those of the out-edges
40Methods Show Me the Contigs (contd)
- After at most 2V of these simplifications, the
number of edges will be reduced by 105 fold, and
only conflict vertices remain (those that
didnt match the previous criteria) - Conflict Node Resolution
- Look for edges at these vertices with opposite
orientations supported by matepairs - Use BFS to find all reads within a certain
distance from the vertex - Match those reads that are matepairs
- For those matepairs where one read is on the
incoming side and the other is on the outgoing
side, find the shortest path between them using
Dijkstras algorithm
41Methods Show Me the Contigs (contd)
- Make note of the number of mates that fall within
the expected insert distance - Pairs of in/out edges that have a significant
number of matepairs that fall within the insert
distance are joined into a common edge - The previous step is repeated until no more edges
can be joined in this manner - Graph simplification continues in iterative
phases until somebody decides its time to stop
42Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
43Results
- Unable to obtain real data for experimental
validation - Generated synthetic reads from E. coli genome,
which must be 4.6 megabases long, or else it
takes up 4.6 MB of hard drive space - Simulated matepairs distances were uniformly
distributed within 10 of the expected insert
size - Reads were 25 bp long, and error-free
- Coverage rates involved 50x, 75x, 100x, and 200x
- Minimum overlap length varied between 17 and 21
- Overall running time on one machine 1 hour
- Question What kind of machine?
44Read Count Results
- Compared vertex flow with read frequency in the
original genome - High degree of accuracy
- Error rate between 10-4 and 10-6
- Generally more tendency to overestimate read
frequency - Authors claim only slight improvements beyond 75x
coverage, but 200x coverage is fantastically good
45Assembly Results
- Take the edges of the graph produced after the
conflict node resolution and generate the
sequence it spells out - Compute N50 The length of the shortest contig
s.t. 50 of the genome lies in longer contigs - Also compute N90 Similar to N50, but the cutoff
is 90 - Finally, compute errors by aligning each contig
to the reference genome and seeing how many local
alignments it takes to completely tile the contig
(minus one because it always takes at least one
alignment to do it)
46Assembly Results (contd)
- Length of contigs that contain 50 of the genome
varied between 23-28 kb - Length of contigs that contain 90 of the genome
varied between 7-8 kb - N50 error rate 1/100-180 kb
- N90 error rate 1/100-160 kb
- Greedy algorithm can be fooled by several strong
edge matches - Contig size is good relative to other whole
genome assemblies involving small read sizes
47Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
48Discussion
- Successfully demonstrated that ab initio genome
assembly is feasible with 25 bp reads and
matepair information - Still needs improvements for sequencing unknown
(bacterial) genomes - In the future, matepair information could be
further used to join contigs into supercontigs,
and better help resolve conflict nodes - Convex network flow can be more broadly applied
within computational biology
49Discussion (contd)
- First major assumption Reads are error-free
- Can be overcome with higher coverage
- Second major assumption Uniform sampling of all
genomic regions - Reality certain portions of the genome are
easier to sample than others - More difficult to overcome
- Could be overcome by establishing the biases of
the sequencing apparatus used
50Is NGS The One?
- The One that will herald the beginning of
cost-effective whole genome assembly? - Maybe you should ask the Oracle
51Thats all folks!
- Discussion questions
- What were the strengths/weaknesses of the
Medvedev-Brudno Method? How would you improve it? - How rigorous do you think the experimental
validation was? How would you improve it? - Would you agree with the conclusions the authors
drew from their results? Why or why not? - How many of you fell asleep during this
presentation? What can I do to keep your
attention in the future?