Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads - PowerPoint PPT Presentation

About This Presentation
Title:

Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads

Description:

Problem: each repeat only used once 'over-collapsing' the repeats ... Results in over-collapsing of repeats ... vertex at least once produces the original ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 52
Provided by: lucaslo
Category:

less

Transcript and Presenter's Notes

Title: Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads


1
Ab Initio Whole Genome Shotgun Assembly With
Mated Short Reads

Paul Medvedevs brain
Michael Brudnos brain
Brainchild!
  • Presented by Lucas Lochovsky

2
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

3
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

4
Who Wants to Assemble an Entire Genome?
  • Two approaches
  • I) Clone-by-clone assembly (a.k.a. hierarchical
    shotgun approach)
  • Break genome into 150 kb fragments
  • Insert into BAC and map onto chromosomes
  • Shotgun sequence each BAC and assemble into a
    contig

5
Who Wants to Assemble an Entire Genome? (contd)
  • II) Whole Genome Shotgun (WGS) Assembly
  • Break genome into shotgun-sized fragments and
    sequence
  • Match the overlapping regions of contiguous
    sequences
  • Demonstrated by Celera Genomics to be feasible
    for whole genome assembly
  • Sequenced human genome at 1/10th the cost of the
    public Human Genome Project

6
Who Wants to Assemble an Entire Genome? (contd)
  • Next Generation Sequencing (NGS) potential
  • third player?
  • Improved speed and cost-effectiveness relative to
    the other methods
  • but much shorter read length (25-200 bp)
  • Only proven on resequencing projects, i.e. a
    reference genome is already available
  • Not yet proven on ab initio genome sequencing
  • No advance knowledge available for ab initio
    (Defn from first principles) genome sequencing

7
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

8
Review of Related Work
  • Genome assembly intuition identify shortest
    common superstring of all reads
  • Equivalent to finding the shortest genome that is
    possible with the given reads
  • Problem each repeat only used once
    over-collapsing the repeats
  • Accommodate multiple uses of repeat
    regions makes sequence assembly amenable to
    graph-theoretic methods

9
Review of Related Work (contd)
  • EULER assembler (Pevzner, Tang and
  • Waterman)
  • Represent reads as edges and overlaps as vertices
    in a de Bruijn graph
  • Assembly can be efficiently solved as an Eulerian
    Path Problem each edge must be visited exactly
    once
  • Repeats dealt with by using multiple edges for a
    single repeat read

10
Review of Related Work (contd)
  • String graph (Myers)
  • Represent reads as vertices, and read overlaps as
    edges
  • Remove redundant edges
  • Establish edge constraints
  • Unique? (flow is exactly one)
  • Required? (min. flow is 1)
  • Optional? (min. flow is 0)
  • Find shortest walk

11
Review of Related Work (contd)
  • Flow entering and exiting each vertex in these
    graphs must be balanced
  • Gave rise to the idea of network flow methods for
    assembly
  • EULER assembler minimum cost circulation problem
  • String graph place constraints on copy-counts
    before solving the flow
  • Network flow fails for the assembly problem when
    there are repeats longer than the span of a
    single read

12
Review of Related Work (contd)
  • Sequence assembly using NGS
  • Several methods available now (e.g. SSAKE, VCAKE,
    SHARCGS, etc.)
  • All of these assume that the length of the
    assembled genome must be minimized
  • Results in over-collapsing of repeats
  • Given ubiquity of repeats in eukaryotic genomes,
    authors considered this a poor assumption

13
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

14
The Medvedev-Brudno Method
  • Change goal of sequence assembly
  • Maximize the likelihood that the resultant genome
    was the source of the given reads
  • Take advantage of the high coverage of NGS to
    statistically estimate the copy-count of each
    read identify and quantify repeats
  • Maximizing the likelihood of observed read
    frequencies can be cast as mininum cost
    bidirected flow (biflow) problem
  • Allows solution to be obtained with an
    off-the-shelf network flow solver
  • Authors claim 99.99 accuracy

15
The Medvedev-Brudno Method (contd)
  • Second important aspect is the use of matepair
    information for joining contigs
  • Other systems look for all paths between mated
    reads
  • The Medvedev-Brudno Method looks only for short
    paths between some pairs of reads
  • Question How do you decide the upper bound for
    these short paths? And how do you decide which
    pairs of reads to examine?

16
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

17
Methods Bidirected Overlap Graph
  • Bidirected graphs are kind of like directed
    graphs, except each edge has an orientation on
    each of its ends
  • Gives rise to three types of edges
  • Edges where one arrow points out of a vertex, and
    one arrow points into a vertex
  • Edges with both arrows pointing out, and
  • Edges with both arrows pointing in (easiest one
    to do in PowerPoint!)
  • For a walk in a bidirected graph, for each vertex
    on that walk, the orientation of the edge
    entering the vertex must be opposite that of the
    edge leaving the vertex

18
Methods Bidirected Overlap Graph (contd)
  • In a bidirected overlap graph, each vertex is a
    double-stranded read
  • Edges represent read overlaps
  • Three possible ways that two double-stranded
    reads can overlap (corresponds to the three types
    of edges)
  • Suppose we have two ds reads r1 and r2
  • Each read can be oriented to the left or to the
    right
  • The three possible overlaps are
  • i) Both strands point in the same direction (both
  • reads can point left, or both can point
    right, its
  • the same overlap either way)
  • ii) r1 points left and r2 points right
  • iii) r1 points right and r2 points left

19
Methods Bidirected Overlap Graph (contd)
  • A walk along this graph that visits every vertex
    at least once produces the original
    double-stranded genome (under the assumptions
    that the whole genome was covered by the reads,
    and that the reads are error-free)
  • In this paper, the overlap graph is constructed
    by placing an edge between two reads if they
    overlap by a minimum number of characters omin
  • Question How is omin determined?
  • Then perform transitive edge reduction remove
    overlaps covered by two shorter overlaps

20
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

21
Methods Adjustments to the Standard Min-cost
Biflow Problem
  • Standard Min-cost Biflow Problem
  • Set upper and lower flow bounds on each edge
  • Flow function f E ? N must obey the constraint
    for each edge e
  • For each vertex, the incoming flow is balanced
    with the outgoing flow
  • Objective Find the flow that minimizes

22
Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
  • Medvedev-Brudno Min-cost Biflow Problem
  • Upper and lower flow bounds on vertices as well
  • Accomplished by splitting every vertex v into
    two v and v-
  • v- serves as the incoming vertex, and inherits
    vs incoming edges
  • v serves as the outgoing vertex, and inherits
    vs outgoing edges
  • Finally add one edge between v-
  • and v and assign it the upper and
  • lower flow bounds for v

23
Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
  • Second variation represent the cost ce as a
    convex function
  • A function is convex if every point on or above
    it forms a convex set
  • A convex set refers to an area where, for every
    pair of points within that area, every point on
    the straight line segment connecting those points
    also lies within that area

24
Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
  • An area that is not convex would have some sort
    of concave portion that would contradict the
    above property of convex sets
  • In the overlap graph, convex functions are
    modelled with piecewise-linear approximations,
    allowing the flow to be solved as a linear
    min-cost flow problem

25
Methods Adjustments to the Standard Min-cost
Biflow Problem (contd)
  • Supersource and supersink added to convert flow
    problem into circulation problem
  • Each vertex has a lower bound of 1, since each
    read must appear in the finished genome at least
    once
  • Edge bounds are set to 0 (lower bound) and
    infinity (upper bound)
  • Put prohibitively large cost on the edge leading
    from the supersource and the edge leading to the
    supersink to ensure that the assembly uses the
    smallest number of contigs possible
  • Flow through each vertex represents number of
    times it appears in the assembled genome

26
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

27
Methods Maximizing the Global Read-Count
Likelihood
  • Start with the probability of a k-mer i being
    sampled a certain number of times from a genome G
  • Let N(G) be the length of the genome assembly of
    G, and let gi be the frequency of i in G
  • Under the assumption of uniform sampling, the
    probability of sampling i is gi/N(G)
  • Let Xi be the random variable that represents the
    number of trials whose outcome is i
  • Each random variable for every possible k-mer has
    a binomial distribution. Their joint distribution
    is the following multinomial distribution

28
Methods Maximizing the Global Read-Count
Likelihood (contd)
  • From this, derive the global read-count
    likelihood, the likelihood of k-mer distributions
    (gi) given the sampling outcomes (xi)
  • Goal is to maximize L, or, equivalently, minimize
    the negative log of L
  • To translate this problem into a convex min-cost
    biflow problem, we need convex functions for each
    k-mer ci s.t.
  • Problem the Xi random variables are not
    independent

29
Methods Maximizing the Global Read-Count
Likelihood (contd)
  • unless the number of trials approaches infinity
  • The number of trials is usually large enough to
    warrant the approximation of the multinomial
    distribution as the product of the binomial
    distributions for each Xi
  • In this binomial approximation, genome length
    N(G) is constant, and independent of the sampling
    frequencies
  • Therefore, use N instead, which is the actual
    length of the genome G

30
Methods Maximizing the Global Read-Count
Likelihood (contd)
  • New approximation of L
  • Now
  • And
  • ci is used as the convex functions for the
    vertices of the min-cost biflow graph described
    earlier

31
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

32
Methods Efficiently Solving a Min-Cost Biflow
(Linear)
  • Problem No existing efficient implementation of
    a min-cost biflow algorithm
  • Solution Roll your own! Introducing the
    Medvedev-Brudno Min-cost Biflow Solver!
  • Its fast!
  • Its 2-approximate!
  • Get one today! Now available at half-price for
    one day only!

offer contingent on willingness of authors to go
into business with this
33
Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
  • Can solve directed network flow by reducing the
    problem to a linear program (LP)
  • Use an edge incidence matrix derived from
    the overlap graph
  • If cell has a value of 1, then edge n is an
    in-edge for vertex m
  • If the value is -1, n is an out-edge
  • 0 means n and m are not on speaking terms
  • Use incidence matrix as constraint matrix for LP
    optimal LP solution corresponds to a minimum flow

34
Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
  • The incidence matrix is Totally Unimodular (TU)
  • Translation from wowspeak Every linear
    combination of a TU matrix M and the identity
    matrix I has integer coefficients. Therefore,
    every solution to the l.c. consists of integers
  • Makes it possible to produce an integral solution
    with LP, rather than resort to Integer
    Programming -gt NP-hard

35
Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
  • Possible for 2 or -2 to appear in the incidence
    matrix, since two in-edges/out-edges can enter a
    single vertex
  • Incidence matrix is actually a
  • binet matrix
  • Optimal LP solution for binet matrices
    is guaranteed to be half-integral (i.e. the
    coefficients are multiples of 0.5)

36
Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
  • The binet matrix can be reduced to a TU matrix
    through monotonization, i.e. doubling the number
    of columns and rows
  • TU matrix corresponds to a directed graph
  • Min-cost directed flow can be solved must faster
    than LP

37
Methods Efficiently Solving a Min-Cost Biflow
(Linear) (contd)
  • Monotonization Procedure
  • For every vertex v in the bidirected graph,
    replace with two vertices v1 and v2 in the new
    graph
  • Each of vs in-edges are replaced with two edges,
    one of which points into v1, while the other
    points out of v2
  • Likewise, each of vs out-edges are replaced with
    two edges, one of which points out of v1, while
    the other points into v2
  • Bounds and costs from original graph are
    transferred to the new graph, and the solution of
    the new graph will be transferred to the original
    graph
  • Problem can now be solved with off-the-shelf
    software (no more work for us!)

38
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

39
Methods Show Me the Contigs
  • Assuming the flows been solved, now its time to
    decompose it into a collection of walks, which
    translates into assembled contigs
  • Graph is first simplified by removing all edges
    with a flow of zero
  • Additional simplifications possible by removing
    vertices v where
  • There is exactly one edge going into v and one
    edge leading out of v, and the flow on both edges
    is the same
  • Vertices where there is also a loop with the same
    flow as the other two edges, and
  • Split and join vertices, where the flow on the
    in-
  • edges is the same as those of the out-edges

40
Methods Show Me the Contigs (contd)
  • After at most 2V of these simplifications, the
    number of edges will be reduced by 105 fold, and
    only conflict vertices remain (those that
    didnt match the previous criteria)
  • Conflict Node Resolution
  • Look for edges at these vertices with opposite
    orientations supported by matepairs
  • Use BFS to find all reads within a certain
    distance from the vertex
  • Match those reads that are matepairs
  • For those matepairs where one read is on the
    incoming side and the other is on the outgoing
    side, find the shortest path between them using
    Dijkstras algorithm

41
Methods Show Me the Contigs (contd)
  • Make note of the number of mates that fall within
    the expected insert distance
  • Pairs of in/out edges that have a significant
    number of matepairs that fall within the insert
    distance are joined into a common edge
  • The previous step is repeated until no more edges
    can be joined in this manner
  • Graph simplification continues in iterative
    phases until somebody decides its time to stop

42
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

43
Results
  • Unable to obtain real data for experimental
    validation
  • Generated synthetic reads from E. coli genome,
    which must be 4.6 megabases long, or else it
    takes up 4.6 MB of hard drive space
  • Simulated matepairs distances were uniformly
    distributed within 10 of the expected insert
    size
  • Reads were 25 bp long, and error-free
  • Coverage rates involved 50x, 75x, 100x, and 200x
  • Minimum overlap length varied between 17 and 21
  • Overall running time on one machine 1 hour
  • Question What kind of machine?

44
Read Count Results
  • Compared vertex flow with read frequency in the
    original genome
  • High degree of accuracy
  • Error rate between 10-4 and 10-6
  • Generally more tendency to overestimate read
    frequency
  • Authors claim only slight improvements beyond 75x
    coverage, but 200x coverage is fantastically good

45
Assembly Results
  • Take the edges of the graph produced after the
    conflict node resolution and generate the
    sequence it spells out
  • Compute N50 The length of the shortest contig
    s.t. 50 of the genome lies in longer contigs
  • Also compute N90 Similar to N50, but the cutoff
    is 90
  • Finally, compute errors by aligning each contig
    to the reference genome and seeing how many local
    alignments it takes to completely tile the contig
    (minus one because it always takes at least one
    alignment to do it)

46
Assembly Results (contd)
  • Length of contigs that contain 50 of the genome
    varied between 23-28 kb
  • Length of contigs that contain 90 of the genome
    varied between 7-8 kb
  • N50 error rate 1/100-180 kb
  • N90 error rate 1/100-160 kb
  • Greedy algorithm can be fooled by several strong
    edge matches
  • Contig size is good relative to other whole
    genome assemblies involving small read sizes

47
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

48
Discussion
  • Successfully demonstrated that ab initio genome
    assembly is feasible with 25 bp reads and
    matepair information
  • Still needs improvements for sequencing unknown
    (bacterial) genomes
  • In the future, matepair information could be
    further used to join contigs into supercontigs,
    and better help resolve conflict nodes
  • Convex network flow can be more broadly applied
    within computational biology

49
Discussion (contd)
  • First major assumption Reads are error-free
  • Can be overcome with higher coverage
  • Second major assumption Uniform sampling of all
    genomic regions
  • Reality certain portions of the genome are
    easier to sample than others
  • More difficult to overcome
  • Could be overcome by establishing the biases of
    the sequencing apparatus used

50
Is NGS The One?
  • The One that will herald the beginning of
    cost-effective whole genome assembly?
  • Maybe you should ask the Oracle

51
Thats all folks!
  • Discussion questions
  • What were the strengths/weaknesses of the
    Medvedev-Brudno Method? How would you improve it?
  • How rigorous do you think the experimental
    validation was? How would you improve it?
  • Would you agree with the conclusions the authors
    drew from their results? Why or why not?
  • How many of you fell asleep during this
    presentation? What can I do to keep your
    attention in the future?
Write a Comment
User Comments (0)
About PowerShow.com