Fragment Assembly - PowerPoint PPT Presentation

About This Presentation
Title:

Fragment Assembly

Description:

Fragment Assembly Introduction Fragments are typically of 200-700 bp long Target string is about 30k 100k bp long Problem: given a set of fragments ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 32
Provided by: csFitEdu1
Learn more at: https://cs.fit.edu
Category:

less

Transcript and Presenter's Notes

Title: Fragment Assembly


1
Fragment Assembly

2
Introduction
  • Fragments are typically of 200-700 bp long
  • Target string is about 30k 100k bp long
  • Problem given a set of fragments reconstruct the
    target

3
Introduction
  • Multiple-alignment of the fragments ignoring
    spaces at the end
  • The alignment is called layout
  • The output is called the consensus sequence
  • An optimization problem

4
Complications
  • Base-call errors
  • Substitution errors p 107
  • Insertion errors (possibly from the host
    sequence) p 108, fig 4.3
  • Deletion error fig 4.4
  • Majority voting solves them (or some form of
    optimization)

5
Complications
  • Chimeras
  • To non-contiguous fragments get joined as a
    single fragment p 109, fig 4.5
  • Needs to be weeded out as a preprocessing step
  • Similar to chimeras, contaminant fragments
    (possibly from host) needs to be filtered out as
    well

6
Complications
  • Unknown orientation
  • Fragments may come from either strand
  • Even from the opposite strand, its
    reverse-complement must be in the target string
  • Consequence try both forward and rev-complement
    of each fragment (2n trial in worst, for n
    fragments)
  • p 109, fig 4.6

7
Complications
  • Repeats
  • Regions (super-string of some fragments) may
    repeat in a target
  • Consequent problem where do the fragments really
    come from, on approximate alignment? p 110, fig
    4.7
  • Problem 2 where should the inter-repeat
    fragments go? p111, fig 4.8, fig 4.9
  • Inverted repeats repeat of the reverse
    complement fig 4.10

8
Complications
  • Insufficient coverage
  • Chance of coverage increases with redundancy (a
    heuristic cover 8 times the target length)
  • Chance of covering a gap reduces when it remains
    uncovered even after multiple fragments are
    aligned) random sampling is not good solution
    here

9
Complications
  • Insufficient coverage
  • What you get with insufficient coverage is
    multiple contigs, not one contig
  • t-contig is where we expect t-long overlap
    between pairs of fragments
  • Expected number of contigs p 112, formula 4.1
  • Lower t means lesser number of contigs (more
    aligned segments), but weaker consensus

10
Reconstruction
  • Shortest common superstrings are not the best
    solution
  • Fig 4.12 vs Fig 4.13 (p115/116)

11
Reconstruction
  • Superstring to be reconstructed out of fragments
  • An alignment problem with no end penalty
  • d_s is edit distance score without end-penalty
    minimized over edit distances d
  • Fig 4.14 (p117) for best aligned
    subsequence-matching
  • Note, char matched is charged 0, mismatch 1, gap
    2, in distance rather than similarity
  • We will use d for d_s

12
Reconstruction
  • f is approximate substring of S at error level e,
    then the score is
  • d(f, S) lt ef,
  • e1 means no error allowed
  • elt1 allows insert/delete/substitution errors
  • f and f- both should be matched

13
Reconstruction Problem
  • Input Set F of substrings, error level e
  • Output Shortest possible string S s.t. for all f
  • Min(d(f, S), d(f-, S)) lt ef

14
Reconstruction Multicontig
  • How much overlap do we require between strings?
  • Ideally, each column in the layout L should have
    same character, for all columns 1 through L
  • Fig 4.4 (p 118) t-contig for t3, 2, 1
  • Balance between t and number of t-contigs

15
Reconstruction Multicontig
  • S is e-consensus sequence (multicontig) for
    0ltelt1 edit distance d(f, S) lt ef
  • Multicontig problem
  • Input set F, integer tgt0, 0ltelt1
  • Output Minimum partition over F, each partition
    Ci is a t-contig with e-consensus

16
Reconstruction Overlap Multi-graph
  • Nodes are the fragments
  • Directed arcs label length t of overlap between
    nodes t-suffix t-prefix
  • Arcs between all pairs of nodes, but no self-loop
  • Fig 4.15 (p 121) example
  • Length of a created superstringtotal wt along
    the path(or overlaps) total length of all
    fragments involved
  • Max weight Hamiltonian path is what we are
    looking for in this graph ? max overlapped
    superstring

17
Reconstruction
  • Substrings of fragments within the set of
    fragments are noise remove them
  • Draw OMG of the substring free set of fragments
  • Shortest common superstring always correspond to
    a Hamiltonian path in this graph

18
Reconstruction OMG
  • Thm 4.1 (p 123) F substring free, for every
    common superstring S, there is a Ham. Path P,
    s.t., S(P) is in S
  • Substrings are strictly ordered over S order of
    left pts order of rt points (otherwise
    substring exists)
  • Path follows the same order of fragments (as in
    S) in OMG
  • S may contain extra garbage materials, so, S(P)
    is within S

19
Reconstruction OMG
  • If S is shortest common superstring, then S must
    be within S(P), or SS(P)
  • In other words, a Ham. Path in OMG for
    substring-free collection F is a shortest common
    superstring of the Fragment set F

20
Reconstruction OMG
  • Think of an algorithm for weeding out substrings
    from F
  • Also, weed out multi-edges by keeping the largest
    wt edge between any pair of nodes
  • If the wt on an edge is below a threshold t, then
    the wt should be treated as 0

21
Reconstruction OMG
  • Greedy Algorithm to draw Ham. Path (p 125)
  • Collects edges largest to smallest,
  • (1) preventing cycle (union-find),
  • (2) indegree of each node should be lt1 (first
    node has 0)
  • (3) outdegree of each node should be lt1 (last
    node has 0)
  • Does not return Ham. Path. Can you modify to
    return Ham. Path?
  • Alg is NOT optimal, example (p 126) returns 3,
    optimal wt is 4

22
Reconstruction OMG
  • Subintervals if a fragment can be embedded
    within another one in the set
  • Subinterval-free and repeat-free graphs connected
    at level t has a Ham. Path that generates the
    target string

23
Reconstruction OMG
  • If a repeat exists in the original string, then
    the graph will have a cycle
  • False positive substrings from two different
    portions has t-overlap
  • If a cycle exist in the graph, then there must be
    a false positive (Thm 4.4, p129) proof by
    contradiction, otherwise the subinterval-free
    fragments can be totally ordered

24
Reconstruction OMG
  • If there is no repeats in a subinterval-free
    graph, then there exist a unique Ham. Path
  • If there exist a cycle it may not come from a
    repeat

25
Reconstruction OMG
  • Example 4.6 (p 130) greedy alg finds wrong
    string, but the Ham. Path finds the correct one
  • Greedy does not care about linkage (optimizes on
    total overlap finds shortest common
    superstring)
  • Ham path chooses any t-overlap connections
    cares for linkage only

26
Parameters in aligning for fragment assembly
  • Score on a column traditionally 0,-1,-2 in
    sum-of-pairs
  • Entropy
  • Sumover alphabets and space c pc log pc,
    where pc is probability of c
  • All same character, pc 1, entropy0
  • For a, t, c, g, -, all different, pc 1/5,
    entropylog 5entropy measures uniformity alone, a
    better metric

27
Parameters in aligning for fragment assembly
  • Coverage How many each column is covered by
    how many fragments? (Average, min, max)
  • This is different from the concept of t-overlap
  • If a column (of the target) is covered by 0, then
    the layout is disconnected
  • Counteracts with the requirement of
    subinterval-free collection if we expect
    coveragegt1 for all columns

28
Parameters in aligning for fragment assembly
  • Coverage is not enough, we need good linkage,
    Example p 133
  • Ham. Path algorithm is doing that

29
Steps in assembly
  • Step 1 Overlap finding
  • Approximate delete, insert, replace allowed
  • by semi-global DP algorithm
  • with appropriate end-gap penalty,
  • pairwise between each fragment and its
    reverse-complement

30
Steps in assembly
  • Step 2 Construct over (F union F-bar) for the
    fragment set F
  • (-- after eliminating substrings?)
  • Construct Hamiltonian path in this graph
  • Cycles and unbalanced coverage may mean repeats

31
Steps in assembly
  • Step 3 fine tuning the multiple alignment to get
    a consensus target
  • Manual or algorithmic
  • Examples in p 137-138
Write a Comment
User Comments (0)
About PowerShow.com