Fragment Assembly - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Fragment Assembly

Description:

Today, resolution of reads is up to 1000 nucleotides of a sample, ... lil max, lil min, min shrinks faster at xtreme. The following formula is uniform deviation ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 32
Provided by: trin7
Category:
Tags: assembly | fragment | lil

less

Transcript and Presenter's Notes

Title: Fragment Assembly


1
Fragment Assembly
  • A graphic approach to shotgun DNA sequencing

2
  • Quick review of shotgun sequencing
  • Shotgun sequencing is for processing data
    collected by gel-electrophoretic procedures.
  • Today, resolution of reads is up to 1000
    nucleotides of a sample, but since strands are
    much longer than that, they need to be broken up
    to be read in pieces, then sequenced back
    together.

3
  • F of fragments
  • L average read length
  • N amount of data collected F L
  • G length of shotgun sequence target string
  • c coverage N / G
  • F N / L N cG
  • F cG / L
  • With todays technology, L 300, c 6, G
    50,000
  • So average problem length is 1000 fragments.

4
  • Standard shotgun sequencing works, right?

The goal is to minimize the length of the target,
by finding overlaps in the fragments and
stringing them together into one continuous
sequence.
5
  • Why cant we just try to minimize length?
  • There are some cases where the basic method of
    shotgun sequencing failslike repeats.

6
  • The quest for uniform distribution
  • We still need the sequencing to conform to
    certain standards set for error, but we really
    want to make the distribution close to level
    across the sequence. We are going to implement a
    graph ADT to help move fragments.
  • R our reconstruction string
  • fi the ith read fragment .
  • spi the start-point of the ith f
  • epi the end-point of the ith f

7
  • R our reconstruction string
  • fi the ith read fragment of R
  • spi the start-point of the ith f
  • epi the end-point of the ith f
  • 1 lt spi, epi lt R G
  • In order to be e-valid, (within acceptable
  • error) we need to have no more than
  • efi differences in each f and R
  • Must be covered, NO GAPS

8
  • p an overlap
  • pi the overlap for the ith f
  • We are going to represent the chain of fis and
    pis as a graph, so to be implemented as
    follows...
  • fi Rspi, epi
  • fi Repi, spic
  • each fragment has a start and end point, which it
    can be referenced by, and conveniently stored in
    a 2d array or some kind of hash. In our graph,
    each fragment is represented as a vertex, and
  • each overlap is represented as an
  • edge.

9
  • p.A the overlap on A
  • pi the overlap for the ith f
  • We can refer to p.A by its start and stop, like
    f
  • p.Ap.sA, p.eA
  • Now that we know the terms, we can easily refer
    to parts of the graph without having to describe
    it all over again

10
  • There are 3 phases to fragment assembly.
  • ?overlap
  • ?layout
  • ?consensus
  • We can skip the processes behind overlap, and
    just assume that they work like in general
    shotgun assembly.
  • Layout is where we generate values for each spi,
    epi and all of the overlaps.
  • Consensus is a e-valid checking and changing
    process to bring the coverage down in high
    regions.

11
  • Layout

In the edge column, we see each fragment A and B
represented as a vertex, and the overlap as a
graph edge. Note direction, and suffix/prefix.
12
  • Application
  • There are several
  • ways to arrange the
  • Dovetail Path Framework,
  • but some parts are unique.

13
  • First we do the containment overlaps, then we can
    do the dovetails.

14
  • dovein(f) dovetail in-degree of vertex f
  • doveout(f) dovetail out-degree of vertex f
  • contin(f) containment in-degree of vertex f
  • contout(f) containment out-degree of vertex f

dovein(B) 1 doveout(B) 1 contin(C)
1 contout(E) 2 contin(D) 0
15
  • To maintain e-validity, the layout must satisfy
  • For all f in R
  • dovein(f) lt 1
  • doveout(f) lt 1
  • contin(f) lt 1
  • if dovein(f) 0 and
  • doveout(f) 0 then
  • contin(f) 1

16
  • Now we can infer some things about a layout.
  • We can create an example overlap

p.sk is the distance from spj to spk. p.ek is the
distance from spj to epk Because we sometimes
have complimentary fragments where spi gt epi, we
need to have a method to handle them p.lfti
min(p.si, p.ei) p.rgti max(p.si, p.ei)
17
p.lfti min(p.si, p.ei) p.rgti max(p.si, p.ei)
  • The reason for all of this is to calculate the
    length of fi.
  • fi p.hangi 1, i p.lfti, p.rgti
  • The only other thing we need to know is which
    direction the fragments are going
  • Let p.sufi be true
  • if the overlap substring of i
  • is a suffix of i, as in p.rgti i

18
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
  • We are able to use these equations as functions
    over all i and calculate each starting point,
    recursively. (all D.P.)
  • Base Case sp1 1
  • fwd1 p.suff1
  • For i gt 1 spi spi-1
    pi-1.hangfi-1
  • fwdi pi-1.suffi
  • This gets us all of the dovetail pairs,
  • use j and k to get a containment

19
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
  • Since k is contained in j, we have a containment
    overlap, to be computed after all of the dovetail
    pairs are computed. So we can assume that spj is
    known. However, spk is kind of floating off in
    no-mans land with all the other fragments fi
    that have dovein(fi) 0, doveout(fi) 0 and
    contin(fi) 1.

We will use j and k as an example now.
20
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
  • p.sk if fwdj
  • spk spj
  • j - p.ek otherwise
  • fwdj if p.sj lt p.ej
  • fwdk
  • fwdj otherwise

Lets reverse j and try it again.
21
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
j
k
  • p.sk if fwdi
  • spk spj
  • j - p.ek otherwise
  • fwdj if p.sj lt p.ej
  • fwdk
  • fwdj otherwise

22
The following formula is uniform deviation ? ½
(max1ltiltF(i/F spi/G) - min1ltiltF(i/F
spi/G) 1/F)
  • Big max, big min, max grows faster at xtreme
  • lil max, lil min, min shrinks faster at xtreme

23
  • Simplifying Layout

First we have to remove all of the contained
fragments, as well as all edges to and from these
fragments.
24
  • Simplifying Layout

25
  • Simplifying Layout

Remove all transitive edges
26
  • Simplifying Layout

27
  • Simplifying Layout

Unique-Join, colapse all linear sections into one
vertex.
28
  • Simplifying Layout

29
  • Simplifying Layout

Every e-valid layout can be modeled by a chunk
framework, and this is much more efficient.
30
  • Repeat

31
The End
Write a Comment
User Comments (0)
About PowerShow.com