Fragment Assembly

About This Presentation

Title:

Fragment Assembly

Description:

Today, resolution of reads is up to 1000 nucleotides of a sample, ... lil max, lil min, min shrinks faster at xtreme. The following formula is uniform deviation ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 32

Provided by: trin7

Category:

more less

Transcript and Presenter's Notes

Title: Fragment Assembly

1
Fragment Assembly

A graphic approach to shotgun DNA sequencing

Quick review of shotgun sequencing
Shotgun sequencing is for processing data
collected by gel-electrophoretic procedures.
Today, resolution of reads is up to 1000
nucleotides of a sample, but since strands are
much longer than that, they need to be broken up
to be read in pieces, then sequenced back
together.

F of fragments
L average read length
N amount of data collected F L
G length of shotgun sequence target string
c coverage N / G
F N / L N cG
F cG / L
With todays technology, L 300, c 6, G
50,000
So average problem length is 1000 fragments.

Standard shotgun sequencing works, right?

The goal is to minimize the length of the target,
by finding overlaps in the fragments and
stringing them together into one continuous
sequence.
5

Why cant we just try to minimize length?
There are some cases where the basic method of
shotgun sequencing failslike repeats.

The quest for uniform distribution
We still need the sequencing to conform to
certain standards set for error, but we really
want to make the distribution close to level
across the sequence. We are going to implement a
graph ADT to help move fragments.
R our reconstruction string
fi the ith read fragment .
spi the start-point of the ith f
epi the end-point of the ith f

R our reconstruction string
fi the ith read fragment of R
spi the start-point of the ith f
epi the end-point of the ith f
1 lt spi, epi lt R G
In order to be e-valid, (within acceptable
error) we need to have no more than
efi differences in each f and R
Must be covered, NO GAPS

p an overlap
pi the overlap for the ith f
We are going to represent the chain of fis and
pis as a graph, so to be implemented as
follows...
fi Rspi, epi
fi Repi, spic
each fragment has a start and end point, which it
can be referenced by, and conveniently stored in
a 2d array or some kind of hash. In our graph,
each fragment is represented as a vertex, and
each overlap is represented as an
edge.

p.A the overlap on A
pi the overlap for the ith f
We can refer to p.A by its start and stop, like
f
p.Ap.sA, p.eA
Now that we know the terms, we can easily refer
to parts of the graph without having to describe
it all over again

There are 3 phases to fragment assembly.
?overlap
?layout
?consensus
We can skip the processes behind overlap, and
just assume that they work like in general
shotgun assembly.
Layout is where we generate values for each spi,
epi and all of the overlaps.
Consensus is a e-valid checking and changing
process to bring the coverage down in high
regions.

Layout

In the edge column, we see each fragment A and B
represented as a vertex, and the overlap as a
graph edge. Note direction, and suffix/prefix.
12

Application
There are several
ways to arrange the
Dovetail Path Framework,
but some parts are unique.

First we do the containment overlaps, then we can
do the dovetails.

dovein(f) dovetail in-degree of vertex f
doveout(f) dovetail out-degree of vertex f
contin(f) containment in-degree of vertex f
contout(f) containment out-degree of vertex f

dovein(B) 1 doveout(B) 1 contin(C)
1 contout(E) 2 contin(D) 0
15

To maintain e-validity, the layout must satisfy
For all f in R
dovein(f) lt 1
doveout(f) lt 1
contin(f) lt 1
if dovein(f) 0 and
doveout(f) 0 then
contin(f) 1

Now we can infer some things about a layout.
We can create an example overlap

p.sk is the distance from spj to spk. p.ek is the
distance from spj to epk Because we sometimes
have complimentary fragments where spi gt epi, we
need to have a method to handle them p.lfti
min(p.si, p.ei) p.rgti max(p.si, p.ei)
17
p.lfti min(p.si, p.ei) p.rgti max(p.si, p.ei)

The reason for all of this is to calculate the
length of fi.
fi p.hangi 1, i p.lfti, p.rgti
The only other thing we need to know is which
direction the fragments are going
Let p.sufi be true
if the overlap substring of i
is a suffix of i, as in p.rgti i

18
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti

We are able to use these equations as functions
over all i and calculate each starting point,
recursively. (all D.P.)
Base Case sp1 1
fwd1 p.suff1
For i gt 1 spi spi-1
pi-1.hangfi-1
fwdi pi-1.suffi
This gets us all of the dovetail pairs,
use j and k to get a containment

19
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti

Since k is contained in j, we have a containment
overlap, to be computed after all of the dovetail
pairs are computed. So we can assume that spj is
known. However, spk is kind of floating off in
no-mans land with all the other fragments fi
that have dovein(fi) 0, doveout(fi) 0 and
contin(fi) 1.

We will use j and k as an example now.
20
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti

p.sk if fwdj
spk spj
j - p.ek otherwise
fwdj if p.sj lt p.ej
fwdk
fwdj otherwise

Lets reverse j and try it again.
21
p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
j
k

p.sk if fwdi
spk spj
j - p.ek otherwise
fwdj if p.sj lt p.ej
fwdk
fwdj otherwise

22
The following formula is uniform deviation ? ½
(max1ltiltF(i/F spi/G) - min1ltiltF(i/F
spi/G) 1/F)

Big max, big min, max grows faster at xtreme
lil max, lil min, min shrinks faster at xtreme

Simplifying Layout

First we have to remove all of the contained
fragments, as well as all edges to and from these
fragments.
24

Simplifying Layout

Simplifying Layout

Remove all transitive edges
26

Simplifying Layout

Simplifying Layout

Unique-Join, colapse all linear sections into one
vertex.
28

Simplifying Layout

Simplifying Layout

Every e-valid layout can be modeled by a chunk
framework, and this is much more efficient.
30

Fragment Assembly - PowerPoint PPT Presentation

Fragment Assembly

Today, resolution of reads is up to 1000 nucleotides of a sample, ... lil max, lil min, min shrinks faster at xtreme. The following formula is uniform deviation ... – PowerPoint PPT presentation