Title: Fragment Assembly
1Fragment Assembly
- A graphic approach to shotgun DNA sequencing
2- Quick review of shotgun sequencing
- Shotgun sequencing is for processing data
collected by gel-electrophoretic procedures. - Today, resolution of reads is up to 1000
nucleotides of a sample, but since strands are
much longer than that, they need to be broken up
to be read in pieces, then sequenced back
together.
3- F of fragments
- L average read length
- N amount of data collected F L
- G length of shotgun sequence target string
- c coverage N / G
- F N / L N cG
- F cG / L
- With todays technology, L 300, c 6, G
50,000 - So average problem length is 1000 fragments.
4- Standard shotgun sequencing works, right?
The goal is to minimize the length of the target,
by finding overlaps in the fragments and
stringing them together into one continuous
sequence.
5- Why cant we just try to minimize length?
- There are some cases where the basic method of
shotgun sequencing failslike repeats.
6- The quest for uniform distribution
- We still need the sequencing to conform to
certain standards set for error, but we really
want to make the distribution close to level
across the sequence. We are going to implement a
graph ADT to help move fragments. - R our reconstruction string
- fi the ith read fragment .
- spi the start-point of the ith f
- epi the end-point of the ith f
7- R our reconstruction string
- fi the ith read fragment of R
- spi the start-point of the ith f
- epi the end-point of the ith f
- 1 lt spi, epi lt R G
- In order to be e-valid, (within acceptable
- error) we need to have no more than
- efi differences in each f and R
- Must be covered, NO GAPS
8- p an overlap
- pi the overlap for the ith f
- We are going to represent the chain of fis and
pis as a graph, so to be implemented as
follows... - fi Rspi, epi
- fi Repi, spic
- each fragment has a start and end point, which it
can be referenced by, and conveniently stored in
a 2d array or some kind of hash. In our graph,
each fragment is represented as a vertex, and - each overlap is represented as an
- edge.
9- p.A the overlap on A
- pi the overlap for the ith f
- We can refer to p.A by its start and stop, like
f - p.Ap.sA, p.eA
- Now that we know the terms, we can easily refer
to parts of the graph without having to describe
it all over again
10- There are 3 phases to fragment assembly.
- ?overlap
- ?layout
- ?consensus
- We can skip the processes behind overlap, and
just assume that they work like in general
shotgun assembly. - Layout is where we generate values for each spi,
epi and all of the overlaps. - Consensus is a e-valid checking and changing
process to bring the coverage down in high
regions.
11In the edge column, we see each fragment A and B
represented as a vertex, and the overlap as a
graph edge. Note direction, and suffix/prefix.
12- Application
- There are several
- ways to arrange the
- Dovetail Path Framework,
- but some parts are unique.
13- First we do the containment overlaps, then we can
do the dovetails.
14- dovein(f) dovetail in-degree of vertex f
- doveout(f) dovetail out-degree of vertex f
- contin(f) containment in-degree of vertex f
- contout(f) containment out-degree of vertex f
dovein(B) 1 doveout(B) 1 contin(C)
1 contout(E) 2 contin(D) 0
15- To maintain e-validity, the layout must satisfy
- For all f in R
- dovein(f) lt 1
- doveout(f) lt 1
- contin(f) lt 1
- if dovein(f) 0 and
- doveout(f) 0 then
- contin(f) 1
16- Now we can infer some things about a layout.
- We can create an example overlap
p.sk is the distance from spj to spk. p.ek is the
distance from spj to epk Because we sometimes
have complimentary fragments where spi gt epi, we
need to have a method to handle them p.lfti
min(p.si, p.ei) p.rgti max(p.si, p.ei)
17p.lfti min(p.si, p.ei) p.rgti max(p.si, p.ei)
- The reason for all of this is to calculate the
length of fi. - fi p.hangi 1, i p.lfti, p.rgti
- The only other thing we need to know is which
direction the fragments are going - Let p.sufi be true
- if the overlap substring of i
- is a suffix of i, as in p.rgti i
18p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
- We are able to use these equations as functions
over all i and calculate each starting point,
recursively. (all D.P.) - Base Case sp1 1
- fwd1 p.suff1
- For i gt 1 spi spi-1
pi-1.hangfi-1 - fwdi pi-1.suffi
- This gets us all of the dovetail pairs,
- use j and k to get a containment
19p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
- Since k is contained in j, we have a containment
overlap, to be computed after all of the dovetail
pairs are computed. So we can assume that spj is
known. However, spk is kind of floating off in
no-mans land with all the other fragments fi
that have dovein(fi) 0, doveout(fi) 0 and
contin(fi) 1.
We will use j and k as an example now.
20p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
- p.sk if fwdj
- spk spj
- j - p.ek otherwise
- fwdj if p.sj lt p.ej
- fwdk
- fwdj otherwise
Lets reverse j and try it again.
21p.lfti min(p.si, p.ei) p.rgti max(p.si,
p.ei) p.hangi 1, i p.lfti, p.rgti
j
k
- p.sk if fwdi
- spk spj
- j - p.ek otherwise
- fwdj if p.sj lt p.ej
- fwdk
- fwdj otherwise
22The following formula is uniform deviation ? ½
(max1ltiltF(i/F spi/G) - min1ltiltF(i/F
spi/G) 1/F)
- Big max, big min, max grows faster at xtreme
- lil max, lil min, min shrinks faster at xtreme
23First we have to remove all of the contained
fragments, as well as all edges to and from these
fragments.
24 25Remove all transitive edges
26 27Unique-Join, colapse all linear sections into one
vertex.
28 29Every e-valid layout can be modeled by a chunk
framework, and this is much more efficient.
30 31The End