Sequence Assembly for Single Molecule Methods - PowerPoint PPT Presentation

About This Presentation

Title:

Sequence Assembly for Single Molecule Methods

Description:

Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena, alexey}_at_cs.sunysb.edu – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 21

Provided by: Preferr169

Learn more at: https://www3.cs.stonybrook.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Assembly for Single Molecule Methods

1
Sequence Assembly for Single Molecule Methods

Steven Skiena, Alexey Smirnov
Department of Computer Science
SUNY at Stony Brook
skiena, alexey_at_cs.sunysb.edu

2
The State of Sequence Assembly

The success of full genome sequencing implies
that shotgun sequence assembly with current
technologies is largely a solved problem
With conventional sequence technologies
read length about 500 base pairs
error rate under 2
coverage about 10 times for bacteria, about 30
times for humans
But single molecule sequencing methods promise to
change these parameters significantly

3
Single Molecule Sequencing Methods

Single molecule sequencing methods, such as being
developed by U.S. Genomics, promise much longer
read lengths
read length hundreds of thousands of bases ?
error rate ?
"No free lunch hypothesis" - we anticipate that
the new technologies will (at least initially)
have significantly higher error rates than
current sequencing machines.
Our assumption long lousy reads.

4
Our Problems

What levels of coverage will be needed to get
accurate sequence informationfrom long noisy
reads?
How do we efficiently assemble such long noisy
reads?

5
Sequencing from Subsequences

Why subsequences?
We anticipate that certain single molecule
sequencing technologies will be prone to having
many base deletion errors
Example in the U.S. Genomics technology,
sequence bases are replaced by tagged bases.
Untagged bases are invisible, generating
subsequences.
We study the effect of per base deletion
frequencies on our ability to accurately
reconstruct long sequences. Our study revolves
around this theoretical error model. But our
algorithm can be easily generalized.

6
Notation

n length of the original sequence
p base deletion rate
k number of reads
Ri a read of the original sequence

7
Quality of Reconstruction Metric

Our score function is

where ED is the edit distance, s is the target
sequence of length n, s sequence reconstructed
from the reads.
An empty string has a score of 0
The target string has a score of 1

8
Lower Bounds on Reconstruction Quality

k0 -gt report a random string of some length.
Computational experiments showed that reporting a
string of length 0.6n gives best results
(score0.37)
k1 -gt report this read score1-p (because
(1-p)n characters will be matched and the rest
will be inserted).

9
Lower Bounds on Reconstruction Quality
10
Information Theory Bounds

What is the minimal number of reads that we need
to reconstruct the sequence?
First, we need to know the number of sequences of
length n in which a given read of length k occurs

Each of reads gives us at most this number of
bits of information
Therefore, we will need at least this many reads
11
Bounds on the Number of Reads
Conclusion reconstruction becomes impossible for
error rates higher than 75, but possible for 50
12
Sequence Assembly Algorithm

We use a two phase procedure
Insertion align a read Ri with consensus
sequence Ci-2 and build a new consensus Ci-1
Refining and Cleanup delete/reorder characters
from current consensus to better reflect the
reads and delete unused characters

C3
refine cleanup
C2
R4
refine cleanup
C1
R3
refine cleanup
R1
R2
13
Read Insertion

How to choose the optimal alignment to insert a
new read into current consensus Ci?
Pairwise align all reads against Ci and for each
position of Ci, compute the number of times each
particular character was inserted into it at this
position.
Align the read being inserted against the
weighted consensus sequence using the insertion
weights generated before.

14
Consensus Refining

Pairwise alignment from reads is prone to two
types of errors inserting a pair of characters
in a wrong order and undersampling

ATA
R1
refine
ATCA
ACTAA
ACTAA
ACA
R2
Solution Try to make a swap and a character
doubling at each position and see if it improves
the alignment score for some reads.
15
Clean up Procedure

Pairwise align all reads against the target to
weight the positions of S by frequency of use.
Update weights after each alignment to bias
matches toward frequently used positions.
Delete all characters matched fewer than a
certain number of times.

16
Complexity Analysis

Each insertion step takes O(knn) time
Each refining step takes O(knn) time
Each cleanup step takes O(knn) time
Total O(niterknn) where niter is the number
of iterations

17
Results

For base deletion rates as high as 40, we can
completely reconstruct sequences with high enough
coverage (50 times coverage)
For larger error rates, our algorithm finds
shorter supersequences, i.e. there are multiple
answers so exact reconstruction is impossible.
Here we ignored the possibility of
insertion/substitution errors, but it is clear
our methods can adapt to different error models
at lower error rates.

18
Results
19
Future Work

We want to build your single molecule sequence
assembler!
Our Stroll shotgun sequence assembler (Chen and
Skiena) was used by Brookhaven National
Laboratory to sequence the bacterial Borrelia
burgdorferi.
We are particularly interested in identifying
better error models for sequencing technologies
under current development.

20
The End