Title: An Eulerian path approach to DNA fragment assembly Pavel A'Pevzner, Haixu Tang, and Michael S' Water
1An Eulerian path approach to DNA fragment
assemblyPavel A.Pevzner, Haixu Tang, and Michael
S. WatermanPNAS 2001
- Mohamed Tikah Marrakchi
- BIN6002
- Summer, 2005
2Genome sequencing
- Analysing DNA using improved technologies opens a
new era in (but not only in) biomedical research.
- In the past few years genome sequences of many
organisms were generated. - A yeast (Saccharomyces cerevisiae)
- A nematode (C. elegans)
- A fly (Drosophila melanogaster)
- A plant (Arabidopsis thaliana)
- Human (Homo sapiens)
3Genome sequencing
- Depending on the intended use of the genome
sequence data, choose a specific sequencing
strategy. - a detailed 'blueprint' (to establish a gene
catalogue...) - not so detailed to acquire information about
repetitive sequences, carry simple comparisons
with other organisms...)
4Genome sequencing
- Central to almost all the past and current genome
sequencing projects is the 'dideoxy chain
termination' - sequencing method (developed by Fred Sanger and
colleagues in the 70s) - electrophoretic separation and detection of
invitro synthesized, single-stranded DNA
molecules terminated with dideoxynucleotides.
5Genome sequencing
- Many improvements were brought to the Sanger
sequencing method. - laser based instrumentation that allows the
detection of fluorescently labeled DNA-molecules - development of thermostable polymerases
- development of more robust fluorescent dye
systems - robotic systems have been designed to automate
specific steps in the sequencing process (prepare
sequencing reactions, load samples on gels ...)
6Genome sequencing
- Improvement of the quality and overall throughput
of DNA sequencing and decrease in the cost
(100-fold less in the past decade...)
7Genome sequencing
- Software tools were developed for analysing
sequence data and for carrying out sequence
assembly - calling the nucleotide base at each position and
assigning a corresponding quality score - assembling sequences (using the quality score to
calculate accuracy rates which are helpful for
analysis and finishing steps) - user friendly viewers (ex. phred, phrap and
consed)
8Genome sequencing
- Many strategies were tested for large genomes
sequencing - 'Shotgun' sequencing (described in the 80s) was
found to be very efficient. - large piece of DNA can be sequenced by first
fragmenting it into smaller pieces generating
redundant amounts of sequence - Obtaining sequence data from random fragments
data then piecing the sequence reads together
9Genome sequencing
- Strategies in shotgun sequencing
- Strategy using large insert clones and associated
physical maps - Strategy taking a whole genome approach (without
using clone-based physical maps) - Hybrid strategies involving both approaches
10Clone by clone shotgun sequencing
- Also referred to as hierarchical shotgun
sequencing or map-based shotgun sequencing - Map construction pieces of genomic DNA are
cloned using a host vector system (bacteria or
yeast) - Individual clones are analysed for the presence
of unique DNA landmarks (STS, restriction sites
...) used to assemble overlapping clone maps. - YACs (yeast artificial chromosomes) up to a
megabase pair in size (used for the first
physical maps of the human and mouse). - BACs (100 - 200 kb)
11Clone by clone shotgun sequencing
- 2 Eric D. Green. Nature Review Genetics 2,
573-583 (2001)
12Clone by clone shotgun sequencing
- Clone selection With the assembled BAC contig
map, minimally overlapping clones are selected
for shotgun sequences - One BAC is usually selected for sequencing if
each of the restriction fragments in its
fingerprint is also present in one overlapping
clone.
13Clone by clone shotgun sequencing
- 2 Eric D. Green. Nature Review Genetics 2,
573-583 (2001)
14Clone by clone shotgun sequencing
- Subclone library selection Random fragmentation
of the cloned DNA in each selected BAC. subclone
it into a plasmid - even assemblies generated with low coverage (3 -
5x) can be used for important analysis to provide
a working draft sequence - highly accurate sequences (gt99.99 accurate) are
obtained with 8 - 10x coverage.
15Clone by clone shotgun sequencing
- Directed Finishing phase
- sequence finishing is a process in which
remaining problems with the assembly are
resolved - discontinuities between sequence contigs (gaps),
areas of low quality, ambiguous bases in the
consensus sequence... - software facilitates the process of sequence
finishing. - phred statistical foundation for sequence
assembly programs (phrap). - Autofinish automates the finishing process
(recommends specific additional sequencing
reactions)
16Clone by clone shotgun sequencing
- 2 Eric D. Green. Nature Review Genetics 2,
573-583 (2001)
17Whole genome shotgun sequencing
- Assembly of sequence reads generated in a random,
genome-wide fashion. - Bypasses the need for a clone-based physical map.
- Pieces of the entire genome are subcloned in
suitable plasmid vectors. - Sequence reads are generated from both ends of
subclones to produce highly redundant sequence - Coverage to deal with the problem of repetitive
sequences.
18Whole genome shotgun sequencing
- (drosophilia, Haemophilus influenzae)
- using several size classes of subclone is
important - availability of long range mapping data is also
crutially important - software tools
19Whole genome shotgun sequencing
- 2 Eric D. Green. Nature Review Genetics 2,
573-583 (2001)
20Hybrid strategies for shotgun sequencing
- The two strategies clone-by-clone and whole
genome shotgun sequencing are not mutually
exclusive. - Mixed approach that capture the advantages of of
both the approches. - provide a rapid insight about the sequence of the
entire genome - minimizing the likelihood of serious
misassemblies - Finding optimal balance between generating
sequence reads in a clone-by-clone versus whole
genome fashion
21Hybrid strategies for shotgun sequencing
- 2 Eric D. Green. Nature Review Genetics 2,
573-583 (2001)
22Sequencing of genomes from multicellular organisms
- 2 Eric D. Green. Nature Review Genetics 2,
573-583 (2001)
23DNA Fragment Assembly
- Fragment assembly is trying to assemble a big
puzzle. - Follows the "overlap - layout - concensus
paradigm which is used in almost all available
assembly tools (phrap, cap, tigr, celera) - There is no polynomial algorithm for the
resolution of the layout step - Finishing step is time consuming
24DNA Fragment Assembly
- fragment assembly problem finding a path in the
overlap graph. - Hamiltonian path problem NP-complete difficult
problem.
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
25DNA Fragment Assembly
- Euler is a new algorithm and software tool that
solves the repeat problem. - Uses a counter-intuitive approach which consists
in breaking the puzzle in more pieces! - loss of information is minimal (if we still use
'big' pieces). - information can be restored in later stages.
- Doesn't have the overlap step
- Reduces the NP-complete Hamiltonian path problem
to an easy to solve Eulerian path problem.
26DNA Fragment Assembly
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
27DNA Fragment Assembly - EULER
- A repeat corresponds to an edge rather than a
collection of vertices. - The problem is transformed into finding a path
visiting every edge of the graph exactly once. -
- Eulerian path problem.
28DNA Fragment Assembly - EULER
- How to construct the de Bruijn graph from
sequencing reads? - finished DNA sequence is not available...its
actually what we are looking for! - Consensus (error correction in reads) is the
first step in the proposed approach. - But again, how could we correct the errors
without having the final sequence?
29DNA Fragment Assembly - EULER
- Unlike the existing tools. Euler starts with the
consensus step - Spectral Alignment Problem
- Error Correction Problem
30DNA Fragment Assembly - EULER
- Spectral Alignment Problem Two types of l-tuples
(which are subsequences of length l) are defined - solid l-tuples belonging to more than M reads (M
is a threshold) - weak l-tuples otherwise
- Lets T be a set of l-tuples (called a spectrum).
31DNA Fragment Assembly - EULER
- given a string s and a spectrum T find the
minimum number of mutations in s that transform s
into a T-string (e.g all l-tuples of s belong to
T) - solve the problem using dynamic programming
- Use spectral alignment of a read against all
solid l-tuples - Use iterative spectral alignments with the set of
reads reduces the number of weak l-tuples (and
increases the number of solid l-tuples)
32DNA Fragment Assembly - EULER
- Error Correction Problem
- Given a collection of reads S and a maximum of d
errors per read, introduce d corrections in each
read in such a way that Sl is minimised where
Sl is the spectrum of S (all l-tuples of the
reads and their reverse complement). - An error in a read s affects at most 2l l-tuples
that point to the same error - or 2x for position within a distance xltl from
endpoints of read.
33DNA Fragment Assembly - EULER
- Error Correction Problem
- Look for error corrections that reduce the size
of Sl by 2l (or 2x) - Euler uses a more evolved approach. It eliminates
97.7 of sequencing errors (in some case going
from 4.8 errors per read to 0.11 errors per read) - Error correction is not perfect. It can introduce
errors but as long as the errors are
consistent. - Errors introduced are corrected in a later stage.
Eliminating the false edges in the de Bruijn
graph being built is more important.
34DNA Fragment Assembly - EULER
- Eulerian Superpath
- S is a set of reads. The de Bruijn graph is
defined as follows - Sl is the set l-tuples of S
- vertices in the graph are the set S(l-1).
- if Sl contains a l-tuple whose first
(l-1)-elements are the vertex v and last
(l-1)-elements are the vertex w then join v and w
in the graph. - If S is only one read then the assembly problem
is finding the Chinese Postman path which can be
easily transformed to the Eulerian path problem.
35DNA Fragment Assembly - EULER
- Definitions
- source vertex
- sink vertex
- branching vertex
- repeat
- entrance
- exit
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
36DNA Fragment Assembly - EULER
- de Bruijn graph is very complicated even in error
free cases. - use the information about which l-tuples belong
to the same reads - covering reads (reads containing an entrance and
an exit) reveal information about the pairing
between entrances and exits - tangles are repeats that do not have a covering
read.
37DNA Fragment Assembly - EULER
- Eulerian Superpath Problem find in a given graph
G an Eulerian path which contains a given set of
paths P as subpaths. - the graph G is the de Bruijn graph
- the subpaths in P are the reads.
38DNA Fragment Assembly - EULER
- Solving the subpath problem
- carry k consecutive transformations of the graph
G and the system of paths P in order to obtain
new 'equivalent' graph Gk and system of paths Pk. - where every path is a single edge.
- Every solution of the Eulerian path problem in
(Gk, Pk) provides a solution of the Eulerian
superpath problem in (G, P)
39DNA Fragment Assembly - EULER
- Definition Px,y, P-gtx, Py-gt
- x,y-detachment reduces the number of edges in G
to eventually 1 path per edge. - However, in the case of multiple edges some extra
work has to be done.
40DNA Fragment Assembly - EULER
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
41DNA Fragment Assembly - EULER
- resolvable path, resolvable edge
- some edges cannot be resolved even after
detachment of all resolvable edges - usually this situation corresponds to tangles
(x-cuts)
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
42DNA Fragment Assembly - EULER
- Neseira meningitidis (NM) project
- Better results with real sequencing data than
those obtained with other tools sing error free
sequencing data.
43DNA Fragment Assembly - EULER
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
44Bibliography
- 1 An Eulerian path approach to DNA fragment
assembly, Pavel A. Pevzner, Haixu Tang, Michael
S. Waterman. PNAS, August 2001. - 2 Strategies for the systematic sequencing of
complex genomes. Eric D. Green. Nature Review
Genetics 2, 573-583 (2001) - 3 Eulerian Cycle / Chinese Postman. The Stony
Brook Algorithm Repository. Steven S. Skiena.
http//www.cs.sunysb.edu/algorith/