An Eulerian path approach to DNA fragment assembly Pavel A'Pevzner, Haixu Tang, and Michael S' Water - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

An Eulerian path approach to DNA fragment assembly Pavel A'Pevzner, Haixu Tang, and Michael S' Water

Description:

An Eulerian path approach to DNA fragment assembly ... A nematode (C. elegans) A fly (Drosophila melanogaster) A plant (Arabidopsis thaliana) ... – PowerPoint PPT presentation

Number of Views:351
Avg rating:3.0/5.0
Slides: 45
Provided by: megasunBc
Category:

less

Transcript and Presenter's Notes

Title: An Eulerian path approach to DNA fragment assembly Pavel A'Pevzner, Haixu Tang, and Michael S' Water


1
An Eulerian path approach to DNA fragment
assemblyPavel A.Pevzner, Haixu Tang, and Michael
S. WatermanPNAS 2001
  • Mohamed Tikah Marrakchi
  • BIN6002
  • Summer, 2005

2
Genome sequencing
  • Analysing DNA using improved technologies opens a
    new era in (but not only in) biomedical research.
  • In the past few years genome sequences of many
    organisms were generated.
  • A yeast (Saccharomyces cerevisiae)
  • A nematode (C. elegans)
  • A fly (Drosophila melanogaster)
  • A plant (Arabidopsis thaliana)
  • Human (Homo sapiens)

3
Genome sequencing
  • Depending on the intended use of the genome
    sequence data, choose a specific sequencing
    strategy.
  • a detailed 'blueprint' (to establish a gene
    catalogue...)
  • not so detailed to acquire information about
    repetitive sequences, carry simple comparisons
    with other organisms...)

4
Genome sequencing
  • Central to almost all the past and current genome
    sequencing projects is the 'dideoxy chain
    termination'
  • sequencing method (developed by Fred Sanger and
    colleagues in the 70s)
  • electrophoretic separation and detection of
    invitro synthesized, single-stranded DNA
    molecules terminated with dideoxynucleotides.

5
Genome sequencing
  • Many improvements were brought to the Sanger
    sequencing method.
  • laser based instrumentation that allows the
    detection of fluorescently labeled DNA-molecules
  • development of thermostable polymerases
  • development of more robust fluorescent dye
    systems
  • robotic systems have been designed to automate
    specific steps in the sequencing process (prepare
    sequencing reactions, load samples on gels ...)

6
Genome sequencing
  • Improvement of the quality and overall throughput
    of DNA sequencing and decrease in the cost
    (100-fold less in the past decade...)

7
Genome sequencing
  • Software tools were developed for analysing
    sequence data and for carrying out sequence
    assembly
  • calling the nucleotide base at each position and
    assigning a corresponding quality score
  • assembling sequences (using the quality score to
    calculate accuracy rates which are helpful for
    analysis and finishing steps)
  • user friendly viewers (ex. phred, phrap and
    consed)

8
Genome sequencing
  • Many strategies were tested for large genomes
    sequencing
  • 'Shotgun' sequencing (described in the 80s) was
    found to be very efficient.
  • large piece of DNA can be sequenced by first
    fragmenting it into smaller pieces generating
    redundant amounts of sequence
  • Obtaining sequence data from random fragments
    data then piecing the sequence reads together

9
Genome sequencing
  • Strategies in shotgun sequencing
  • Strategy using large insert clones and associated
    physical maps
  • Strategy taking a whole genome approach (without
    using clone-based physical maps)
  • Hybrid strategies involving both approaches

10
Clone by clone shotgun sequencing
  • Also referred to as hierarchical shotgun
    sequencing or map-based shotgun sequencing
  • Map construction pieces of genomic DNA are
    cloned using a host vector system (bacteria or
    yeast)
  • Individual clones are analysed for the presence
    of unique DNA landmarks (STS, restriction sites
    ...) used to assemble overlapping clone maps.
  • YACs (yeast artificial chromosomes) up to a
    megabase pair in size (used for the first
    physical maps of the human and mouse).
  • BACs (100 - 200 kb)

11
Clone by clone shotgun sequencing
  • 2 Eric D. Green. Nature Review Genetics 2,
    573-583 (2001)

12
Clone by clone shotgun sequencing
  • Clone selection With the assembled BAC contig
    map, minimally overlapping clones are selected
    for shotgun sequences
  • One BAC is usually selected for sequencing if
    each of the restriction fragments in its
    fingerprint is also present in one overlapping
    clone.

13
Clone by clone shotgun sequencing
  • 2 Eric D. Green. Nature Review Genetics 2,
    573-583 (2001)

14
Clone by clone shotgun sequencing
  • Subclone library selection Random fragmentation
    of the cloned DNA in each selected BAC. subclone
    it into a plasmid
  • even assemblies generated with low coverage (3 -
    5x) can be used for important analysis to provide
    a working draft sequence
  • highly accurate sequences (gt99.99 accurate) are
    obtained with 8 - 10x coverage.

15
Clone by clone shotgun sequencing
  • Directed Finishing phase
  • sequence finishing is a process in which
    remaining problems with the assembly are
    resolved
  • discontinuities between sequence contigs (gaps),
    areas of low quality, ambiguous bases in the
    consensus sequence...
  • software facilitates the process of sequence
    finishing.
  • phred statistical foundation for sequence
    assembly programs (phrap).
  • Autofinish automates the finishing process
    (recommends specific additional sequencing
    reactions)

16
Clone by clone shotgun sequencing
  • 2 Eric D. Green. Nature Review Genetics 2,
    573-583 (2001)

17
Whole genome shotgun sequencing
  • Assembly of sequence reads generated in a random,
    genome-wide fashion.
  • Bypasses the need for a clone-based physical map.
  • Pieces of the entire genome are subcloned in
    suitable plasmid vectors.
  • Sequence reads are generated from both ends of
    subclones to produce highly redundant sequence
  • Coverage to deal with the problem of repetitive
    sequences.

18
Whole genome shotgun sequencing
  • (drosophilia, Haemophilus influenzae)
  • using several size classes of subclone is
    important
  • availability of long range mapping data is also
    crutially important
  • software tools

19
Whole genome shotgun sequencing
  • 2 Eric D. Green. Nature Review Genetics 2,
    573-583 (2001)

20
Hybrid strategies for shotgun sequencing
  • The two strategies clone-by-clone and whole
    genome shotgun sequencing are not mutually
    exclusive.
  • Mixed approach that capture the advantages of of
    both the approches.
  • provide a rapid insight about the sequence of the
    entire genome
  • minimizing the likelihood of serious
    misassemblies
  • Finding optimal balance between generating
    sequence reads in a clone-by-clone versus whole
    genome fashion

21
Hybrid strategies for shotgun sequencing
  • 2 Eric D. Green. Nature Review Genetics 2,
    573-583 (2001)

22
Sequencing of genomes from multicellular organisms
  • 2 Eric D. Green. Nature Review Genetics 2,
    573-583 (2001)

23
DNA Fragment Assembly
  • Fragment assembly is trying to assemble a big
    puzzle.
  • Follows the "overlap - layout - concensus
    paradigm which is used in almost all available
    assembly tools (phrap, cap, tigr, celera)
  • There is no polynomial algorithm for the
    resolution of the layout step
  • Finishing step is time consuming

24
DNA Fragment Assembly
  • fragment assembly problem finding a path in the
    overlap graph.
  • Hamiltonian path problem NP-complete difficult
    problem.

1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
25
DNA Fragment Assembly
  • Euler is a new algorithm and software tool that
    solves the repeat problem.
  • Uses a counter-intuitive approach which consists
    in breaking the puzzle in more pieces!
  • loss of information is minimal (if we still use
    'big' pieces).
  • information can be restored in later stages.
  • Doesn't have the overlap step
  • Reduces the NP-complete Hamiltonian path problem
    to an easy to solve Eulerian path problem.

26
DNA Fragment Assembly
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
27
DNA Fragment Assembly - EULER
  • A repeat corresponds to an edge rather than a
    collection of vertices.
  • The problem is transformed into finding a path
    visiting every edge of the graph exactly once.
  • Eulerian path problem.

28
DNA Fragment Assembly - EULER
  • How to construct the de Bruijn graph from
    sequencing reads?
  • finished DNA sequence is not available...its
    actually what we are looking for!
  • Consensus (error correction in reads) is the
    first step in the proposed approach.
  • But again, how could we correct the errors
    without having the final sequence?

29
DNA Fragment Assembly - EULER
  • Unlike the existing tools. Euler starts with the
    consensus step
  • Spectral Alignment Problem
  • Error Correction Problem

30
DNA Fragment Assembly - EULER
  • Spectral Alignment Problem Two types of l-tuples
    (which are subsequences of length l) are defined
  • solid l-tuples belonging to more than M reads (M
    is a threshold)
  • weak l-tuples otherwise
  • Lets T be a set of l-tuples (called a spectrum).

31
DNA Fragment Assembly - EULER
  • given a string s and a spectrum T find the
    minimum number of mutations in s that transform s
    into a T-string (e.g all l-tuples of s belong to
    T)
  • solve the problem using dynamic programming
  • Use spectral alignment of a read against all
    solid l-tuples
  • Use iterative spectral alignments with the set of
    reads reduces the number of weak l-tuples (and
    increases the number of solid l-tuples)

32
DNA Fragment Assembly - EULER
  • Error Correction Problem
  • Given a collection of reads S and a maximum of d
    errors per read, introduce d corrections in each
    read in such a way that Sl is minimised where
    Sl is the spectrum of S (all l-tuples of the
    reads and their reverse complement).
  • An error in a read s affects at most 2l l-tuples
    that point to the same error
  • or 2x for position within a distance xltl from
    endpoints of read.

33
DNA Fragment Assembly - EULER
  • Error Correction Problem
  • Look for error corrections that reduce the size
    of Sl by 2l (or 2x)
  • Euler uses a more evolved approach. It eliminates
    97.7 of sequencing errors (in some case going
    from 4.8 errors per read to 0.11 errors per read)
  • Error correction is not perfect. It can introduce
    errors but as long as the errors are
    consistent.
  • Errors introduced are corrected in a later stage.
    Eliminating the false edges in the de Bruijn
    graph being built is more important.

34
DNA Fragment Assembly - EULER
  • Eulerian Superpath
  • S is a set of reads. The de Bruijn graph is
    defined as follows
  • Sl is the set l-tuples of S
  • vertices in the graph are the set S(l-1).
  • if Sl contains a l-tuple whose first
    (l-1)-elements are the vertex v and last
    (l-1)-elements are the vertex w then join v and w
    in the graph.
  • If S is only one read then the assembly problem
    is finding the Chinese Postman path which can be
    easily transformed to the Eulerian path problem.

35
DNA Fragment Assembly - EULER
  • Definitions
  • source vertex
  • sink vertex
  • branching vertex
  • repeat
  • entrance
  • exit

1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
36
DNA Fragment Assembly - EULER
  • de Bruijn graph is very complicated even in error
    free cases.
  • use the information about which l-tuples belong
    to the same reads
  • covering reads (reads containing an entrance and
    an exit) reveal information about the pairing
    between entrances and exits
  • tangles are repeats that do not have a covering
    read.

37
DNA Fragment Assembly - EULER
  • Eulerian Superpath Problem find in a given graph
    G an Eulerian path which contains a given set of
    paths P as subpaths.
  • the graph G is the de Bruijn graph
  • the subpaths in P are the reads.

38
DNA Fragment Assembly - EULER
  • Solving the subpath problem
  • carry k consecutive transformations of the graph
    G and the system of paths P in order to obtain
    new 'equivalent' graph Gk and system of paths Pk.
  • where every path is a single edge.
  • Every solution of the Eulerian path problem in
    (Gk, Pk) provides a solution of the Eulerian
    superpath problem in (G, P)

39
DNA Fragment Assembly - EULER
  • Definition Px,y, P-gtx, Py-gt
  • x,y-detachment reduces the number of edges in G
    to eventually 1 path per edge.
  • However, in the case of multiple edges some extra
    work has to be done.

40
DNA Fragment Assembly - EULER
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
41
DNA Fragment Assembly - EULER
  • resolvable path, resolvable edge
  • some edges cannot be resolved even after
    detachment of all resolvable edges
  • usually this situation corresponds to tangles
    (x-cuts)

1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
42
DNA Fragment Assembly - EULER
  • Neseira meningitidis (NM) project
  • Better results with real sequencing data than
    those obtained with other tools sing error free
    sequencing data.

43
DNA Fragment Assembly - EULER
1 Pavel A. Pevzner, Haixu Tang, Michael S.
Waterman. PNAS, August 2001.
44
Bibliography
  • 1 An Eulerian path approach to DNA fragment
    assembly, Pavel A. Pevzner, Haixu Tang, Michael
    S. Waterman. PNAS, August 2001.
  • 2 Strategies for the systematic sequencing of
    complex genomes. Eric D. Green. Nature Review
    Genetics 2, 573-583 (2001)
  • 3 Eulerian Cycle / Chinese Postman. The Stony
    Brook Algorithm Repository. Steven S. Skiena.
    http//www.cs.sunysb.edu/algorith/
Write a Comment
User Comments (0)
About PowerShow.com