Whole Genome Assembly - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Whole Genome Assembly

Description:

shotgun sequencing the one discussed here. The results were almost identical ... Bacteriophage lambda (virus), 50,000. Escherichia Coli (bacterium), 5,000,000 ... – PowerPoint PPT presentation

Number of Views:164

Avg rating:3.0/5.0

Slides: 26

Provided by: Owne834

Category:

more less

Transcript and Presenter's Notes

Title: Whole Genome Assembly

1
Whole Genome Assembly

The problem biological sequencing is gives
relatively short fragments (clones) of DNA.
How the whole genome can be reconstructed?
Two general approaches were implemented
clone-by-clone
shotgun sequencing the one discussed here
The results were almost identical

2
Mapping using Clones

Clone
A large fragment of genomic DNA obtained using
restriction enzymes
One can make faithful copies of a clone large
number of times from a small number of initial
clones.
All location information for a clone is assumed
to be lost. For instance it is not known
Which chromosome a clone belongs to
Whether two clones overlap
What base-pair sequence the clone has etc.

3
Clone Library

A large set of clones Clone Library is created
Locations of the clones are assumed to be
uniformly random distributed
The sizes of all clone are roughly same.
G Genome length, L Clone Length,
N Clones in a library
Coverage NL/G c
(c is the expected number of clones covering any
location of the genome.)
If the coverage is at least 3 for a base, it is
assumed to be sure.

4
Clone Library
Genome
Clone Library
Minimal Tiling Path
5
Clone Libraries Commonly Used
6
Example Genome Sizes

Bacteriophage lambda (virus),
50,000
Escherichia Coli (bacterium),
5,000,000
Saccharomyces cerevisiae (yeast), 10,000,000
Caenorhabditis elegans (worm), 100,000,000
Drosophila melanogaster (fruitfly),
200,000,000
Homo sapiens (human), 3,000,000,000

7
Example

A BAC library for human
G 3,300 Mb, L 180 Kb, N 96,000
c NL/G 96 103 180 103/ (3.3 109) 6¼
96,000 randomly chosen BACs from the human genome
provide a 6 library.
Certain regions of the genome may be difficult to
clone and hence may not be represented in the
library.
A Tiling Path is a subset of clones that
minimally cover the genome.
Removal of any clone from the tiling path
will leave some location of the genome uncovered.

8
Mapping A Single Clone

Provide a clone with additional information a
finger print
Restriction Pattern
End Sequencing (500 base pairs on each end)
Probes (Hybridization probes, etc.)
Restriction Pattern
Take a clone and completely digest it into small
pieces (restriction fragments) by a restriction
enzyme.
The restriction fragments and their order are
always the same for that clone.

Restriction Patterns are to expansive for large
genome projects
Mapping with probes is much cheaper. Probes may
be short random sequences or extracted from DNA.
STS (sequence tag sites) are long enough DNA
pieces so that are unique with very probability.
Probing is usually done at the ends of clones.

10
Sequence Assembly

Idealized Assembly with probes can be formalized
as following problem (assuming no error in the
read sequence)
Shortest Common Superstring Problem
Given a collection of n strings F f1, f2, ,
fn, find the shortest string that is a
superstring of every string in F.

11
Example

F actcc, gagca, ccctac, agg
Each of these is a superstring of everything in F
(because each contains everything in F as a
substring)
actccgagcaccctacagg
actcccctacaggagca
aggagcactccctac
overlaps in bold shortest one is best.

12
About the SCS Problem

No algorithm is known which is guaranteed to
find the best solution (i.e. the shortest
supersrting) in reasonable time for large cases.
We can never guarantee finding the correct/best
superstring (all genome construction problems
are large).
Thats not all even if we would have a solution
there are still hard problems

Problems of SCS solution
Does not allow experimental errors need to have
perfect superstring.
The orientation of each fragment must be known
unrealistic.
May not be the biological solutions if there are
many long repeats, it is almost impossible to
reconstruct exact original sequence.

14
Example of the repeats problem

Real sequence may include
agactactactactga, and the covering fragments
may be
agactactact and actactactga
which will be collapsed by solving the SCS to
agactactactga,
overlapping too many of the act repeats.

15
Fragment Assembly

Despite the issues, we need to put the fragments
together somehow so
Approximate methods are used, which are
algorithms that try to find good (may be not
best) solutions quickly.
Repeats are dealt with by focussing extra
biotechnology effort to get large sequenced
fragments (large reads) around certain areas
which might contain multiple repeats.

16
Overlap multigraphs
A useful data structure for addressing the SCS is
the overlap multigraph of the set of strings.
This is denoted OM(F) for a set of strings
(fragments) F. First, an illustration for the
set of fragments
4
2
maryha
teass
ryhad
hite
2
1
3
1
2
aswhit
lambit
ssnow
tlela
1
3
4
3
2
2
little
yhadali
lelam
1
2
2
tsfle
2
1
4
leecew
ecewas
Edge between two frags is labelled by length of
overlap
17
More formally
OM(F) is a directed graph in which we have an
edge from fa to fb whenever there is a nonzero
overlap between the suffix of fa and the prefix
of fb. The edge is labelled with the largest
overlap between them. A Hamiltonian path in a
directed graph is a path which visits every node
in the graph precisely once. Any Hamiltonian
path in an overlap multigraph corresponds to a
superstring. Finally, finding the SCS
corresponds to finding the Hamiltionian path with
maximal weight (adding up the edges on the path)

18
Example
4
2
maryha
teass
ryhad
hite
2
1
3
1
2
aswhit
lambit
tlela
1
3
ssnow
4
3
2
2
little
yhadali
lelam
1
2
2
tsfle
2
1
4
leecew
ecewas
The red path is the longest H path in this case,
and leads to the correct sequence. But in genome
sequencing, it is rarely so easy the graph is
immensely larger, and there are very many H
paths.
19
How its done

Basic approach is this (where an edge is start,
finish)
1. build the OM(F) graph
2. Sort the edges in ascending order of
weight, breaking ties randomly
3. Iteratively choose the highest weighted
edge (first in the list) with the following
properties
-- does not have finish same as start or
finish of a previously chosen edge (i.e. does not
form a cycle)
-- does not have start same as start of a
previously chosen edge (i.e. were building a
path, not a tree) .

20
Reconstruction

To deal with experimental errors, we need to
introduce the concept of a distance between
sequences.
Substring edit distance the cost of every
insertion, deletion, or substitution is one unit
distance. Exception deletions in the
extremities of the second sequence has no cost.
If we measure substring edit distances between
strings, then we can mandate a string S such that
either f or its reverse complement f must be an
approximate substring of S at error level e.

21
Given a collection F of f strings, and an error
tolerance e between 0 and 1, we need to find the
shortest possible string S such that for every f,
we have dist (f, S) lt ef. Find a string
S as short as possible such that either f (or
its reverse complement) must be an approximate
substring of S at error level e. This means that
we are allowed an average e errors for each base
in f. If e 0.05 we are allowed 5 errors per
hundred bases. Example If a GCGATAG and b
CAGTCGCTGATCGTACG, then the best alignment
is -----GC-GATAG---- CAGTCGCTGATCGTACG
22
The next step

When short reads (few hundreds of bp) are
assembled into contigs - long sequences by some
of the techniques they are to be assembled to
supercontigs (or scaffolds) and places at right
locations on the genome.

23
The order of the whole genome assembly
24
Placing the assembly on the genome

A sequence tag is a short sequence that is unique
among the whole genome.
Genetic map contains many sequence tags and their
locations.
Align the super contigs to the genome according
to the tags.

25
Review of WGS

WGS Whole Genome Shotgun sequencing
1. Create clone library
2. Assembly the overlapped reads into contigs
3. Assembly the contigs into super contigs.
4. Align the super contigs to the genome
5. Genome Finishing filling the possible gaps
between supercontigs by additional sequencing

Write a Comment

User Comments (0)