Bioinformatics - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Bioinformatics

Description:

How many books are needed to print the entire mammalian genome? ... Sequence alignment and merging is done using a Smith-Waterman dynamic programming algorithm. ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 21

Provided by: aruv

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics

1
Lecture 14
Bioinformatics

Genome sequencing projects
Hierarchical and Shotgun approaches
Genome assembly
TIGR Assembler
Ensembl

Genome size
Mammalian genome 3 megabase 3x109 base pairs
How many books are needed to print the entire
mammalian genome?
1,500 letter per page x 1000 pages per book x
2000 books
Assuming 5 cm per book this shelf is 100 meters
long!

3
Genome sequencing the problem

Sequencing read lengths vary depending upon
several parameters but 600 to 800 nucleotides
correspond to a good estimate. To sequence much
larger fragments or even whole genome,
essentially two strategies have been designed.
a) The hierarchical approach. Depending on the
vector used for cloning BAC, YAC, cosmid and
other libraries of cloned contigs are usually
created. The size of insert/contig may vary from
tens to hundred thousand of base pairs.
Collections of sub-fragments obtained by
enzymatic restriction are mapped to get a unique
contigs from which a minimal set of sub-fragments
can be selected and sequenced thus limiting
sequence redundancy.
b) The shotgun approach. This can be applied to a
DNA sequence of any size, including the whole
genome. DNA is randomly fragmented by sonication
or shearing. Following fragmentation and
enzymatic end repair the DNA fragments are
ligated to a plasmid vector and a bacterium host
transformed to produce a library. Clones taken at
random from the library are then sequenced from
both end using two universal primers. At this
stage a shotgun is characterised by its depth
i.e. the cumulative length of sequence determined
divided by the length of the fragment or genome
to be sequenced. For example with an estimated
size of 4 Mb a 10X shotgun would correspond to
the assembly of about 60,000 reads with a mean
size of 650 nt. The resulting sequences are
assembled in a unique contig representing the
whole fragment by sequence comparison using
appropriate bio-informatic programs. The final
stage or polishing stage corresponds to the
elimination of gaps and other possible problems.

4
Shotgun approach
5
Genome assembly
6
Assembly of a contiguous DNA sequences

Sequencing projects have rapidly moved to using
the two approaches sequentially.
For example, the construction of a BAC map
covering an entire genome or chromosome is
followed by a shotgun strategy to sequence a
minimal set of BACs.
The change that was introduced by G. Venter was
the size of the DNA fragment or genome that was
directly shotguned. The possibility to increase
the size of the shotgun projects was dependent
upon the development of robots adapted to high
throughput project and of bioinformatic programs
that solve two major problems.
One is a quantitative problem regarding the
capacity to store, compare, retrieve millions of
reads corresponding to billions of nucleotides.
DB problem.
The second problem is related to the presence of
numerous repeat sequences that are often longer
than the mean read length, complicating correct
assembly. Assembly problem.

7
Fragment assembly problem

The Shortest Superstring Problem, while
representing a challenge, is simplified
abstraction, since it should also take into
consideration three other difficulties.
1. Sequence data are not perfect and mistaken
reads are possible.
2. Presence of numerous repeats. There is a
million of 300 base pairs Alu copies and many
other repeats. Fortunately some repeats may
slightly differ due to mutation process.
3. As DNA is double-stranded, orientation of
substrings is unknown and it is not known which
strand should be used in the reconstruction.
Most of fragment assembly algorithms include the
following three steps
Overlap. The problem is to find the best match
between the suffix of one sequence an the prefix
of another. The difficulties above force to use
variation of the dynamic programming algorithm
filtration methods
Layout. This is the hardest step in DNA assembly,
which becomes even more computationally demanding
with increasing number of fragments. The most
difficult is deciding whether two fragments with
a good overlap really overlap or represent a
repeat or something else.
Consensus. This step is devoted to finding the
most frequent character in the stringing layout
that is constructed after the layout step is
completed. More sophisticated algorithms align
substrings in small windows along the layout or
use a mosaic of the best (high probabilistic
scores) segments from the layout.

8
Genome assembly from smaller sequence fragments
9
TIGR Assembler

TIGR Assembler is an Open Source software.
The TIGR Assembler is a sequence fragment
assembly program building contigs from small
sequence reads.
It is versatile, offering a wide variety of
options for tuning the assembly process and
analyzing sequence data. The current assembly
engine uses a greedy algorithm and heuristics to
build contigs, find repeat regions, and target
alignment regions.
Sequence overlaps are detected and scored using a
32-mer hash.
Sequence alignment and merging is done using a
Smith-Waterman dynamic programming algorithm.
Gap penalties and score values corresponding to
the bases and their quality values are predefined
and hard coded into the program.

10
(No Transcript)
11
(No Transcript)
12
Genome assembly contigs and suprcontigs
alignment

It is very difficult to produce a finished
continuous sequence having the level of
redundancy typical for many high eukaryotes.
Instead, a draft sequence of about 150,000
contigs will be generated that could be combined
to give a few thousand supercontigs.
The production, in parallel, of a dense RH map
will not only facilitate the assembly of the
contigs into supercontigs, but will also make it
possible to order the supercontigs a necessary
step for understand genome rearrangements and
synteny.

13
85cM
14
Mouse Genome sequencing and assembly

The mouse genome is about 14 smaller than the
human genome (2.5 Gb compared with 2.9 Gb)
probably due to higher rate of deletions.
Over 90 of mouse and human genomes can be
partitioned into corresponding regions of
conserved synteny.
Sequencing strategy included four approaches 1)
construction of BAC-based physical map by
fingerprinting and sequencing the clones ends, 2)
Whole-Genome Shotgun sequencing to 7 fold
coverage and assembly to generate an initial
draft, 3) hierarchical shotgun sequencing of BAC
clones combined with WGS to create a hybrid
WGS-BAC assembly, 4) production of finished
sequence by using the BAC clones as template for
direct finishing
About 41 million reads were generated by the
project participants, of which 33.6 million
passed quality checks and 29.7 were paired
(opposite end of the same clone). Clone inserts
provide 47-fold physical coverage of the genome.
Genome assembly were achieved using two newly
developed programs Arachne and Phusion.
The assembly contains 224,713 contigs, connected
into 7,418 supercontigs. The 200 largest
supercontigs span more that 98 of the assembled
sequence, of which 3 is within sequence gaps.

15
Ensembl An Open-Source Tool

The Ensembl consists of two main parts
1) The analysis pipeline, which adds new data and
analyses regularly to the core database. The DB
contains DNA sequences, predicted features on the
sequences and a complete body of evidence
supporting these predictions. Ensembl known genes
therefore are those predicted genes that have
high similarity to genes confirmed by
experimental evidence.
2) The API (application programming interface),
which gives structured access to the data.
Easiness of retrieving information in meaningful
form makes API an extremely powerful tool. The
initial implementation of the API is in Perl,
built upon layer of Bio-Perl objects. Other
implementations and languages like Java and
Python are also in use.
The Ensembl is based around two ideas a golden
path (the pathway through the data containing
nonredundant sequence) and virtual contig (contig
determined by the user, an arbitrary region of a
chromosome).
NCBI and USCS web-sites contains systems similar
to the Ensembl.