Bioinformatics - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Bioinformatics

Description:

How many books are needed to print the entire mammalian genome? ... Sequence alignment and merging is done using a Smith-Waterman dynamic programming algorithm. ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 21
Provided by: aruv
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics


1
Lecture 14
Bioinformatics
  • Genome sequencing projects
  • Hierarchical and Shotgun approaches
  • Genome assembly
  • TIGR Assembler
  • Ensembl

2
  • Genome size
  • Mammalian genome 3 megabase 3x109 base pairs
  • How many books are needed to print the entire
    mammalian genome?
  • 1,500 letter per page x 1000 pages per book x
    2000 books
  • Assuming 5 cm per book this shelf is 100 meters
    long!

3
Genome sequencing the problem
  • Sequencing read lengths vary depending upon
    several parameters but 600 to 800 nucleotides
    correspond to a good estimate. To sequence much
    larger fragments or even whole genome,
    essentially two strategies have been designed.
  • a) The hierarchical approach. Depending on the
    vector used for cloning BAC, YAC, cosmid and
    other libraries of cloned contigs are usually
    created. The size of insert/contig may vary from
    tens to hundred thousand of base pairs.
    Collections of sub-fragments obtained by
    enzymatic restriction are mapped to get a unique
    contigs from which a minimal set of sub-fragments
    can be selected and sequenced thus limiting
    sequence redundancy.
  • b) The shotgun approach. This can be applied to a
    DNA sequence of any size, including the whole
    genome. DNA is randomly fragmented by sonication
    or shearing. Following fragmentation and
    enzymatic end repair the DNA fragments are
    ligated to a plasmid vector and a bacterium host
    transformed to produce a library. Clones taken at
    random from the library are then sequenced from
    both end using two universal primers. At this
    stage a shotgun is characterised by its depth
    i.e. the cumulative length of sequence determined
    divided by the length of the fragment or genome
    to be sequenced. For example with an estimated
    size of 4 Mb a 10X shotgun would correspond to
    the assembly of about 60,000 reads with a mean
    size of 650 nt. The resulting sequences are
    assembled in a unique contig representing the
    whole fragment by sequence comparison using
    appropriate bio-informatic programs. The final
    stage or polishing stage corresponds to the
    elimination of gaps and other possible problems.

4
Shotgun approach
5
Genome assembly
6
Assembly of a contiguous DNA sequences
  • Sequencing projects have rapidly moved to using
    the two approaches sequentially.
  • For example, the construction of a BAC map
    covering an entire genome or chromosome is
    followed by a shotgun strategy to sequence a
    minimal set of BACs.
  • The change that was introduced by G. Venter was
    the size of the DNA fragment or genome that was
    directly shotguned. The possibility to increase
    the size of the shotgun projects was dependent
    upon the development of robots adapted to high
    throughput project and of bioinformatic programs
    that solve two major problems.
  • One is a quantitative problem regarding the
    capacity to store, compare, retrieve millions of
    reads corresponding to billions of nucleotides.
    DB problem.
  • The second problem is related to the presence of
    numerous repeat sequences that are often longer
    than the mean read length, complicating correct
    assembly. Assembly problem.

7
Fragment assembly problem
  • The Shortest Superstring Problem, while
    representing a challenge, is simplified
    abstraction, since it should also take into
    consideration three other difficulties.
  • 1. Sequence data are not perfect and mistaken
    reads are possible.
  • 2. Presence of numerous repeats. There is a
    million of 300 base pairs Alu copies and many
    other repeats. Fortunately some repeats may
    slightly differ due to mutation process.
  • 3. As DNA is double-stranded, orientation of
    substrings is unknown and it is not known which
    strand should be used in the reconstruction.
  • Most of fragment assembly algorithms include the
    following three steps
  • Overlap. The problem is to find the best match
    between the suffix of one sequence an the prefix
    of another. The difficulties above force to use
    variation of the dynamic programming algorithm
    filtration methods
  • Layout. This is the hardest step in DNA assembly,
    which becomes even more computationally demanding
    with increasing number of fragments. The most
    difficult is deciding whether two fragments with
    a good overlap really overlap or represent a
    repeat or something else.
  • Consensus. This step is devoted to finding the
    most frequent character in the stringing layout
    that is constructed after the layout step is
    completed. More sophisticated algorithms align
    substrings in small windows along the layout or
    use a mosaic of the best (high probabilistic
    scores) segments from the layout.

8
Genome assembly from smaller sequence fragments
9
TIGR Assembler
  • TIGR Assembler is an Open Source software.
  • The TIGR Assembler is a sequence fragment
    assembly program building contigs from small
    sequence reads.
  • It is versatile, offering a wide variety of
    options for tuning the assembly process and
    analyzing sequence data. The current assembly
    engine uses a greedy algorithm and heuristics to
    build contigs, find repeat regions, and target
    alignment regions.
  • Sequence overlaps are detected and scored using a
    32-mer hash.
  • Sequence alignment and merging is done using a
    Smith-Waterman dynamic programming algorithm.
  • Gap penalties and score values corresponding to
    the bases and their quality values are predefined
    and hard coded into the program.

10
(No Transcript)
11
(No Transcript)
12
Genome assembly contigs and suprcontigs
alignment
  • It is very difficult to produce a finished
    continuous sequence having the level of
    redundancy typical for many high eukaryotes.
  • Instead, a draft sequence of about 150,000
    contigs will be generated that could be combined
    to give a few thousand supercontigs.
  • The production, in parallel, of a dense RH map
    will not only facilitate the assembly of the
    contigs into supercontigs, but will also make it
    possible to order the supercontigs a necessary
    step for understand genome rearrangements and
    synteny.

13
85cM
14
Mouse Genome sequencing and assembly
  • The mouse genome is about 14 smaller than the
    human genome (2.5 Gb compared with 2.9 Gb)
    probably due to higher rate of deletions.
  • Over 90 of mouse and human genomes can be
    partitioned into corresponding regions of
    conserved synteny.
  • Sequencing strategy included four approaches 1)
    construction of BAC-based physical map by
    fingerprinting and sequencing the clones ends, 2)
    Whole-Genome Shotgun sequencing to 7 fold
    coverage and assembly to generate an initial
    draft, 3) hierarchical shotgun sequencing of BAC
    clones combined with WGS to create a hybrid
    WGS-BAC assembly, 4) production of finished
    sequence by using the BAC clones as template for
    direct finishing
  • About 41 million reads were generated by the
    project participants, of which 33.6 million
    passed quality checks and 29.7 were paired
    (opposite end of the same clone). Clone inserts
    provide 47-fold physical coverage of the genome.
  • Genome assembly were achieved using two newly
    developed programs Arachne and Phusion.
  • The assembly contains 224,713 contigs, connected
    into 7,418 supercontigs. The 200 largest
    supercontigs span more that 98 of the assembled
    sequence, of which 3 is within sequence gaps.

15
Ensembl An Open-Source Tool
  • The Ensembl consists of two main parts
  • 1) The analysis pipeline, which adds new data and
    analyses regularly to the core database. The DB
    contains DNA sequences, predicted features on the
    sequences and a complete body of evidence
    supporting these predictions. Ensembl known genes
    therefore are those predicted genes that have
    high similarity to genes confirmed by
    experimental evidence.
  • 2) The API (application programming interface),
    which gives structured access to the data.
    Easiness of retrieving information in meaningful
    form makes API an extremely powerful tool. The
    initial implementation of the API is in Perl,
    built upon layer of Bio-Perl objects. Other
    implementations and languages like Java and
    Python are also in use.
  • The Ensembl is based around two ideas a golden
    path (the pathway through the data containing
    nonredundant sequence) and virtual contig (contig
    determined by the user, an arbitrary region of a
    chromosome).
  • NCBI and USCS web-sites contains systems similar
    to the Ensembl.

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com