Computational Biology - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Computational Biology

Description:

http://www.hyseq.com/content/131.php. http://citeseer.nj.nec.com/context/471959/0 ... Greedy algorithm: The idea behind a greedy algorithm is to perform a single ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 30
Provided by: UWPar4
Category:

less

Transcript and Presenter's Notes

Title: Computational Biology


1
Computational Biology
  • The advent of genomic sequencing has led to a new
    era in biological studies, one involving aspects
    of informational technologies to expedite
    analysis of biological data.

2
BIOS 480 Goals
  • Provide a comprehensive understanding of current
    methods in biological sequence analysis
  • Assess challenges and approaches in new
    bioinformatics-related disciplines
  • Support in-depth, hands-on experience in design
    and implementation of bioinformatics tools (BIOS
    482)

3
Grades
  • The primary assessment tool in this course will
    be participation as exhibited by in-class
    discussion (40) and performance on assigned
    homework (30).
  • A final exam testing your comprehensive knowledge
    of the material will count 20 of your final
    grade.
  • www.uwp.edu/barber/bioinformatics/BIOS480.htm
    has lectures and important materials for this
    class and BIOS 482

4
Introduction to
  • Sequence alignments
  • Gene prediction
  • Motifs
  • Phylogeny
  • Genomics
  • Proteomics
  • Metabolomics

5
The advent of genome sequencing brought
bioinformatics into its own
  • Yeah, but now that the human genome is done,
    isnt genomics done
  • No

6
Capillary and Slab gel electrophoresis use a
modified Sanger technology with
fluorescent dyes
Typical reads of 500-750 nt on an hour
timescale. Variation depending on sequencer.
7
Microfabricated Capillary Arrays
  • Etch a glass chip with T-shaped channels that are
    7 cm long, and mM in depth and width, can devise
    a 96 well chip that would be capable of 150,000
    bases/h
  • Miniaturization is one booming field driving
    bioinformatics

8
Free Solution Electrophoresis
  • Possibly will improve separation time (no matrix)
    without losing read length
  • Label DNA molecules with friction increasing
    molecule such as streptavidin
  • Currently can read 100 bp, a long way to go

9
Who needs electrophoresis?
  • Pyrosequencing
  • MALDI-TOF Mass Spectrometry
  • Sequencing by Hybridization
  • Massively Parallel Signature Sequencing
  • A testimony to innovative molecular biology
  • Single molecule methods

10
Pyrosequencing
  • Real-time sequencing measuring release of PPi
    during DNA synthesis
  • Has been of particular use for SNP analysis

11
Put the sequencing reactions through a mass
spectrometer
Spectra of the C- and G- terminated
oligonucleotides
Current limit 100 bp, Facilitated by sensitivity
and high-throughput loading
12
Potential innovations in DNA sequencing
  • Sequencing by hybridization
  • Cot-based analysis
  • http//www.msstate.edu/research/mgel/cbcs/cbcs.htm
  • Chip-based analysis
  • http//www.hyseq.com/content/131.php
  • http//citeseer.nj.nec.com/context/471959/0
  • Linear Read
  • http//www.usgenomics.com/about/index.shtml

13
Cot analysis
14
(No Transcript)
15
Growth in genomic technology
  • U.S. Genomics's technology platform, the
    GeneEngine, has two components, (1)
    nanotechnology systems for positioning DNA so
    that it can be read linearly
    (broadly termed DNA Delivery
    Mechanism(s)) and (2) detection technologies
    that allow the reading of information from the
    DNA Delivery Mechanism(s). (FRET-based??)

16
The future looks bright, but what about right now?
17
Overview of Genomic Sequencing
Original DNA
  • Break DNA into random fragments (8-10X Coverage)
  • Amplify fragments in a vector and sequence
    500-700 bases
  • in from each end

Base calling performed by Phred software
http//www.phrap.org/ http//www.genome.org/c
gi/reprint/8/3/175.pdf
18
Overview of Genomic Sequencing
Original DNA
  • Break DNA into random fragments (8-10X Coverage)
  • Amplify fragments in a vector and sequence
    500-700 bases
  • in from each end
  • Assemble fragments of sequence that have been
    read

Contig 1
Contig 2
19
Phred Software
  • Calls bases in four phases
  • Predicting peaks (ideal locations)
  • Locating observed peaks
  • Matching observed to predicted
  • Finding missing peaks
  • http//www.genome.org/cgi/reprint/8/3/186.pdf
  • http//www.genome.org/cgi/reprint/8/3/175.pdf

20
Errors in Sequencing Reads
  • Each base call is assigned a quality score
  • q -10 x log10(p) Higher quality scores
    correspond to low error probabilities
  • Errors are associated with peak vicinity, use the
    following parameters in error probability
    determination on a TRAINING SET
  • Peak spacing
  • Uncalled/called ration (two window sizes)
  • Peak resolution
  • Result in a look-up table inherent to Phred
    software

21
Common Sources of Sequencing Errors
  • The first fifty or so peaks of a trace are noisy
    and unevenly spaced due to anomalous migration of
    short DNA fragments, and unreacted dye-primer and
    dye-terminator molecules.
  • Near the end of the trace, peaks become less
    evenly spaced due to less accurate trace
    processing, less well resolved as diffusion
    effects increase, and also labeled molecules
    decrease.
  • Compressions most common in GC-rich regions
    when bases near the end of a single-stranded
    fragment bind to a complementary region forming a
    hairpin (migrates more rapidly than expected)
  • Dye-terminator sequencing method helps resolve
    compressions, but has own problems About 85
    of high quality dye terminator errors resulted
    from a missing G peak following an A, or a
    missing A folling a T, Ewing and Green, 1998.

22
Assembly of large DNA sequences
  • Several assembly programs exist and can be run
    with different degrees of success Phrap, TIGR
    Assembler, CAP, STROLL, etc.

23
Overlap-layout-consensus
  • Most fragment assembly algorithms include the
    following three steps
  • Overlap. Finding potentially overlapping
    fragments.
  • Layout. Finding the order of fragments.
  • Consensus. Deriving the DNA sequence from the
    layout.

24
Overlap
  • The overlap problem is to find the best match
    between the suffix of one sequence and the prefix
    of another.
  • If no sequencing errors, simply find the longest
    suffix of one string that exactly matches the
    prefix of another string.
  • Since errors are small, the common practice is to
    use filtration method and to filter out pairs of
    fragments that do not share a significantly long
    common substring.

25
Layout
  • Many algorithms select a pair of fragments with
    the best overlap at every step.
  • The score of overlap is either the similarity
    score or a more involved probablilistic score.
  • The selected pair of fragments with the best
    overlap score is checked for consistency.
  • If this check is accepted, the two fragments are
    merged.

26
Layout
  • At later stages of the algorithm the collections
    of fragments (contig) rather than individual
    fragments are merged.
  • The difficulty with the layout step is deciding
    whether two fragments with a good overlap really
    overlap (i.e. their differences are caused by
    sequencing errors) or represent a repeat in a
    genome (i.e. their differences are caused by
    mutations).
  • Use additional scaffolding measures physical
    mapping, optical mapping, http//schwartzlab.biote
    ch.wisc.edu/omm/omm.html

27
Consensus
  • The simplest way to build the consensus is to
    report the most frequent character in the
    substring layout that is (implicitly) constructed
    after the layout step is completed.

28
The Human Touch
  • Consed A Graphical Tool for Editing Phrap
    Assemblies.

29
Some definitions
  • Heuristics A term in computer science that
    refers to 'guesses" made by a program to obtain
    approximately accurate results. Typically, these
    are used to increase the speed of a program
    greatly at the cost of potentially yielding
    suboptimal results. BLAST and FASTA use
    heuristics based on knowledge of how sequences
    evolve.
  • Greedy algorithm The idea behind a greedy
    algorithm is to perform a single procedure in the
    recipe over and over again until it can't be done
    any more and see what kind of results it will
    produce.
Write a Comment
User Comments (0)
About PowerShow.com