Computational Biology - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Computational Biology

Description:

http://www.hyseq.com/content/131.php. http://citeseer.nj.nec.com/context/471959/0 ... Greedy algorithm: The idea behind a greedy algorithm is to perform a single ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 30

Provided by: UWPar4

Category:

more less

Transcript and Presenter's Notes

Title: Computational Biology

1
Computational Biology

The advent of genomic sequencing has led to a new
era in biological studies, one involving aspects
of informational technologies to expedite
analysis of biological data.

2
BIOS 480 Goals

Provide a comprehensive understanding of current
methods in biological sequence analysis
Assess challenges and approaches in new
bioinformatics-related disciplines
Support in-depth, hands-on experience in design
and implementation of bioinformatics tools (BIOS
482)

3
Grades

The primary assessment tool in this course will
be participation as exhibited by in-class
discussion (40) and performance on assigned
homework (30).
A final exam testing your comprehensive knowledge
of the material will count 20 of your final
grade.
www.uwp.edu/barber/bioinformatics/BIOS480.htm
has lectures and important materials for this
class and BIOS 482

4
Introduction to

Sequence alignments
Gene prediction
Motifs
Phylogeny
Genomics
Proteomics
Metabolomics

5
The advent of genome sequencing brought
bioinformatics into its own

Yeah, but now that the human genome is done,
isnt genomics done
No

6
Capillary and Slab gel electrophoresis use a
modified Sanger technology with
fluorescent dyes
Typical reads of 500-750 nt on an hour
timescale. Variation depending on sequencer.
7
Microfabricated Capillary Arrays

Etch a glass chip with T-shaped channels that are
7 cm long, and mM in depth and width, can devise
a 96 well chip that would be capable of 150,000
bases/h
Miniaturization is one booming field driving
bioinformatics

8
Free Solution Electrophoresis

Possibly will improve separation time (no matrix)
without losing read length
Label DNA molecules with friction increasing
molecule such as streptavidin
Currently can read 100 bp, a long way to go

9
Who needs electrophoresis?

Pyrosequencing
MALDI-TOF Mass Spectrometry
Sequencing by Hybridization
Massively Parallel Signature Sequencing
A testimony to innovative molecular biology
Single molecule methods

10
Pyrosequencing

Real-time sequencing measuring release of PPi
during DNA synthesis
Has been of particular use for SNP analysis

11
Put the sequencing reactions through a mass
spectrometer
Spectra of the C- and G- terminated
oligonucleotides
Current limit 100 bp, Facilitated by sensitivity
and high-throughput loading
12
Potential innovations in DNA sequencing

Sequencing by hybridization
Cot-based analysis
http//www.msstate.edu/research/mgel/cbcs/cbcs.htm
Chip-based analysis
http//www.hyseq.com/content/131.php
http//citeseer.nj.nec.com/context/471959/0
Linear Read
http//www.usgenomics.com/about/index.shtml

13
Cot analysis
14
(No Transcript)
15
Growth in genomic technology

U.S. Genomics's technology platform, the
GeneEngine, has two components, (1)
nanotechnology systems for positioning DNA so
that it can be read linearly
(broadly termed DNA Delivery
Mechanism(s)) and (2) detection technologies
that allow the reading of information from the
DNA Delivery Mechanism(s). (FRET-based??)

16
The future looks bright, but what about right now?
17
Overview of Genomic Sequencing
Original DNA

Break DNA into random fragments (8-10X Coverage)
Amplify fragments in a vector and sequence
500-700 bases
in from each end

Base calling performed by Phred software
http//www.phrap.org/ http//www.genome.org/c
gi/reprint/8/3/175.pdf
18
Overview of Genomic Sequencing
Original DNA

Break DNA into random fragments (8-10X Coverage)
Amplify fragments in a vector and sequence
500-700 bases
in from each end
Assemble fragments of sequence that have been
read

Contig 1
Contig 2
19
Phred Software

Calls bases in four phases
Predicting peaks (ideal locations)
Locating observed peaks
Matching observed to predicted
Finding missing peaks
http//www.genome.org/cgi/reprint/8/3/186.pdf
http//www.genome.org/cgi/reprint/8/3/175.pdf

20
Errors in Sequencing Reads

Each base call is assigned a quality score
q -10 x log10(p) Higher quality scores
correspond to low error probabilities
Errors are associated with peak vicinity, use the
following parameters in error probability
determination on a TRAINING SET
Peak spacing
Uncalled/called ration (two window sizes)
Peak resolution
Result in a look-up table inherent to Phred
software

21
Common Sources of Sequencing Errors

The first fifty or so peaks of a trace are noisy
and unevenly spaced due to anomalous migration of
short DNA fragments, and unreacted dye-primer and
dye-terminator molecules.
Near the end of the trace, peaks become less
evenly spaced due to less accurate trace
processing, less well resolved as diffusion
effects increase, and also labeled molecules
decrease.
Compressions most common in GC-rich regions
when bases near the end of a single-stranded
fragment bind to a complementary region forming a
hairpin (migrates more rapidly than expected)
Dye-terminator sequencing method helps resolve
compressions, but has own problems About 85
of high quality dye terminator errors resulted
from a missing G peak following an A, or a
missing A folling a T, Ewing and Green, 1998.

22
Assembly of large DNA sequences

Several assembly programs exist and can be run
with different degrees of success Phrap, TIGR
Assembler, CAP, STROLL, etc.

23
Overlap-layout-consensus

Most fragment assembly algorithms include the
following three steps
Overlap. Finding potentially overlapping
fragments.
Layout. Finding the order of fragments.
Consensus. Deriving the DNA sequence from the
layout.

24
Overlap

The overlap problem is to find the best match
between the suffix of one sequence and the prefix
of another.
If no sequencing errors, simply find the longest
suffix of one string that exactly matches the
prefix of another string.
Since errors are small, the common practice is to
use filtration method and to filter out pairs of
fragments that do not share a significantly long
common substring.

25
Layout

Many algorithms select a pair of fragments with
the best overlap at every step.
The score of overlap is either the similarity
score or a more involved probablilistic score.
The selected pair of fragments with the best
overlap score is checked for consistency.
If this check is accepted, the two fragments are
merged.

26
Layout

At later stages of the algorithm the collections
of fragments (contig) rather than individual
fragments are merged.
The difficulty with the layout step is deciding
whether two fragments with a good overlap really
overlap (i.e. their differences are caused by
sequencing errors) or represent a repeat in a
genome (i.e. their differences are caused by
mutations).
Use additional scaffolding measures physical
mapping, optical mapping, http//schwartzlab.biote
ch.wisc.edu/omm/omm.html

27
Consensus

The simplest way to build the consensus is to
report the most frequent character in the
substring layout that is (implicitly) constructed
after the layout step is completed.

28
The Human Touch

Consed A Graphical Tool for Editing Phrap
Assemblies.

29
Some definitions

Heuristics A term in computer science that
refers to 'guesses" made by a program to obtain
approximately accurate results. Typically, these
are used to increase the speed of a program
greatly at the cost of potentially yielding
suboptimal results. BLAST and FASTA use
heuristics based on knowledge of how sequences
evolve.
Greedy algorithm The idea behind a greedy
algorithm is to perform a single procedure in the
recipe over and over again until it can't be done
any more and see what kind of results it will
produce.