Title: Predictive methods using DNA sequences Unit 11
1Predictive methods using DNA sequencesUnit 11
- BIOL221T Advanced Bioinformatics for
Biotechnology
Irene Gabashvili, PhD
2Reminders from last week
- Polymorphism and mutations
- Mapping and Sequencing
- Genomic Map Elements
- Types of Maps
- Resources
- Practical Use
3Polymorphism - Types of variation
- SNPsnp_class, True single nucleotide
polymorphism - in-del, Insertion deletion polymorphism
('-/) - Microsatellite/simple sequence repeat
- FUNC Function_Class
- "coding nonsynonymous
- locus region, intron, exception
- mrna, utr , splice site
- coding synonymous
4Nonsynonymous Mutations
- Missense type of nonsynonymous (different amino
acid in the product of mutated genes) - EXAMPLE sickle-cell disease The replacement of A
by T at the 17th nucleotide of the gene for the
beta chain of hemoglobin changes the codon GAG
(for glutamic acid) to GTG (which encodes
valine). Thus the 6th amino acid in the chain
becomes valine instead of glutamic acid
5Nonsynonymous Mutations
- Another example of a missense mutation In one
patient with cystic fibrosis (Patient B), the
substitution of a T for a C at nucleotide 1609
converted a glutamine codon (CAG) to a STOP codon
(TAG). The protein produced by this patient had
only the first 493 amino acids of the normal
chain of 1480 and could not function.
6Nonsynonymous Mutations
- The new nucleotide changes a codon that specified
an amino acid to one of the STOP codons (TAA,
TAG, or TGA). Therefore, translation of the
messenger RNA transcribed from this mutant gene
will stop prematurely. The earlier in the gene
that this occurs, the more truncated the protein
product and the more likely that it will be
unable to function. These type of mutations are
called nonsense mutations
7Insertions and Deletions (Indels) ADRB1gene
AND humanorgn AND "in-del"snp_class
- Base pairs may be added (insertions) or removed
(deletions) from the DNA of a gene. The number
can range from one to thousands. - As a result, translation of the gene can be
"frameshifted". Indels of three nucleotides or
multiples of three may be less serious. - Huntington's disease and the fragile X syndrome
are examples of trinucleotide repeat diseases
caused by insertion
8Silent and splice-site mutations
- For example, if the third base in the TCT codon
for serine is changed to any one of the other
three bases, serine will still be encoded. Such
mutations are said to be silent because they
cause no change in protein (synonymous) - Nucleotide signals at the splice sites guide the
enzymatic machinery. If a mutation alters one of
these signals, then the intron is not removed and
remains as part of the final RNA molecule. This
alters the sequence of the protein product.
9Types of Maps see MapViewer
- Cytogenetic
- Genetic Linkage
- Physical
- Radiation Hybrid
- Sequence-based
10Genomic Map Elements
- DNA markers, PACR-based
- STS
- Polymorphic markers
- RFLPs, VNTRs, SNPs
- DNA clones
- BACs and PACs
11Databases Servers
- BLAT
- MapView
- GeneCards
- GeneLoc
- Stanford Source
- Bioinformatics Harvester
12Predictive methods using DNA sequences, BO
chapter 5
- Gene Prediction methods
- Gene Prediction Programs
- How good the methods are?
- Promoter Analysis
- Strategies and Considerations
- Markov models
- HMMs in Gene Prediction
- Discriminant Analysis in Gene Prediction
13Sequence Signals Gene Structure
14Sequence Signals Gene Structure
- UCSC Genome Browser
- Ensembl
- NCBIs Gene Viewer
15What is Computational Gene Finding?
- Given an uncharacterized DNA sequence, find out
- Which region codes for a protein?
- Which DNA strand is used to encode the gene?
- Which reading frame is used in that strand?
- Where does the gene starts and ends?
- Where are the exon-intron boundaries in
eukaryotes? - (optionally) Where are the regulatory sequences
for that gene?
16Gene Prediction Methods
- Searching by Signal
- Searching by Content
- Homology-based Gene Prediction
- Comparative Gene Prediction
- Ab initio, intrinsic, template (1st and 2nd)
vs extrinsic, look-up (3rd and 4th)
17Eukaryotes vs Prokaryotes
- Genes separated by intergenic DNA, coding exons
separated by large introns vs ORFs adjacent to
one another
18Prokaryotic Vs. Eukaryotic Gene Finding
- Prokaryotes
- small genomes 0.5 10106 bp
- high coding density (gt90)
- no introns
- Gene identification relatively easy, with success
rate 99 - Problems
- overlapping ORFs
- short genes
- finding TSS and promoters
- Eukaryotes
- large genomes 107 1010 bp
- low coding density (lt50)
- intron/exon structure
- Gene identification a complex problem, gene level
accuracy 50 - Problems
- many
19Gene Structure
20Gene Finding Different Approaches
- Similarity-based methods (extrinsic) - use
similarity to annotated sequences - proteins
- cDNAs
- ESTs
- Comparative genomics - Aligning genomic sequences
from different species - Ab initio gene-finding (intrinsic)
- Integrated approaches
21Similarity-based methods
- Based on sequence conservation due to functional
constraints - Use local alignment tools (Smith-Waterman algo,
BLAST, FASTA) to search protein, cDNA, and EST
databases - Will not identify genes that code for proteins
not already in databases (can identify 50 new
genes) - Limits of the regions of similarity not well
defined
22Comparative Genomics
- Based on the assumption that coding sequences are
more conserved than non-coding - Two approaches
- intra-genomic (gene families)
- inter-genomic (cross-species)
- Alignment of homologous regions
- Difficult to define limits of higher similarity
- Difficult to find optimal evolutionary distance
(pattern of conservation differ between loci)
23(No Transcript)
24Summary for Extrinsic Approaches
- Strengths
- Rely on accumulated pre-existing biological data,
thus should produce biologically relevant
predictions - Weaknesses
- Limited to pre-existing biological data
- Errors in databases
- Difficult to find limits of similarity
25Signal Sensors
- Signal a string of DNA recognized by the
cellular machinery
26Signal Sensors
- Various pattern recognition method are used for
identification of these signals - consensus sequences
- weight matrices
- weight arrays
- decision trees
- Hidden Markov Models (HMMs)
- neural networks
27Example of Consensus Sequence
- obtained by choosing the most frequent base at
each position of the multiple alignment of
subsequences of interest - TACGAT
- TATAAT
- TATAAT
- GATACT
- TATGAT
- TATGTT
- consensus sequence
- consensus (IUPAC)
- Leads to loss of information and can produce
- many false positive or false negative
predictions
TATAAT
MELON MANGO HONEY SWEET COOKY
TATRNT
MONEY
28Example of (Positional) Weight Matrix
- Computed by measuring the frequency of every
element of every position of the site (weight) - Score for any putative site is the sum of the
matrix values (converted in probabilities) for
that sequence (log-likelihood score) - Disadvantages
- cut-off value required
- assumes independence between adjacent bases
TACGAT TATAAT TATAAT GATACT TATGAT TATGTT
29Example of Decision Tree
30Markov Models
31Ingredients of a Markov Model
- Collection of states
- S1, S2, ,SN
- State transition probabilities (transition
matrix) - Aij P(qt1 Si qt Sj)
- Initial state distribution
- ?i P(q1 Si)
32Hidden Markov Models
33Ingredients of a HMM
- Collection of states S1, S2,,SN
- State transition probabilities (transition
matrix) - Aij P(qt1 Si qt Sj)
- Initial state distribution
- ?i P(q1 Si)
- Observations O1, O2,,OM
- Observation probabilities
- Bj(k) P(vt Ok qt Sj)
34Examples of Gene Finders
- FGENES linear DF for content and signal sensors
and DP for finding optimal combination of exons - GeneMark HMMs enhanced with ribosomal binding
site recognition - Genie neural networks for splicing, HMMs for
coding sensors, overall structure modeled by HMM - Genscan WM, WA and decision trees as signal
sensors, HMMs for content sensors, overall HMM - HMMgene HMM trained using conditional maximum
likelihood - Morgan decision trees for exon classification,
also Markov Models - MZEF quadratic DF, predict only internal exons
35Genscan Example
- Developed by Chris Burge 1997
- One of the most accurate ab initio programs
- Uses explicit state duration HMM to model gene
structure (different length distributions for
exons) - Different model parameters for regions with
different GC content
36Ab initio Gene Finding is Difficult
- Genes are separated by large intergenic regions
- Genes are not continuous, but split in a number
of (small) coding exons, separated by (larger)
non-coding introns - in humans coding sequence comprise only a few
percents of the genome and an average of 5 of
each gene - Sequence signals that are essential for
elucidation of a gene structure are degenerate
and highly unspecific - Alternative splicing
- Repeat elements (gt50 in humans) some contain
coding regions
37Problems with Ab initio Gene Finding
- No biological evidence
- In long genomic sequences many false positive
predictions - Prediction accuracy high, but not sufficient
38Evaluation of Gene Finding Programs
- Calculating accuracy of programs predictions
- Many evaluation studies, one of the earliest
- Burset and Guigó, 1996 (vertebrate sequences)
- Pavy et al., 1999 (Arabidopsis thaliana)
- Rogic et al., 2001 (mammalian sequences)
39Measures of Prediction Accuracy,
- Nucleotide level accuracy
- Sensitivity
- Specificity
number of correct exons
number of actual exons
number of correct exons
number of predicted exons
40Measures of Prediction Accuracy, Part 2
MISSING EXON
WRONGEXON
CORRECTEXON
REALITY
PREDICTION
41Integrated Approaches for Gene Finding
- Programs that integrate results of similarity
searches with ab initio techniques (GenomeScan,
FGENESH, Procrustes) - Programs that use synteny between organisms
(ROSETTA, SLAM) - Integration of programs predicting different
elements of a gene (EuGène) - Combining predictions from several gene finding
programs (combination of experts)
42Combining Programs Predictions
- Set of methods used and they way they are
integrated differs between individual programs - Different programs often predict different
elements of an actual gene - they could complement each other yielding
better prediction
43Gene Prediction Links
- http//genome.imim.es/geneid.html
- http//genes.mit.edu/GENSCAN.html
- FGENES, commercial, but can try
- http//www.softberry.com/berry.phtml?topicfgenes
groupprogramssubgroupgfind
44GeneID
- Hierarchical approach
- Splice sites and stop codons predicted and scored
using position-specific weight matrices - Exons built from identified defining sites.
Scored as the sum of scores of defining sites
plus the score of their coding potential - Maximization of all the score to assemble gene
structure - Latesr versions of the program add sequence
similarity searches
45GeneScan,Fgenes, Genewise
- GeneSCAN - Underlying Hidden Markov Model program
- FGENES linear discriminant analysis to identify
splice sites, exons, promoter elements - Genewise compares a genomic sequence with a
protein sequence or with HMMs representing
protein sequences
46How good the methods are?
- Different methods - different results. How to
measure accuracy? - Sensitivity proportion of coding nucleotides,
exons, genes predicted correctly (true positives) - Specificity proportion of predicted elements,
genes that are real - Correlation coefficient combines both
47Screening Test for Occult Cancer
- 100 patients with occult cancer 95 have "x" in
their blood - 100 patients without occult cancer 95 do not
have "x" in their blood - 5 out of every 1000 randomly selected individuals
will have occult cancer
SENSITIVITY
SPECIFICITY
PREVALENCE
482 X 2 Table
100,000
If a patient has x in his blood, chance of
occult canceris 475 / 5475 8.7
49Standard Terminology
True Positives (TPs)
False Negatives (FNs)
False Positives (FPs)
True Negatives (TNs)
Entire Population
50Definitions
51What is a Positive Test?
- All the analysis has assumed that it is clear
whether a test is positive or negative - In reality, many tests involve continuous values
so that one result may be more positive than
another - How should one define the cut-off at which a test
is judged to be abnormal?
52Continuously Valued Variables
Result
53Continuously Valued Variables
- Fewer false positives (more conservative)
- More false negatives
- Higher specificity
- Lower sensitivity
Normal cutoff
Result
54Continuously Valued Variables
Result
- Fewer false negatives (more aggressive)
- More false positives
- Higher sensitivity
- Lower specificity
55More on Projects
- vaccine development (Ramya)
- http//immunax.dfci.harvard.edu/PEPVAC/
- HIV the Black Plague Harshal
- CCR5 - chemokine (C-C motif) receptor 5
- HIV drug esistance http//hivdb.stanford.edu/
- Gene Annotation Chris
- Pharmacogenomics Jennifer
56More on Projects
- Disease network Jyoti
- Disease networks (gout) Annie
- Genotyping Nancy
- Physiological Genomics Erin
- Harmeet perl program for protein structure
analysis?
57More on Projects
- Human Genetic Variation - Priyanka
- Cloning (humans) Parag
- Evolution Sukhpreet
- Metabolic engineering - Danh
- Protein structure - Tanzeema