Title: BCB 444/544
1 BCB 444/544
- Lecture 26
- Gene Prediction
- 26_Oct22
2 Required Reading (before lecture)
- Mon Oct 22 - Lecture 26
- Gene Prediction
- Chp 8 - pp 97 - 112
- Wed Oct 24 - Lecture 27 (will not be covered
on Exam 2) - Regulatory Element Prediction
- Chp 9 - pp 113 - 126
- Thurs Oct 25 - Review Session Project Planning
-
- Fri Oct 26 - EXAM 2
-
3 Assignments Announcements
- Sun Oct 21 - Study Guide for Exam 2 was posted
- Mon Oct 22 - HW4 Due
- (no "correct" answer to post)
- Thu Oct 25 - Lab Optional Review Session for
Exam - 544 Project Planning/Consult with DD MT
- Fri Oct 26 - Exam 2 - Will cover
- Lectures 13-26 (thru Mon Sept 17)
- Labs 5-8
- HW 3 4
- All assigned reading
- Chps 6 (beginning with HMMs), 7-8, 12-16
- Eddy What is an HMM
- Ginalski Practical Lessons
4BCB 544 "Team" Projects
- 544 Extra HW2 is next step in Team Projects
- Write 1 page outline
- Schedule meeting with Michael Drena to discuss
topic - Read a few papers
- Write a more detailed plan
- You may work alone if you prefer
- Last week of classes will be devoted to Projects
- Written reports due Mon Dec 3 (no class that
day) - Oral presentations (15-20') will be Wed-Fri Dec
5,6,7 - 1 or 2 teams will present during each class
period - See Guidelines for Projects posted online
5BCB 544 Only New Homework Assignment
- 544 Extra2 (posted online Thurs?)
- No - sorry! sent by email on Sat
-
- Due PART 1 - ASAP
- PART 2 - Fri Nov 2 by 5 PM
- Part 1 - Brief outline of Project, email to Drena
Michael - after response/approval, then
- Part 2 - More detailed outline of project
- Read a few papers and summarize status of
problem - Schedule meeting with Drena Michael to
discuss ideas -
6 Seminars this Week
- BCB List of URLs for Seminars related to
Bioinformatics - http//www.bcb.iastate.edu/seminars/index.html
- Oct 25 Thur - BBMB Seminar 410 in 1414 MBB
- Dave Segal UC Davis Zinc Finger Protein Design
- Oct 19 Fri - BCB Faculty Seminar 210 in 102 ScI
- Guang Song ComS, ISU Probing functional
mechanisms by structure-based modeling and
simulations
7Chp 16 - RNA Structure Prediction
- SECTION V STRUCTURAL BIOINFORMATICS
- Xiong Chp 16 RNA Structure Prediction
(Terribilini) - RNA Function
- Types of RNA Structures
- RNA Secondary Structure Prediction Methods
- Ab Initio Approach
- Comparative Approach
- Performance Evaluation
8Covalent non-covalent bonds in RNA
This is a new slide
- Primary
- Covalent bonds
- Secondary/Tertiary
- Non-covalent bonds
- H-bonds
- (base-pairing)
- Base stacking
-
Fig 6.2 Baxevanis Ouellette 2005
9RNA Pseudoknots Tetraloops
This is a new slide
- Often have important regulatory or catalytic
functions
Pseudoknot
Tetraloop
http//academic.brooklyn.cuny.edu/chem/zhuang/QD/m
ckay_hr.gif
http//www.lbl.gov/Science-Articles/Research-Revie
w/Annual-Reports/1995/images/rna.gif
10Base Pairing in RNA
This slide has been changed
- G-C, A-U, G-U ("wobble") many variants
See IMB Image Library of Biological Molecules
http//www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NAN
A.htmlbasepairs
11RNA Secondary Structure Prediction Methods
This slide has been changed
- Two (three, recently) main types of methods
- Ab initio - based on calculating most
energetically favorable secondary structure(s) - Energy minimization (thermodynamics)
- Comparative approach - based on comparisons of
multiple evolutionarily-related RNA sequences - Sequence comparison (co-variation)
- Combined computational experimental
- Use experimental constraints when available
12RNA Secondary structure prediction - 3
This is a new slide
3) Combined experimental computational
- Experiments
- Map single-stranded vs double-stranded regions
in folded RNA - How?
- Enzymes S1 nuclease, T1 RNase
- Chemicals kethoxal, DMS, OH?
- Software
- Mfold
- Sfold
- RNAStructure
- RNAFold
- RNAlifold
13Ab Initio Prediction Clarifications
This slide has been changed
- Free energy is calculated based on parameters
determined in the wet lab - Correction Use known energy associated with
each type of nearest-neighbor pair
(base-stacking) (not base-pair) - Base-pair formation is not independent multiple
base-pairs adjacent to each other are more
favorable than individual base-pairs -
cooperative - because of base-stacking
interactions - Bulges and loops adjacent to base-pairs have a
free energy penalty
14Energy minimization What are the rules?
This is a new slide
What gives here?
Why 1.2 vs 1.6?
C Staben 2005
15Energy minimization calculations Base-stacking
is critical
This is a new slide
- Tinocco et al.
C Staben 2005
16Ab Initio Energy Calculation
This slide has been changed
- Search for all possible base-pairing patterns
- Calculate total energy of each structure based on
all stabilizing and destabilizing forces
- Total free energy for a specific RNA conformation
Sum of incremental energy terms for - helical stacking
- (sequence dependent)
- loop initiation
- unpaired stacking
(favorable "increments" are lt 0)
Fig 6.3 Baxevanis Ouellette 2005
17Dynamic Programming
This slide has been changed
- Finding optimal secondary structure is difficult
- lots of possibilities - Compare RNA sequence with itself
- Apply scoring scheme based on energy parameters
for base stacking, cooperativity, and penalties
for destabilizing forces (loops, bulges) - Find path that represents most energetically
favorable secondary structure
183 - Popular Programs that use Combined
Computational Experimental Approaches
- Mfold
- Sfold
- RNAStructure
- RNAFold
- RNAlifold
19Comparison of Predictions for Single RNA using
Different Methods
JH Lee 2007
20Comparison of Mfold Predictions -/
Constraints
Mfold plus constraints -54.84 kcal/mol
Mfold -126.05 kcal/mol
JH Lee 2007
21Performance Evaluation
This slide has been changed
- Ab initio methods? correlation coefficient
20-60 - Comparative approaches? correlation coefficient
20-80 - Programs that require user to supply MSA are more
accurate - Comparative programs are consistently more
accurate than ab initio - Base-pairs predicted by comparative sequence
analysis for large small subunit rRNAs are 97
accurate when compared with high resolution
crystal structures! - Gutell, Pace - BEST APPROACH? Methods that combine
computational prediction (ab initio
comparative) with experimental constraints (from
chemical/enzymatic modification studies)
22Chp 8 - Gene Prediction
- SECTION III GENE AND PROMOTER PREDICTION
- Xiong Chp 8 Gene Prediction
- Categories of Gene Prediction Programs
- Gene Prediction in Prokaryotes
- Gene Prediction in Eukaryotes
23What is a Gene?
- What is a gene? segment of DNA, some of which is
"structural," i.e., transcribed to give a
functional RNA product, some of which is
"regulatory" - Genes can encode
- mRNA (for protein)
- other types of RNA (tRNA, rRNA, miRNA, etc.)
- Genes differ in eukaryotes vs prokaryotes (
archaea), both structure regulation
24Gene Finding
- Problem Given a new genomic DNA sequence,
identify coding regions and their predicted RNA
and protein sequences - ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
- ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
- Steps
- Search against protein / EST database
- Apply gene prediction programs (many programs
available) - Analyze regulatory regions
25Gene Prediction in Prokaryotes vs Eukaryotes
- Eukaryotes
- Large genomes 107 1010 bp
- Often less than 2 coding
- Complicated gene structure (splicing, long
exons) - Prediction success 50-95
- Prokaryotes
- Small genomes 0.5 - 10106 bp
- About 90 of genome is coding
- Simple gene structure
- Prediction success 99
26DNA "Signals" Used by Gene Finding Algorithms
- Exploit the regular gene structure
- ATGExon1Intron1Exon2ExonNSTOP
- Recognize coding bias
- CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-
- Recognize splice sites
- IntroncAGtExongGTgagIntron
- Model the duration of regions
- Introns tend to be much longer than exons, in
mammals - Exons are biased to have a given minimum length
- Use cross-species comparison
- Gene structure is conserved in mammals
- Exons are more similar (85) than introns
27Computational Gene Finding Approaches
- Ab initio methods
- Search by signal find DNA sequences involved in
gene expression. - Search by content Test statistical properties
distinguishing coding from non-coding DNA - Similarity based methods
- Database search exploit similarity to proteins,
ESTs, and cDNAs - Comparative genomics exploit aligned genomes
- Do other organisms have similar sequence?
- Hybrid methods - best
28Examples of Gene Prediction Software
- Ab initio
- Genscan, GeneMark.hmm, Genie, GeneID
- Similarity-based
- BLAST, Procrustes
- Hybrids
- GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP,
ROSETTA, CEM, TBLASTX, SLAM. - BEST? Ab initio - Genescan (according to some
assessments) - Hybrid - GeneSeqer
- But depends on organism specific task
- Lists of Gene Prediction Software
- http//www.bioinformaticsonline.org/links/ch_09_t
_1.html - http//cmgm.stanford.edu/classes/genefind/
29Synthesis Processing of Eukaryotic mRNA
Gene in DNA
30What are cDNAs ESTs?
- cDNA libraries are important for determining gene
- structure studying regulation of gene
expression - Isolate RNA (always from a specific
- organism, region, and time point)
- Convert RNA to complementary DNA
- (with reverse transcriptase)
- Clone into cDNA vector
- Sequence the cDNA inserts
- Short cDNAs are called ESTs or
- Expressed Sequence Tags
- ESTs are strong evidence for genes
- Full-length cDNAs can be difficult to obtain
31UniGene Unique genes via ESTs
- Find UniGene at NCBI
- www.ncbi.nlm.nih.gov/UniGene
- UniGene clusters contain many ESTs
- UniGene data come from many cDNA libraries.
-
- When you look up a gene in UniGene, you can
- obtain information re level tissue
- distribution of expression
32 Gene Prediction
- Overview of steps strategies
- What sequence signals can be used?
- What other types of information can be used?
- Algorithms
- HMMs, Bayesian models, neural nets
- Gene prediction software
- 3 major types
- many, many programs!
33Overview of Gene Prediction Strategies
- What sequence signals can be used?
- Transcription TF binding sites, promoter,
initiation site, terminator, GC islands, etc. - Processing signals Splice donor/acceptors,
polyA signal - Translation Start (AUG Met) stop (UGA,UUA,
UAG) - ORFs, codon usage
- What other types of information can be used?
- Homology (sequence comparison, BLAST)
- cDNAs ESTs (experimental data, pairwise
alignment)
34Gene prediction Eukaryotes vs prokaryotes
Gene prediction is easier in microbial
genomes Why? Smaller genomes Simpler gene
structures Many more sequenced genomes!
(for comparative approaches)
Many microbial genomes have been fully sequenced
whole-genome "gene structure" and "gene
function" annotations are available e.g.,
GeneMark.hmm TIGR Comprehensive
Microbial Resource (CMR) NCBI Microbial
Genomes
35Predicting Genes - Basic steps
- Obtain genomic sequence
- BLAST it!
- Perform database similarity search
- (with EST cDNA databases, if
available) - Translate in all 6 reading frames
- (i.e., "6-frame translation")
- Compare with protein sequence databases
- Use Gene Prediction software to locate genes
- Analyze regulatory sequences
- Refine gene prediction
36Predicting Genes - Details
- 1. 1st, mask to "remove" repetitive elements
(ALUs, etc.) - Perform database search on translated DNA
(BlastX,TFasta) - Use several programs to predict genes
(GENSCAN, GeneMark.hmm, GeneSeqer) - Search for functional motifs in translated ORFs
(Blocks, Motifs, etc.) in neighboring DNA
sequences - Repeat
37Spliced Alignment Algorithm
GeneSeqer - Brendel et al.- ISU
http//deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Brendel et al (2004) Bioinformatics 20 1157
- Perform pairwise alignment with large gaps in one
sequence - (due to introns)
- Align genomic DNA with cDNA, ESTs, protein
sequences - Score semi-conserved sequences at splice
junctions - Using Bayesian model or MM
- Score coding constraints in translated exons
- Using a Bayesian model or MM
Brendel 2005
38Brendel - Spliced Alignment II Compare with
protein probes
Brendel 2005
39Splice Site Detection
Do DNA sequences surrounding splice "consensus"
sequences contribute to splicing signal?
YES
i ith position in sequence I avg
information content over all positions gt20 nt
from splice site ?I avg sample standard
deviation of I
Brendel 2005
40Information content vs position
Which sequences are exons which are
introns? How can you tell?
Brendel et al (2004) Bioinformatics 20 1157
Brendel 2005
41Markov Model for Spliced Alignment
Brendel 2005