Title: CoModelling and Conditional Modelling
1Co-Modelling and Conditional Modelling
Needs
2Grammars Finite Set of Rules for Generating
Strings
3Ab Initio Gene prediction
5'....tttttgcagtactcccgggccctctgttggg
gcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcct
ccttatctctagagccggccctggctctctggcgcggggccccttagtcc
gggctttttgccATGGGGTCTCTGTTCCCTCTGTCGCTGCTGTTTTTTTT
GGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGC
GGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTG
CCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAA
GTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCA
GCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGG
GTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTG
CCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCG
CCTACAgtgagggacaggggctcggtcccggctggggtgaggggaggggg
ctggaagaggtggggaagggtagttgacagtcgctctatagggagcgccc
gcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGT
GATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCC
ACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCAT
GGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGA
TCTGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACT
TCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTG
GTCCGCAACAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccct
gtaaccctggggactaggaggaagggggcagagagagttatgaccccgag
agggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctg
tgttccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCC
CACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCA
CTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGC
GTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaa
acaatctgg ggaaatggccatacatggtgg.... 3'
4Annotating genes
- Despite all difficulties, protein-coding genes
are among the easiest functional - elements to annotate. Several sources of
information - Sequence features (ab-initio approaches)
- Coding exon contains no stop codons (open reading
frame, ORF) - Coding exons tend to reside in CG-rich regions
- Comparative information
- Similarity to known proteins in databases
- Similarity to other species reduced mutation
rates - Experimental evidence for transcription
- cDNA sequences (complementary copy of spliced
mRNA) - ESTs (few 100s basepair copy of 5 end of
(spliced) mRNA transcript)
5Annotating genes
- What makes annotating protein-coding genes so
difficult? - Gene density in human genome is low
- 1-2 are coding exons, some of which are small
(50 nt) - Introns may be very large (100 kb)
- Alternative splicing
- Several promoters
- Several alternative transcripts
- Pseudogenes
- Genes may lose functionality (e.g. after
duplication)Especially recent degenerated genes
hard to spot - Mature (spliced) transcript may be reverse
transcribedThese are often easy to spot (no
introns poly-A tail)
6HMM Examples
7Genscan
Exons of phase 0, 1 or 2
State with length distribution
Introns of phase 0, 1 or 2
Initial exon
Terminal exon
Exon of single exon genes
5' UTR
3' UTR
Poly-A signal
Promoter
Intergenic sequence
Omitted reverse strand part of the HMM
8Gene Finding Protein Homology (Gelfand, Mironov
Pevzner, 1996)
Protein Database
Exon Ordering Graph
Spliced Alignment 1. Define set of potential
exons in new genome. 2. Make exon ordering graph
- EOG. 3. Align EOG to protein database.
T Y G H L P
T Y G H L P T Y - - L P M
Y
L P M
T
W
Q
9Comparative Gene Annotation
AGGTATATAATGCG..... PcodingATG--gtGTG
or AGCCATTTAGTGCG..... Pnon-codingATG--gtGTG
10Simultaneous Alignment Gene Finding Bafna
Huson, 2000, T.Scharling,2001 Blayo,2002.
Align by minimizing Distance/ Maximizing
Similarity
Align genes with structure Known/unknown
115 of the Human genome is under
conservation (Chiaromonte et al.)
(Whole Genome ARs)
Whole Genome
ARs
Due to this work, people often say 5 of the
genome is constrained
From Caleb Webber Gerton Lunter
12Percentage of Genome under Purifying Selection
CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCA
GGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAA
G----CAGACGCA
Consider lengths of inter-gap segments! Do they
follow a geometric distribution?
From Caleb Webber Gerton Lunter
13Finding Regulatory Signals in Genomes
Searching for known signal in 1 sequence
Searching for unknown signal common to set of
unrelated sequences
Searching for conserved segments in homologous
Challenges
Combining homologous and non-homologous analysis
Merging Annotations
Predicting signal-regulatory protein relationships
14Weight Matrices Sequence Logos
Wasserman and Sandelin (2004) Applied
Bioinformatics for the Identification of
Regulatory Elements Nature Review Genetics
5.4.276
15Motifs in Biological Sequences 1990 Lawrence
Reilly An Expectation Maximisation (EM)
Algorithm for the identification and
Characterization of Common Sites in Unaligned
Biopolymer Sequences Proteins 7.41-51. 1992
Cardon and Stormo Expectation Maximisation
Algorithm for Identifying Protein-binding sites
with variable lengths from Unaligned DNA
Fragments L.Mol.Biol. 223.159-170 1993 Lawrence
Liu Detecting subtle sequence signals a Gibbs
sampling strategy for multiple alignment Science
262, 208-214.
Q(q1,A,,qw,T) probability of different bases
in the window
A(a1,..,aK) positions of the windows
q0(qA,..,qT) background frequencies of
nucleotides.
Priors A has uniform prior Qj
has Dirichlet(N0a) prior a base frequency in
genome. N0 is pseudocounts
16Natural Extensions to Basic Model I
Modified from Liu
17Natural Extensions to Basic Model II
18Combining Signals and other Data
Modified from Liu
19Phylogenetic Footprinting (homologous detection)
Blanchette and Tompa (2003) FootPrinter a
program designed for phylogenetic footprinting
NAR 31.13.3840-
20(No Transcript)
21Statistical Alignment and Footprinting.
Solution Cartesian Product of HMMs
22SAPF - Statistical Alignment and Phylogenetic
Footprinting
23BigFoot
http//www.stats.ox.ac.uk/research/genome/software
- Dynamical programming is too slow for more
than 4-6 sequences - MCMC integration is used instead works until
10-15 sequences - For more sequences other methods are needed.
24FSA - Fast Statistical Alignment Pachter,
Holmes Co
Data k genomes/sequences
Iterative addition of homology statements to
shrinking alignment
http//math.berkeley.edu/rbradley/papers/manual.
pdf
Spanning tree
Additional edges
i. Conflicting homology statements cannot be
added ii. Some scoring on multiple sequence
homology statements is used.
25Rate of Molecular Evolution versus estimated
Selective Deceleration
Halpern and Bruno (1998) Evolutionary Distances
for Protein-Coding Sequences MBE 15.7.910-
Moses et al.(2003) Position specific variation
in the rate fo evolution of transcription binding
sites BMC Evolutionary Biology 3.19-
26Signal Factor Prediction
- Use PWM and Bruno-Halpern (BH) method to make
TF specific evolutionary models - Drawback BH only uses rates and equilibrium
distribution
- Superior method Infer TF Specific Position
Specific evolutionary model - Drawback cannot be done without large scale
data on TF-signal binding.
http//jaspar.cgb.ki.se/ http//www.gene-regula
tion.com/
27Knowledge Transfer and Combining Annotations
Must be solvable by Bayesian Priors Each
position pi probability of being jth position in
kth TFBS If no experiment, low probability
for being in TFBS
28(Homologous Non-homologous) detection
Wang and Stormo (2003) Combining phylogenetic
data with co-regulated genes to identify
regulatory motifs Bioinformatics
19.18.2369-80 Zhou and Wong (2007) Coupling
Hidden Markov Models for discovery of
cis-regulatory signals in multiple species
Annals Statistics 1.1.36-65