CoModelling and Conditional Modelling - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CoModelling and Conditional Modelling

Description:

Ab initio gene prediction: prediction of the location of genes (and the amino ... Coding exon contains no stop codons (open reading frame, ORF) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 29
Provided by: Jotun
Category:

less

Transcript and Presenter's Notes

Title: CoModelling and Conditional Modelling


1
Co-Modelling and Conditional Modelling
Needs
2
Grammars Finite Set of Rules for Generating
Strings
3
Ab Initio Gene prediction
5'....tttttgcagtactcccgggccctctgttggg
gcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcct
ccttatctctagagccggccctggctctctggcgcggggccccttagtcc
gggctttttgccATGGGGTCTCTGTTCCCTCTGTCGCTGCTGTTTTTTTT
GGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGC
GGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTG
CCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAA
GTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCA
GCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGG
GTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTG
CCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCG
CCTACAgtgagggacaggggctcggtcccggctggggtgaggggaggggg
ctggaagaggtggggaagggtagttgacagtcgctctatagggagcgccc
gcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGT
GATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCC
ACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCAT
GGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGA
TCTGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACT
TCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTG
GTCCGCAACAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccct
gtaaccctggggactaggaggaagggggcagagagagttatgaccccgag
agggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctg
tgttccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCC
CACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCA
CTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGC
GTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaa
acaatctgg ggaaatggccatacatggtgg.... 3'
4
Annotating genes
  • Despite all difficulties, protein-coding genes
    are among the easiest functional
  • elements to annotate. Several sources of
    information
  • Sequence features (ab-initio approaches)
  • Coding exon contains no stop codons (open reading
    frame, ORF)
  • Coding exons tend to reside in CG-rich regions
  • Comparative information
  • Similarity to known proteins in databases
  • Similarity to other species reduced mutation
    rates
  • Experimental evidence for transcription
  • cDNA sequences (complementary copy of spliced
    mRNA)
  • ESTs (few 100s basepair copy of 5 end of
    (spliced) mRNA transcript)

5
Annotating genes
  • What makes annotating protein-coding genes so
    difficult?
  • Gene density in human genome is low
  • 1-2 are coding exons, some of which are small
    (50 nt)
  • Introns may be very large (100 kb)
  • Alternative splicing
  • Several promoters
  • Several alternative transcripts
  • Pseudogenes
  • Genes may lose functionality (e.g. after
    duplication)Especially recent degenerated genes
    hard to spot
  • Mature (spliced) transcript may be reverse
    transcribedThese are often easy to spot (no
    introns poly-A tail)

6
HMM Examples
7
Genscan
Exons of phase 0, 1 or 2
State with length distribution

Introns of phase 0, 1 or 2
Initial exon
Terminal exon
Exon of single exon genes
5' UTR
3' UTR
Poly-A signal
Promoter
Intergenic sequence
Omitted reverse strand part of the HMM
8
Gene Finding Protein Homology (Gelfand, Mironov
Pevzner, 1996)
Protein Database
Exon Ordering Graph
Spliced Alignment 1. Define set of potential
exons in new genome. 2. Make exon ordering graph
- EOG. 3. Align EOG to protein database.
T Y G H L P
T Y G H L P T Y - - L P M
Y
L P M
T
W
Q
9
Comparative Gene Annotation
AGGTATATAATGCG..... PcodingATG--gtGTG
or AGCCATTTAGTGCG..... Pnon-codingATG--gtGTG
10
Simultaneous Alignment Gene Finding Bafna
Huson, 2000, T.Scharling,2001 Blayo,2002.
Align by minimizing Distance/ Maximizing
Similarity
Align genes with structure Known/unknown
11
5 of the Human genome is under
conservation (Chiaromonte et al.)
(Whole Genome ARs)
Whole Genome
ARs
Due to this work, people often say 5 of the
genome is constrained
From Caleb Webber Gerton Lunter
12
Percentage of Genome under Purifying Selection
CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCA
GGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAA
G----CAGACGCA
Consider lengths of inter-gap segments! Do they
follow a geometric distribution?
From Caleb Webber Gerton Lunter
13
Finding Regulatory Signals in Genomes
Searching for known signal in 1 sequence
Searching for unknown signal common to set of
unrelated sequences
Searching for conserved segments in homologous
Challenges
Combining homologous and non-homologous analysis
Merging Annotations
Predicting signal-regulatory protein relationships
14
Weight Matrices Sequence Logos
Wasserman and Sandelin (2004) Applied
Bioinformatics for the Identification of
Regulatory Elements Nature Review Genetics
5.4.276
15
Motifs in Biological Sequences 1990 Lawrence
Reilly An Expectation Maximisation (EM)
Algorithm for the identification and
Characterization of Common Sites in Unaligned
Biopolymer Sequences Proteins 7.41-51. 1992
Cardon and Stormo Expectation Maximisation
Algorithm for Identifying Protein-binding sites
with variable lengths from Unaligned DNA
Fragments L.Mol.Biol. 223.159-170 1993 Lawrence
Liu Detecting subtle sequence signals a Gibbs
sampling strategy for multiple alignment Science
262, 208-214.
Q(q1,A,,qw,T) probability of different bases
in the window
A(a1,..,aK) positions of the windows
q0(qA,..,qT) background frequencies of
nucleotides.
Priors A has uniform prior Qj
has Dirichlet(N0a) prior a base frequency in
genome. N0 is pseudocounts
16
Natural Extensions to Basic Model I
Modified from Liu
17
Natural Extensions to Basic Model II
18
Combining Signals and other Data
Modified from Liu
19
Phylogenetic Footprinting (homologous detection)
Blanchette and Tompa (2003) FootPrinter a
program designed for phylogenetic footprinting
NAR 31.13.3840-
20
(No Transcript)
21
Statistical Alignment and Footprinting.
Solution Cartesian Product of HMMs
22
SAPF - Statistical Alignment and Phylogenetic
Footprinting
23
BigFoot
http//www.stats.ox.ac.uk/research/genome/software
  • Dynamical programming is too slow for more
    than 4-6 sequences
  • MCMC integration is used instead works until
    10-15 sequences
  • For more sequences other methods are needed.

24
FSA - Fast Statistical Alignment Pachter,
Holmes Co
Data k genomes/sequences
Iterative addition of homology statements to
shrinking alignment
http//math.berkeley.edu/rbradley/papers/manual.
pdf
Spanning tree
Additional edges
i. Conflicting homology statements cannot be
added ii. Some scoring on multiple sequence
homology statements is used.
25
Rate of Molecular Evolution versus estimated
Selective Deceleration
Halpern and Bruno (1998) Evolutionary Distances
for Protein-Coding Sequences MBE 15.7.910-
Moses et al.(2003) Position specific variation
in the rate fo evolution of transcription binding
sites BMC Evolutionary Biology 3.19-
26
Signal Factor Prediction
  • Use PWM and Bruno-Halpern (BH) method to make
    TF specific evolutionary models
  • Drawback BH only uses rates and equilibrium
    distribution
  • Superior method Infer TF Specific Position
    Specific evolutionary model
  • Drawback cannot be done without large scale
    data on TF-signal binding.

http//jaspar.cgb.ki.se/ http//www.gene-regula
tion.com/
27
Knowledge Transfer and Combining Annotations
Must be solvable by Bayesian Priors Each
position pi probability of being jth position in
kth TFBS If no experiment, low probability
for being in TFBS
28
(Homologous Non-homologous) detection
Wang and Stormo (2003) Combining phylogenetic
data with co-regulated genes to identify
regulatory motifs Bioinformatics
19.18.2369-80 Zhou and Wong (2007) Coupling
Hidden Markov Models for discovery of
cis-regulatory signals in multiple species
Annals Statistics 1.1.36-65
Write a Comment
User Comments (0)
About PowerShow.com