Title: CS5263 Bioinformatics
1CS5263 Bioinformatics
- Lecture 14 Hidden Markov Models and applications
2- Last lecture
- Some practical issues in HMM modeling
- HMMs for pairwise sequence alignment
- Today
- HMM for multiple sequence alignment
- HMM for gene prediction
3Duration models
p
P(L)
s
1-p
L
p
s
s
s
s
1-p
Min, then geometric
p
p
p
p
s
s
s
s
1-p
1-p
1-p
1-p
Negative binominal
4Explicit duration modeling
Intron
Exon
Intergenic
P(A I) 0.3 P(C I) 0.2 P(G I) 0.2 P(T
I) 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
5Silent states
- Silent states are states that do not emit symbols
(e.g., the state 0 in our previous examples) - Silent states can be introduced in HMMs to reduce
the number of transitions
Dj
Silent state
B
E
B
Mj
E
6FSA for global alignment
e
Xi aligned to a gap
?
d
Xi and Yj aligned
?
d
Yj aligned to a gap
e
?
7Pair-HMM for global alignment
?
Xi aligned to a gap
1- ?
q(xi) 4 emission probabilities
?
Xi and Yj aligned
1-2?
q(yj) 4 emission probabilities
?
Yj aligned to a gap
P(xi,yj)
?
16 emission probabilities
1-?
Pair-wise HMM emit two sequences
simultaneously Algorithm is similar to regular
HMM, but need an additional dimension. e.g. in
Viterbi, we need Vk(i, j) instead of Vk(i)
8Pair-HMM and FSA for alignment
- After proper transformation between the
probabilities and substitution scores, the two
are identical - ?(a, b) ? log p(a, b) / (q(a) q(b))
- d ? log ?
- e ? log ?
- Details in Durbin book chap 4
- Finding an optimal FSA alignment is equivalent to
finding the most probable path with Viterbi - Has more use in comparative gene finding than in
seq alignment
9- HMM for multiple sequence alignment
10Question
- How to represent a multiple alignment so that
comparing a single sequence with an alignment can
be done efficiently?
11Solution 1
- Use regular expression to represent the consensus
sequences - Implemented in the ProSite database
- for example
- C - x(2) - P - F - x - FYWIV - x(7) - C -
x(8,10) - W - C - x(4) - DNSR - FYW - x(3,5)
- FYW - x - FYWI - C
12Multi-alignments represented by regular
expression
- Intuitive
- Flexible
- Efficient matching algorithms
- Mismatches allowed possibly
- Problems
- Limited power in expressing more divergent
positions - Specify boundaries of indels not so easy
- May have to change the regular expression when a
new member of a protein family is identified
13Solution 2
- For a non-gapped alignment, we can create a
position-specific weight matrix (PWM) - PSWM
- PSSM
- Also called a profile
- May use pseudocounts
14Scoring by PWMs
x KCIDNTHIKR
P(x M) ?i ei(xi) Random model each amino
acid xi can be generated with probability
q(xi) P(x random) ?i q(xi) Log-odds ratio
S log P(XM) / P(Xrandom) ?i log
ei(xi) / q(xi)
15PWMs
- Advantage
- Can be used to identify both strong and weak
homologies - Easy to implement and use
- Probabilistic interpretation
- Problem
- Not intuitive to deal with gaps
16PWMs are HMMs
B
E
M1
Mk
Transition probability 1
20 emission probabilities for each state
- This can only be used to search for sequences
without insertion / deletions (indels) - We can add additional states for indels!
17Dealing with insertions
Ij
B
E
M1
Mj
Mk
- This would allow an arbitrary number of
insertions after the j-th position - i.e. the sequence being compared can be longer
than the PWM
18Dealing with insertions
I1
Ij
Ik
B
E
M1
Mj
Mk
- This allows insertions at any position
19Dealing with Deletions
B
E
Mi
Mj
- This would allow a subsequence i, j being
deleted - i.e. the sequence being compared can be shorter
than the PWM
20Dealing with Deletions
B
E
- This would allow an arbitrary length of deletion
- Completely forward connected
- Too many transitions
21Deletion with silent states
Dj
Silent state
B
Mj
E
- Still allows any length of deletions
- Fewer parameters
- Less detailed control
22Full model
Dj
D deletion state I insertion state M matching
state
Ij
B
Mj
E
Algorithm basically the same as a regular HMM
23Using profile HMM
- Alignment
- Align a sequence to a profile HMM
- Viterbi
- Searching
- Given a sequence and HMMs for different protein
families, which family the sequence may belong
to? - Given a HMM for a protein family and many
proteins, which protein may belong to the family? - Viterbi or forward
- How to score?
- Training / Learning
- Given a multiple alignment, how to construct a
HMM? - Given an unaligned protein family, how to
construct a HMM?
24Pfam
- A database of protein families
- Developed by Sean Eddy and colleagues while
working in Durbins lab - Hand-curated seed multiple alignment
- Train HMM from seed alignment
- Hand-chosen score thresholds
- Automatic classification / classification of all
other protein sequences - 7973 families in Rfam 18.0, 8/2005
- (covers 75 of proteins)
25Modeling building from aligned sequences
- Match for columns with no gaps
- When gaps exist, how to decide whether they are
insertions or matches? - Heuristic rule gt50 gaps, insertion, otherwise,
match - How to add pseudocount?
- Simply add one
- According to background distribution
- Use a mixture of priors (Dirchlet mixtures)
- Sequence weighting
- Very similar sequences should each get less weight
26Modeling building from unaligned sequences
- Choose a model length and initial parameters
- Commonly use average seq length as model length
- Baum-Welch or Viterbi training
- Usually necessary to use multiple starting points
or other heuristics to escape from local optima - Align all sequences to the final model using
Viterbi - Why not F-B?
27Align a seq to profile HMMs
Viterbi
28Searching
Protein Database
- Scoring
- Log likelihood Log P(X M)
- Log odds Log P(X M) / P(X random)
- Length-normalization
- Is the hit real?
- How does the score compare with that of the
sequences already in the family? - How does the score compare with that of random
sequences?
Score for each protein
29Example modeling and searching for globins
- 300 random picked globin sequence
- Build a profile HMM from scratch (without
pre-alignment) - Align 60,000 proteins to the HMM
- Majority are non-globin
30- Even after length normalization, LL is still
length-dependent - Log-odd score provides better separation
- Takes amino acid composition into account
- Real globins could have scores less than 0
31- Estimate mean score and standard deviation for
non-globin sequences for each length - Z-score (raw score mean) / (standard
deviation) - Z-score is length-invariant
- Real globins have positive scores
32 33Gene structure
intron1
intron2
exon2
exon3
Intergenic
exon1
5
3
transcription
splicing
translation
Exon coding Intron non-coding Intergenic
non-coding
34Finding genes
Human
Fugu
worm
E.coli
As the coding/non-coding length ratio decreases,
exon prediction becomes more complex
35Gene prediction in prokaryote
- Finding long ORFs (open reading frame)
- An ORF may not contain stop codons
- Average ORF length 64/3
- Expect 300bp ORF per 36kbp
- Actual ORF length 1000bp
- Codon biases
- Some triplets are used more frequently than
others - Codon third position biases
36HMM for eukaryote gene finding
- Most basic idea is the same the distributions of
nucleotides is different in exon and other
regions - Alone wont work very well
- More signals are needed
- How to combine all the signal together?
37Simplest model
intron
Intergenic
exon
4 emission probability
64 triplets emission probabilities
4 emission probability
Actually more accurate at the di-amino-acid
level, i.e. 2 codons. Many methods use 5th-order
Markov model for all regions
- Exon length may not be exact multiple of 3
- Basically have to triple the number of states to
remember the excess number of bases in the
previous state
38More detailed model
Single exon
Init exon
intron
Intergenic
Term exon
Internal Exon
39Sub-models
CDS coding sequence
5-UTR
START
CDS
Init exon
3-UTR
STOP
CDS
Term exon
- START, STOP are PWMs
- Including start and stop codons and surrounding
bases
40Sub-model for intron
Splice donor
Intron body
Splice acceptor
Intron
- Sequence logos an informative display of PWMs
- Within each column, relative height represents
probability - Height of each column reflects information
content
41(No Transcript)
42Explicit duration modeling
x1 x2 x3 ..xN
1
2
Pk(i)
K
- For each position j and each state i
- Need to consider the transition from all previous
positions - Time O(N2K2)
- N can be 108
43Speedup GHMM
- Restrict maximum duration length to be L
- O(LNK2)
- However, intergenic and intron can be quite long
- L can be 105
- Compromise explicit duration for exons only,
geometric for all other states - Pre-compute all possible starting points of ORFs
- For init exon ATG
- For internal/terminal exon splice donor signal
(GT)
44GeneScan model
45Approaches to gene finding
- Homology
- BLAST, BLAT, etc.
- Ab initio
- Genscan, Glimmer, Fgenesh, GeneMark, etc.
- Each one has been tuned towards certain organisms
- Hybrids
- Twinscan, SLAM
- Use pair-HMM, or pre-compute score for potential
coding regions based on alignment - None are perfect, never used alone in practice
46Current status
- More accurate on internal exons
- Determining boundaries of init and term exons is
hard - Biased towards multiple-exon genes
- Alternative splicing is hard
- Non-coding RNA is hard
47- State of the Art
- predictions 60 similar to real proteins
- 80 if database similarity used
- lab verification still needed, still expensive
48HMM wrap up
- Weve talked about
- Probability, mainly Bayes Theorem
- Markov models
- Hidden Markov models
- HMM parameter estimation given state path
- Decoding given HMM and parameters
- Viterbi
- F-B
- Learning
- Baum-Welch (Expectation-Maximization)
- Viterbi
49HMM wrap up
- Weve also talked about
- Extension to gHMMs
- HMM for multiple sequence alignment
- gHMM for gene finding
- We did not talk about
- Higher-order Markov models
- How to escape from local optima in learning
50Next lectures
- String pattern matching algorithms
- Suffix tree and its applications
- Motif finding
- Biology
- Algorithm
- Combinatorial
- Probabilistic