CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

CS5263 Bioinformatics

Description:

Mj. Dealing with Deletions. This would allow an arbitrary length of deletion ... Mj. E. Dj. Still allows any length of deletions. Fewer parameters. Less ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 51

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics

1
CS5263 Bioinformatics

Lecture 14 Hidden Markov Models and applications

Last lecture
Some practical issues in HMM modeling
HMMs for pairwise sequence alignment
Today
HMM for multiple sequence alignment
HMM for gene prediction

3
Duration models
p
P(L)
s
1-p
L
p
s
s
s
s
1-p
Min, then geometric
p
p
p
p
s
s
s
s
1-p
1-p
1-p
1-p
Negative binominal
4
Explicit duration modeling
Intron
Exon
Intergenic
P(A I) 0.3 P(C I) 0.2 P(G I) 0.2 P(T
I) 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
5
Silent states

Silent states are states that do not emit symbols
(e.g., the state 0 in our previous examples)
Silent states can be introduced in HMMs to reduce
the number of transitions

Dj
Silent state
B
E
B
Mj
E
6
FSA for global alignment
e
Xi aligned to a gap
?
d
Xi and Yj aligned
?
d
Yj aligned to a gap
e
?
7
Pair-HMM for global alignment
?
Xi aligned to a gap
1- ?
q(xi) 4 emission probabilities
?
Xi and Yj aligned
1-2?
q(yj) 4 emission probabilities
?
Yj aligned to a gap
P(xi,yj)
?
16 emission probabilities
1-?
Pair-wise HMM emit two sequences
simultaneously Algorithm is similar to regular
HMM, but need an additional dimension. e.g. in
Viterbi, we need Vk(i, j) instead of Vk(i)
8
Pair-HMM and FSA for alignment

After proper transformation between the
probabilities and substitution scores, the two
are identical
?(a, b) ? log p(a, b) / (q(a) q(b))
d ? log ?
e ? log ?
Details in Durbin book chap 4
Finding an optimal FSA alignment is equivalent to
finding the most probable path with Viterbi
Has more use in comparative gene finding than in
seq alignment

HMM for multiple sequence alignment

10
Question

How to represent a multiple alignment so that
comparing a single sequence with an alignment can
be done efficiently?

11
Solution 1

Use regular expression to represent the consensus
sequences
Implemented in the ProSite database
for example
C - x(2) - P - F - x - FYWIV - x(7) - C -
x(8,10) - W - C - x(4) - DNSR - FYW - x(3,5)
- FYW - x - FYWI - C

12
Multi-alignments represented by regular
expression

Intuitive
Flexible
Efficient matching algorithms
Mismatches allowed possibly
Problems
Limited power in expressing more divergent
positions
Specify boundaries of indels not so easy
May have to change the regular expression when a
new member of a protein family is identified

13
Solution 2

For a non-gapped alignment, we can create a
position-specific weight matrix (PWM)
PSWM
PSSM
Also called a profile
May use pseudocounts

14
Scoring by PWMs
x KCIDNTHIKR
P(x M) ?i ei(xi) Random model each amino
acid xi can be generated with probability
q(xi) P(x random) ?i q(xi) Log-odds ratio
S log P(XM) / P(Xrandom) ?i log
ei(xi) / q(xi)
15
PWMs

Advantage
Can be used to identify both strong and weak
homologies
Easy to implement and use
Probabilistic interpretation
Problem
Not intuitive to deal with gaps

16
PWMs are HMMs
B
E
M1
Mk
Transition probability 1
20 emission probabilities for each state

This can only be used to search for sequences
without insertion / deletions (indels)
We can add additional states for indels!

17
Dealing with insertions
Ij
B
E
M1
Mj
Mk

This would allow an arbitrary number of
insertions after the j-th position
i.e. the sequence being compared can be longer
than the PWM

18
Dealing with insertions
I1
Ij
Ik
B
E
M1
Mj
Mk

This allows insertions at any position

19
Dealing with Deletions
B
E
Mi
Mj

This would allow a subsequence i, j being
deleted
i.e. the sequence being compared can be shorter
than the PWM

20
Dealing with Deletions
B
E

This would allow an arbitrary length of deletion
Completely forward connected
Too many transitions

21
Deletion with silent states
Dj
Silent state
B
Mj
E

Still allows any length of deletions
Fewer parameters
Less detailed control

22
Full model

Called profile-HMM

Dj
D deletion state I insertion state M matching
state
Ij
B
Mj
E
Algorithm basically the same as a regular HMM
23
Using profile HMM

Alignment
Align a sequence to a profile HMM
Viterbi
Searching
Given a sequence and HMMs for different protein
families, which family the sequence may belong
to?
Given a HMM for a protein family and many
proteins, which protein may belong to the family?
Viterbi or forward
How to score?
Training / Learning
Given a multiple alignment, how to construct a
HMM?
Given an unaligned protein family, how to
construct a HMM?

24
Pfam

A database of protein families
Developed by Sean Eddy and colleagues while
working in Durbins lab
Hand-curated seed multiple alignment
Train HMM from seed alignment
Hand-chosen score thresholds
Automatic classification / classification of all
other protein sequences
7973 families in Rfam 18.0, 8/2005
(covers 75 of proteins)

25
Modeling building from aligned sequences

Match for columns with no gaps
When gaps exist, how to decide whether they are
insertions or matches?
Heuristic rule gt50 gaps, insertion, otherwise,
match
How to add pseudocount?
Simply add one
According to background distribution
Use a mixture of priors (Dirchlet mixtures)
Sequence weighting
Very similar sequences should each get less weight

26
Modeling building from unaligned sequences

Choose a model length and initial parameters
Commonly use average seq length as model length
Baum-Welch or Viterbi training
Usually necessary to use multiple starting points
or other heuristics to escape from local optima
Align all sequences to the final model using
Viterbi
Why not F-B?

27
Align a seq to profile HMMs
Viterbi
28
Searching
Protein Database

Scoring
Log likelihood Log P(X M)
Log odds Log P(X M) / P(X random)
Length-normalization
Is the hit real?
How does the score compare with that of the
sequences already in the family?
How does the score compare with that of random
sequences?

Score for each protein
29
Example modeling and searching for globins

300 random picked globin sequence
Build a profile HMM from scratch (without
pre-alignment)
Align 60,000 proteins to the HMM
Majority are non-globin

Even after length normalization, LL is still
length-dependent
Log-odd score provides better separation
Takes amino acid composition into account
Real globins could have scores less than 0

Estimate mean score and standard deviation for
non-globin sequences for each length
Z-score (raw score mean) / (standard
deviation)
Z-score is length-invariant
Real globins have positive scores

Gene prediction

33
Gene structure
intron1
intron2
exon2
exon3
Intergenic
exon1
5
3
transcription
splicing
translation
Exon coding Intron non-coding Intergenic
non-coding
34
Finding genes
Human
Fugu
worm
E.coli
As the coding/non-coding length ratio decreases,
exon prediction becomes more complex
35
Gene prediction in prokaryote

Finding long ORFs (open reading frame)
An ORF may not contain stop codons
Average ORF length 64/3
Expect 300bp ORF per 36kbp
Actual ORF length 1000bp
Codon biases
Some triplets are used more frequently than
others
Codon third position biases

36
HMM for eukaryote gene finding

Most basic idea is the same the distributions of
nucleotides is different in exon and other
regions
Alone wont work very well
More signals are needed
How to combine all the signal together?

37
Simplest model
intron
Intergenic
exon
4 emission probability
64 triplets emission probabilities
4 emission probability
Actually more accurate at the di-amino-acid
level, i.e. 2 codons. Many methods use 5th-order
Markov model for all regions

Exon length may not be exact multiple of 3
Basically have to triple the number of states to
remember the excess number of bases in the
previous state

38
More detailed model
Single exon
Init exon
intron
Intergenic
Term exon
Internal Exon
39
Sub-models
CDS coding sequence
5-UTR
START
CDS
Init exon
3-UTR
STOP
CDS
Term exon

START, STOP are PWMs
Including start and stop codons and surrounding
bases

40
Sub-model for intron
Splice donor
Intron body
Splice acceptor
Intron

Sequence logos an informative display of PWMs
Within each column, relative height represents
probability
Height of each column reflects information
content

41
(No Transcript)
42
Explicit duration modeling
x1 x2 x3 ..xN
1
2
Pk(i)
K

For each position j and each state i
Need to consider the transition from all previous
positions
Time O(N2K2)
N can be 108

43
Speedup GHMM

Restrict maximum duration length to be L
O(LNK2)
However, intergenic and intron can be quite long
L can be 105
Compromise explicit duration for exons only,
geometric for all other states
Pre-compute all possible starting points of ORFs
For init exon ATG
For internal/terminal exon splice donor signal
(GT)

44
GeneScan model
45
Approaches to gene finding

Homology
BLAST, BLAT, etc.
Ab initio
Genscan, Glimmer, Fgenesh, GeneMark, etc.
Each one has been tuned towards certain organisms
Hybrids
Twinscan, SLAM
Use pair-HMM, or pre-compute score for potential
coding regions based on alignment
None are perfect, never used alone in practice

46
Current status