Title: Markov chains
1Markov chains
- Basic structure of a classical Markov chain
- example DNA each letter A,C,G,T can be assigned
as a state with transition probabilities
P(XitXi-1s)
Probability of each state xi depends only on the
value of the preceding symbol x i-1
Sum of probabilities over all possible sequences
is 1. A Markov chain describes a proper
probability distribution over the whole space of
sequences.
2CpG islands Biological function and impact on
gene regulation
- Genomic regions with CpG dinucleotide content of
at least 60 - Overall genomes have much lower CpG frequency
(1), (CG suppression). - Methylation of CpG sites in the promoter of a
gene may inhibit the expression of a gene. - CpG islands typically occur at or near the
transcription start site of genes, particularly
in housekeeping genes of vertebrates
3Example CpG island
- short stretches of 1001000 nucleotides with
frequently occuring CpG dinucleotides - Given a short stretch of genomic sequence, how
can one decide it comes from a CpG island? - How can one find CpG islands within a long
stretch of a genomic sequence? - Learning sets
- 48 CpG islands ( set)
- sequences outside CpG islands (- set)
- Likelihood ratio test
4From Markov Chains to Hidden Markov Models (HMM)
- How to find CpG islands in an annotated sequence?
- one approach moving window of 100 nucleotides
- another approach include both markov chains
(model and model -) in one model i.e. switch
between the two models - A,C,G,T and A-,C-,G-,T-
- State does not correspond to the symbols any
more A and A- both generate A, from the symbol
A alone one cannot infer where it is coming from
(hidden) transition probabilities are different
in A or A-. - Graphically
5Globin sequences
6Hidden Markov Modelfor protein sequences
- three types of states
- match states mj insertion states ij gap states
dj
For each line there is a transition probability
Sequence of states (pi) and sequence of symbols
(xj) are decoupled The probability for a path of
states is still a Markov Chain The match
states m emit sequence i.e it corresponds to a
column in a multiple sequence alignment.
7Parametrization of an HMM
- Alignment of a sequence to a model is associated
with a match or an insert state. - An alignment of a sequence to a model is called a
path - the alignment is not unique
- Emission probabilities ek(b)
In practice estimates for the emission
probabilities are taken from multiple sequence
alignments.
8Searching with profile HMM
- Given a sequence x one has to find the most
probable path p (alignment) to the model - Viterbi algorithm
- and the probability of the optimal alignment
gives a score how well a sequence fits into the
protein family. - Probability of an alignment of a sequence to a
model
Practical aspects Log Likehood scoring
LL log-odds score, i.e. comparison to a random
model Example modelling and searching for
globins
9profile HMM
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis, Cambridge
University Press.
10Comparison to 1-d profiles
multiple sequence alignment of Fig. 5.3
f(j,b) probability of amino acid b at position
j profile p the expected score for a given
sequence yj to fit into the family Parameters for
HMM (adhoc rule pseudocounts for residues not
observed)
11Log likelhood score and log-odds score
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis, Cambridge
University Press.
12Z-score
13Concepts of protein structure prediction
- Why is there a need for protein structure
prediction ? - the sequence of a protein is easily available
- the determination of 3D structures is still a
slow process - energy based methods
- free energy of the protein in the native state
is minimal - Anfinson experiment
- ab initio structure prediction is still an
unsolved problem - holy grail of computational biology
- knowledge-based methods
- parameters are extracted from currently known 3D
structures - examples
- secondary structure prediction
- fold recognition methods (threading)
- knowledge based force field terms are added to
free energy term
14- (2) prediction of secondary structure and long
range contacts -
- Secondary structure prediction derive propensity
values of residues from statistical analysis of
residues in known secondary structure - More sophisticated methods Neural Network,
combined prediction from MSA and HMM - Long range contacts
- Tree-determinant residues
- Motifs
- Correlated mutations
15Comparative Homology Modeling
- 3D structures of proteins come in families and
superfamilies - E.g. SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
- families sequence identities high ( 35), same
functional residues - superfamilies similar 3D fold some common
functional motifs - No universal definition of superfamilies
- . folds similar 3D fold
- Rule of thumb if two proteins have an alignment
with a sequence identity 30 they have the same
fold. - More sophisticated methods for fold recognition
3D profiles or threading - Steps
- - for a target sequence find a homologous PDB
template structure, - - make an optimum alignment between the target
and template sequences, - - generate the the tertiary structure of the
target using the template geometry.
16Additional considerations
What is the secondary structure? Is it homologous
to other protein sequences? Is it homologous to
other protein structures? What is the best
sequence alignment between your target protein
and homologous PDB structures? Examine the
regions of insertions and deletions. Are they
located in the loop regions? On the surface? Is
the region hydrophobic or hydrophilic? The PDB
template might have functional sites and
established motifs. Does your target sequence has
the same features? If disulphide bridges are
present in the PDB template, are cysteine
residues aligned?
17Methods for prediction
- Classic method (Chou Fasman, 1985)
- simplified rules
- separate amino acids into groups of helix
(b-strand) formers and breakers - search for clusters of formers (four h-former out
of six contiguous residues three b-former out of
five residues extend the segments in both
dimensions until a tetrapeptide of breakers is
found - later improvements
- Garnier Osguthorpe Robson (GOR) method
- influence of residue at postion j on secondary
structure in the neighborhood of the residue is
included - main effect is statistically found in the range
j-8
18Improvements of the methods
major improvements larger databases multiple
sequence alignments neural network
method consensus prediction Meta server
19Neural network
Topology of a neural network each node
represents a number between 0 and one nodes of
input layer In hidden layer output layer
Switch function
The parameters w and b are determined by training
the net If wkn is very large, the influence of
node n on node k is very sensitive
20Training of the network
The training of the network requires (many) pairs
of (Ip,Op) given to the network, and adjusting w
and b to obtain an optimal fit. The number of
training patterns should be at least 3 to 5 times
larger than the number of adjustable parameters w
and b, to avoid overfitting. The network learns
by minimizing
where Opcalc is calculated by the NN from Ip
- T can be minimized by an iterative process
- choose w and b randomly
- change w and b according to the steepest descent
method
21Secondary structure prediction by NN
(A) Basic scheme Output layer represents the
secondary structure, i.e. a,b, c Input layer
each amino is treated as as a separate node, 20
nodes additional nodes for gaps and
insertions (B) Topology of the network first
level sequence to structure the window Xi-6 to
Xi6 determines the secondary structure at
position i second level structure to structure a
window of predicted sec. structures with mixed
predictions e.g. aabaabcaabbaaac determines a
helical or b-strand region third level jury
decision over independently trained networks
Ref Rost, B. PHD predicting one-dimensional
protein structure by profile based neural
networks. Methods in Enzymology 266525-539, 1996
22Secondary Structure Prediction Servers
- APSSP2 www.imtech.res.in/raghava/apssp2/
- Advanced Protein Secondary Structure Prediction
Server, GPS Raghava, Bioinformatics Center,
Chandigarh - PSIPRED bioinf.cs.ucl.ac.uk/psipred/index.html
- The PSIPRED Protein Structure Prediction Server,
D. T. Jones, Department of Computer Science,
University College London, UK. - PROF www.aber.ac.uk/phiwww/prof/
- University of Wales, Aberystwyth, Computational
Biology Group. - PredictProtein cubic.bioc.columbia.edu/predictpro
tein/ - The PredictProtein server , B. Rost, Columbia
University, NY. - SAM-T02sec www.cse.ucsc.edu/research/compbio/HMM-
apps/T02-query.html - HMM methods, K. Karplus, UCSC
- JPRED www.compbio.dundee.ac.uk/www-jpred/
- A consensus method for protein secondary
structure prediction - G. Barton, University of Dundee
23Performance of secondary structure prediction
methods in CAFASP
Ref Eyrich et. al. Proteins, 53548-560 (2003).
24Performance of secondary prediction methods on a
larger set