Markov chains

About This Presentation

Title:

Markov chains

Description:

From Markov Chains to Hidden Markov Models (HMM) ... Hidden Markov Model. for protein sequences. three types ... E.g. SCOP http://scop.mrc-lmb.cam.ac.uk/scop ... – PowerPoint PPT presentation

Number of Views:892

Avg rating:3.0/5.0

Slides: 25

Provided by: werner6

Category:

more less

Transcript and Presenter's Notes

Title: Markov chains

1
Markov chains

Basic structure of a classical Markov chain
example DNA each letter A,C,G,T can be assigned
as a state with transition probabilities
P(XitXi-1s)

Probability of each state xi depends only on the
value of the preceding symbol x i-1
Sum of probabilities over all possible sequences
is 1. A Markov chain describes a proper
probability distribution over the whole space of
sequences.
2
CpG islands Biological function and impact on
gene regulation

Genomic regions with CpG dinucleotide content of
at least 60
Overall genomes have much lower CpG frequency
(1), (CG suppression).
Methylation of CpG sites in the promoter of a
gene may inhibit the expression of a gene.
CpG islands typically occur at or near the
transcription start site of genes, particularly
in housekeeping genes of vertebrates

3
Example CpG island

short stretches of 1001000 nucleotides with
frequently occuring CpG dinucleotides
Given a short stretch of genomic sequence, how
can one decide it comes from a CpG island?
How can one find CpG islands within a long
stretch of a genomic sequence?
Learning sets
48 CpG islands ( set)
sequences outside CpG islands (- set)
Likelihood ratio test

4
From Markov Chains to Hidden Markov Models (HMM)

How to find CpG islands in an annotated sequence?
one approach moving window of 100 nucleotides
another approach include both markov chains
(model and model -) in one model i.e. switch
between the two models
A,C,G,T and A-,C-,G-,T-
State does not correspond to the symbols any
more A and A- both generate A, from the symbol
A alone one cannot infer where it is coming from
(hidden) transition probabilities are different
in A or A-.
Graphically

5
Globin sequences
6
Hidden Markov Modelfor protein sequences

three types of states
match states mj insertion states ij gap states
dj

For each line there is a transition probability
Sequence of states (pi) and sequence of symbols
(xj) are decoupled The probability for a path of
states is still a Markov Chain The match
states m emit sequence i.e it corresponds to a
column in a multiple sequence alignment.
7
Parametrization of an HMM

Alignment of a sequence to a model is associated
with a match or an insert state.
An alignment of a sequence to a model is called a
path
the alignment is not unique
Emission probabilities ek(b)

In practice estimates for the emission
probabilities are taken from multiple sequence
alignments.
8
Searching with profile HMM

Given a sequence x one has to find the most
probable path p (alignment) to the model
Viterbi algorithm
and the probability of the optimal alignment
gives a score how well a sequence fits into the
protein family.
Probability of an alignment of a sequence to a
model

Practical aspects Log Likehood scoring
LL log-odds score, i.e. comparison to a random
model Example modelling and searching for
globins
9
profile HMM
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis, Cambridge
University Press.
10
Comparison to 1-d profiles
multiple sequence alignment of Fig. 5.3
f(j,b) probability of amino acid b at position
j profile p the expected score for a given
sequence yj to fit into the family Parameters for
HMM (adhoc rule pseudocounts for residues not
observed)
11
Log likelhood score and log-odds score
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis, Cambridge
University Press.
12
Z-score
13
Concepts of protein structure prediction

Why is there a need for protein structure
prediction ?
the sequence of a protein is easily available
the determination of 3D structures is still a
slow process
energy based methods
free energy of the protein in the native state
is minimal
Anfinson experiment
ab initio structure prediction is still an
unsolved problem
holy grail of computational biology
knowledge-based methods
parameters are extracted from currently known 3D
structures
examples
secondary structure prediction
fold recognition methods (threading)
knowledge based force field terms are added to
free energy term

(2) prediction of secondary structure and long
range contacts
Secondary structure prediction derive propensity
values of residues from statistical analysis of
residues in known secondary structure
More sophisticated methods Neural Network,
combined prediction from MSA and HMM
Long range contacts
Tree-determinant residues
Motifs
Correlated mutations

15
Comparative Homology Modeling

3D structures of proteins come in families and
superfamilies
E.g. SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
families sequence identities high ( 35), same
functional residues
superfamilies similar 3D fold some common
functional motifs
No universal definition of superfamilies
. folds similar 3D fold
Rule of thumb if two proteins have an alignment
with a sequence identity 30 they have the same
fold.
More sophisticated methods for fold recognition
3D profiles or threading
Steps
- for a target sequence find a homologous PDB
template structure,
- make an optimum alignment between the target
and template sequences,
- generate the the tertiary structure of the
target using the template geometry.

16
Additional considerations
What is the secondary structure? Is it homologous
to other protein sequences? Is it homologous to
other protein structures? What is the best
sequence alignment between your target protein
and homologous PDB structures? Examine the
regions of insertions and deletions. Are they
located in the loop regions? On the surface? Is
the region hydrophobic or hydrophilic? The PDB
template might have functional sites and
established motifs. Does your target sequence has
the same features? If disulphide bridges are
present in the PDB template, are cysteine
residues aligned?
17
Methods for prediction

Classic method (Chou Fasman, 1985)
simplified rules
separate amino acids into groups of helix
(b-strand) formers and breakers
search for clusters of formers (four h-former out
of six contiguous residues three b-former out of
five residues extend the segments in both
dimensions until a tetrapeptide of breakers is
found
later improvements
Garnier Osguthorpe Robson (GOR) method
influence of residue at postion j on secondary
structure in the neighborhood of the residue is
included
main effect is statistically found in the range
j-8

18
Improvements of the methods
major improvements larger databases multiple
sequence alignments neural network
method consensus prediction Meta server
19
Neural network
Topology of a neural network each node
represents a number between 0 and one nodes of
input layer In hidden layer output layer
Switch function
The parameters w and b are determined by training
the net If wkn is very large, the influence of
node n on node k is very sensitive
20
Training of the network
The training of the network requires (many) pairs
of (Ip,Op) given to the network, and adjusting w
and b to obtain an optimal fit. The number of
training patterns should be at least 3 to 5 times
larger than the number of adjustable parameters w
and b, to avoid overfitting. The network learns
by minimizing
where Opcalc is calculated by the NN from Ip

T can be minimized by an iterative process
choose w and b randomly
change w and b according to the steepest descent
method

21
Secondary structure prediction by NN
(A) Basic scheme Output layer represents the
secondary structure, i.e. a,b, c Input layer
each amino is treated as as a separate node, 20
nodes additional nodes for gaps and
insertions (B) Topology of the network first
level sequence to structure the window Xi-6 to
Xi6 determines the secondary structure at
position i second level structure to structure a
window of predicted sec. structures with mixed
predictions e.g. aabaabcaabbaaac determines a
helical or b-strand region third level jury
decision over independently trained networks
Ref Rost, B. PHD predicting one-dimensional
protein structure by profile based neural
networks. Methods in Enzymology 266525-539, 1996
22
Secondary Structure Prediction Servers

APSSP2 www.imtech.res.in/raghava/apssp2/
Advanced Protein Secondary Structure Prediction
Server, GPS Raghava, Bioinformatics Center,
Chandigarh
PSIPRED bioinf.cs.ucl.ac.uk/psipred/index.html
The PSIPRED Protein Structure Prediction Server,
D. T. Jones, Department of Computer Science,
University College London, UK.
PROF www.aber.ac.uk/phiwww/prof/
University of Wales, Aberystwyth, Computational
Biology Group.
PredictProtein cubic.bioc.columbia.edu/predictpro
tein/
The PredictProtein server , B. Rost, Columbia
University, NY.
SAM-T02sec www.cse.ucsc.edu/research/compbio/HMM-
apps/T02-query.html
HMM methods, K. Karplus, UCSC
JPRED www.compbio.dundee.ac.uk/www-jpred/
A consensus method for protein secondary
structure prediction
G. Barton, University of Dundee

23
Performance of secondary structure prediction
methods in CAFASP
Ref Eyrich et. al. Proteins, 53548-560 (2003).
24
Performance of secondary prediction methods on a
larger set

Write a Comment

User Comments (0)

About PowerShow.com

Markov chains - PowerPoint PPT Presentation

Markov chains

From Markov Chains to Hidden Markov Models (HMM) ... Hidden Markov Model. for protein sequences. three types ... E.g. SCOP http://scop.mrc-lmb.cam.ac.uk/scop ... – PowerPoint PPT presentation