Title: Markov Models and HMM in Genome Analysis
1Markov Models and HMMin Genome Analysis
Journée Statistique et Santé Paris 5 5 mai 2006
La genopole Evry France
prum_at_genopole.cnrs.fr
2Why Markov Models ?
A biological sequence X (X1, X2, ,
Xn) where Xk ??A t , c , a , g or A C
D E F G H I K L M N P Q R S T V W Y A very
common tool for analyzing these sequences is the
Markov Model (MM) P(Xk v Xj , j lt k)
P(Xk v Xk 1) u, v
??A denoted by p(u , v) if Xk 1 u
3Why MM ? 2
A complex, called Rec BCD, protects the cell
against viruses
Exemple
Rec BCD
E. coli
viruses
own bacteria genome
chi
To avoid the destruction of the genome of the
cell, along the genome exists a password gctggtgg
(it is called chi). When rec BCD bumps into the
chi, it stops its destruction. In order to be
efficient the number of occurrences of the chi is
much higher that the number predicted in a Markov
model.
4Why MM ? 3
The  exceptional character of a word depends on
the frequency of each letter ? model M0
(Bernoulli)
Exemple HIV1 t 2164 c 1773 a
3410 g 2370
5Why MM ? 3
The  exceptional character of a word depends on
the frequency of each letter ? model M0
(Bernoulli), the frequency of 2-words tt , tc
ta, tg, gg ? model M1 (Markov)
Exemple HIV1 t c a g t 548 342
684 590 2164 c 470 413 795
95 1773 a 713 561 1112 1024 3410 g 432 457
820 661 2370
6Results MM
7Hidden Markov Models
An important criticism against Markov
modelization is its stationarity a well known
theorem says that, under weak conditions, P(Xk
u) ?? µ(u) (when k ?? 8) (and the rate
of convergence is exponential.) But
biological sequences are not homogeneous.
There are gc rich segments / gc poor segments
(isochores). One may presume (and verify)
that the rules of succession of letters differ
in coding parts / non-coding parts.
Is it possible to take avantage of this problem
and to develop a tool for the
analysis of heterogeneity ?
gt annotation
8HMM 2
Suppose that d states alternate along the
sequence
Sk 1 Sk 2 Sk 1
Sk 2 Sk 1
And in each state we have a MC if Sk 1,
then P(Xk v Xk1 u) p1(u v) if Sk
2, then P(Xk v Xk1 u) p2(u v) and
(more technical than biological - see
HSMM) P(Sk y Sk1 x) p0(u v)
Estimate the parameters p1, p2, p0 Allocate
a state 1, 2 to each position
Our objectives
9HMM 3
Use the likelihood !!
L(q) ? µ0(S1) µS (X1) ....
1
...? p0(Sk-1,Sk) pk (Xk-1,Xk)
n terms (length of the sequence)
over all possibilities S1S2...Sn there are sn
terms
Désespoir !!!
210 000 103 000
10back to HMM
Step n 1
We do not know what is S 1 and what is Sk 2
We make an arbitrary allocation of these states
Sk 1 Sk 2 Sk 1
Sk 2 Sk 1
Knowing the states, it is obvious to compute
the parameters
exple p1(c,g) of the c which are followed
by a g in green parts.
p0(,) of which are followed by a
11Baum Churchill formula
Step n 2
Define the predictive probability ak(v) P(Sk
v X1k-1) the filtragee probability bk(v)
P(Sk v X1k) the estimated probability jk(v)
P(Sk v X1n)
Bayes formula
Forward (k 2 to n)
ak(v) ?u bk-1(u) p0 (u, v)
ak-1(u) pu (Xk-2, Xk-1)
bk-1(u)
?w ak-1(w) pw (Xk-2, Xk-1)
Backward (k n to 2)
jk(v)
jk-1(v) bk-1(u) ?v p0 (u, v)
bk(v)
12and Sk ?
Bad idea Sk arg max jk(v) () First good
idea keep the distribution jk(v)
EM Second good idea draw Sk
according to jk(v) SEM
() except, may be at the end of the algorithm
(freezing)
13Annotation
14SHMM
(
)
New defect transition between states is
described using a Markov model
1 - p p q 1 - q
p0
Consequence the length of segments of a given
colour are r.v. Expo()
15SHMM
Does not correspond to the reality ! ! Histograms
of biological segments (after smoothing) look
more like
density g(h)
h
It is easy to make the probability of leaving the
state depend on h to get the suitable law
g(h)
p(h)
1 - G(h)
16Searching nucleosomes
What are nucleosomes ?
In eukaryotes (only), an important part of the
chromosomes forms chromatine, a state where the
double helix winds round beads forming a collar
10 nm
Each bead is called a nucleosome. Its core is a
complex involving 8 proteins (an octamer) called
histone (H2A, H2B, H3, H4). DNA winds twice this
core and is locked by an other histone (H1). The
total weight of the histones is equal to the
weight of the DNA.
17Nucleosomes
Curvature within curvature
Back to one nucleosome the DNA helix turns
twice around the histone core. Each turn
corresponds to about 7 pitches of the helix, each
one made with about 10 nucleotides. Total 146
nt within each nucleosome.
Depending on the position (invs out) the
curvature satisfies different constraints
18Bendability
Following an idea (Baldi, Lavery,...) we
introduce an indice of bendability it
depends on succession of 2, 3, 4,
...di-nucleotides.
g
a
t
t
a
g
c
t
a
c
t
a
q
19PNUC table
2nd letter a t g c a 0.0 2.8 3,3 5.2 a t 7
,3 7,3 10,0 10,0 a g 3,0 6,4 6,2 7,5
a c 3,3 2,2 8,3 5,4 a a 0.7 0.7 5,8 5,8 t t 2.
8 0.0 5.2 3,3 t g 5.3 3.7 5.4 7,5 t c 6,7 5.2 5.
4 5.4 t 1rst letter 3rd
letter a 5.2 6,7 5.4 5.4 g t 2,2 3,3 5.4 8,3 g
g 5.4 6,5 6,0 7,5 g c 4,2 4,2 4,7 4,7 g a 3.7 5
.3 7,5 5.4 c t 6,4 3,0 7,5 6,2 c g 5.6 5.6 8.2 8
.2 c c 6,5 5.4 7,5 6,0 c
There exist various tables which indicate the
bendability of di-, tri or even tetra-nucleotides
(PNUC, DNase, ...)
We used PNUC-3
PNUC(cga) 8,3
PNUC(tcg) 8,3
() Goodsell, Dickerson, NAR 22 (1994)
20HMM for nucleosomes
Better consider a different state for each
position in the nucleosome (say 146 states)
1 2 3
. . .
145 146
and the repetition of r (say 4) identical states
for the between-nucleosome regions (
spacer). These brother states give to the law of
the length of the b.n. regions a Gamma form which
is not geometrical ! !
21A no-nuc state
Trifonov (99) as well as Rando (05) underline
that there are no nucleosome in the gene
promotors (accessibility) The introduce before
nucleosome a no-nucleosme state.
. . .
1 2
70
nucleosome core
no-nuc
spacer
Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc.
96 (1999) Yuan, Liu, Dion, Slack, Wu, Altschuler,
Rando, Sciencexpress (2005)
22(No Transcript)
23Acknowledgements
Labo Statistique et Génome Labo MIG
INRA Maurice BAUDRY Philippe BESSIÈRE Etienne
BIRMELE François RODOLPHE Cécile COT
Sophie SCHBATH Mark HOEBEKE Élisabeth de
TURCKHEIM Mickael GUEDJ François KÉPÈS Sophie
LEBRE Labo AGRO Catherine MATIAS Jean-Noël
BACRO Vincent MIELE Jean-Jacques
DAUDIN Florence MURI-MAJOUBE Stéphane
ROBIN Grégory NUEL Hugues RICHARD Anne-Sophie
TOCQUET Lab Rouen Dominique
CELLIER Nicolas VERGNE Dominique CELLIER Sec
Michèle ILBERT Sabine MERCIER