Title: Contextspecific Independence Mixture Modelling for Protein Families
1Context-specific Independence Mixture Modelling
for Protein Families
- Benjamin Georgi
- Jörg Schultz
- Alexander Schliep
2Introduction
- Protein families fall into sub families of
similar but distinct function - Specific function of sub families is often
determined by a small number of residues - functional residues
- Example Malate / Lactate dehydrogenase single
residue determines specificity
3Functional Positions
- Multiple Sequence alignment (MSA)
- Two sub families, three functional positions
- Strong signal of subgroup specific conservation
F1
F2
4Functional Positions
- Multiple sequence alignment (MSA)
- Two sub families, three functional positions
- Strong signal of subgroup specific conservation
F1
F2
5Introduction
- Problem Clustering of protein families,
simultaneous prediction of functional residues - Prior approaches
- Mostly supervised, requiring additional prior
knowledge - Mostly based on phylogenetic trees
- Our approach First unsupervised method that does
not require a tree
Context-specific independence (CSI) mixture models
6Context-specific independence mixture models
7Mixture Models
- Given random variable
- represents row in a MSA
- K component mixture density
8Mixture Models
- Example MSA length 4 3 component
mixture
9Mixture Models
10Mixture Models
- Model structure matrixone atomar distribution
per feature and component
11CSI Mixture Models
- Model structure matrixvariable number of
parameters for each feature
- reduction of model complexity
- the contribution of noise features is
negated - avoids overfitting
- highly descriptive model due to block
structure
12CSI Mixture Models
- Model structure matrixvariable number of
parameters for each feature
use model structure to predict functional
positions
13CSI for Protein Families
F1
F2
C1
C2
14CSI for Protein Families
- CSI Structure matrix (idealized)
F1
F2
C1
C2
15CSI for Protein Families
- CSI Structure matrix (more realistic)
F1
F2
C1
C2
16Prediction of Functional Residues
- Rank features by importance for characterization
of a cluster - Ranking of feature j for component i by
- Take highest ranking features as putative
functional residues
17CSI mixture structure learning
18CSI Structure Learning
- Question How to learn a CSI structure from data?
- Bayesian approach Score models by posterior
distribution
19CSI Structure Learning
Structure prior
Model posterior
Likelihood
Parameter prior
20CSI Structure Learning
Structure prior
Model posterior
Likelihood
Parameter prior
Criterion used to select structure and parameter
estimates
21CSI Structure Learning
Structure prior
Model posterior
Likelihood
Parameter prior
Probability of the data under the mixture model
22CSI Structure Learning
Structure prior
Model posterior
Likelihood
Parameter prior
Typically uninformative prior, acts as pseudo
counts, conjugate Dirichlet distribution
23CSI Structure Learning
Structure prior
Model posterior
Likelihood
Parameter prior
Introduces preference for simpler model, acts as
a regularizer, simple factored form
24Learning Algorithm
- Structural EM framework (Friedman 1998)
- Efficiently score candidate structures based on
the expected sufficent statistics - Perform greedy search over structure space to
arrive at final structure
25CSI mixtures for protein data
26Mixtures for Proteins
- Using sequences to infer structural properties of
the proteins - Different amino acid substitutions have different
structural impacts - Need notion of amino acid similarity in the model
to guide the structure learning - Remark related to substitution matrices in
phylogeny
27Conceptional problem
- Example
- Aspartate (D) and Glutamate (E) much more similar
thanLeucine (L) and Asparagine (N) - E/D column might best be modeled with a single
distribution in the structure
L E L E L E L E ----- N D N D N D N D
F1
F2
28Amino acid properties
Livingstone and Barton (1993)
29Amino acid properties
30Model extension
- Need to integrate AA properties into
- Idea Construct parameter prior which
defines appropriate density - Dirichlet Mixture priors
31Dirichlet Distribution
- Defines density over discrete distributions
- Example preference for
32Dirichlet Mixture Priors (DMP)
- Mixture of several Dirichlet distributions as
parameter prior - instead of single Dirichlet as in the
uninformative case - Allows for different contexts, preferences in the
density over the parameter space - Fits seamlessly into the mixture framework
-
- How to model AA properties with DMP ?
33DMP for Amino acids
34DMP for Amino acids
Values of X and . for each component chosen by
heuristic
35DMP for Amino acids
- Yields probabilistic representation of amino acid
property hierarchy as DMP - Drives parameter estimation and structure
learning to be consistent with this notion of
similarity - Yields improvement of model performance for
protein clustering -
36Results
37Data sets
- Method evaluation on well-studied families with
known subgroups - Clustering and prediction of functional residues
- Malate / Lactate dehydrogenase (MDH/LDH)
- Guanylyl / Adenylyl cyclases (GC/AC)
- (Serine/Threonine and Tyrosine Protein kinases)
38Malate / Lactate Dehydrogenase
- Oxidoreductase, part of citrate cylce
- Small, clean PFAM seed alignment for MDH/LDH NAD
binding domain - 29 sequences (13 MDH, 16 LDH)
- MSA of length 141 (after filtering highly gapped
columns gt0.33 gaps)
39Malate / Lactate Dehydrogenase
- Single residue has been experimentally confirmed
to determine substrate specificity - Model selection Normalized entropy criterion
(NEC) K 2 - Perfect recovery of MDH/LDH
- Consider top ranked positions for prediction of
functional residues
40MDH/LDH
- Top 10
- Arg 81
- Met 85
- Gly 145
- Ser 88
- Leu 132
- Val 42
- Thr 123
- Ala 52
- Tyr 138
- Asn 122
Ecoli MDH chain A
White ligand interactions (NAD, SO4)
41MDH/LDH
- Top 10
- Arg 81
- Met 85
- Gly 145
- Ser 88
- Leu 132
- Val 42
- Thr 123
- Ala 52
- Tyr 138
- Asn 122
true, experimentally verified specificity
determining residue
Ecoli MDH chain A
42MDH/LDH
- Top 10
- Arg 81
- Met 85
- Gly 145
- Ser 88
- Leu 132
- Val 42
- Thr 123
- Ala 52
- Tyr 138
- Asn 122
Ecoli MDH chain A
43Guanylyl / Adenylyl Cyclases
- Catalyzes ATP-gtcAMP, GTP-gtcGMP
- 132 sequences (81 AC, 51 GC)
- MSA of length 771 (filtered for gaps)
44Guanylyl / Adenylyl Cyclases
- Group of five residues identified to influence
substrate binding in mutation experiments - NEC K2
- Clustering Sensitivity 83 , Specificity 87
wrt. AC/GC separation - Evaluation Consider top ranked positions within
C2 domain
45AC/GC Cyclases
- Top 10
- Ile 919
- Asp 1018
- Gln 1016
- Lys 1014
- Phe 975
- Lys 938
- Thr 943
- Cys 911
- Ile 1019
- Tyr 899
Rat AC C2 domain
46AC/GC Cyclases
- Top 10
- Ile 919
- Asp 1018
- Gln 1016
- Lys 1014
- Phe 975
- Lys 938
- Thr 943
- Cys 911
- Ile 1019
- Tyr 899
Specificity determining residues
Rat AC C2 domain
47AC/GC Cyclases
- Top 10
- Ile 919
- Asp 1018
- Gln 1016
- Lys 1014
- Phe 975
- Lys 938
- Thr 943
- Cys 911
- Ile 1019
- Tyr 899
Part of C1/C2 domain interface
Rat AC C2 domain
48AC/GC Cyclases
- Top 10
- Ile 919
- Asp 1018
- Gln 1016
- Lys 1014
- Phe 975
- Lys 938
- Thr 943
- Cys 911
- Ile 1019
- Tyr 899
Next to forsoklin interaction site
Rat AC C2 domain
49Conclusion
- Clustering of protein families and simultaneous
prediction of functional residues - Unsupervised and does not rely on phylogeny
- DMP based on amino acid properties
- Results on well-studied families encouraging
50Future work
- Consider Machine learning approaches for DMP
parameter estimation - Analysis of protein families without known
subgroup classification and functional site
prediction
51Software
- Pymix Python Mixture Package
- http//algorithmics.molgen.mpg.de/pymix.html
52 53Mixture Models
- For MSA
- Each is a realization of
- Probability of D under mixture M
54CSI Structure Learning
- Bayesian data likelihood
- is the mixture density
- is a conjugate prior over parameters
55CSI Model
Structure prior
Model posterior
Likelihood
Parameter prior
56CSI Structure Learning
Structure prior
Model posterior
Likelihood
Parameter prior
Amino acid property DMP instead of uninformative
prior
57Protein kinase
- Data set of tyrosine kinases (TK) and two classes
of serine/threonine kinases (STE, AGC) - Stretch of three residues highly indicative for
substrate specificity - Three component clustering sensitivity 79,
specificity 83
58Protein kinase
- Top 10
- Thr 201
- Lys 168
- Gly 200
- Leu 273
- Glu 170
-
- 15. Pro 169
cAMP dep. protein kinase, mus musculus
59Protein kinase
- Top 10
- Thr 201
- Lys 168
- Gly 200
- Leu 273
- Glu 170
-
- 15. Pro 169
cAMP dep. protein kinase, mus musculus
Stretch of three residues known to
determine substrate specificity