Contextspecific Independence Mixture Modelling for Protein Families - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Contextspecific Independence Mixture Modelling for Protein Families

Description:

MSA of length 141 (after filtering highly gapped columns: 0.33 gaps) ... MSA of length 771 (filtered for gaps) Guanylyl / Adenylyl Cyclases ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 52
Provided by: root61
Category:

less

Transcript and Presenter's Notes

Title: Contextspecific Independence Mixture Modelling for Protein Families


1
Context-specific Independence Mixture Modelling
for Protein Families
  • Benjamin Georgi
  • Jörg Schultz
  • Alexander Schliep

2
Introduction
  • Protein families fall into sub families of
    similar but distinct function
  • Specific function of sub families is often
    determined by a small number of residues
  • functional residues
  • Example Malate / Lactate dehydrogenase single
    residue determines specificity

3
Functional Positions
  • Multiple Sequence alignment (MSA)
  • Two sub families, three functional positions
  • Strong signal of subgroup specific conservation

F1
F2
4
Functional Positions
  • Multiple sequence alignment (MSA)
  • Two sub families, three functional positions
  • Strong signal of subgroup specific conservation

F1
F2
5
Introduction
  • Problem Clustering of protein families,
    simultaneous prediction of functional residues
  • Prior approaches
  • Mostly supervised, requiring additional prior
    knowledge
  • Mostly based on phylogenetic trees
  • Our approach First unsupervised method that does
    not require a tree

Context-specific independence (CSI) mixture models
6
Context-specific independence mixture models
7
Mixture Models
  • Given random variable
  • represents row in a MSA
  • K component mixture density

8
Mixture Models
  • Example MSA length 4 3 component
    mixture

9
Mixture Models
  • Model parameterization

10
Mixture Models
  • Model structure matrixone atomar distribution
    per feature and component

11
CSI Mixture Models
  • Model structure matrixvariable number of
    parameters for each feature
  • reduction of model complexity
  • the contribution of noise features is
    negated
  • avoids overfitting
  • highly descriptive model due to block
    structure

12
CSI Mixture Models
  • Model structure matrixvariable number of
    parameters for each feature

use model structure to predict functional
positions
13
CSI for Protein Families
  • Conventional mixture

F1
F2
C1
C2
14
CSI for Protein Families
  • CSI Structure matrix (idealized)

F1
F2
C1
C2
15
CSI for Protein Families
  • CSI Structure matrix (more realistic)

F1
F2
C1
C2
16
Prediction of Functional Residues
  • Rank features by importance for characterization
    of a cluster
  • Ranking of feature j for component i by
  • Take highest ranking features as putative
    functional residues

17
CSI mixture structure learning
18
CSI Structure Learning
  • Question How to learn a CSI structure from data?
  • Bayesian approach Score models by posterior
    distribution

19
CSI Structure Learning
  • Model posterior

Structure prior
Model posterior
Likelihood
Parameter prior
20
CSI Structure Learning
  • Model posterior

Structure prior
Model posterior
Likelihood
Parameter prior
Criterion used to select structure and parameter
estimates
21
CSI Structure Learning
  • Model posterior

Structure prior
Model posterior
Likelihood
Parameter prior
Probability of the data under the mixture model
22
CSI Structure Learning
  • Model posterior

Structure prior
Model posterior
Likelihood
Parameter prior
Typically uninformative prior, acts as pseudo
counts, conjugate Dirichlet distribution
23
CSI Structure Learning
  • Model posterior

Structure prior
Model posterior
Likelihood
Parameter prior
Introduces preference for simpler model, acts as
a regularizer, simple factored form
24
Learning Algorithm
  • Structural EM framework (Friedman 1998)
  • Efficiently score candidate structures based on
    the expected sufficent statistics
  • Perform greedy search over structure space to
    arrive at final structure

25
CSI mixtures for protein data
26
Mixtures for Proteins
  • Using sequences to infer structural properties of
    the proteins
  • Different amino acid substitutions have different
    structural impacts
  • Need notion of amino acid similarity in the model
    to guide the structure learning
  • Remark related to substitution matrices in
    phylogeny

27
Conceptional problem
  • Example
  • Aspartate (D) and Glutamate (E) much more similar
    thanLeucine (L) and Asparagine (N)
  • E/D column might best be modeled with a single
    distribution in the structure

L E L E L E L E ----- N D N D N D N D
F1
F2
28
Amino acid properties

Livingstone and Barton (1993)
29
Amino acid properties

30
Model extension
  • Need to integrate AA properties into
  • Idea Construct parameter prior which
    defines appropriate density
  • Dirichlet Mixture priors

31
Dirichlet Distribution
  • Defines density over discrete distributions
  • Example preference for

32
Dirichlet Mixture Priors (DMP)
  • Mixture of several Dirichlet distributions as
    parameter prior
  • instead of single Dirichlet as in the
    uninformative case
  • Allows for different contexts, preferences in the
    density over the parameter space
  • Fits seamlessly into the mixture framework
  • How to model AA properties with DMP ?

33
DMP for Amino acids

34
DMP for Amino acids

Values of X and . for each component chosen by
heuristic
35
DMP for Amino acids
  • Yields probabilistic representation of amino acid
    property hierarchy as DMP
  • Drives parameter estimation and structure
    learning to be consistent with this notion of
    similarity
  • Yields improvement of model performance for
    protein clustering

36
Results
37
Data sets
  • Method evaluation on well-studied families with
    known subgroups
  • Clustering and prediction of functional residues
  • Malate / Lactate dehydrogenase (MDH/LDH)
  • Guanylyl / Adenylyl cyclases (GC/AC)
  • (Serine/Threonine and Tyrosine Protein kinases)

38
Malate / Lactate Dehydrogenase
  • Oxidoreductase, part of citrate cylce
  • Small, clean PFAM seed alignment for MDH/LDH NAD
    binding domain
  • 29 sequences (13 MDH, 16 LDH)
  • MSA of length 141 (after filtering highly gapped
    columns gt0.33 gaps)

39
Malate / Lactate Dehydrogenase
  • Single residue has been experimentally confirmed
    to determine substrate specificity
  • Model selection Normalized entropy criterion
    (NEC) K 2
  • Perfect recovery of MDH/LDH
  • Consider top ranked positions for prediction of
    functional residues

40
MDH/LDH
  • Top 10
  • Arg 81
  • Met 85
  • Gly 145
  • Ser 88
  • Leu 132
  • Val 42
  • Thr 123
  • Ala 52
  • Tyr 138
  • Asn 122

Ecoli MDH chain A
White ligand interactions (NAD, SO4)
41
MDH/LDH
  • Top 10
  • Arg 81
  • Met 85
  • Gly 145
  • Ser 88
  • Leu 132
  • Val 42
  • Thr 123
  • Ala 52
  • Tyr 138
  • Asn 122

true, experimentally verified specificity
determining residue
Ecoli MDH chain A
42
MDH/LDH
  • Top 10
  • Arg 81
  • Met 85
  • Gly 145
  • Ser 88
  • Leu 132
  • Val 42
  • Thr 123
  • Ala 52
  • Tyr 138
  • Asn 122

Ecoli MDH chain A
43
Guanylyl / Adenylyl Cyclases
  • Catalyzes ATP-gtcAMP, GTP-gtcGMP
  • 132 sequences (81 AC, 51 GC)
  • MSA of length 771 (filtered for gaps)

44
Guanylyl / Adenylyl Cyclases
  • Group of five residues identified to influence
    substrate binding in mutation experiments
  • NEC K2
  • Clustering Sensitivity 83 , Specificity 87
    wrt. AC/GC separation
  • Evaluation Consider top ranked positions within
    C2 domain

45
AC/GC Cyclases
  • Top 10
  • Ile 919
  • Asp 1018
  • Gln 1016
  • Lys 1014
  • Phe 975
  • Lys 938
  • Thr 943
  • Cys 911
  • Ile 1019
  • Tyr 899

Rat AC C2 domain
46
AC/GC Cyclases
  • Top 10
  • Ile 919
  • Asp 1018
  • Gln 1016
  • Lys 1014
  • Phe 975
  • Lys 938
  • Thr 943
  • Cys 911
  • Ile 1019
  • Tyr 899

Specificity determining residues
Rat AC C2 domain
47
AC/GC Cyclases
  • Top 10
  • Ile 919
  • Asp 1018
  • Gln 1016
  • Lys 1014
  • Phe 975
  • Lys 938
  • Thr 943
  • Cys 911
  • Ile 1019
  • Tyr 899

Part of C1/C2 domain interface
Rat AC C2 domain
48
AC/GC Cyclases
  • Top 10
  • Ile 919
  • Asp 1018
  • Gln 1016
  • Lys 1014
  • Phe 975
  • Lys 938
  • Thr 943
  • Cys 911
  • Ile 1019
  • Tyr 899

Next to forsoklin interaction site
Rat AC C2 domain
49
Conclusion
  • Clustering of protein families and simultaneous
    prediction of functional residues
  • Unsupervised and does not rely on phylogeny
  • DMP based on amino acid properties
  • Results on well-studied families encouraging

50
Future work
  • Consider Machine learning approaches for DMP
    parameter estimation
  • Analysis of protein families without known
    subgroup classification and functional site
    prediction

51
Software
  • Pymix Python Mixture Package
  • http//algorithmics.molgen.mpg.de/pymix.html

52
  • Thank you.

53
Mixture Models
  • For MSA
  • Each is a realization of
  • Probability of D under mixture M

54
CSI Structure Learning
  • Bayesian data likelihood
  • is the mixture density
  • is a conjugate prior over parameters

55
CSI Model
  • Model posterior

Structure prior
Model posterior
Likelihood
Parameter prior
56
CSI Structure Learning
  • Model posterior

Structure prior
Model posterior
Likelihood
Parameter prior
Amino acid property DMP instead of uninformative
prior
57
Protein kinase
  • Data set of tyrosine kinases (TK) and two classes
    of serine/threonine kinases (STE, AGC)
  • Stretch of three residues highly indicative for
    substrate specificity
  • Three component clustering sensitivity 79,
    specificity 83

58
Protein kinase
  • Top 10
  • Thr 201
  • Lys 168
  • Gly 200
  • Leu 273
  • Glu 170
  • 15. Pro 169

cAMP dep. protein kinase, mus musculus
59
Protein kinase
  • Top 10
  • Thr 201
  • Lys 168
  • Gly 200
  • Leu 273
  • Glu 170
  • 15. Pro 169

cAMP dep. protein kinase, mus musculus
Stretch of three residues known to
determine substrate specificity
Write a Comment
User Comments (0)
About PowerShow.com