Title: CISC 841 Bioinformatics
1 CISC 841 Bioinformatics Combining HMMs with
SVMs
2HMM gradients
- Fisher Score ltXgt ?? log P(XH, ?)
- The gradient of a sequence X with respect to a
given model is computed using the
forward-backward algorithm. - Each dimension corresponds to one parameter of
the model. - The feature space is tailored to the sequences
from which the model was trained.
3 SVM-Fisher discrimination
- A probabilistic hidden Markov model ? is
trained from some example - sequences x1 x2 x3 xN
- Usually probability model P(xi?) (or function
of P(xi?)) is used as a measure of - sequence-model membership, and a threshold is
used on this measure - to decide membership.
- The Fisher vector is a vector of gradients of
P(xi?) (or gradients of function of P(xi?)) - w.r.t the parameters of the model.
- Uxi ?? P(xi?)
- One can take the training example sequences
(positive set) and other sequences that are - known to be non-members (negative set), and
transform them into Fisher vectors. - A Support Vector Machine (SVM) can be trained
using the positive and negative - Fisher vectors, and can be used to classify
other sequences. -
4Application Protein remote homology detection
5SVM-Pairwise method
Positive train
Negative train
Protein non-homologs
Protein homologs
1
Positive pairwise score vectors
Negative pairwise score vectors
Testing data
Target protein of unknown function
2
3
Support vector machine
Binary classification
6Experiment known protein families
Jaakkola, Diekhans and Haussler 1999
7Sample family sizes
8A measure of sensitivity and specificity
5
6
ROC 1
ROC 0.67
ROC 0
ROC receiver operating characteristic score is
the normalized area under a curve the plots true
positives as a function of false positives
9Application Discriminating signal peptide from
transmembrane proteins
10 Feature selection
- We expect gradients w.r.t transition parameters
- to be better discrimination features
- We look for those transitions that
- are differentially used by TM
- proteins and SP proteins
-
- - transform each signal peptide sequence (1275)
- into a Fisher vector w.r.t transition
parameters - and find the resultant vector
- - transform each TM sequence into a Fisher
- vector w.r.t transition parameters and find
- the resultant vector
- - compare the two resultant vectors
11Gradients of P(sx)
- In pattern recognition problems, we are
interested in P(sx,?) rather than P(x?) - Usx ?? log P(sx,?) ?? log P(s, x?) - ??
log P(x?)
12 Classification experiment
- 10-fold cross validation experiment using
- - positive set (247 TM proteins)
- - negative set (1275 signal peptide containing
proteins) - SVM-light package is used.
13Discrimination results
-
-
- Results
-
- A third (68) more SP proteins that were
incorrectly classified as TM - TM proteins are identified correctly.
14Application Protein-Protein Interaction
Prediction
15Interaction Profile Hidden Markov Model (ipHMM)
Fredrich et al (2006)
16- Knowledge transfer
- Build ipHMM from proteins whose structural
information is available. - Align the sequences of proteins whose
structural information is - not available to the model.
Likelihood Score Vector
ltLSai, A, LSai, B, LSbj,A, LSbj, Bgt
Fisher Score Vector
U (x) ?? logP(x?)
Uij Ej(i) / ej(i) ? k Ej(k)
17(No Transcript)
18(No Transcript)
19 Data set Fredrich et al (2006) 2018 proteins
in 36 domain families
20Conclusions
- Structural information at binding sites enhances
protein-protein interaction prediction. - Interaction profile HMM can transfer structural
information - Fisher scores extracted from domain profiles
further enhance protein-protein interaction
prediction for proteins with no available
structural information.