Protein Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Protein Classification

Description:

Quantifies how each parameter ... To train a classifier for a given family H1, Build profile HMM, H1 ... From each webpage W, linked nbrs receive flow ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 49

Provided by: root

Learn more at: https://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Protein Classification

1
Protein Classification
2
Protein Classification

Given a new protein, can we place it in its
correct position within an existing protein
hierarchy?
Methods
BLAST / PsiBLAST
Profile HMMs
Supervised Machine Learning methods

Fold
Superfamily
new protein
?
Family
Proteins
3
PSI-BLAST

Given a sequence query x, and database D
Find all pairwise alignments of x to sequences in
D
Collect all matches of x to y with some minimum
significance
Construct position specific matrix M
Each sequence y is given a weight so that many
similar sequences cannot have much influence on a
position (Henikoff Henikoff 1994)
Using the matrix M, search D for more matches
Iterate 14 until convergence

Profile M
4
Classification with Profile HMMs
Fold
Superfamily
Family
new protein
?
5
The Fisher Kernel

Fisher score
UX ?? log P(X H1, ?)
Quantifies how each parameter contributes to
generating X
For two different sequences X and Y, can compare
UX, UY
D2F(X, Y) ½ ?2 UX UY2
Given this distance function, K(X, Y) is defined
as a similarity measure
K(X, Y) exp(-D2F(X, Y))
Set ? so that the average distance of training
sequences Xi ? H1 to sequences Xj ? H0 is 1

6
The Fisher Kernel

To train a classifier for a given family H1,
Build profile HMM, H1
UX ?? log P(X H1, ?) (Fisher score)
D2F(X, Y) ½ ?2 UX UY2 (distance)
K(X, Y) exp(-D2F(X, Y)), (akin to dot
product)
L(X) ?Xi?H1 ?i K(X, Xi) ?Xj?H0 ?j K(X, Xj)
Iteratively adjust ? to optimize
J(?) ?Xi?H1 ?i(2 - L(Xi)) ?Xj?H0 ?j(2
L(Xj))
To classify query X,
Compute UX
Compute K(X, Xi) for all training examples Xi
with ?I ? 0 (few)
Decide based on L(X) gt? 0

7
O. Jangmin
8
(No Transcript)
9
QUESTION

Running time of Fisher kernel SVM
on query X?

10
k-mer based SVMs

Leslie, Eskin, Weston, Noble NIPS 2002
Highlights
K(X, Y) exp(-½ ?2 UX UY2), requires
expensive profile alignment
UX ?? log P(X H1, ?) O(X H1)
Instead, new kernel K(X, Y) just counts up
k-mers with mismatches in common between X and
Y O(X) in practice
Off-the-shelf SVM software used

11
k-mer based SVMs

For given word size k, and mismatch tolerance l,
define
K(X, Y) distinct k-long word occurrences
with l mismatches
Define normalized kernel K(X, Y) K(X, Y)/
sqrt(K(X,X)K(Y,Y))
SVM can be learned by supplying this kernel
function

A B A C A R D I
K(X, Y) 4 K(X, Y) 4/sqrt(77) 4/7
Let k 3 l 1
A B R A D A B I
12
SVMs will find a few support vectors
After training, SVM has determined a small set of
sequences, the support vectors, who need to be
compared with query sequence X
v
13
Benchmarks
14
Semi-Supervised Methods
GENERATIVE SUPERVISED METHODS
15
Semi-Supervised Methods
DISCRIMINATIVE SUPERVISED METHODS
16
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
17
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
18
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
19
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
20
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
21
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
22
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
23
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
24
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
25
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
26
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

27
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

SVMs and other discriminative methods may make
significant mistakes due to lack of data

28
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

29
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

30
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

31
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

Attempt to contract the distances within each
cluster while keeping intracluster distances
larger
32
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

33
Semi-Supervised Methods

Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005
A Psi-BLAST profilebased method
Weston, Leslie, Elisseeff, Noble, NIPS 2003
Cluster kernels

34
(semi)1. Profile k-mer based SVMs
PSI-BLAST
Profile M

For each sequence X,
Obtain PSI-BLAST profile Q(X) pi(?) ? amino
acid, 1 i X
For every k-mer in X, xj xjk-1, define
?-neighborhood
Mk,? (Qxjxjk-1) b1bk -?i0k-1 log
pji(bi) lt ?
Define K(X, Y)
For each b1bk matching m times in X, n times in
Y, add mn
In practice, each k-mer can have 2 mismatches
and K(X, Y) can be computed quickly in O(k2 202
(X Y))

35
(semi)1. Discriminative motifs

According to this kernel K(X, Y), sequence X is
mapped to Fk,?(X) vector in 20k dimensions
Fk,?(X)(b1bk) k-mers in Q(X) whose
neighborhood includes b1bk
Then, SVM learns a discriminating hyperplane
with normal vector v
v ?i1N (/-) ?i Fk,?(X(i))
Consider a profile k-mer Qxjxjk-1 its
contribution to v is
?Fk,?(Qxjxjk-1), v?
Consider a position i in X count up the
contributions of all words containing xi
g(xi) ?j1k max 0, ?Fk,?(Qxi-kjxj-1j),
v?
Sort these contributions within all positions of
all sequences, to pick important positions or
discriminative motifs

36
(semi)1. Discriminative motifs

Consider a position i in X count up the
contributions to v of all words containing xi
Sort these contributions within all positions of
all sequences, to pick discriminative motifs

37
(semi)2. Cluster Kernels

Two (more!) methods
Neighborhood
For each X, run PSI-BLAST to get similar seqs ?
Nbd(X)
Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
Foriginal(X)
Counts of all k-mers matching with at most 1
diff. all sequences that are similar to X
Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
?Y ? Nbd(Y) K(X, Y)
Bagged mismatch

38
(semi)2. Cluster Kernels

Two (more!) methods
Neighborhood
For each X, run PSI-BLAST to get similar seqs ?
Nbd(X)
Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
Foriginal(X)
Counts of all k-mers matching with at most 1
diff. all sequences that are similar to X
Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
?Y ? Nbd(Y) K(X, Y)
Bagged mismatch
Run k-means clustering n times, giving p 1,,n
assignments cp(X)
For every X and Y, count up the fraction of times
they are bagged together
Kbag(X, Y) 1/n ?p 1(cp(X) cp (Y))
Combine the bag fraction with the original
comparison K(.,.)
Knew(X, Y) Kbag(X, Y) K(X, Y)

39
Some Benchmarks
40
Google-like homology search

The internet and the network of protein
homologies have some similarityscale free
Given query X, Google ranks webpages by a flow
algorithm
From each webpage W, linked nbrs receive flow
At time t1, W sends to nbrs flow it received at
time t
Finite, ergodic, aperiodic Markov Chain
Can find stationary distribution efficiently as
left eigenvector with eigenvalue 1
Start with arbitrary probability distribution,
and multiply by the transition matrix

41
Google-like homology search

Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004
RANKPROP algorithm for protein homology
First, compute a matrix Kij of PSI-BLAST homology
between proteins i and j, normalized so that
?jKji 1
Initialization y1(0) 1 yi(0) 0
For t 0, 1, ,
For i 2 to m
yi(t1) K1i ??Kjiyj(t)
In the end, let yi be the ranking score for
similarity of sequence i to sequence 1
(? 0.95 is good)

42
Google-like homology search
For a given protein family, what fraction of true
members of the family are ranked higher than the
first 50 non-members?
43
Protein Structure Prediction
44
Protein Structure Determination

Experimental
X-ray crystallography
NMR spectrometry
Computational Structure Prediction
(The Holy Grail)
Sequence implies structure, therefore in
principle we can predict the structure from the
sequence alone

45
Protein Structure Prediction