Protein Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Protein Classification

Description:

Quantifies how each parameter ... To train a classifier for a given family H1, Build profile HMM, H1 ... From each webpage W, linked nbrs receive flow ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 49
Provided by: root
Learn more at: https://web.stanford.edu
Category:
Tags: build | classification | hard | how | nips | page | protein | to | web

less

Transcript and Presenter's Notes

Title: Protein Classification


1
Protein Classification
2
Protein Classification
  • Given a new protein, can we place it in its
    correct position within an existing protein
    hierarchy?
  • Methods
  • BLAST / PsiBLAST
  • Profile HMMs
  • Supervised Machine Learning methods

Fold
Superfamily
new protein
?
Family
Proteins
3
PSI-BLAST
  • Given a sequence query x, and database D
  • Find all pairwise alignments of x to sequences in
    D
  • Collect all matches of x to y with some minimum
    significance
  • Construct position specific matrix M
  • Each sequence y is given a weight so that many
    similar sequences cannot have much influence on a
    position (Henikoff Henikoff 1994)
  • Using the matrix M, search D for more matches
  • Iterate 14 until convergence

Profile M
4
Classification with Profile HMMs
Fold
Superfamily
Family
new protein
?
5
The Fisher Kernel
  • Fisher score
  • UX ?? log P(X H1, ?)
  • Quantifies how each parameter contributes to
    generating X
  • For two different sequences X and Y, can compare
    UX, UY
  • D2F(X, Y) ½ ?2 UX UY2
  • Given this distance function, K(X, Y) is defined
    as a similarity measure
  • K(X, Y) exp(-D2F(X, Y))
  • Set ? so that the average distance of training
    sequences Xi ? H1 to sequences Xj ? H0 is 1

6
The Fisher Kernel
  • To train a classifier for a given family H1,
  • Build profile HMM, H1
  • UX ?? log P(X H1, ?) (Fisher score)
  • D2F(X, Y) ½ ?2 UX UY2 (distance)
  • K(X, Y) exp(-D2F(X, Y)), (akin to dot
    product)
  • L(X) ?Xi?H1 ?i K(X, Xi) ?Xj?H0 ?j K(X, Xj)
  • Iteratively adjust ? to optimize
  • J(?) ?Xi?H1 ?i(2 - L(Xi)) ?Xj?H0 ?j(2
    L(Xj))
  • To classify query X,
  • Compute UX
  • Compute K(X, Xi) for all training examples Xi
    with ?I ? 0 (few)
  • Decide based on L(X) gt? 0

7
O. Jangmin
8
(No Transcript)
9
QUESTION
  • Running time of Fisher kernel SVM
  • on query X?

10
k-mer based SVMs
  • Leslie, Eskin, Weston, Noble NIPS 2002
  • Highlights
  • K(X, Y) exp(-½ ?2 UX UY2), requires
    expensive profile alignment
  • UX ?? log P(X H1, ?) O(X H1)
  • Instead, new kernel K(X, Y) just counts up
    k-mers with mismatches in common between X and
    Y O(X) in practice
  • Off-the-shelf SVM software used

11
k-mer based SVMs
  • For given word size k, and mismatch tolerance l,
    define
  • K(X, Y) distinct k-long word occurrences
    with l mismatches
  • Define normalized kernel K(X, Y) K(X, Y)/
    sqrt(K(X,X)K(Y,Y))
  • SVM can be learned by supplying this kernel
    function

A B A C A R D I
K(X, Y) 4 K(X, Y) 4/sqrt(77) 4/7
Let k 3 l 1
A B R A D A B I
12
SVMs will find a few support vectors
After training, SVM has determined a small set of
sequences, the support vectors, who need to be
compared with query sequence X
v
13
Benchmarks
14
Semi-Supervised Methods
GENERATIVE SUPERVISED METHODS
15
Semi-Supervised Methods
DISCRIMINATIVE SUPERVISED METHODS
16
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
17
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
18
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
19
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
20
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
21
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
22
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
23
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
24
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
25
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
26
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

27
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples
  • SVMs and other discriminative methods may make
    significant mistakes due to lack of data

28
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

29
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

30
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

31
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

Attempt to contract the distances within each
cluster while keeping intracluster distances
larger
32
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

33
Semi-Supervised Methods
  • Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005
  • A Psi-BLAST profilebased method
  • Weston, Leslie, Elisseeff, Noble, NIPS 2003
  • Cluster kernels

34
(semi)1. Profile k-mer based SVMs
PSI-BLAST
Profile M
  • For each sequence X,
  • Obtain PSI-BLAST profile Q(X) pi(?) ? amino
    acid, 1 i X
  • For every k-mer in X, xj xjk-1, define
    ?-neighborhood
  • Mk,? (Qxjxjk-1) b1bk -?i0k-1 log
    pji(bi) lt ?
  • Define K(X, Y)
  • For each b1bk matching m times in X, n times in
    Y, add mn
  • In practice, each k-mer can have 2 mismatches
    and K(X, Y) can be computed quickly in O(k2 202
    (X Y))

35
(semi)1. Discriminative motifs
  • According to this kernel K(X, Y), sequence X is
    mapped to Fk,?(X) vector in 20k dimensions
  • Fk,?(X)(b1bk) k-mers in Q(X) whose
    neighborhood includes b1bk
  • Then, SVM learns a discriminating hyperplane
    with normal vector v
  • v ?i1N (/-) ?i Fk,?(X(i))
  • Consider a profile k-mer Qxjxjk-1 its
    contribution to v is
  • ?Fk,?(Qxjxjk-1), v?
  • Consider a position i in X count up the
    contributions of all words containing xi
  • g(xi) ?j1k max 0, ?Fk,?(Qxi-kjxj-1j),
    v?
  • Sort these contributions within all positions of
    all sequences, to pick important positions or
    discriminative motifs

36
(semi)1. Discriminative motifs
  • Consider a position i in X count up the
    contributions to v of all words containing xi
  • Sort these contributions within all positions of
    all sequences, to pick discriminative motifs

37
(semi)2. Cluster Kernels
  • Two (more!) methods
  • Neighborhood
  • For each X, run PSI-BLAST to get similar seqs ?
    Nbd(X)
  • Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
    Foriginal(X)
  • Counts of all k-mers matching with at most 1
    diff. all sequences that are similar to X
  • Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
    ?Y ? Nbd(Y) K(X, Y)
  • Bagged mismatch

38
(semi)2. Cluster Kernels
  • Two (more!) methods
  • Neighborhood
  • For each X, run PSI-BLAST to get similar seqs ?
    Nbd(X)
  • Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
    Foriginal(X)
  • Counts of all k-mers matching with at most 1
    diff. all sequences that are similar to X
  • Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
    ?Y ? Nbd(Y) K(X, Y)
  • Bagged mismatch
  • Run k-means clustering n times, giving p 1,,n
    assignments cp(X)
  • For every X and Y, count up the fraction of times
    they are bagged together
  • Kbag(X, Y) 1/n ?p 1(cp(X) cp (Y))
  • Combine the bag fraction with the original
    comparison K(.,.)
  • Knew(X, Y) Kbag(X, Y) K(X, Y)

39
Some Benchmarks
40
Google-like homology search
  • The internet and the network of protein
    homologies have some similarityscale free
  • Given query X, Google ranks webpages by a flow
    algorithm
  • From each webpage W, linked nbrs receive flow
  • At time t1, W sends to nbrs flow it received at
    time t
  • Finite, ergodic, aperiodic Markov Chain
  • Can find stationary distribution efficiently as
    left eigenvector with eigenvalue 1
  • Start with arbitrary probability distribution,
    and multiply by the transition matrix

41
Google-like homology search
  • Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004
  • RANKPROP algorithm for protein homology
  • First, compute a matrix Kij of PSI-BLAST homology
    between proteins i and j, normalized so that
    ?jKji 1
  • Initialization y1(0) 1 yi(0) 0
  • For t 0, 1, ,
  • For i 2 to m
  • yi(t1) K1i ??Kjiyj(t)
  • In the end, let yi be the ranking score for
    similarity of sequence i to sequence 1
  • (? 0.95 is good)

42
Google-like homology search
For a given protein family, what fraction of true
members of the family are ranked higher than the
first 50 non-members?
43
Protein Structure Prediction
44
Protein Structure Determination
  • Experimental
  • X-ray crystallography
  • NMR spectrometry
  • Computational Structure Prediction
  • (The Holy Grail)
  • Sequence implies structure, therefore in
    principle we can predict the structure from the
    sequence alone

45
Protein Structure Prediction
  • ab initio
  • Use just first principles energy, geometry, and
    kinematics
  • Homology
  • Find the best match to a database of sequences
    with known 3D-structure
  • Threading
  • Meta-servers and other methods

46
Ab initio Prediction
  • Sampling the global conformation space
  • Lattice models / Discrete-state models
  • Molecular Dynamics
  • Picking native conformations with an energy
    function
  • Solvation model how protein interacts with water
  • Pair interactions between amino acids
  • Predicting secondary structure
  • Local homology
  • Fragment libraries

47
Lattice String Folding
  • HP model main modeled force is hydrophobic
    attraction
  • NP-hard in both 2-D square and 3-D cubic
  • Constant approximation algorithms
  • Not so relevant biologically

48
Lattice String Folding
Write a Comment
User Comments (0)
About PowerShow.com