Protein Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Protein Classification

Description:

Protein Classification PDB Growth Protein classification Number of protein sequences grow exponentially Number of solved structures grow exponentially Number of new ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 41

Provided by: root

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Protein Classification

1
Protein Classification
2
PDB Growth
New PDB structures
3
Protein classification

Number of protein sequences grow exponentially
Number of solved structures grow exponentially
Number of new folds identified very small (and
close to constant)
Protein classification can
Generate overview of structure types
Detect similarities (evolutionary relationships)
between protein sequences

SCOP release 1.67, Class folds superfamilies families
All alpha proteins 202 342 550
All beta proteins 141 280 529
Alpha and beta proteins (a/b) 130 213 593
Alpha and beta proteins (ab) 260 386 650
Multi-domain proteins 40 40 55
Membrane cell surface 42 82 91
Small proteins 72 104 162
Total 887 1447 2630

Morten Nielsen,CBS, BioCentrum, DTU
4
Protein structure classification
Protein fold
Protein superfamily
Protein family
Morten Nielsen,CBS, BioCentrum, DTU
5
Structure Classification Databases

SCOP
Manual classification (A. Murzin)
scop.berkeley.edu
CATH
Semi manual classification (C. Orengo)
www.biochem.ucl.ac.uk/bsm/cath
FSSP
Automatic classification (L. Holm)
www.ebi.ac.uk/dali/fssp/fssp.html

Morten Nielsen,CBS, BioCentrum, DTU
6
Major classes in SCOP

Classes
All alpha proteins
Alpha and beta proteins (a/b)
Alpha and beta proteins (ab)
Multi-domain proteins
Membrane and cell surface proteins
Small proteins

Morten Nielsen,CBS, BioCentrum, DTU
7
All a Hemoglobin (1bab)
Morten Nielsen,CBS, BioCentrum, DTU
8
All b Immunoglobulin (8fab)
Morten Nielsen,CBS, BioCentrum, DTU
9
a/b Triosephosphate isomerase (1hti)
Morten Nielsen,CBS, BioCentrum, DTU
10
ab Lysozyme (1jsf)
Morten Nielsen,CBS, BioCentrum, DTU
11
Families

Proteins whose evolutionarily relationship is
readily recognizable from the sequence
(gt25 sequence identity)
Families are further subdivided into Proteins
Proteins are divided into Species
The same protein may be found in several species

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
12
Superfamilies

Proteins which are (remote) evolutionarily
related
Sequence similarity low
Share function
Share special structural features
Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
13
Folds

Proteins which have gt50 secondary structure
elements arranged the in the same order in the
protein chain and in three dimensions are
classified as having the same fold
No evolutionary relation between proteins

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
14
Protein Classification

Given a new protein, can we place it in its
correct position within an existing protein
hierarchy?
Methods
BLAST / PsiBLAST
Profile HMMs
Supervised Machine Learning methods

Fold
Superfamily
new protein
?
Family
Proteins
15
PSI-BLAST

Given a sequence query x, and database D
Find all pairwise alignments of x to sequences in
D
Collect all matches of x to y with some minimum
significance
Construct position specific matrix M
Each sequence y is given a weight so that many
similar sequences cannot have much influence on a
position (Henikoff Henikoff 1994)
Using the matrix M, search D for more matches
Iterate 14 until convergence

Profile M
16
Profile HMMs
Protein profile H

Each M state has a position-specific pre-computed
substitution table
Each I and D state has position-specific gap
penalties
Profile is a generative model
The sequence X that is aligned to H, is thought
of as generated by H
Therefore, H parameterizes a conditional
distribution P(X H)

17
Classification with Profile HMMs
Fold
Superfamily
Family
new protein
?
18
Classification with Profile HMMs

How generative models work
Training examples ( sequences known to be members
of family ) positive
Model assigns a probability to any given protein
sequence.
The sequence from that family yield a higher
probability than that of outside family.
Log-likelihood ratio as score
P(X H1) P(H1) P(H1X)
P(X) P(H1X)
L(X) log -------------------------- log
--------------------- log --------------
P(X H0) P(H0)
P(H0X) P(X) P(H0X)

19
Generation of a protein by a profile HMM

P(X H) ??
To generate sequence x1xn by profile HMM H
We will find the sum probability of all possible
ways to generate X
Define
AjM(i) probability of generating x1xi and
ending with xi being emitted from Mj
AjI(i) probability of generating of x1xi and
ending with xi being emitted from Ij
AjD(i) probability of generating of x1xi and
ending in Dj
(xi is the last character emitted before Dj)

20
Alignment of a protein to a profile HMM

AjM(i) eM(j)(xi) Aj-1M(i 1) log
aM(j-1)M(j)
Aj-1I(i 1) log aI(j-1)M(j)
Aj-1D(i 1) log aD(j-1)M(j)
AjI(i) eI(j)(xi) AjM(i 1) log
aM(j)I(j)
AjI(i 1) log aI(j)I(j)
AjD(i 1) log aD(j)I(j)
AjD(i) Aj-1M(i) log aM(j-1)D(j)
Aj-1I(i) log aI(j-1)D(j)
Aj-1D(i) log aD(j-1)D(j)

21
Generative Models
22
Generative Models
23
Generative Models
24
Generative Models
25
Generative Models
26
Discriminative Methods
Instead of modeling the process that generates
data, directly discriminate between classes

More direct way to the goal
Better if model is not accurate

27
Discriminative Models -- SVM

If x1 xn training examples,
sign(?i?ixiTx) decides where x falls
Train ?i to achieve best margin

margin
Decision Rule red vTx gt 0
v
Large Margin for v lt 1 ? Margin of 1 for small
v
28
Discriminative protein classification

Jaakkola, Diekhans, Haussler, ISMB 1999
Define the discriminating function to be
L(X) ?Xi?H1 ?i K(X, Xi) - ?Xj?H0 ?j K(X, Xj)
We decide X ? family H whenever L(X) gt 0
For now, lets just assume K(.,.) is a similarity
function
Then, we want to train ?i so that this classifier
makes as few mistakes as possible in the new data
Similarly to SVMs, train ?i so that margin is
largest for 0 ? ?i ? 1

29
Discriminative protein classification

Ideally, for training examples, L(Xi) 1 if Xi ?
H1, L(Xi) ? -1 otherwise
This is not always possible softer constraints
are obtained with the following objective
function
J(?) ?Xi?H1 ?i(2 - L(Xi)) - ?Xj?H0 ?j(2
L(Xj))
Training for Xi ? H, try to make L(Xi) 1
1 - L(Xi) ?i K(Xi, Xi)
?i ? ----------------------------- with minimum
allowable value 0, and maximum 1
K(Xi, Xi)
Similarly, for Xi ? H0 try to make L(Xi) -1

30
The Fisher Kernel

The function K(X, Y) compares two sequences
Acts effectively as an inner product in a
(non-Euclidean) space
Called Kernel
Has to be positive definite
For any X1, , Xn, the matrix K Kij K(Xi, Xj)
is such that
For any X ? Rn, X ? 0, XT K X gt 0
Choice of this function is important
Consider P(X H1, ?) sufficient statistics
How many expected times X takes each
transition/emission

31
The Fisher Kernel

Fisher score
UX ?? log P(X H1, ?)
Quantifies how each parameter contributes to
generating X
For two different sequences X and Y, can compare
UX, UY
D2F(X, Y) ½ ?2 UX UY2
Given this distance function, K(X, Y) is defined
as a similarity measure
K(X, Y) exp(-D2F(X, Y))
Set ? so that the average distance of training
sequences Xi ? H1 to sequences Xj ? H0 is 1

Question Is partial derivative larger when X
uses a given parameter ?I more or less often?
Question Is partial derivative larger when a
given parameter ?I is larger or smaller?
32
The Fisher Kernel

In summary, to distinguish between family H1 and
(non-family) H0, define
Profile H1
UX ?? log P(X H1, ?) (Fisher score)
D2F(X, Y) ½ ?2 UX UY2 (distance)
K(X, Y) exp(-D2F(X, Y)), (akin to dot
product)
L(X) ?Xi?H1 ?i K(X, Xi) ?Xj?H0 ?j K(X, Xj)
Iteratively adjust ? to optimize
J(?) ?Xi?H1 ?i(2 - L(Xi)) ?Xj?H0 ?j(2
L(Xj))

33
The Fisher Kernel

If a given superfamily has more than one profile
model,
Lmax(X) maxi Li(X) maxi (?Xj?Hi ?j K(X, Xj)
?Xj?H0 ?j K(X, Xj))

Superfamily
Family
34
Benchmarks

Methods evaluated
BLAST (Altschul et al. 1990 Gish States 1993)
HMMs using SAM-T98 methodology (Park et al. 1998
Karplus, Barrett, Hughey 1998 Hughey Krogh
1995, 1996)
SVM-Fisher
Measurement of recognition rate for members of
superfamilies of SCOP (Hubbard et al. 1997)
PDB90 eliminates redundant sequences
Withhold all members of a given SCOP family
Train with the remaining members of SCOP
superfamily
Test with withheld data
Question Could the method discover a new family
of a known superfamily?

O. Jangmin
35
O. Jangmin
36
Other methods

WU-BLAST version 2.0a16 (Althcshul Gish 1996)
PDB90 database was queried with each positive
training examples, and E-values were recorded.
BLASTSCOP-only
BLASTSCOPSAM-T98-homologs
Scores were combined by the maximum method
SAM-T98 method
Same data and same set of models as in the
SVM-Fisher
Combined with maximum methods

O. Jangmin
37
Results

Metric the rate of false positives (RFP)
RFP for a positive test sequence the fraction
of negative test sequences that score as good of
better than positive sequence
Result of the family of the nucleotide
triphosphate hydrolases SCOP superfamily
Test the ability to distinguish 8 PDB90 G
proteins from 2439 sequences in other SCOP folds

O. Jangmin
38
Table 1. Rate of false positives for G proteins
family. BLAST BLASTSCOP-only, B-Hom
BLASTSCOPSAMT-98-homologs, S-T98 SAMT-98, and
SVM-F SVM-Fisher method
O. Jangmin
39
(No Transcript)
40
QUESTION