Protein Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Protein Classification

Description:

Protein Classification PDB Growth Protein classification Number of protein sequences grow exponentially Number of solved structures grow exponentially Number of new ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 41
Provided by: root
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Protein Classification


1
Protein Classification
2
PDB Growth
New PDB structures
3
Protein classification
  • Number of protein sequences grow exponentially
  • Number of solved structures grow exponentially
  • Number of new folds identified very small (and
    close to constant)
  • Protein classification can
  • Generate overview of structure types
  • Detect similarities (evolutionary relationships)
    between protein sequences

SCOP release 1.67, Class folds superfamilies families
All alpha proteins 202 342 550
All beta proteins 141 280 529
Alpha and beta proteins (a/b) 130 213 593
Alpha and beta proteins (ab) 260 386 650
Multi-domain proteins 40 40 55
Membrane cell surface 42 82 91
Small proteins 72 104 162
Total 887 1447 2630

Morten Nielsen,CBS, BioCentrum, DTU
4
Protein structure classification
Protein fold
Protein superfamily
Protein family
Morten Nielsen,CBS, BioCentrum, DTU
5
Structure Classification Databases
  • SCOP
  • Manual classification (A. Murzin)
  • scop.berkeley.edu
  • CATH
  • Semi manual classification (C. Orengo)
  • www.biochem.ucl.ac.uk/bsm/cath
  • FSSP
  • Automatic classification (L. Holm)
  • www.ebi.ac.uk/dali/fssp/fssp.html

Morten Nielsen,CBS, BioCentrum, DTU
6
Major classes in SCOP
  • Classes
  • All alpha proteins
  • Alpha and beta proteins (a/b)
  • Alpha and beta proteins (ab)
  • Multi-domain proteins
  • Membrane and cell surface proteins
  • Small proteins

Morten Nielsen,CBS, BioCentrum, DTU
7
All a Hemoglobin (1bab)
Morten Nielsen,CBS, BioCentrum, DTU
8
All b Immunoglobulin (8fab)
Morten Nielsen,CBS, BioCentrum, DTU
9
a/b Triosephosphate isomerase (1hti)
Morten Nielsen,CBS, BioCentrum, DTU
10
ab Lysozyme (1jsf)
Morten Nielsen,CBS, BioCentrum, DTU
11
Families
  • Proteins whose evolutionarily relationship is
    readily recognizable from the sequence
  • (gt25 sequence identity)
  • Families are further subdivided into Proteins
  • Proteins are divided into Species
  • The same protein may be found in several species

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
12
Superfamilies
  • Proteins which are (remote) evolutionarily
    related
  • Sequence similarity low
  • Share function
  • Share special structural features
  • Relationships between members of a superfamily
    may not be readily recognizable from the sequence
    alone

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
13
Folds
  • Proteins which have gt50 secondary structure
    elements arranged the in the same order in the
    protein chain and in three dimensions are
    classified as having the same fold
  • No evolutionary relation between proteins

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
14
Protein Classification
  • Given a new protein, can we place it in its
    correct position within an existing protein
    hierarchy?
  • Methods
  • BLAST / PsiBLAST
  • Profile HMMs
  • Supervised Machine Learning methods

Fold
Superfamily
new protein
?
Family
Proteins
15
PSI-BLAST
  • Given a sequence query x, and database D
  • Find all pairwise alignments of x to sequences in
    D
  • Collect all matches of x to y with some minimum
    significance
  • Construct position specific matrix M
  • Each sequence y is given a weight so that many
    similar sequences cannot have much influence on a
    position (Henikoff Henikoff 1994)
  • Using the matrix M, search D for more matches
  • Iterate 14 until convergence

Profile M
16
Profile HMMs
Protein profile H
  • Each M state has a position-specific pre-computed
    substitution table
  • Each I and D state has position-specific gap
    penalties
  • Profile is a generative model
  • The sequence X that is aligned to H, is thought
    of as generated by H
  • Therefore, H parameterizes a conditional
    distribution P(X H)

17
Classification with Profile HMMs
Fold
Superfamily
Family
new protein
?
18
Classification with Profile HMMs
  • How generative models work
  • Training examples ( sequences known to be members
    of family ) positive
  • Model assigns a probability to any given protein
    sequence.
  • The sequence from that family yield a higher
    probability than that of outside family.
  • Log-likelihood ratio as score
  • P(X H1) P(H1) P(H1X)
    P(X) P(H1X)
  • L(X) log -------------------------- log
    --------------------- log --------------
  • P(X H0) P(H0)
    P(H0X) P(X) P(H0X)

19
Generation of a protein by a profile HMM
  • P(X H) ??
  • To generate sequence x1xn by profile HMM H
  • We will find the sum probability of all possible
    ways to generate X
  • Define
  • AjM(i) probability of generating x1xi and
    ending with xi being emitted from Mj
  • AjI(i) probability of generating of x1xi and
    ending with xi being emitted from Ij
  • AjD(i) probability of generating of x1xi and
    ending in Dj
  • (xi is the last character emitted before Dj)

20
Alignment of a protein to a profile HMM
  • AjM(i) eM(j)(xi) Aj-1M(i 1) log
    aM(j-1)M(j)
  • Aj-1I(i 1) log aI(j-1)M(j)
  • Aj-1D(i 1) log aD(j-1)M(j)
  • AjI(i) eI(j)(xi) AjM(i 1) log
    aM(j)I(j)
  • AjI(i 1) log aI(j)I(j)
  • AjD(i 1) log aD(j)I(j)
  • AjD(i) Aj-1M(i) log aM(j-1)D(j)
  • Aj-1I(i) log aI(j-1)D(j)
  • Aj-1D(i) log aD(j-1)D(j)

21
Generative Models
22
Generative Models
23
Generative Models
24
Generative Models
25
Generative Models
26
Discriminative Methods
Instead of modeling the process that generates
data, directly discriminate between classes
  • More direct way to the goal
  • Better if model is not accurate

27
Discriminative Models -- SVM
  • If x1 xn training examples,
  • sign(?i?ixiTx) decides where x falls
  • Train ?i to achieve best margin

margin
Decision Rule red vTx gt 0
v
Large Margin for v lt 1 ? Margin of 1 for small
v
28
Discriminative protein classification
  • Jaakkola, Diekhans, Haussler, ISMB 1999
  • Define the discriminating function to be
  • L(X) ?Xi?H1 ?i K(X, Xi) - ?Xj?H0 ?j K(X, Xj)
  • We decide X ? family H whenever L(X) gt 0
  • For now, lets just assume K(.,.) is a similarity
    function
  • Then, we want to train ?i so that this classifier
    makes as few mistakes as possible in the new data
  • Similarly to SVMs, train ?i so that margin is
    largest for 0 ? ?i ? 1

29
Discriminative protein classification
  • Ideally, for training examples, L(Xi) 1 if Xi ?
    H1, L(Xi) ? -1 otherwise
  • This is not always possible softer constraints
    are obtained with the following objective
    function
  • J(?) ?Xi?H1 ?i(2 - L(Xi)) - ?Xj?H0 ?j(2
    L(Xj))
  • Training for Xi ? H, try to make L(Xi) 1
  • 1 - L(Xi) ?i K(Xi, Xi)
  • ?i ? ----------------------------- with minimum
    allowable value 0, and maximum 1
  • K(Xi, Xi)
  • Similarly, for Xi ? H0 try to make L(Xi) -1

30
The Fisher Kernel
  • The function K(X, Y) compares two sequences
  • Acts effectively as an inner product in a
    (non-Euclidean) space
  • Called Kernel
  • Has to be positive definite
  • For any X1, , Xn, the matrix K Kij K(Xi, Xj)
    is such that
  • For any X ? Rn, X ? 0, XT K X gt 0
  • Choice of this function is important
  • Consider P(X H1, ?) sufficient statistics
  • How many expected times X takes each
    transition/emission

31
The Fisher Kernel
  • Fisher score
  • UX ?? log P(X H1, ?)
  • Quantifies how each parameter contributes to
    generating X
  • For two different sequences X and Y, can compare
    UX, UY
  • D2F(X, Y) ½ ?2 UX UY2
  • Given this distance function, K(X, Y) is defined
    as a similarity measure
  • K(X, Y) exp(-D2F(X, Y))
  • Set ? so that the average distance of training
    sequences Xi ? H1 to sequences Xj ? H0 is 1

Question Is partial derivative larger when X
uses a given parameter ?I more or less often?
Question Is partial derivative larger when a
given parameter ?I is larger or smaller?
32
The Fisher Kernel
  • In summary, to distinguish between family H1 and
    (non-family) H0, define
  • Profile H1
  • UX ?? log P(X H1, ?) (Fisher score)
  • D2F(X, Y) ½ ?2 UX UY2 (distance)
  • K(X, Y) exp(-D2F(X, Y)), (akin to dot
    product)
  • L(X) ?Xi?H1 ?i K(X, Xi) ?Xj?H0 ?j K(X, Xj)
  • Iteratively adjust ? to optimize
  • J(?) ?Xi?H1 ?i(2 - L(Xi)) ?Xj?H0 ?j(2
    L(Xj))

33
The Fisher Kernel
  • If a given superfamily has more than one profile
    model,
  • Lmax(X) maxi Li(X) maxi (?Xj?Hi ?j K(X, Xj)
    ?Xj?H0 ?j K(X, Xj))

Superfamily
Family
34
Benchmarks
  • Methods evaluated
  • BLAST (Altschul et al. 1990 Gish States 1993)
  • HMMs using SAM-T98 methodology (Park et al. 1998
    Karplus, Barrett, Hughey 1998 Hughey Krogh
    1995, 1996)
  • SVM-Fisher
  • Measurement of recognition rate for members of
    superfamilies of SCOP (Hubbard et al. 1997)
  • PDB90 eliminates redundant sequences
  • Withhold all members of a given SCOP family
  • Train with the remaining members of SCOP
    superfamily
  • Test with withheld data
  • Question Could the method discover a new family
    of a known superfamily?

O. Jangmin
35
O. Jangmin
36
Other methods
  • WU-BLAST version 2.0a16 (Althcshul Gish 1996)
  • PDB90 database was queried with each positive
    training examples, and E-values were recorded.
  • BLASTSCOP-only
  • BLASTSCOPSAM-T98-homologs
  • Scores were combined by the maximum method
  • SAM-T98 method
  • Same data and same set of models as in the
    SVM-Fisher
  • Combined with maximum methods

O. Jangmin
37
Results
  • Metric the rate of false positives (RFP)
  • RFP for a positive test sequence the fraction
    of negative test sequences that score as good of
    better than positive sequence
  • Result of the family of the nucleotide
    triphosphate hydrolases SCOP superfamily
  • Test the ability to distinguish 8 PDB90 G
    proteins from 2439 sequences in other SCOP folds

O. Jangmin
38
Table 1. Rate of false positives for G proteins
family. BLAST BLASTSCOP-only, B-Hom
BLASTSCOPSAMT-98-homologs, S-T98 SAMT-98, and
SVM-F SVM-Fisher method
O. Jangmin
39
(No Transcript)
40
QUESTION
  • Running time of Fisher kernel SVM
  • on query X?
Write a Comment
User Comments (0)
About PowerShow.com