Title: MultipleInstance Machine Learning for Subcellular Localization Classification
1Multiple-Instance Machine Learning for
Subcellular Localization Classification
2Subcellular Localization
- Cells are compartmentalized
- Each compartment has a different chemistry
- E.g. mitochondria have a high negative charge
- Proteins have evolved to function optimally in a
specific organelle - Most proteins are synthesized in the cytoplasm
- A highly specific system is involved in
transporting proteins to the correct localization
3Why is Subcellular Localization Important?
- Subcellular localization is important for
understanding gene/protein function - Proteins must be co-localized to cooperate in a
common physiological function (metabolic pathway,
signal transduction cascade, structural
association, etc). - Aberrant localizations lead to diseases such as
cancer and Alzheimer's. - Yields information about a protein
- Protein's biological function
- Membership to a specific enzyme pathway
- Possible post-translational modifications
4Cellular Compartments
- Nuclear
- 1097 (45)
- Extracellular
- 325 (13)
- Cytoplasmic
- 684 (29)
- Mitochondrial
- 321 (13)
- Prokaryote
- Cytosolic, Extracellular, and Periplasmic
http//www.virtuallaboratory.net/Biofundamentals/l
ectureNotes/AllGraphics/animalCellCompartments.jpg
5Outline
- Introduction
- Machine Learning
- Conventionally-used SL methods
- Sorting signal TARGETp
- AAC NNPSL, SubLoc, Esub8
- Hybrid Method SLP-Local, PSORT
- Introduction to Multiple-Instance Learning
- Application of MIL to SL
- Conclusion
6What is Machine Learning?
- Building machines that automatically learn from
experience - Small sampling of biological applications
- Classification of protein sequences by family
- Gene finding
- Structure prediction
- Inferring phylogenies
- Subcellular localization prediction
7What is Machine Learning (cont'd)
- Given several labeled examples of a concept
- E.g. Mitochondrial targeting vs.
non-mitochondrial targeting - Examples are described by features
- E.g. AA-frequency, N-terminal targeting sequence,
pI, mass - A ML algorithm uses these examples to create a
hypothesis that will predict the label of new
(previously unseen) examples - Hypotheses can take on many forms artificial
neural networks, hidden Markov models, k nearest
neighbor, decision trees, etc.
8Conventional Machine Learning ModelDecision
Problem
9Conventional Machine Learning ModelDecision
Problem
10Conventional Machine Learning ModelDecision
Problem
11Conventional Machine Learning ModelDecision
Problem
12Conventional Machine Learning ModelDecision
Problem
13Conventional Machine Learning ModelDecision
Problem
14Conventional Machine Learning ModelDecision
Problem
15Conventional Machine Learning ModelDecision
Problem
16Conventional Machine Learning ModelDecision
Problem
Important decision Selection of good
representation (features) for problem
17Possible Features for Subcellular Localization
Conventional ML
- Signal peptides (SP)
- Commonly on the N-terminus, but also on
C-terminus - Vergunst (2005), Furukawa (1997)
- Amino Acid composition (AAC) Protein chemistry
tends to match compartmental environments - Identity (ID) High similarity between sequences
? same localization
18Possible Features for Subcellular Localization
Signal Peptides
- Endoplasmic Reticulum
- Easily recognizable A stretch of hydrophobic
residues 15-30 AAs in from the N-terminus. - Nuclear Localization Signals
- High content of Arg and Lys, may contain residues
that disrupt helical domains (Pro, Gly). - Mitochondrial targeting peptides
- Difficult to predict Rich in Arg, Ala, and Ser,
acidic residues are rare, often form alpha
helices. - Chloroplast transit peptides
- Least understood. Similar to mitochondrial, but
structure is unknown.
Increasingly difficult to predict
19Outline
- Introduction
- Machine Learning
- Conventionally-used SL methods
- Sorting signal TARGETp
- AAC NNPSL, SubLoc, Esub8
- Hybrid Method SLP-Local, PSORT
- Introduction to Multiple-Instance Learning
- Application of MIL to SL
- Conclusion
20Current Methods
Input Amino Acid Composition, dipeptide,
physico-chemical properties
Input N-terminal Sorting Sequence
ChloroP (ANN)
NNPSL (ANN)
SignalP (ANNHMM)
PSORT
SubLoc (SVM)
TargetP (ANN)
Esub8 (SVM)
SLP-Local (SVM)
(ANN) Artifical Neural Network (SVM) Support
Vector Machine (HMM) Hidden Markov Model
21TargetP Emanuelsson et al. (2000)
- Uses only Signaling sequences
- Predictions based on the 130 N-terminal residues
of each input sequence. Missing N-terminal
residues make the prediction more difficult and
less reliable. - Reliability of predictions
- 85.3 for plants
- 90 for animals
22NNPSL, SubLoc Reinhardt Hubbard (1998),Hua
Sun (2001)
Esub8Cui et al. (2004)
- Use global Amino Acid Composition (AAC) to
capture the physico-chemical properties of the
protein sequence. - 20 amino acids ? 20 numbers representing
percentage of AA in sequence - Reliability of Predictions
- 66.1/79.4 for Eukaryotic
- 80.9/91.4 for Prokaryotic
- Reliability of Predictions
- 84.1 for Eukaryotic
- 91 Nuc, 80 Cyt, 68 Mit, 87 Ext.
- 92.9 for Prokaryotic
23SLP-LocalMatsuda et al. (2005)
Split Sequence
- Features
- x1(p), , x20(p) Composition of 20 amino acids
in part p (p 1, 2, 3, 4, M, C, E) - y1(M), , y20(M) Composition of 20 twin amino
acids in the middle part (AA, RR, NN, ...) - f1(q), , f6(q), g1(M), , g6(M), h1(M),,h6(M)
Distance frequencies of basic, hydrophobic, and
other amino acids, respectively - 184 features!!!
Distance Frequencies
- Reliability of Predictions
- 87.1 for Eukaryotic
- 91 Nuc, 83 Cyt, 82 Mit, 93 Ext
24PSORT II Nakai Horton (1999)
- Uses a set of rules to predict the localization
of protein sequences - Difficult to make rules for every possible case
- "The classical type of Nuclear Localization
Signal is detected using the following two
rules 4 residue pattern (called 'pat4')
composed of 4 basic amino acids (K or R), or
composed of three basic amino acids (K or R) and
either H or P the other (called 'pat7') is ... "
(http//psort.nibb.ac.jp/helpwww2.html) - Reliability of prediction
- 57 for yeast
- 86 for E. coli
25Issues with Current Methods
- Signal Peptides
- Signal Peptide dataset accuracies are relatively
low - SLP-Local 88.1
- TargetP 85.3
- PSORT 69.8
- Have to predict feature values to be used by SL
classifier - Noisy features
- Amino Acid Composition
- Lose all sequence information
- Methods that split the sequence, such as
SLP-Local, are very complex, still cannot
represent all information - ID
- Impossible to compactly represent all cell
machinations
26Outline
- Introduction
- Machine Learning
- Conventionally-used SL methods
- Sorting signal TARGETp
- AAC NNPSL, SubLoc, Esub8
- Hybrid Method SLP-Local, PSORT
- Introduction to Multiple-Instance Learning
- Application of MIL to SL
- Conclusion
27Conventional Learning ModelDecision Problem
28Multiple-Instance Learning
- Generalizes conventional machine learning
(provably harder to learn) - Now each example consists of a set (bag) of
instances - Single label for entire bag is a function of
individual instances labels
29MIL - An Example
30MIL - An Example
Positive training example
31MIL - An Example
Hypothesis built from training examples
32MIL - An Example
Hypothesis built from training examples
33MIL - An Example
New example to classify predict negative
34MIL - An Example
New example to classify predict negative
35MIL - An Example
New example to classify predict positive
36MIL - An Example
New example to classify predict positive
37Application of MIL to Subcellular Localization
- MIL/signature representation can capture SP, ID,
and AAC the following way - Sequence Identity Identical sequences will have
identical signatures - Signal Peptides Particular targeting signals
have close consensus sequences - Amino Acid Composition Similar compositions will
have similar signatures - Can also include features from conventional ML,
such as AAC - Where does the position in sequence come from?
38Application of MIL to Subcellular Localization
- Local Alignments!
- Locally align pairs of sequences (s)
- Take signatures of locally-aligned subsequences
and combine them - More compact MIL representation of sequences
39Application of MIL to Subcellular Localization
AAC Forward and Backward LA
PDB 3bcc
PDB 1dxm
40MIL for SLForward v. Backward
- Forward (5)
- GGKGGAKEKISALQS GTLTYEAVHQTTQV PNCP
- GEKGQYTHKIYHLKS GQYTHKIYHLKSKV PDCP
- VFNTVKEAKEKTG IGPN
- LYSVAEASKNETG LGPN
- Backward (5)
- GKGGAKEKISALQSAGVIVSMSPAQLGTCMYKEFEK GKGGAKE
- GEGGGTENKSA-EAVSYLQGVQYEQVSCPLVVRFEK GKEGDKE
- IVLIGEIGGHAEENA KAKPVVSFIAGI GGHAEENAAE
- LVEIGEGGGTENKSA KSAEAVSYLQGV GGGTENKSAE
41Outline
- Introduction
- Machine Learning
- Conventionally-used SL methods
- Sorting signal TARGETp
- AAC NNPSL, SubLoc, Esub8
- Hybrid Method SLP-Local, PSORT
- Introduction to Multiple-Instance Learning
- Application of MIL to SL
- Conclusion