MultipleInstance Machine Learning for Subcellular Localization Classification - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

MultipleInstance Machine Learning for Subcellular Localization Classification

Description:

Proteins have evolved to function optimally in a specific organelle ... Aberrant localizations lead to diseases such as cancer and Alzheimer's. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 42
Provided by: corys6
Category:

less

Transcript and Presenter's Notes

Title: MultipleInstance Machine Learning for Subcellular Localization Classification


1
Multiple-Instance Machine Learning for
Subcellular Localization Classification
  • Cory Strope

2
Subcellular Localization
  • Cells are compartmentalized
  • Each compartment has a different chemistry
  • E.g. mitochondria have a high negative charge
  • Proteins have evolved to function optimally in a
    specific organelle
  • Most proteins are synthesized in the cytoplasm
  • A highly specific system is involved in
    transporting proteins to the correct localization

3
Why is Subcellular Localization Important?
  • Subcellular localization is important for
    understanding gene/protein function
  • Proteins must be co-localized to cooperate in a
    common physiological function (metabolic pathway,
    signal transduction cascade, structural
    association, etc).
  • Aberrant localizations lead to diseases such as
    cancer and Alzheimer's.
  • Yields information about a protein
  • Protein's biological function
  • Membership to a specific enzyme pathway
  • Possible post-translational modifications

4
Cellular Compartments
  • Nuclear
  • 1097 (45)
  • Extracellular
  • 325 (13)
  • Cytoplasmic
  • 684 (29)
  • Mitochondrial
  • 321 (13)
  • Prokaryote
  • Cytosolic, Extracellular, and Periplasmic

http//www.virtuallaboratory.net/Biofundamentals/l
ectureNotes/AllGraphics/animalCellCompartments.jpg
5
Outline
  • Introduction
  • Machine Learning
  • Conventionally-used SL methods
  • Sorting signal TARGETp
  • AAC NNPSL, SubLoc, Esub8
  • Hybrid Method SLP-Local, PSORT
  • Introduction to Multiple-Instance Learning
  • Application of MIL to SL
  • Conclusion

6
What is Machine Learning?
  • Building machines that automatically learn from
    experience
  • Small sampling of biological applications
  • Classification of protein sequences by family
  • Gene finding
  • Structure prediction
  • Inferring phylogenies
  • Subcellular localization prediction

7
What is Machine Learning (cont'd)
  • Given several labeled examples of a concept
  • E.g. Mitochondrial targeting vs.
    non-mitochondrial targeting
  • Examples are described by features
  • E.g. AA-frequency, N-terminal targeting sequence,
    pI, mass
  • A ML algorithm uses these examples to create a
    hypothesis that will predict the label of new
    (previously unseen) examples
  • Hypotheses can take on many forms artificial
    neural networks, hidden Markov models, k nearest
    neighbor, decision trees, etc.

8
Conventional Machine Learning ModelDecision
Problem
9
Conventional Machine Learning ModelDecision
Problem
10
Conventional Machine Learning ModelDecision
Problem
11
Conventional Machine Learning ModelDecision
Problem
12
Conventional Machine Learning ModelDecision
Problem
13
Conventional Machine Learning ModelDecision
Problem
14
Conventional Machine Learning ModelDecision
Problem
15
Conventional Machine Learning ModelDecision
Problem
16
Conventional Machine Learning ModelDecision
Problem
Important decision Selection of good
representation (features) for problem
17
Possible Features for Subcellular Localization
Conventional ML
  • Signal peptides (SP)
  • Commonly on the N-terminus, but also on
    C-terminus
  • Vergunst (2005), Furukawa (1997)
  • Amino Acid composition (AAC) Protein chemistry
    tends to match compartmental environments
  • Identity (ID) High similarity between sequences
    ? same localization

18
Possible Features for Subcellular Localization
Signal Peptides
  • Endoplasmic Reticulum
  • Easily recognizable A stretch of hydrophobic
    residues 15-30 AAs in from the N-terminus.
  • Nuclear Localization Signals
  • High content of Arg and Lys, may contain residues
    that disrupt helical domains (Pro, Gly).
  • Mitochondrial targeting peptides
  • Difficult to predict Rich in Arg, Ala, and Ser,
    acidic residues are rare, often form alpha
    helices.
  • Chloroplast transit peptides
  • Least understood. Similar to mitochondrial, but
    structure is unknown.

Increasingly difficult to predict
19
Outline
  • Introduction
  • Machine Learning
  • Conventionally-used SL methods
  • Sorting signal TARGETp
  • AAC NNPSL, SubLoc, Esub8
  • Hybrid Method SLP-Local, PSORT
  • Introduction to Multiple-Instance Learning
  • Application of MIL to SL
  • Conclusion

20
Current Methods
Input Amino Acid Composition, dipeptide,
physico-chemical properties
Input N-terminal Sorting Sequence
ChloroP (ANN)
NNPSL (ANN)
SignalP (ANNHMM)
PSORT
SubLoc (SVM)
TargetP (ANN)
Esub8 (SVM)
SLP-Local (SVM)
(ANN) Artifical Neural Network (SVM) Support
Vector Machine (HMM) Hidden Markov Model
21
TargetP Emanuelsson et al. (2000)
  • Uses only Signaling sequences
  • Predictions based on the 130 N-terminal residues
    of each input sequence. Missing N-terminal
    residues make the prediction more difficult and
    less reliable.
  • Reliability of predictions
  • 85.3 for plants
  • 90 for animals

22
NNPSL, SubLoc Reinhardt Hubbard (1998),Hua
Sun (2001)
Esub8Cui et al. (2004)
  • Use global Amino Acid Composition (AAC) to
    capture the physico-chemical properties of the
    protein sequence.
  • 20 amino acids ? 20 numbers representing
    percentage of AA in sequence
  • Reliability of Predictions
  • 66.1/79.4 for Eukaryotic
  • 80.9/91.4 for Prokaryotic
  • Reliability of Predictions
  • 84.1 for Eukaryotic
  • 91 Nuc, 80 Cyt, 68 Mit, 87 Ext.
  • 92.9 for Prokaryotic

23
SLP-LocalMatsuda et al. (2005)
Split Sequence
  • Features
  • x1(p), , x20(p) Composition of 20 amino acids
    in part p (p 1, 2, 3, 4, M, C, E)
  • y1(M), , y20(M) Composition of 20 twin amino
    acids in the middle part (AA, RR, NN, ...)
  • f1(q), , f6(q), g1(M), , g6(M), h1(M),,h6(M)
    Distance frequencies of basic, hydrophobic, and
    other amino acids, respectively
  • 184 features!!!

Distance Frequencies
  • Reliability of Predictions
  • 87.1 for Eukaryotic
  • 91 Nuc, 83 Cyt, 82 Mit, 93 Ext

24
PSORT II Nakai Horton (1999)
  • Uses a set of rules to predict the localization
    of protein sequences
  • Difficult to make rules for every possible case
  • "The classical type of Nuclear Localization
    Signal is detected using the following two
    rules 4 residue pattern (called 'pat4')
    composed of 4 basic amino acids (K or R), or
    composed of three basic amino acids (K or R) and
    either H or P the other (called 'pat7') is ... "
    (http//psort.nibb.ac.jp/helpwww2.html)
  • Reliability of prediction
  • 57 for yeast
  • 86 for E. coli

25
Issues with Current Methods
  • Signal Peptides
  • Signal Peptide dataset accuracies are relatively
    low
  • SLP-Local 88.1
  • TargetP 85.3
  • PSORT 69.8
  • Have to predict feature values to be used by SL
    classifier
  • Noisy features
  • Amino Acid Composition
  • Lose all sequence information
  • Methods that split the sequence, such as
    SLP-Local, are very complex, still cannot
    represent all information
  • ID
  • Impossible to compactly represent all cell
    machinations

26
Outline
  • Introduction
  • Machine Learning
  • Conventionally-used SL methods
  • Sorting signal TARGETp
  • AAC NNPSL, SubLoc, Esub8
  • Hybrid Method SLP-Local, PSORT
  • Introduction to Multiple-Instance Learning
  • Application of MIL to SL
  • Conclusion

27
Conventional Learning ModelDecision Problem
28
Multiple-Instance Learning
  • Generalizes conventional machine learning
    (provably harder to learn)
  • Now each example consists of a set (bag) of
    instances
  • Single label for entire bag is a function of
    individual instances labels

29
MIL - An Example
30
MIL - An Example
Positive training example
31
MIL - An Example
Hypothesis built from training examples
32
MIL - An Example
Hypothesis built from training examples
33
MIL - An Example
New example to classify predict negative
34
MIL - An Example
New example to classify predict negative
35
MIL - An Example
New example to classify predict positive
36
MIL - An Example
New example to classify predict positive
37
Application of MIL to Subcellular Localization
  • MIL/signature representation can capture SP, ID,
    and AAC the following way
  • Sequence Identity Identical sequences will have
    identical signatures
  • Signal Peptides Particular targeting signals
    have close consensus sequences
  • Amino Acid Composition Similar compositions will
    have similar signatures
  • Can also include features from conventional ML,
    such as AAC
  • Where does the position in sequence come from?

38
Application of MIL to Subcellular Localization
  • Local Alignments!
  • Locally align pairs of sequences (s)
  • Take signatures of locally-aligned subsequences
    and combine them
  • More compact MIL representation of sequences



39
Application of MIL to Subcellular Localization
AAC Forward and Backward LA
PDB 3bcc
PDB 1dxm
40
MIL for SLForward v. Backward
  • Forward (5)
  • GGKGGAKEKISALQS GTLTYEAVHQTTQV PNCP
  • GEKGQYTHKIYHLKS GQYTHKIYHLKSKV PDCP
  • VFNTVKEAKEKTG IGPN
  • LYSVAEASKNETG LGPN
  • Backward (5)
  • GKGGAKEKISALQSAGVIVSMSPAQLGTCMYKEFEK GKGGAKE
  • GEGGGTENKSA-EAVSYLQGVQYEQVSCPLVVRFEK GKEGDKE
  • IVLIGEIGGHAEENA KAKPVVSFIAGI GGHAEENAAE
  • LVEIGEGGGTENKSA KSAEAVSYLQGV GGGTENKSAE

41
Outline
  • Introduction
  • Machine Learning
  • Conventionally-used SL methods
  • Sorting signal TARGETp
  • AAC NNPSL, SubLoc, Esub8
  • Hybrid Method SLP-Local, PSORT
  • Introduction to Multiple-Instance Learning
  • Application of MIL to SL
  • Conclusion
Write a Comment
User Comments (0)
About PowerShow.com