Title: Semi-supervised learning for protein classification
1Semi-supervised learning for protein
classification
- Brian R. King
- Chittibabu Guda, Ph.D.
- Department of Computer Science
- University at Albany, SUNY
- GenNYsis Center for Excellence in Cancer
Genomics - University at Albany, SUNY
2The problem
- Develop computational models of characteristics
of protein structure and function from sequence
alone using machine-learned classifiers - Input Data
- Output A model (function) h X ? Y
- Traditional approach supervised learning
- Challenges
- Experimentally determined data Expensive,
limited, subject to noise/error - Large repositories of unannotated data
- Data representation, bias from unbalanced /
underrepresented classes, etc.
TrEMBL 37.5 5,035,267
Swiss-Prot 54.5 289,473
AIM Develop a method to use labeled and
unlabeled data, while improving performance given
the challenges presented by small, unbalanced data
3Solution
- Semi-supervised learning
- Use Dl and Du for model induction
- Method Generative, Bayesian probabilistic model
- Based on ngLOC supervised, Naïve Bayes
classification method - Input / Feature Representation Sequence ? n-gram
model - Assumption multinomial distribution
- IID Sequence and n-grams
- Use EXPECTATION MAXIMIZATION!
- Test setup
- Prediction of subcellular localization
- Eukaryotic, non-plant sequences only
- Dl Data annotated with subcellular localization
for eukaryotic, non-plant sequences - DL-2 EXT/PLA (5500 sequences, balanced)
- DL-3 GOL 65 / LYS 14 /POX 21 (600
sequences, unbalanced) - Du Set from 75K eukaryotic, non-plant protein
sequences. - Comparative method
4Algorithms based on EM
- EM-? on DL-3 data
- ? controls effect of UL data on parameter
adjustments - ALL labeled data (600)
- Varied UL data
- EM- ? outperforms TSVM on this problem
- (Failed to converge on large amounts of UL data,
despite parameter selection) - NOTE TSVM performed very well on binary,
balanced classification problems
- Basic EM on DL-2
- Varied labeled data
- 25,000 UL sequences
- Most improvement when data is limited
5Algorithm EM-CS
- Core ngLOC method outputs a confidence score (CS)
- Improve running time through intelligent
selection of unlabeled instances - CS(xi) gt CSthresh? Use the instance
- Test on DL-3 data
First, determine range of CS scores through
cross-validation without UL 33.5-47.8 (Dependent
on level of similarity in data, size of dataset.)
Using only sequences that meet or exceed CSthresh
significantly reduces UL data required (97.5
eliminated) NOTE it is possible to reduce UL
data too much.
6Conclusion
- Benefits
- Probabilistic
- Extract unlabeled sequences of high-confidence
- Difficult with SVM or TSVM
- Extraction of knowledge from model
- Discriminative n-grams and anomalies
- Information theoretic measures, KL-divergence,
etc. - Again, difficult with SVM or TSVM
- Computational resources
- Time Significantly lower than SVM and TSVM
- Space Dependent on n-gram model
- Can use large amounts of unlabeled data
- Applicable toward prediction of any structural or
functional characteristic - Outputs a global model
- Transduction is not global!
- Most substantial gain with limited labeled data
- Current work in progress
- TSVMs
- Improve performance on smaller, unbalanced data