Semi-supervised learning for protein classification - PowerPoint PPT Presentation

About This Presentation
Title:

Semi-supervised learning for protein classification

Description:

Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 7
Provided by: BrianR172
Learn more at: https://www.iscb.org
Category:

less

Transcript and Presenter's Notes

Title: Semi-supervised learning for protein classification


1
Semi-supervised learning for protein
classification
  • Brian R. King
  • Chittibabu Guda, Ph.D.
  • Department of Computer Science
  • University at Albany, SUNY
  • GenNYsis Center for Excellence in Cancer
    Genomics
  • University at Albany, SUNY

2
The problem
  • Develop computational models of characteristics
    of protein structure and function from sequence
    alone using machine-learned classifiers
  • Input Data
  • Output A model (function) h X ? Y
  • Traditional approach supervised learning
  • Challenges
  • Experimentally determined data Expensive,
    limited, subject to noise/error
  • Large repositories of unannotated data
  • Data representation, bias from unbalanced /
    underrepresented classes, etc.

TrEMBL 37.5 5,035,267
Swiss-Prot 54.5 289,473
AIM Develop a method to use labeled and
unlabeled data, while improving performance given
the challenges presented by small, unbalanced data
3
Solution
  • Semi-supervised learning
  • Use Dl and Du for model induction
  • Method Generative, Bayesian probabilistic model
  • Based on ngLOC supervised, Naïve Bayes
    classification method
  • Input / Feature Representation Sequence ? n-gram
    model
  • Assumption multinomial distribution
  • IID Sequence and n-grams
  • Use EXPECTATION MAXIMIZATION!
  • Test setup
  • Prediction of subcellular localization
  • Eukaryotic, non-plant sequences only
  • Dl Data annotated with subcellular localization
    for eukaryotic, non-plant sequences
  • DL-2 EXT/PLA (5500 sequences, balanced)
  • DL-3 GOL 65 / LYS 14 /POX 21 (600
    sequences, unbalanced)
  • Du Set from 75K eukaryotic, non-plant protein
    sequences.
  • Comparative method

4
Algorithms based on EM
  • EM-? on DL-3 data
  • ? controls effect of UL data on parameter
    adjustments
  • ALL labeled data (600)
  • Varied UL data
  • EM- ? outperforms TSVM on this problem
  • (Failed to converge on large amounts of UL data,
    despite parameter selection)
  • NOTE TSVM performed very well on binary,
    balanced classification problems
  • Basic EM on DL-2
  • Varied labeled data
  • 25,000 UL sequences
  • Most improvement when data is limited

5
Algorithm EM-CS
  • Core ngLOC method outputs a confidence score (CS)
  • Improve running time through intelligent
    selection of unlabeled instances
  • CS(xi) gt CSthresh? Use the instance
  • Test on DL-3 data

First, determine range of CS scores through
cross-validation without UL 33.5-47.8 (Dependent
on level of similarity in data, size of dataset.)
Using only sequences that meet or exceed CSthresh
significantly reduces UL data required (97.5
eliminated) NOTE it is possible to reduce UL
data too much.
6
Conclusion
  • Benefits
  • Probabilistic
  • Extract unlabeled sequences of high-confidence
  • Difficult with SVM or TSVM
  • Extraction of knowledge from model
  • Discriminative n-grams and anomalies
  • Information theoretic measures, KL-divergence,
    etc.
  • Again, difficult with SVM or TSVM
  • Computational resources
  • Time Significantly lower than SVM and TSVM
  • Space Dependent on n-gram model
  • Can use large amounts of unlabeled data
  • Applicable toward prediction of any structural or
    functional characteristic
  • Outputs a global model
  • Transduction is not global!
  • Most substantial gain with limited labeled data
  • Current work in progress
  • TSVMs
  • Improve performance on smaller, unbalanced data
Write a Comment
User Comments (0)
About PowerShow.com