Semi-supervised learning for protein classification - PowerPoint PPT Presentation

About This Presentation

Title:

Semi-supervised learning for protein classification

Description:

Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 7

Provided by: BrianR172

Learn more at: https://www.iscb.org

Category:

Tags: classification | learning | protein | semi | supervised

Transcript and Presenter's Notes

Title: Semi-supervised learning for protein classification

1
Semi-supervised learning for protein
classification

Brian R. King
Chittibabu Guda, Ph.D.
Department of Computer Science
University at Albany, SUNY
GenNYsis Center for Excellence in Cancer
Genomics
University at Albany, SUNY

2
The problem

Develop computational models of characteristics
of protein structure and function from sequence
alone using machine-learned classifiers
Input Data
Output A model (function) h X ? Y
Traditional approach supervised learning
Challenges
Experimentally determined data Expensive,
limited, subject to noise/error
Large repositories of unannotated data
Data representation, bias from unbalanced /
underrepresented classes, etc.

TrEMBL 37.5 5,035,267
Swiss-Prot 54.5 289,473
AIM Develop a method to use labeled and
unlabeled data, while improving performance given
the challenges presented by small, unbalanced data
3
Solution

Semi-supervised learning
Use Dl and Du for model induction
Method Generative, Bayesian probabilistic model
Based on ngLOC supervised, Naïve Bayes
classification method
Input / Feature Representation Sequence ? n-gram
model
Assumption multinomial distribution
IID Sequence and n-grams
Use EXPECTATION MAXIMIZATION!
Test setup
Prediction of subcellular localization
Eukaryotic, non-plant sequences only
Dl Data annotated with subcellular localization
for eukaryotic, non-plant sequences
DL-2 EXT/PLA (5500 sequences, balanced)
DL-3 GOL 65 / LYS 14 /POX 21 (600
sequences, unbalanced)
Du Set from 75K eukaryotic, non-plant protein
sequences.
Comparative method

4
Algorithms based on EM

EM-? on DL-3 data
? controls effect of UL data on parameter
adjustments
ALL labeled data (600)
Varied UL data
EM- ? outperforms TSVM on this problem
(Failed to converge on large amounts of UL data,
despite parameter selection)
NOTE TSVM performed very well on binary,
balanced classification problems

Basic EM on DL-2
Varied labeled data
25,000 UL sequences
Most improvement when data is limited

5
Algorithm EM-CS

Core ngLOC method outputs a confidence score (CS)
Improve running time through intelligent
selection of unlabeled instances
CS(xi) gt CSthresh? Use the instance
Test on DL-3 data

First, determine range of CS scores through
cross-validation without UL 33.5-47.8 (Dependent
on level of similarity in data, size of dataset.)
Using only sequences that meet or exceed CSthresh
significantly reduces UL data required (97.5
eliminated) NOTE it is possible to reduce UL
data too much.
6
Conclusion

Benefits
Probabilistic
Extract unlabeled sequences of high-confidence
Difficult with SVM or TSVM
Extraction of knowledge from model
Discriminative n-grams and anomalies
Information theoretic measures, KL-divergence,
etc.
Again, difficult with SVM or TSVM
Computational resources
Time Significantly lower than SVM and TSVM
Space Dependent on n-gram model
Can use large amounts of unlabeled data
Applicable toward prediction of any structural or
functional characteristic
Outputs a global model
Transduction is not global!
Most substantial gain with limited labeled data
Current work in progress
TSVMs
Improve performance on smaller, unbalanced data

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Machine Learning for Protein Classification: Kernel Methods PowerPoint PPT Presentation

Machine Learning for Protein Classification: Kernel Methods - MACHINE LEARNING FOR PROTEIN CLASSIFICATION ... Kernel Methods Outline Proteins The Protein Problem How to look at amino acid chains Families ... | PowerPoint PPT presentation | free to view

Semisupervised Learning PowerPoint PPT Presentation

Semisupervised Learning - Ontology. Why semi-supervised clustering? Why not clustering? ... Ontology based semi-supervised clustering 'A framework for ontology-driven ... | PowerPoint PPT presentation | free to view

Classifying with limited training data Active and semi-supervised learning PowerPoint PPT Presentation

Classifying with limited training data Active and semi-supervised learning - Often labeled data is expensive to collect and unlabeled data is abundant ... Discriminative classifiers: Random boundary in uncertainty region. Sarawagi. 9 ... | PowerPoint PPT presentation | free to view

Semi-Supervised Clustering PowerPoint PPT Presentation

Semi-Supervised Clustering - ... for initialization: initial center for cluster i is the mean of the seed points having label i. ... C: number of points involved in must-link constraints. N: ... | PowerPoint PPT presentation | free to view

Classification of Microarray Data - Recent Statistical Approaches PowerPoint PPT Presentation

Classification of Microarray Data - Recent Statistical Approaches - Detecting Differentially Expressed Genes in Known Classes. of Tissue Samples ... Genovese and Wasserman (2002) Storey (2002, 2003) Storey and Tibshirani (2003a, 2003b) ... | PowerPoint PPT presentation | free to view

Dimension%20Augmenting%20Vector%20Machine%20(DAVM):%20A%20new%20General%20Classifier%20System%20for%20Large%20p%20Small%20n%20problem PowerPoint PPT Presentation

Dimension%20Augmenting%20Vector%20Machine%20(DAVM):%20A%20new%20General%20Classifier%20System%20for%20Large%20p%20Small%20n%20problem - Dimension Augmenting Vector Machine (DAVM): A new General Classifier System for ... Classification is a supervised ... ( e.g. Elastic Net, Fussed Lasso) ... | PowerPoint PPT presentation | free to view

A Theory of Learning and Clustering via Similarity Functions PowerPoint PPT Presentation

A Theory of Learning and Clustering via Similarity Functions - New Theoretical Frameworks and Algorithms for Key Problems in ... Predict SPAM if unknown AND (money OR pills) Predict SPAM if 2money 3pills 5 known 0 ... | PowerPoint PPT presentation | free to view

Machine learning methods for protein analyses PowerPoint PPT Presentation

Machine learning methods for protein analyses - Optimal pairwise local alignment via dynamic programming. BLAST (1990) ... Pairwise alignment of profile hidden Markov models. Supervised semantic indexing ... | PowerPoint PPT presentation | free to view

Semi-Supervised Clustering PowerPoint PPT Presentation

Semi-Supervised Clustering - between O1 and O2 is a real number denoted by D(O1, ... Hierarchy algorithms ... If S 0 then swap o with o' to form the new set of k medoids. K-Medoids example ... | PowerPoint PPT presentation | free to view

Association Analysis-based Extraction of Functional Information from Protein-Protein Interaction Data PowerPoint PPT Presentation

Association Analysis-based Extraction of Functional Information from Protein-Protein Interaction Data - Association Analysis-based Extraction of Functional Information from Protein-Protein Interaction Data Vipin Kumar University of Minnesota kumar@cs.umn.edu | PowerPoint PPT presentation | free to view

Machine Learning PowerPoint PPT Presentation

Machine Learning - Tabula Rasa. No background knowledge other than the training examples. Knowledge-based learning ... Tabula Rasa, fully supervised. Qns: How do we test a learner? ... | PowerPoint PPT presentation | free to view

CSE 591: Machine learning and Applications PowerPoint PPT Presentation

CSE 591: Machine learning and Applications - When does a customer buy, what does he buy, how often he pays on time, etc ... Intuition: how does your brain store these pictures? Model selection ... | PowerPoint PPT presentation | free to view

Protein Classification PowerPoint PPT Presentation

Protein Classification - Protein Classification PDB Growth Protein classification Number of protein sequences grow exponentially Number of solved structures grow exponentially Number of new ... | PowerPoint PPT presentation | free to view

Protein Classification PowerPoint PPT Presentation

Protein Classification - Detect similarities (evolutionary relationships) between protein sequences ... www.biochem.ucl.ac.uk/bsm/cath. FSSP. Automatic classification (L. Holm) ... | PowerPoint PPT presentation | free to view

A TwoStage Approach to Domain Adaptation for Statistical Classifiers PowerPoint PPT Presentation

A TwoStage Approach to Domain Adaptation for Statistical Classifiers - Digital cameras cell phones. Movies books. Can we do better than standard supervised learning? ... (K 1) source domains and test on the held-out source domain ... | PowerPoint PPT presentation | free to view

Semisupervised Classification PowerPoint PPT Presentation

Semisupervised Classification - ... the affinity matrix W defined by Wij = exp(-||xi-xj||2 /2 2) if i j and Wii = 0. ... Not all unlabeled data fit in one class. Reference ... | PowerPoint PPT presentation | free to view

Protein Classification PowerPoint PPT Presentation

Protein Classification - Quantifies how each parameter ... To train a classifier for a given family H1, Build profile HMM, H1 ... From each webpage W, linked nbrs receive flow ... | PowerPoint PPT presentation | free to view

Overview of Machine Learning PowerPoint PPT Presentation

Overview of Machine Learning - Backgammon. Pole balancing. Driving a car. Flying a helicopter. Robot navigation. 5 ... Checkers board positions labeled with correct move. Road images with ... | PowerPoint PPT presentation | free to view

ClosedForm Supervised Dimensionality Reduction with Generalized Linear Models PowerPoint PPT Presentation

ClosedForm Supervised Dimensionality Reduction with Generalized Linear Models - Francisco Pereira Princeton University, Princeton, NJ, USA ... Image courtesy of V. D. Calhoun and T. Adali, 'Unmixing fMRI with independent ... | PowerPoint PPT presentation | free to view

Machine Learning for Information Extraction: An Overview PowerPoint PPT Presentation

Machine Learning for Information Extraction: An Overview - ... from William Cohen, Andrew McCallum and Ion ... 'J. C. Penny') In list of company suffixes (Inc, & Associates, Foundation) Word Features ... 7500 Wean Hall ... | PowerPoint PPT presentation | free to view

Learning Structured Prediction Models: A Large Margin Approach PowerPoint PPT Presentation

Learning Structured Prediction Models: A Large Margin Approach - Learning Structured Prediction Models: A Large Margin Approach | PowerPoint PPT presentation | free to view

Learning Similarity Measures Based on Random Walks in Graphs PowerPoint PPT Presentation

Learning Similarity Measures Based on Random Walks in Graphs - Learning Similarity Measures Based on Random Walks in Graphs William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science | PowerPoint PPT presentation | free to view

Machine Learning PowerPoint PPT Presentation

Machine Learning - Machine Learning | PowerPoint PPT presentation | free to view

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature PowerPoint PPT Presentation

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature - ISMB 2003 presentation. Extracting Synonymous Gene and Protein Terms from ... Friedman 03] [Pakhomov 02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida ... | PowerPoint PPT presentation | free to view

Learning with Structured Input PowerPoint PPT Presentation

Learning with Structured Input - Two feature maps 1 and 2, assuming conditional independence given label Y ... France European San North Japan Asian India. 11. organization ... | PowerPoint PPT presentation | free to view

Machine Learning PowerPoint PPT Presentation

Machine Learning - Learning Improving the performance of the agent-w.r.t. the external performance measure Dimensions: What can be learned?--Any of the boxes representing | PowerPoint PPT presentation | free to view

CS 388: Natural Language Processing: Information Extraction PowerPoint PPT Presentation

CS 388: Natural Language Processing: Information Extraction - ... (Perl) Perl Regex s Perl Regex Examples Simple Extraction Patterns Adding NLP Information to Patterns Pattern-Match Rule Learning RAPIER Pattern ... | PowerPoint PPT presentation | free to view