Title: Integration of PSLID and SLIF with Virtual Cell
1Integration of PSLID and SLIF with Virtual Cell
- Robert F. Murphy, Les Loew Ion Moraru
- Ray and Stephanie Lane Professor of Computational
Biology - Molecular Biosensors and Imaging Center,
Departments of Biological Sciences, Biomedical
Engineering and Machine Learning and
2Alan Waggoner (CMU) and Simon Watkins (Pitt)
3Brian Athey (UMich), CMU Bob Murphy
4Central questions
- How many distinct locations within cells can
proteins be found in? What are they?
5Automated Interpretation
- Traditional analysis of fluorescence microscope
images has occurred by visual inspection - Our goal over the past twelve years has to been
to automate interpretation with the ultimate goal
of fully automated learning of protein location
from images
6Learn to recognize all major subcellular patterns
ER
gpp130
giantin
2D Images of HeLa cells
Mito
LAMP
Nucleolin
Tubulin
DNA
TfR
Actin
7Classification Results Computer vs. Human
Murphy et al 2000 Boland Murphy 2001 Murphy
et al 2003 Huang Murphy 2004
Lysosomes
Giantin (Golgi)
Gpp130 (Golgi)
Notes Even better results using MR methods by
Kovacevic group Even better results for 3D images
8Tissue Microarrays
Courtesy http//www.beecherinstruments.com
Courtesy www.microarraystation.com
9Human Protein Atlas
Courtesy www.proteinatlas.org
10Test Dataset from Human Protein Atlas
- Selected 16 proteins from the Atlas
- Two each from all major organelles (class)
- 45 tissue types for each class (e.g. liver,
skin) - Goal Train classifier to recognize each
subcellular pattern across all tissue types
Insulin in islet cells
Justin Newberg
11Subcellular Pattern Classification over 45 tissues
Prediction
Labels
Overall accuracy 81
Accuracy for 50 of images with highest
confidence 97
12(No Transcript)
13Annotations of Yeast GFP Fusion Localization
Database
- Contains images of 4156 proteins (out of 6234
ORFs in all 16 yeast chromosomes). - GFP tagged immediately before the stop codon of
each ORF to minimize perturbation of protein
expression. - Annotations were done manually by two scorers and
co-localization experiments were done for some
cases using mRFP. - Each protein is assigned one or more of 22
location categories.
14Classification of Yeast Subcellular Patterns
Chen et al 2007
- Selected only those assigned to single
unambiguous location class (21 classes) - Trained classifier to recognize those classes
- 81 agreement with human classification
- 94.5 agreement for high confidence assignments
(without using colocalization!) - Examination of proteins forwhich methods
disagree suggests machine classifier is correct
in at least some cases
Shann-Ching (Sam) Chen Geoff Gordon
15Example of Potentially Incorrect Label
ORF Name YGR130C UCSF Location punctate_composite
Automated Prediction cell_periphery
(60.67) cytoplasm (30) ER (9.33)
DNA GFP Segmentation
16Supervised vs. Unsupervised Learning
- This work demonstrated the feasibility of using
classification methods to assign all proteins to
known major classes - Do we know all locations? Are assignments to
major classes enough? - Need approach to discover classes
17Location Proteomics
- Tag many proteins (many methods available we use
CD-tagging (developed by Jonathan Jarvik and
Peter Berget) Infect population of cells with a
retrovirus carrying DNA sequence that will tag
in a random gene in each cell - Isolate separate clones, each of which produces
express one tagged protein - Use RT-PCR to identify tagged gene in each clone
- Collect many live cell images for each clone
using spinning disk confocal fluorescence
microscopy
Jarvik et al 2002
18Chen et al 2003Chen and Murphy 2005
Group proteins by pattern automatically
Uniform punctate proteins
Nucleolar proteins
Punctate nuclear proteins
Vesicular proteins
Uniform proteins
Nuclear w/ punctate cytoplasm
19CD-tagging project
Garcia Osuna et al 2007
- Running 100 clones/wk
- Automated imaging
Results for 225 clones
20Subcellular Location Families and Generative
Models
- Rather than using words (e.g., GO terms) to
describe location patterns, can make entries in
protein databases that give its Subcellular
Location Family - a specific node in a
Subcellular Location Tree - Provides necessary resolution that is difficult
to obtain with words - How do we communicate patterns Use generative
models learned from images to capture pattern and
variation in pattern
21Generative Model Components
Nucleus
Model parameters
Cell membrane
Fitted
Original
Filtered
Protein objects
Zhao Murphy 2007
22Synthesized Images
Lysosomes
Endosomes
- Have XML design for capturing model parameters
- Have portable tool for generating images from
model
SLML toolbox - Ivan Cao-Berg, Tao Peng, Ting Zhao
23Combining Models for Cell Simulations
Simulation for multiple proteins
Shared Nuclear and Cell Shape
XML
Integrating with Virtual Cell (University of
Connectiicut)) and M-Cell (Pittsburgh
Supercomputing Center)
24PSLID Protein Subcellular Location Image Database
- Version 4 to be released March 2008
- Adding 50,000 analyzed images (1,000 clones,
350,000 cells) from 3T3 cell random tagging
project - Adding 7,500 analyzed images (2,500 genes,
40,000 cells) from UCSF yeast GFP database - Adding 400,000 analyzed images (3,000 proteins,
45 tissues) from Human Protein Atlas - Adding generative models to describe subcellular
patterns consisting of discrete objects (e.g.,
lysosomes, endosomes, mitochondria) - Return XML file with real images that match a
query - Return XML file with generative model for a
pattern - Connecting to MBIC TCNP fluorescent probes
database - Connecting to CCAM TCNP Virtual Cell system