Title: Understanding Databases of Microscope Images
1Understanding Databases of Microscope Images
- Kai Huang
- Department of Biological Sciences and
- Center for Automated Learning and Discovery
- Carnegie Mellon University
2A Little Biology
- All advanced living organisms comprise many cells
- Each cell has a mechanism to control how and when
they develop and function - The mystery is coded in the sequence of genes
that are packed tightly to form chromosomes in
cell nucleus - The genome of an organism is its set of
chromosomes, containing all of its genes and
associated DNA
3Central Dogma
http//www.accessexcellence.org/AB/GG/central.html
4Genomics
Comparative Genomics
Gene Prediction
Genome Sequencing
RNA Secondary Structure
Genome Analysis
5Genomics
- Both human and mouse genome drafts are in hand
plus hundreds of smaller organisms - The building blocks of a cell are proteins that
are coded in the genome - Finding how proteins function with each other is
the ultimate goal of life science - Next step Proteomics
6Proteomics
- The set of proteins expressed in a given cell
type or tissue is its proteome - Protein differences between cell types
responsible for the different behaviors of those
cell types
7Proteomics
- Things to learn about proteins
- sequence
- structure
- expression level
- activity
- partners
- location
8Proteomics
- Things to learn about proteins
- sequence
- structure
- expression level
- activity
- partners
- location
Image source http//users.rcn.com/jkimball.ma.ult
ranet/BiologyPages/A/AnimalCell.gif
9Proteomics
- Things to learn about proteins
- sequence
- structure
- expression level
- activity
- partners
- location - critical to understanding function
10Subcellular Location
- How do we determine the subcellular location of
a protein? - By looking it up
- By doing an experiment
11Looking it up Example
- Giantin
- Entrez /note"a new 376kD Golgi complex outher
membrane protein" - SwissProt INTEGRAL MEMBRANE PROTEIN. GOLGI
MEMBRANE. - GPP130
- Entrez /note"GPP130 type II Golgi membrane
protein - SwissProt nothing
12Looking it up Example
- We learned that Giantin and GPP130 are both Golgi
proteins, but do we know - What part (i.e., cis, medial, trans) of the Golgi
complex they each are found in? - If they have the same subcellular distribution?
- If they also are found in other compartments?
13Words are not enough
- Different investigators may use different terms
to refer to the same pattern or the same term to
refer to different patterns - Some efforts using restricted vocabularies (e.g.,
Yeast Protein Database, Gene Ontology consortium)
for location have been made but these do not
provide the necessary complexity and specificity
14Need to advance past cartoon view of
subcellular location
http//www.cellsalive.net/cells/animcell.htm
- Need a systematic, quantitative approach to
protein location - Need new methods for accurately and objectively
determining the subcellular location pattern of
all proteins
15Fluorescence Microscopy
- Cells and proteins have almost no natural
contrast under light - Proteins can be tagged by fluorescence dyes
- Fluorescence microscopy provides high-resolution
observation of protein subcellular location
16Initial Goal
- Classification by direct (pixel-by-pixel)
comparison of individual images to known patterns
is not useful, since - different cells have different shapes, sizes,
orientations - organelles within cells are not found in fixed
locations
17Supervised Learning Approach
- 1. Create sets of images showing the localization
of many different proteins (each set defines one
class of pattern) - 2. Reduce each image to a set of numerical values
(features) that are insensitive to position and
rotation of the cell - 3. Use statistical classification methods to
learn how to distinguish each class using the
features
18Input Images
- Created image database for HeLa cells
- Ten classes covering all major subcellular
structures Golgi, ER, mitochondria, lysosomes,
endosomes, nuclei, nucleoli, microfilaments,
microtubules - Includes classes that are similar to each other
19Example Images
- Patterns that might be easily confused
Endoplasmic Reticulum (ER)
Mitochondria
20Example Images
- Patterns that might be easily confused
Lysosomes (LAMP2)
Endosomes (TfR)
21Example Images
- Patterns that might be easily confused
F-actin
Tubulin
22Example Images
- Classes expected to be indistinguishable
Golgi (Giantin)
Golgi (gpp130)
23Features Haralick texture
- Give information on correlations in intensity
between adjacent pixels to answer questions like - is the pattern more like a checkerboard or
alternating stripes? - is the pattern highly organized (ordered) or more
scattered (disordered)?
24Example Difference detected by texture feature
entropy
25Features Zernike moment
- Measure degree to which pattern matches a
particular Zernike polynomial - Give information on basic nature of pattern
(e.g., circle, donut) and sizes (frequencies)
present in pattern
26Examples of Zernike Polynomials
Z axis shows intensity
27Features SLF
- Developed additional features (SLF, for
Subcellular Location Features) - Motivated by descriptions of patterns used by
biologists (e.g., punctate, perinuclear) - Combined with Zernike and Haralick features to
give 84 features used to describe each image
28Example Features from SLF1
- Number of fluorescent objects per cell
- Variance of the object sizes
- Ratio of the largest object to the smallest
- Average distance of objects to the center of
fluorescence - Fraction of convex hull occupied by fluorescence
29Subcellular Location Features 2D
- Haralick texture features
- Zernike moment features
- Morphological features
- Geometric features
- Edge features
- Gabor wavelet features
- Daubechies 4 wavelet features
30Feature Reduction
- Remove non-discriminative features
- Remove redundant features
- Combine features
- Benefits
- Speed
- Accuracy
- Multimedia indexing
31Feature Reduction
- Feature Recombination
- PCA (Principal Component Analysis)
- NLPCA (Nonlinear PCA)
- KPCA (Kernel PCA)
- ICA (Independent Component Analysis)
- Feature Selection
- Classification Tree (Gain ratio)
- Fractal Dimensionality Reduction
- Genetic Algorithm
- Stepwise Discriminant Analysis
32Feature Reduction Results
33Classifier Supervised Learning
- Neural Network
- Support Vector Machine
- Linear kernel
- Polynomial kernel
- Radial basis kernel
- Exponential radial basis kernel
- Ensemble Classifiers
- AdaBoost
- Bagging
- Mixtures-of-Experts
- Majority-voting classifier combining the above
classifiers
34Majority-voting Classifier
- Neural Network
- Linear-kernel SVM
- Exponential-rbf-kernel SVM
- Polynomial-kernel SVM
- AdaBoost
- Pairwise-Classifier-Error Correlation
Coefficients - Mean 0.10, STD 0.07
352D Classification Results
Overall accuracy 92.34
36Human Classification Results
Overall accuracy 83
37Extending to 3D Labeling approach
- Total protein labeled with Cy5 reactive dye
- DNA labeled with PI
- Specific Proteins labeled with primary Ab
Alexa488 conjugated secondary Ab
383D Image Set
Giantin
Nuclear
ER
Lysosomal
gpp130
Actin
Mitoch.
Nucleolar
Tubulin
Endosomal
39Features to measure z asymmetry
- 2D features treated x and y equivalently
- For 3D images, while it makes sense to treat x
and y equivalently (cells dont have a left and
right, z should be treated differently (top
and bottom are not the same) - We designed features to separate distance
measures into x-y component and z component
40Classification Results for 3D images
Overall accuracy 97
41How to do even better
- Biologists interpreting images of protein
localization typically view many cells before
reaching a conclusion - Can simulate this by classifying sets of cells
from the same microscope slide
42Classification of Sets of 3D Images
Set size 9, Overall accuracy 99.7
43Next Experiment Interpretation
- Classification results demonstrate the value of
the SLF feature sets for describing subcellular
patterns - The validation of the features suggests that they
can be used for other applications, such as
testing of hypotheses using image sets - Enabling concept image similarity
44Searching databases
- Sequence databases allow search by similarity
- The same is true for protein structure databases
GSNWLAMQLT
45Basic Method for Sequence Comparison
M A T N W G S L L Q
M D T N P V S L L R
Similarity Matrix
5 -1 3 2 -9 4 2 1 1 -3
25.7
46Extension to location?
- Use SLF to find similar patterns
Database
47Goal Typical Image Selection
- To develop automated methods for selecting a
representative image from a set of images
obtained by fluorescence microscopy
48TypIC - Typical Image Chooser
Image Set
49Approach
- Calculate numerical features that contain
information about each image (just like when
classifying images) - Calculate the mean and covariance matrix for the
set (usually after automated elimination of
outliers) - Rank the images by their distance to the mean
(centroid) of the population (usually using
Mahalanobis distance, which weights according to
the covariance matrix)
50Goal Image Set Comparison
- A common paradigm in molecular cell biology is to
compare the distribution of a protein with and
without the addition of a potential perturbing
agent (e.g., drug, overexpressed protein) - Such experiments usually assayed by visual
examination - We have explored automating such comparisons
51SImEC - Statistical Imaging Experiment Comparator
Image Set 2
Image Set 1
52Method
- Calculate feature matrix for each set of images
- Compare feature matrices using a multivariate
hypothesis test called the Hotelling T2-test - Test returns an F value that can be compared to a
critical value for a given confidence level
53F values for comparison of all pairs of classes
using 65 features
- 95 confidence critical values are approximately
1.4 for all comparisons (depends on number of
images)
54Comparison of two sets drawn randomly from the
same class
- TfR Phal
- Average F 1.05 1.05
- Critical F (0.95) 1.63 1.61
- Number of failing sets out of 1000 47 45
Expected result obtained 95 of randomly drawn
sets are considered to be the same
55Image Databasing
- Our automated tools facilitate interpretation of
large numbers of images - Ideal for use with image databases
- We therefore began building an image database
system in 1997 by first developing a database
schema to describe all aspects of fluorescence
microscope images - Fluorescence Microscope Annotation Schema (FMAS)
http//murphylab.web.cmu.edu/services/FMAS - For Cell Biology labs to store and analyze
fluorescence microscope images -
56Protein Subcellular Location Image Database
- Implemented image database incorporating
- Full annotation of experimental (FMAS)
- SLF numerical features
- Queries can be done by text or image content
- Results can be fed to TypIC, SImEC, SLIC, SLIF
K. Huang, J. Lin, J.A. Gajnak, and R.F. Murphy
(2002). Image Content-based Retrieval and
Automated Interpretation of Fluorescence
Microscope Images via the Protein Subcellular
Location Image Database. Proceedings of the 2002
IEEE International Symposium on Biomedical
Imaging (ISBI 2002), pp. 325-328.
57(No Transcript)
58(No Transcript)
59Clustering by Image Similarity
- Ability to measure similarity of protein patterns
allows us for the first time to create a
systematic, objective, framework for describing
subcellular locations - Ideal for database references
- One way is by creating a Subcellular Location
Tree - Start tree with the two proteins whose patterns
are most similar, keep adding branches for less
and less similar patterns
60Subcellular Location Tree for 10 classes in HeLa
cells
61Location Proteomics
- Tag all proteins in a cell line randomly
- Examine many cells, each of which expected to
express one tagged protein, using fluorescence
microscopy to determine the subcellular location
of that protein
62Example images of randomly tagged clones
- Glut1 gene (type 1 glucose transporter)
- Tmpo gene (thymopoietin ??
- tuba1 gene (?-tubulin)
- Cald gene (caldesmon 1)
- Ncl gene (nucleolin)
- Rps11 gene (ribosomal protein S11)
- Hmga1 gene (high mobility group AT-hook 1)
- Col1a2 gene (procollagen type I ?2)
- Atp5a1 gene (ATP synthase isoform 1)
63Goal
- Cluster 46 clones expressing different tagged
proteins based on their subcellular location
patterns
64Outlier removal
- Have 9 to 30 images per protein
- Use Q test or t test on individual features to
remove outlier images - Calculate mean feature vector for each protein
65Feature selection
- Feature set optimization NP-complete
- Use Stepwise Discriminant Analysis
(backward/forward method) to rank features based
on their ability to distinguish proteins - Use increasing numbers of features to train
neural network classifiers and evaluate
classification accuracy over all 46 clones
66Classifier accuracy
67Tree building
- Best performance using between 10 and 15
- Calculate Euclidean distance matrix for best 10
features - Build SLT
- Using classifier results to cut tree
- Sort confusion matrix in order of tree
- Group proteins with more than 25 confusion
68Detailed view of some proteins
Classification result
100
Real classes
90
100
90
100
0 1 2 3 4 5
69Hmga1-1 (nucleus)
Hmga1-2 (nucleus)
Unknown-9 (nucleus)
Hmgn2-1 (nucleus)
Unknown-8 (nucleus)
Rpl32 (Nucleolus)
70SLT with best 10 features
71ConclusionsLocation Proteomics
- New frontier of automated cell biology opening
for analysis of large numbers of 2D through 5D
fluorescence microscope images - Random tagging provides tool for determining
patterns for many proteins - Can construct Subcellular Location Trees to
systematically represent knowledge about location - Can be built for many cell types and can reflect
dynamic properties of proteins (changes with
time, drugs, oncogenes, etc.)
72Acknowledgments
- Prof. Robert F. Murphy
- Current grad students
- Kai Huang
- Xiang Chen
- Yanhua Hu
- Elvira Garcia Osuna
- Ting Zhao
- Juchang Hua
- Former students
- Meel Velliste
- Michael Boland
- Mia Markey
- Gregory Porreca
- Edward Roques
- Jie Yao
- Funding
- NSF, NCI, Merck, Rockefeller Bros. Fund
- Collaborators/Consultants
- Simon Watkins
- David Cassasent
- Tom Mitchell
- Christos Faloutsos
- Jon Jarvik
- Peter Berget