Understanding Databases of Microscope Images

About This Presentation

Title:

Understanding Databases of Microscope Images

Description:

All advanced living organisms comprise many cells ... Bagging. Mixtures-of-Experts. Majority-voting classifier combining the above classifiers ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 73

Provided by: robertf54

Category:

more less

Transcript and Presenter's Notes

Title: Understanding Databases of Microscope Images

1
Understanding Databases of Microscope Images

Kai Huang
Department of Biological Sciences and
Center for Automated Learning and Discovery
Carnegie Mellon University

2
A Little Biology

All advanced living organisms comprise many cells
Each cell has a mechanism to control how and when
they develop and function
The mystery is coded in the sequence of genes
that are packed tightly to form chromosomes in
cell nucleus
The genome of an organism is its set of
chromosomes, containing all of its genes and
associated DNA

3
Central Dogma
http//www.accessexcellence.org/AB/GG/central.html
4
Genomics
Comparative Genomics
Gene Prediction
Genome Sequencing
RNA Secondary Structure
Genome Analysis
5
Genomics

Both human and mouse genome drafts are in hand
plus hundreds of smaller organisms
The building blocks of a cell are proteins that
are coded in the genome
Finding how proteins function with each other is
the ultimate goal of life science
Next step Proteomics

6
Proteomics

The set of proteins expressed in a given cell
type or tissue is its proteome
Protein differences between cell types
responsible for the different behaviors of those
cell types

7
Proteomics

Things to learn about proteins
sequence
structure
expression level
activity
partners
location

8
Proteomics

Things to learn about proteins
sequence
structure
expression level
activity
partners
location

Image source http//users.rcn.com/jkimball.ma.ult
ranet/BiologyPages/A/AnimalCell.gif
9
Proteomics

Things to learn about proteins
sequence
structure
expression level
activity
partners
location - critical to understanding function

10
Subcellular Location

How do we determine the subcellular location of
a protein?
By looking it up
By doing an experiment

11
Looking it up Example

Giantin
Entrez /note"a new 376kD Golgi complex outher
membrane protein"
SwissProt INTEGRAL MEMBRANE PROTEIN. GOLGI
MEMBRANE.
GPP130
Entrez /note"GPP130 type II Golgi membrane
protein
SwissProt nothing

12
Looking it up Example

We learned that Giantin and GPP130 are both Golgi
proteins, but do we know
What part (i.e., cis, medial, trans) of the Golgi
complex they each are found in?
If they have the same subcellular distribution?
If they also are found in other compartments?

13
Words are not enough

Different investigators may use different terms
to refer to the same pattern or the same term to
refer to different patterns
Some efforts using restricted vocabularies (e.g.,
Yeast Protein Database, Gene Ontology consortium)
for location have been made but these do not
provide the necessary complexity and specificity

14
Need to advance past cartoon view of
subcellular location
http//www.cellsalive.net/cells/animcell.htm

Need a systematic, quantitative approach to
protein location
Need new methods for accurately and objectively
determining the subcellular location pattern of
all proteins

15
Fluorescence Microscopy

Cells and proteins have almost no natural
contrast under light
Proteins can be tagged by fluorescence dyes
Fluorescence microscopy provides high-resolution
observation of protein subcellular location

16
Initial Goal

Classification by direct (pixel-by-pixel)
comparison of individual images to known patterns
is not useful, since
different cells have different shapes, sizes,
orientations
organelles within cells are not found in fixed
locations

17
Supervised Learning Approach

1. Create sets of images showing the localization
of many different proteins (each set defines one
class of pattern)
2. Reduce each image to a set of numerical values
(features) that are insensitive to position and
rotation of the cell
3. Use statistical classification methods to
learn how to distinguish each class using the
features

18
Input Images

Created image database for HeLa cells
Ten classes covering all major subcellular
structures Golgi, ER, mitochondria, lysosomes,
endosomes, nuclei, nucleoli, microfilaments,
microtubules
Includes classes that are similar to each other

19
Example Images

Patterns that might be easily confused

Endoplasmic Reticulum (ER)
Mitochondria
20
Example Images

Patterns that might be easily confused

Lysosomes (LAMP2)
Endosomes (TfR)
21
Example Images

Patterns that might be easily confused

F-actin
Tubulin
22
Example Images

Classes expected to be indistinguishable

Golgi (Giantin)
Golgi (gpp130)
23
Features Haralick texture

Give information on correlations in intensity
between adjacent pixels to answer questions like
is the pattern more like a checkerboard or
alternating stripes?
is the pattern highly organized (ordered) or more
scattered (disordered)?

24
Example Difference detected by texture feature
entropy
25
Features Zernike moment

Measure degree to which pattern matches a
particular Zernike polynomial
Give information on basic nature of pattern
(e.g., circle, donut) and sizes (frequencies)
present in pattern

26
Examples of Zernike Polynomials
Z axis shows intensity
27
Features SLF

Developed additional features (SLF, for
Subcellular Location Features)
Motivated by descriptions of patterns used by
biologists (e.g., punctate, perinuclear)
Combined with Zernike and Haralick features to
give 84 features used to describe each image

28
Example Features from SLF1

Number of fluorescent objects per cell
Variance of the object sizes
Ratio of the largest object to the smallest
Average distance of objects to the center of
fluorescence
Fraction of convex hull occupied by fluorescence

29
Subcellular Location Features 2D

Haralick texture features
Zernike moment features
Morphological features
Geometric features
Edge features
Gabor wavelet features
Daubechies 4 wavelet features

30
Feature Reduction

Remove non-discriminative features
Remove redundant features
Combine features
Benefits
Speed
Accuracy
Multimedia indexing

31
Feature Reduction

Feature Recombination
PCA (Principal Component Analysis)
NLPCA (Nonlinear PCA)
KPCA (Kernel PCA)
ICA (Independent Component Analysis)
Feature Selection
Classification Tree (Gain ratio)
Fractal Dimensionality Reduction
Genetic Algorithm
Stepwise Discriminant Analysis

32
Feature Reduction Results
33
Classifier Supervised Learning

Neural Network
Support Vector Machine
Linear kernel
Polynomial kernel
Radial basis kernel
Exponential radial basis kernel
Ensemble Classifiers
AdaBoost
Bagging
Mixtures-of-Experts
Majority-voting classifier combining the above
classifiers

34
Majority-voting Classifier

Neural Network
Linear-kernel SVM
Exponential-rbf-kernel SVM
Polynomial-kernel SVM
AdaBoost
Pairwise-Classifier-Error Correlation
Coefficients
Mean 0.10, STD 0.07

35
2D Classification Results
Overall accuracy 92.34
36
Human Classification Results
Overall accuracy 83
37
Extending to 3D Labeling approach

Total protein labeled with Cy5 reactive dye
DNA labeled with PI
Specific Proteins labeled with primary Ab
Alexa488 conjugated secondary Ab

38
3D Image Set
Giantin
Nuclear
ER
Lysosomal
gpp130
Actin
Mitoch.
Nucleolar
Tubulin
Endosomal
39
Features to measure z asymmetry

2D features treated x and y equivalently
For 3D images, while it makes sense to treat x
and y equivalently (cells dont have a left and
right, z should be treated differently (top
and bottom are not the same)
We designed features to separate distance
measures into x-y component and z component

40
Classification Results for 3D images
Overall accuracy 97
41
How to do even better

Biologists interpreting images of protein
localization typically view many cells before
reaching a conclusion
Can simulate this by classifying sets of cells
from the same microscope slide

42
Classification of Sets of 3D Images
Set size 9, Overall accuracy 99.7
43
Next Experiment Interpretation

Classification results demonstrate the value of
the SLF feature sets for describing subcellular
patterns
The validation of the features suggests that they
can be used for other applications, such as
testing of hypotheses using image sets
Enabling concept image similarity

44
Searching databases

Sequence databases allow search by similarity

The same is true for protein structure databases

GSNWLAMQLT
45
Basic Method for Sequence Comparison
M A T N W G S L L Q
M D T N P V S L L R
Similarity Matrix
5 -1 3 2 -9 4 2 1 1 -3
25.7
46
Extension to location?

Use SLF to find similar patterns

Database
47
Goal Typical Image Selection

To develop automated methods for selecting a
representative image from a set of images
obtained by fluorescence microscopy

48
TypIC - Typical Image Chooser
Image Set
49
Approach

Calculate numerical features that contain
information about each image (just like when
classifying images)
Calculate the mean and covariance matrix for the
set (usually after automated elimination of
outliers)
Rank the images by their distance to the mean
(centroid) of the population (usually using
Mahalanobis distance, which weights according to
the covariance matrix)

50
Goal Image Set Comparison

A common paradigm in molecular cell biology is to
compare the distribution of a protein with and
without the addition of a potential perturbing
agent (e.g., drug, overexpressed protein)
Such experiments usually assayed by visual
examination
We have explored automating such comparisons

51
SImEC - Statistical Imaging Experiment Comparator
Image Set 2
Image Set 1
52
Method

Calculate feature matrix for each set of images
Compare feature matrices using a multivariate
hypothesis test called the Hotelling T2-test
Test returns an F value that can be compared to a
critical value for a given confidence level

53
F values for comparison of all pairs of classes
using 65 features

95 confidence critical values are approximately
1.4 for all comparisons (depends on number of
images)

54
Comparison of two sets drawn randomly from the
same class

TfR Phal
Average F 1.05 1.05
Critical F (0.95) 1.63 1.61
Number of failing sets out of 1000 47 45

Expected result obtained 95 of randomly drawn
sets are considered to be the same
55
Image Databasing

Our automated tools facilitate interpretation of
large numbers of images
Ideal for use with image databases
We therefore began building an image database
system in 1997 by first developing a database
schema to describe all aspects of fluorescence
microscope images
Fluorescence Microscope Annotation Schema (FMAS)
http//murphylab.web.cmu.edu/services/FMAS
For Cell Biology labs to store and analyze
fluorescence microscope images

56
Protein Subcellular Location Image Database

Implemented image database incorporating
Full annotation of experimental (FMAS)
SLF numerical features
Queries can be done by text or image content
Results can be fed to TypIC, SImEC, SLIC, SLIF

K. Huang, J. Lin, J.A. Gajnak, and R.F. Murphy
(2002). Image Content-based Retrieval and
Automated Interpretation of Fluorescence
Microscope Images via the Protein Subcellular
Location Image Database. Proceedings of the 2002
IEEE International Symposium on Biomedical
Imaging (ISBI 2002), pp. 325-328.
57
(No Transcript)
58
(No Transcript)
59
Clustering by Image Similarity

Ability to measure similarity of protein patterns
allows us for the first time to create a
systematic, objective, framework for describing
subcellular locations
Ideal for database references
One way is by creating a Subcellular Location
Tree
Start tree with the two proteins whose patterns
are most similar, keep adding branches for less
and less similar patterns

60
Subcellular Location Tree for 10 classes in HeLa
cells
61
Location Proteomics

Tag all proteins in a cell line randomly
Examine many cells, each of which expected to
express one tagged protein, using fluorescence
microscopy to determine the subcellular location
of that protein

62
Example images of randomly tagged clones

Glut1 gene (type 1 glucose transporter)
Tmpo gene (thymopoietin ??
tuba1 gene (?-tubulin)
Cald gene (caldesmon 1)
Ncl gene (nucleolin)
Rps11 gene (ribosomal protein S11)
Hmga1 gene (high mobility group AT-hook 1)
Col1a2 gene (procollagen type I ?2)
Atp5a1 gene (ATP synthase isoform 1)

63
Goal

Cluster 46 clones expressing different tagged
proteins based on their subcellular location
patterns

64
Outlier removal

Have 9 to 30 images per protein
Use Q test or t test on individual features to
remove outlier images
Calculate mean feature vector for each protein

65
Feature selection

Feature set optimization NP-complete
Use Stepwise Discriminant Analysis
(backward/forward method) to rank features based
on their ability to distinguish proteins
Use increasing numbers of features to train
neural network classifiers and evaluate
classification accuracy over all 46 clones

66
Classifier accuracy
67
Tree building

Best performance using between 10 and 15
Calculate Euclidean distance matrix for best 10
features
Build SLT
Using classifier results to cut tree
Sort confusion matrix in order of tree
Group proteins with more than 25 confusion

68
Detailed view of some proteins
Classification result
100
Real classes
90
100
90
100
0 1 2 3 4 5
69
Hmga1-1 (nucleus)
Hmga1-2 (nucleus)
Unknown-9 (nucleus)
Hmgn2-1 (nucleus)
Unknown-8 (nucleus)
Rpl32 (Nucleolus)
70
SLT with best 10 features
71
ConclusionsLocation Proteomics

New frontier of automated cell biology opening
for analysis of large numbers of 2D through 5D
fluorescence microscope images
Random tagging provides tool for determining
patterns for many proteins
Can construct Subcellular Location Trees to
systematically represent knowledge about location
Can be built for many cell types and can reflect
dynamic properties of proteins (changes with
time, drugs, oncogenes, etc.)

72
Acknowledgments

Prof. Robert F. Murphy
Current grad students
Kai Huang
Xiang Chen
Yanhua Hu
Elvira Garcia Osuna
Ting Zhao
Juchang Hua
Former students
Meel Velliste
Michael Boland
Mia Markey
Gregory Porreca
Edward Roques
Jie Yao
Funding
NSF, NCI, Merck, Rockefeller Bros. Fund

Collaborators/Consultants
Simon Watkins
David Cassasent
Tom Mitchell
Christos Faloutsos
Jon Jarvik
Peter Berget

Write a Comment

User Comments (0)

About PowerShow.com

Understanding Databases of Microscope Images - PowerPoint PPT Presentation

Understanding Databases of Microscope Images

All advanced living organisms comprise many cells ... Bagging. Mixtures-of-Experts. Majority-voting classifier combining the above classifiers ... – PowerPoint PPT presentation