Title: Characterizing Gene Functional Expression Profiles
1Characterizing Gene Functional Expression Profiles
- Zoran Obradovic
- Slobodan Vucetic
- Hongbo Xie, Hao Sun, Pooja Hedge
- Information Science and Technology Center, Temple
University
2Outline
- Microarray Data Analysis Process
- Functional Expression Profile Analysis
- Functional Expression Profile Ranking
- Functional Expression Profile Clustering
- Functional Characterization of
- Plasmodium Falciparum,
- Saccharomyces Cerevisiae,
- Mus Musculus and
- Homo Sapiens
3What is a DNA Microarray?
DNA microarray technology allows measuring
expressions for tens of thousands of genes at a
time
Analysis of Replicated Experiments Gordon Smyth,
Walter and Eliza Hall Institute
4Scanning/Signal Detection
Cy3 channel
Cy5 channel
5Microarray Data Analysis Process
- Designing gene expression experiments
- Image processing and analysis
- Preprocessing raw intensity data
- Discovering differentially expressed genes
- Advanced analysis
- Finding relevant pathways
- Discovering gene expression patterns
- Understanding gene functions
- More information
- www.ist.temple.edu/research/biocore.html
6Designing Gene Expression Experiments
reference design
loop design
Design experiment
A saturated design
Comparative designing
http//discover.nci.nih.gov/microarrayAnalysis/Exp
erimental.Design.jsp
7Image Processing and Analysis (figure is
obtained using Imagene software)
8Preprocessing Raw Intensity Data
normalize
Analysis of Replicated Experiments Gordon Smyth,
Walter and Eliza Hall Institute
9Discovering Differentially Expressed Genes
- Fold change (log ratio)
- Statistics methods
- 1)T-test
- 2)ANOVA
- 3)Non-parametric analysis
- Wilcoxon Rank-Sum Test
10Advanced Analysis Finding Relevant Pathways
(figure is obtained using Ingenuity software)
11Advanced Analysis Discovering Gene Expression
Patterns
- Plasmodium Falciparum intraerythrocytic
developmental cycle - Genes are sorted based on expression time peaks
- Bozdech Z et al., PLoS Biol. 2003 Oct1(1))
12Advanced Analysis Identifying Unknown Gene
Functions Based on Expression Profiles
Is this alignment reliable ?
- Standard practice
- Basic Assumption Expression profiles of
functionally related genes are correlated - Objectives Confirm a specific biological
hypothesis predict functional properties of less
characterized genes or uncover new/unexpected
biological knowledge - Methodology clustering genes based on similarity
of their expression profiles followed by
functional analysis of the obtained clusters
Gene 2 expression profile with function B
Unknown sequence Tag
Unknown sequence has high correlation With gene
1 expression profile
Gene 1 expression profile with function A
Functions ?
Sequence Tag has function A
13Problems with old approaches
- Genes with same function do not necessarily have
the same expression profiles - Clustering on all genes expression profiles could
be unreliable
14Our Approach Analyzing Microarray Functional
Expression Profiles (FEP)FEPs Compute FEP as
the average profile of all genes associated with
a given highly correlated GO term
Advanced Analysis Identifying Unknown Gene
Functions Based on Expression Profiles
GO0004721 phosphoprotein phosphatase activity
GO0016311 Dephosphorylation
15Questions that we address
- How to perform functional analysis in an
objective manner - How to estimate biological significance of
discovers
16Tools and Applications
- Developed tools to identify
- (1) Explore which functions have the conserved
expression profiles - (Tool 1 functional expression profile
ranking package) - (2) Explore which functions have similar
expression profiles and test of their functional
similarity - (Tool 2 functional expression profile
clustering package) - Applications
- Functional characterization of gene expression
related to Intraerythrocytic Developmental Cycle
of Plasmodium Falciparum, Saccharomyces
Cerevisiae, Mus Musculus and Home Sapiens
17Tools Architecture
Microarray raw data
Report
List of significantly correlated GO terms
Clusters of functional Expression profiles
Data pre- processing
Gene function annotation database
Functional expression profile ranking
Functional expression profile clustering
Gene Function Semantic Distance Mapping Space
18Tool 1 Functional Expression Profile (FEP)
Ranking Package
- Objective
- Identify genes with same function having
correlated expression profiles - Task
- Evaluate gene expression correlation within each
FEP - Methodology
- Step 1 calculate average pairwise correlation
coefficient S among n gene expression profiles
for a given function term - Step 2 randomly select n genes from the whole
dataset and compute average pairwise correlation
coefficient S - Step 3 repeated Step 2 m times (mgt10,000) and
compare the distribution S to the original S to
evaluate p-value
19Dataset 1 Plasmodium Falciparum
Intraerythrocytic Developmental Cycle
- (Bozdech Z et al., (2003) PLoS Biol. Oct 1(1))
Objective Identification of P.falciparum genes
whose RNA levels vary periodically within the
asexual intraerythrocytic developmental cycle
(IDC) transcriptom Materials 5080 ORFs, 3532
unique genes, 46 assays (sampled in time) using
cDNAs Methods Permutation test with Fast Fourier
Transform alg. and correlations Found 60 of
genes transcriptionally active and most genes
only active once during the IDC Figure Major
morphological stages during the IDC and 2712
genes transcriptional profiles
20Dataset 2 Saccharomyces Cerevisiae Cell Cycle
(Spellman et al., (1998)Â Molecular Biology of
the Cell 9, 3273-3297)
- Objective Identification of yeast genes whose
RNA levels vary periodically within cell cycle
process - Materials 6178 ORFs, 4450 unique genes, 77
assays (sampled in time) using cDNAs - Methods Periodicity and correlation algorithm
- Found Identified 800Â genes that meet an
objective minimum criterion for cell cycle
regulation - Figure The M/G1 clusters
21Dataset 3 Homo Sapiens Cell Cycle(R.Cho, et al
(2001) Nature, 27)
- Objective Identification of human genes whose
RNA levels vary periodically within cell cycle
process - Materials 6800 ORFs, 5795 unique genes, 14
assays (sampled in time) Using affymatrix arrays - Methods Fold change
- Found 700 genes that display transcriptional
fluctuation with a periodicity consistent with
that of the cell cycle - Figure Clustering analysis of cell-cycleregulate
d transcripts
22DataSet 4 Mus Musculus Cell Cycle(Ishida, S et
al (2001) Mol. Cell. Biol. 21, 4684-4699 )
- Objective Analysis of gene regulation during the
mammalian cell cycle - Materials 6347 unique genes, 14 assays
- Methods Clustering
- Found Identified 7 distinct clusters of genes
that exhibit unique patterns of expression - Figure Patterns of gene expression following
growth stimulation and during the mammalian cell
cycle
23Applying FEP Ranking Package Cumulative
Distributions of GO Term p-Values of Human,
Yeast, Mouse and P.F.
24Applying FEP Ranking Package GO Terms with the
Most Conserved FEP Among Multi-organisms
25Applying FEP Ranking Package Selection of GO
Terms with Significantly Correlated Expression
Patterns at Plasmodium Falciparum Developmental
Cycle Data
Cumulative distribution of p-values for GO terms
associated with at least two genes
GO0016311 Dephosphorylation
GO 0007028 cytoplasm Organization and
biosynthesis
46 functions of all function GO terms are
significantly correlated 52 processes of all
process GO terms are significantly correlated
Selected
26Plasmodium Falciparum Processes and Functions
with the Highest/Lowest Correlation
Highest correlation
Lowest correlation
27Plasmodium Falciparum Findings by FEP Ranking
Package
- Of 12 FEPs referenced by Bozdech et al, two have
p-value larger than 0.05. - E.g. the average correlation coefficient among
genes associated with Robonucleotide Synthesis
function is only 0.258 (p-value 0.11) which
weakens the claim that is related to the Ring
stage of IDC. - No linear relationship were found between number
of genes associated with a given GO term and
average correlation coefficient among these genes
- Ranking of GO terms based on p-value could be
useful in rapid identification of functions that
are closely related with a specific developmental
stage (of Plasmodium Falciparum)
28All Datasets Findings by FEP Ranking Package
- To some extent genes with identical functions
have similar expression profiles - However, a large fraction of functions do not
follow the underlying hypothesis! - Higher level organisms seem to have lower
fraction of significantly correlated expression
profiles for identical functions. - Fractions of correlated FEPs
- Saccharomyces Cerevisiae 59 (643/1,083)
- Plasmodium Falciparum 48.4 (428/ 884)
- Homo Sapiens 16.4 (249/1514)
- Mus musculus 13.3 (182/1366)
- fractions are for both processes and functions
29Tool 2 FEP Clustering Package
- Objective
- Identifying genes with similar functions and
similar expression profiles - Tasks
- Cluster FEPs selected by FEP ranking package
- Evaluate found clusters for biological relevance
by - Identifying similar functions based on GO term
hierarchy tree structure - Evaluating inter-cluster GO term distance
- Methodology
- Randomly generate k sets each containing same
number of GO terms as the corresponding cluster - Calculate total GO term distance within each
generated set and sum total distance of all sets
to get the overall score S - Repeat the procedure 1000 times and compare the
distribution S to the overall distance obtained
through clustering
30Structure of GO Term Tree (Example)
GO0008150 Biological Process
Level 1
GO0007275 development
GO0007582 physiological process
Level 2
GO0007389 pattern specification
GO0008152 metabolism
Level 3
GO0000003 reproduction
GO0009798 axis specification
Level 4
Level 5
GO0009948 anterior/posterior axis specification
- Measuring Distance of GO Terms
- -- length of the minimal chain
between X and Y terms in GO tree - -- is length of maximal chain from the top
to the bottom
31Determination of Number of Clusters
- Measured
- Larger z-score indicates a better grouping of
functions within clusters.
32Number of Clusters vs Z-score Results for
Plasmodium Falciparum
Plasmodium Falciparum biological processes number
of clusters vs z-scores
Plasmodium Falciparum molecular function number
of clusters vs z-scores
33Applying FEP Clustering Package Results on
Plasmodium Falciparum Processes
k-mean clustering profiles of FEPs for 238
identified processes
1
2
Cluster vs Stage of IDC
3
4
34Applying FEP Clustering Package Results on
Plasmodium Falciparum Functions
k-means clustering profiles of FEPs for 199
identified molecular functions
1
2
Cluster vs stage of IDC
3
4
35GO Trees of Functions 4 Clusters of Plasmodium
Falciparum
36Statistical Evaluation Fund vs. Random Clusters
for P. Falciparum
Biological Processes
Molecular Functions
found clusters
found clusters
- larger distance from found cluster to random
clusters for - biological processes.
- random clusters for biological processes have
smaller variance
37Statistical Evaluation Clustering All GO Terms
for P. Falciparum
- Clustering all GO terms will lead to smaller
z-score which means that we have worse quality
clusters - Right figure is P.F. functional clustering
result. Z-score is 8.5 compared to 12 for
clustering correlated GO terms only
found clusters
38Statistical EvaluationFound vs. Random Clusters
at S. Cerevisiae and Homo Sapiens
found clusters
found clusters
Yeast Processes
Human Processes
found clusters
found clusters
Yeast functions
Human functions
39Remarks
- Statistical significance of identified clusters
(separation between clusters and random
groupings) is increased by - Normalizing data (Plasmodium Falciparum)
- Eliminating noise through singular vector
decomposition (SVD) - Reducing data through Principle Components
Analysis
40Conclusions
- Proposed microarray tools help identifying
- genes with same function and correlated
expression profiles - genes with similar functions have similar
expression profiles - Measuring GO tree based distance was useful for
evaluating biological relevance of clusters
however, - many GO terms have only 1 associated gene
- many genes do not even have a GO term
- parenthood and siblings in GO trees should be
differentiated, but there should be a smaller
penalty for siblings relationship compared to
parenthood - More robust clustering methods could be used
41Thank You !
More information www.ist.temple.edu/research/b
iocore.html Contact Zoran Obradovic,
director IST Center, Temple University 215
204-6265 zoran_at_ist.temple.edu