Title: Unifying measures of gene function and evolution
1Unifying measures of gene function and evolution
Eugene V. Koonin, National Center for
Biotechnology Information, NIH, Bethedsa
Nothing in (systems) biology makes sense except
in the light of evolution after Theodosius
Dobzhansky (1970)
Wolf, Carmel, Koonin, Proc. Roy Soc. B, in press
2Systems Biology and Evolution
With the advent of OMICS data
The game of correlations began
3Evolutionary systems biology
- In principle, we address the classical problem
the relationship between the (largely neutral?)
evolution of the genome and the (largely
adaptive) evolution of the phenotype - In practice, the progress of genomics other
OMICS allows us to measure, on whole-genome
scale, the effects of all kinds of molecular
phenotypic characteristics (expression level,
protein-protein interactions etc etc) on
evolutionary rates this typically yields weak,
even if significant, correlations - Can we synthesize these measurements to produce
- a coherent picture of the links between
phenomic and - genomic evolution?
4The Cautionary Tale
"It was six men of Indostan / To learning much
inclined, Who went to see the Elephant / (Though
all of them were blind), That each by
observation / Might satisfy his mind " (J.G.
Saxe)
5The Cautionary Tail
"each was partly in the right / And all were in
the wrong" (J.G. Saxe)
6Different Faces of the Hypercube?
Pairwise correlations
Synthesis
7Analysis of Multidimensional Data
8Analysis of Multidimensional Data
9Analysis of Multidimensional Data
PC1
PC3
PC2
Principal Components Analysis (PCA) introduces a
new orthogonal coordinate system where axes are
ranked by the fraction of original variance
accounted for.
10PCA
- PCA takes a set of variables and defines new
variables that are linear combinations of the
initial variables. - PCA expects the variables you enter to be
correlated - (as is the case in the correlation game of
Systems Biology). - PCA returns new, uncorrelated variables, the
principal components or axes, that summarize the
information contained in the original full set of
variables. - PCA does not test any hypotheses or predict
values for dependent variables it is more of an
exploratory technique. - The data entered represent a cloud of points, in
n-space. - The cloud is, typically, longer in one direction
than another, and that longest dimension is where
the points are the most different that's where
PCA draws a line called the first principal
component. - The first principal component is guaranteed to be
the line that places your sample points the
farthest apart from each other, in that way, PCA
"extracts the most variance" from your data. This
process is repeated to get multiple components,
or axes.
11The Data Set KOGs
- Ideally, we would like to obtain and synthesize
the data on individual genes in precise
space-time coordinates (e.g., instant
evolutionary rates) - However
- some of the variables are not easily measurable
(if defined at all) for genes in extant species
e.g. rate of evolution - other variables are measurable in principle but,
in practice, are - available only for a few species e.g.,
expression level - much of the data are inherently noisy, either due
to technical problems or true biological
variation e.g. fitness effect of gene
disruption. - Thus, we analyze orthologous protein sets, using
the proteins from different species to derive
complementary data and smooth out variations in
other. - Practically, this means using the KOG dataset
(with additions) 10058 KOGs from 15 species
(Koonin et al. 2004, Genome Biol).
12The Data Set KOGs
Original KOGs for some species, "index orthologs"
for other. 10058 KOGs altogether
13Variables Gene Loss
Propensity for Gene Loss (PGL), introduced by
Krylov et al. (Genome Res. 13, 2229-2235, 2003).
Computed from KOG phyletic pattern. Originally
an empirical measure (Dollo parsimony
reconstruction of events ratio of branch
lengths). In this work employs an Expectation
Maximization algorithm.
14Variables Gene Duplication
Number of Paralogs, average number observed for a
given KOG. Example KOG0417 (Ubiquitin-protein
ligase) and KOG0424 (Ubiquitin-protein ligase).
At1g16890 At1g36340 At1g64230 At1g78870 At2g16740 At2g32790 At3g08690 At3g08700 At3g13550 At4g27960 At5g25760 At5g41700 At5g53300 At5g56150 CE03482 CE09712 CE10824 CE28997 7292764 7292948 7295708_2 7296089 7297757 7298165 7299919 Hs17476541 Hs22043797 Hs22054779 Hs22064361 Hs4507773 Hs4507775 Hs4507777 Hs4507779 Hs4507793 Hs5454146 Hs7661808 Hs8393719 YBR082c YDR059c YDR092w YGR133w SPAC11E3.04c SPAC1250.03 SPBC119.02 SPBC1198.09 ECU10g0940 ECU11g1990
At3g57870 CE01332 CE09784 7296195 Hs4507785 YDL064w SPAC30D11.13 ECU01g0940
15Variables Evolution Rate
Select a taxon Build an alignment
(MUSCLE) Compute distance matrix (PAML) Select
minimum distance between members of the two
subtrees of the group.
Ascomycota Sordariomycetes vs. Yeasts
16Variables Expression Level
Expression Level data for S. cerevisiae, D.
melanogaster and H. sapiens were downloaded from
UCSC Table Browser (hgFixed).
Organism Table No. exp. No. prob. No.
KOGs Sacce yeastChoCellCycle 17 6602 3030 Drome ar
bFlyLifeAll 162 4921 2617 Homsa gnfHumanAtlas2All
158 10197 3872
Standardized (?0 ?1) log values median
expression level among paralogs was used to
represent a KOG.
17Variables Interactions
Protein Protein and Genetic Interactions (PPI and
GI) data for S. cerevisiae, C. elegans and D.
melanogaster were downloaded from GRID Web
site. Median number of interaction partners
among paralogs was used to represent a KOG.
18Variables Lethality
Lethality of Gene Knockout data for S. cerevisiae
were downloaded from MIPS FTP site (0/1
values). Embryonic Lethality of RNAi Interference
data for C. elegans were taken from Kamath et
al., 2003 (0/1 values).
19Missing Data
Total 38 variables in 10058 KOGs lots of
missing data. Complete data (all 38 variabless
available) 23 KOGs too few. Combined data 7
variables, 1482 KOGs with complete data 4124
with at most one missing point 3912 KOGs after
removal of outliers. Example evolution rate.
At.Os Sc.Ca Mg.Nc Hs.Mm. Pl.MF KOG0009 - 0.168 0.
300 - 0.405 KOG0010 0.671 1.252 0.606 0.087 1.492
KOG0011 0.905 1.698 0.428 0.073 1.547 KOG0012 - 2.
238 0.665 0.244 - KOG0013 0.355 - - 0.014 1.343 KO
G0014 1.913 4.041 - 0.126 2.840 KOG0015 - 2.286 0.
400 0.027 - KOG0016 - - 0.506 0.380 - 0.667 1.86
4 0.521 0.075 1.910
At.Os Sc.Ca Mg.Nc Hs.Mm. Pl.MF - 0.090 0.575 - 0
.212 1.006 0.672 1.162 1.166 0.781 1.358 0.911 0
.821 0.984 0.810 - 1.201 1.275 3.275 - 0.532 - -
0.181 0.703 2.869 2.168 - 1.692 1.487 - 1.227 0.
767 0.365 - - - 0.970 5.087 -
Average 0.293 0.957 0.977 1.917 0.472 2.054 0.786
3.028
20Variables
- Phenotypic
- EL expression level
- PPI protein-protein interactions
- GI genetic interactions
- KE knockout effect
- NP number of paralogs
- Evolutionary
- ER (sequence) evolution rate
- PGL propensity for gene loss
21The correlations
NP PPI GI PGL ER EL KE NP - PPI 0.057 -
GI 0.060 0.034 - PGL 0.000 -0.125 -0.019 -
ER -0.070 -0.200 0.034 0.141 - EL 0.129 0.199
-0.050 -0.099 -0.277 - KE 0.027 0.234 -0.048 -0.1
81 -0.155 0.188 -
22Two Tiers of Variables
Observation on the pattern of pairwise
relationships in the data "phenotypic" and
"evolutionary" variables behave differently.
23Two Tiers of Variables
Observation on the pattern of pairwise
relationships in the data "phenotypic" and
"evolutionary" variables behave differently.
24The correlations
NP PPI GI PGL ER EL KE NP - PPI 0.057 -
GI 0.060 0.034 - PGL 0.000 -0.125 -0.019 -
ER -0.070 -0.200 0.034 0.141 - EL 0.129 0.199
-0.050 -0.099 -0.277 - KE 0.027 0.234 -0.048 -0.1
81 -0.155 0.188 -
25PCA of the Data Space
PC.1 PC.2 PC.3 NP 0.17 0.69 0.44 PPI 0.46 0 -0.17
GI 0 0.67 -0.54 PGL -0.33 0 0.51 ER -0.47 0 -0.20
EL 0.48 0 0.36 KE 0.45 -0.27 -0.21 --------------
--------------------------- var. 25.0 15.3 14.5
26PCA of the Data Space
PC2
PC1
27PCA of the Data Space
PC3
PC2
28PC1 Genes status"
PC2
"important"
"accessory"
PC1
29PC2 "Adaptability"
"flexible"
PC2
"rigid"
PC1
30PC2 and Expression Profile Skew
31PC3 "Reactivity"
PC3
PC2
32PC3 and Expression Profile Skew
33Relationships Between Variables
34Status and Adaptability of Genes
Classification of KOGs into 4 major categories
35Status and Adaptability of Genes
Status
INF
CELL
Adaptability
MET
Reactivity
UNKN
Classification of KOGs into 4 major categories
36Status and Adaptability of Genes
Cytoplasmic and Mitochondrial ribosomal proteins
37Status and Adaptability of Genes
Vacuolar ATPase and Vacuolar Sorting proteins
38Status and Adaptability of Genes
Replication Licensing Complex and Histones
39Status and Adaptability of Genes
Core Cluster (spliceosome and mRNA
cleavage-polyadenylation complex)
RNA processing and modification
40Adaptability and Reactivity of Genes
carbohydrate transport and metabolism
translation and ribosome
replication, RNA processing and modification
signal transduction
41(No Transcript)
42Conclusions
- Three composite, independent variables
"status", "adaptability" and "reactivity"
dominate the multidimensional data space of
quantitative genomics. - The notion of status provides biologically
relevant null hypotheses regarding the
connections between various measures. - Breaks in the pattern possibly indicate something
nontrivial (targets for further investigation). - Functional groups of genes show distinctive
patterns of status, adaptability, and reactivity
43Co-Authors
Liran Carmel
Eugene Koonin
Yuri Wolf