Unifying measures of gene function and evolution - PowerPoint PPT Presentation

About This Presentation

Title:

Unifying measures of gene function and evolution

Description:

... problems or true biological variation [e.g. fitness effect of gene disruption] ... Build an alignment (MUSCLE); Compute distance matrix (PAML) ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 44

Provided by: Jord161

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Unifying measures of gene function and evolution

1
Unifying measures of gene function and evolution
Eugene V. Koonin, National Center for
Biotechnology Information, NIH, Bethedsa
Nothing in (systems) biology makes sense except
in the light of evolution after Theodosius
Dobzhansky (1970)
Wolf, Carmel, Koonin, Proc. Roy Soc. B, in press
2
Systems Biology and Evolution
With the advent of OMICS data
The game of correlations began
3
Evolutionary systems biology

In principle, we address the classical problem
the relationship between the (largely neutral?)
evolution of the genome and the (largely
adaptive) evolution of the phenotype
In practice, the progress of genomics other
OMICS allows us to measure, on whole-genome
scale, the effects of all kinds of molecular
phenotypic characteristics (expression level,
protein-protein interactions etc etc) on
evolutionary rates this typically yields weak,
even if significant, correlations
Can we synthesize these measurements to produce
a coherent picture of the links between
phenomic and
genomic evolution?

4
The Cautionary Tale
"It was six men of Indostan / To learning much
inclined, Who went to see the Elephant / (Though
all of them were blind), That each by
observation / Might satisfy his mind " (J.G.
Saxe)
5
The Cautionary Tail
"each was partly in the right / And all were in
the wrong" (J.G. Saxe)
6
Different Faces of the Hypercube?
Pairwise correlations
Synthesis
7
Analysis of Multidimensional Data
8
Analysis of Multidimensional Data
9
Analysis of Multidimensional Data
PC1
PC3
PC2
Principal Components Analysis (PCA) introduces a
new orthogonal coordinate system where axes are
ranked by the fraction of original variance
accounted for.
10
PCA

PCA takes a set of variables and defines new
variables that are linear combinations of the
initial variables.
PCA expects the variables you enter to be
correlated
(as is the case in the correlation game of
Systems Biology).
PCA returns new, uncorrelated variables, the
principal components or axes, that summarize the
information contained in the original full set of
variables.
PCA does not test any hypotheses or predict
values for dependent variables it is more of an
exploratory technique.
The data entered represent a cloud of points, in
n-space.
The cloud is, typically, longer in one direction
than another, and that longest dimension is where
the points are the most different that's where
PCA draws a line called the first principal
component.
The first principal component is guaranteed to be
the line that places your sample points the
farthest apart from each other, in that way, PCA
"extracts the most variance" from your data. This
process is repeated to get multiple components,
or axes.

11
The Data Set KOGs

Ideally, we would like to obtain and synthesize
the data on individual genes in precise
space-time coordinates (e.g., instant
evolutionary rates)
However
some of the variables are not easily measurable
(if defined at all) for genes in extant species
e.g. rate of evolution
other variables are measurable in principle but,
in practice, are
available only for a few species e.g.,
expression level
much of the data are inherently noisy, either due
to technical problems or true biological
variation e.g. fitness effect of gene
disruption.
Thus, we analyze orthologous protein sets, using
the proteins from different species to derive
complementary data and smooth out variations in
other.
Practically, this means using the KOG dataset
(with additions) 10058 KOGs from 15 species
(Koonin et al. 2004, Genome Biol).

12
The Data Set KOGs
Original KOGs for some species, "index orthologs"
for other. 10058 KOGs altogether
13
Variables Gene Loss
Propensity for Gene Loss (PGL), introduced by
Krylov et al. (Genome Res. 13, 2229-2235, 2003).
Computed from KOG phyletic pattern. Originally
an empirical measure (Dollo parsimony
reconstruction of events ratio of branch
lengths). In this work employs an Expectation
Maximization algorithm.
14
Variables Gene Duplication
Number of Paralogs, average number observed for a
given KOG. Example KOG0417 (Ubiquitin-protein
ligase) and KOG0424 (Ubiquitin-protein ligase).
At1g16890 At1g36340 At1g64230 At1g78870 At2g16740 At2g32790 At3g08690 At3g08700 At3g13550 At4g27960 At5g25760 At5g41700 At5g53300 At5g56150 CE03482 CE09712 CE10824 CE28997 7292764 7292948 7295708_2 7296089 7297757 7298165 7299919 Hs17476541 Hs22043797 Hs22054779 Hs22064361 Hs4507773 Hs4507775 Hs4507777 Hs4507779 Hs4507793 Hs5454146 Hs7661808 Hs8393719 YBR082c YDR059c YDR092w YGR133w SPAC11E3.04c SPAC1250.03 SPBC119.02 SPBC1198.09 ECU10g0940 ECU11g1990
At3g57870 CE01332 CE09784 7296195 Hs4507785 YDL064w SPAC30D11.13 ECU01g0940
15
Variables Evolution Rate
Select a taxon Build an alignment
(MUSCLE) Compute distance matrix (PAML) Select
minimum distance between members of the two
subtrees of the group.
Ascomycota Sordariomycetes vs. Yeasts
16
Variables Expression Level
Expression Level data for S. cerevisiae, D.
melanogaster and H. sapiens were downloaded from
UCSC Table Browser (hgFixed).
Organism Table No. exp. No. prob. No.
KOGs Sacce yeastChoCellCycle 17 6602 3030 Drome ar
bFlyLifeAll 162 4921 2617 Homsa gnfHumanAtlas2All
158 10197 3872
Standardized (?0 ?1) log values median
expression level among paralogs was used to
represent a KOG.
17
Variables Interactions
Protein Protein and Genetic Interactions (PPI and
GI) data for S. cerevisiae, C. elegans and D.
melanogaster were downloaded from GRID Web
site. Median number of interaction partners
among paralogs was used to represent a KOG.
18
Variables Lethality
Lethality of Gene Knockout data for S. cerevisiae
were downloaded from MIPS FTP site (0/1
values). Embryonic Lethality of RNAi Interference
data for C. elegans were taken from Kamath et
al., 2003 (0/1 values).
19
Missing Data
Total 38 variables in 10058 KOGs lots of
missing data. Complete data (all 38 variabless
available) 23 KOGs too few. Combined data 7
variables, 1482 KOGs with complete data 4124
with at most one missing point 3912 KOGs after
removal of outliers. Example evolution rate.
At.Os Sc.Ca Mg.Nc Hs.Mm. Pl.MF KOG0009 - 0.168 0.
300 - 0.405 KOG0010 0.671 1.252 0.606 0.087 1.492
KOG0011 0.905 1.698 0.428 0.073 1.547 KOG0012 - 2.
238 0.665 0.244 - KOG0013 0.355 - - 0.014 1.343 KO
G0014 1.913 4.041 - 0.126 2.840 KOG0015 - 2.286 0.
400 0.027 - KOG0016 - - 0.506 0.380 - 0.667 1.86
4 0.521 0.075 1.910
At.Os Sc.Ca Mg.Nc Hs.Mm. Pl.MF - 0.090 0.575 - 0
.212 1.006 0.672 1.162 1.166 0.781 1.358 0.911 0
.821 0.984 0.810 - 1.201 1.275 3.275 - 0.532 - -
0.181 0.703 2.869 2.168 - 1.692 1.487 - 1.227 0.
767 0.365 - - - 0.970 5.087 -
Average 0.293 0.957 0.977 1.917 0.472 2.054 0.786
3.028
20
Variables

Phenotypic
EL expression level
PPI protein-protein interactions
GI genetic interactions
KE knockout effect
NP number of paralogs
Evolutionary
ER (sequence) evolution rate
PGL propensity for gene loss

21
The correlations
NP PPI GI PGL ER EL KE NP - PPI 0.057 -
GI 0.060 0.034 - PGL 0.000 -0.125 -0.019 -
ER -0.070 -0.200 0.034 0.141 - EL 0.129 0.199
-0.050 -0.099 -0.277 - KE 0.027 0.234 -0.048 -0.1
81 -0.155 0.188 -
22
Two Tiers of Variables
Observation on the pattern of pairwise
relationships in the data "phenotypic" and
"evolutionary" variables behave differently.
23
Two Tiers of Variables
Observation on the pattern of pairwise
relationships in the data "phenotypic" and
"evolutionary" variables behave differently.
24
The correlations
NP PPI GI PGL ER EL KE NP - PPI 0.057 -
GI 0.060 0.034 - PGL 0.000 -0.125 -0.019 -
ER -0.070 -0.200 0.034 0.141 - EL 0.129 0.199
-0.050 -0.099 -0.277 - KE 0.027 0.234 -0.048 -0.1
81 -0.155 0.188 -
25
PCA of the Data Space
PC.1 PC.2 PC.3 NP 0.17 0.69 0.44 PPI 0.46 0 -0.17
GI 0 0.67 -0.54 PGL -0.33 0 0.51 ER -0.47 0 -0.20
EL 0.48 0 0.36 KE 0.45 -0.27 -0.21 --------------
--------------------------- var. 25.0 15.3 14.5
26
PCA of the Data Space
PC2
PC1
27
PCA of the Data Space
PC3
PC2
28
PC1 Genes status"
PC2
"important"
"accessory"
PC1
29
PC2 "Adaptability"
"flexible"
PC2
"rigid"
PC1
30
PC2 and Expression Profile Skew
31
PC3 "Reactivity"
PC3
PC2
32
PC3 and Expression Profile Skew
33
Relationships Between Variables
34
Status and Adaptability of Genes
Classification of KOGs into 4 major categories
35
Status and Adaptability of Genes
Status
INF
CELL
Adaptability
MET
Reactivity
UNKN
Classification of KOGs into 4 major categories
36
Status and Adaptability of Genes
Cytoplasmic and Mitochondrial ribosomal proteins
37
Status and Adaptability of Genes
Vacuolar ATPase and Vacuolar Sorting proteins
38
Status and Adaptability of Genes
Replication Licensing Complex and Histones
39
Status and Adaptability of Genes
Core Cluster (spliceosome and mRNA
cleavage-polyadenylation complex)
RNA processing and modification
40
Adaptability and Reactivity of Genes
carbohydrate transport and metabolism
translation and ribosome
replication, RNA processing and modification
signal transduction
41
(No Transcript)
42
Conclusions

Three composite, independent variables
"status", "adaptability" and "reactivity"
dominate the multidimensional data space of
quantitative genomics.
The notion of status provides biologically
relevant null hypotheses regarding the
connections between various measures.
Breaks in the pattern possibly indicate something
nontrivial (targets for further investigation).
Functional groups of genes show distinctive
patterns of status, adaptability, and reactivity