Title: Day 7: Using genomics to predict new pathways
1Day 7 Using genomics to predict new pathways
2- Genome sequences
- Allowing us to interpret the function of proteins
within the context in which they occur - Reverse this process predict the function of a
protein from the context in which it tends to
occur ? prediction of protein function/pathways
from genome sequences
3what on earth does the ketoglurateferredoxin
oxidoreductase do in P. abyssi when there are no
connecting enzymes of the citric acid cycle ?
42-ketoglutarate likely derived from glutamate
5Succinyl-CoA can be broken down via
Methyl-malonyl CoA
6Instead of interpreting, actually predicting
protein function using genomic association
deoxycitidine
Cdd
deoxyuridine, deoxythimidine
DeoA
Glyceraldehyde-3-p, acetaldehyde
deoB
deoC
deoxyribose-1-P
deoxyribose-5-P
DeoD
purine deoxyribonucleosides
deoB ?
M.genitalium M.tuberculosis
deoD deoC deoA cdd pmm
7- Prediction that the cdd gene encodes a protein
that (also) functions as a phosphoribomutase is
based on - Genomic association (operon) with genes involved
in the nucleoside salvage pathway. - Conservation of this association among distantly
related species. - Substrate specificity is less conserved than
catalytic function ? conserved is the mutase
function, altered is the substrate specificity
from a mannose/glucose to a ribose. - A phosphoribose mutase is required, and otherwise
absent from the genome - Such predictions of course have to be confirmed
by experimental research
8Annotatie via guilt by association
Zoek eiwitten in bacteriele genomen die er vaak
naast liggen
nieuw, onbekend gen
gen dat in een bekend proces is betrokken, b.v.
aminozuur synthese
Extrapoleer de pathway
9Define distantly related species..
Remember the rapid shuffling of genomes (compared
to 16S rRNA identity)
10Variations in the genome rearrangements dependent
on the relative direction of transcription ?
hints to the operon organization of genes in
prokaryotes
11Except for the theoretical argument proteins
that are not only encoded in the same operon, but
this organization is actually conserved in
evolution, we also need experimental benchmarks
(compare the protein sequence similarity ?
homology benchmarking via the structure) Dandekar
, Snel, Huynen and Bork, TIBS 1998. Conservation
of gene order a fingerprint of proteins that
physically interact
12..Benchmarking..
13Conservation of the Tryptophane synthesis operon
among the compared genomes
14Types of Genomic Association for the Prediction
of Functional Interaction
- I gene fusion/fission
- II conservation of gene order (operons)
- III co-occurrence of genes in genomes
- IV shared regulatory elements
- V coexpression data
15All the genes in the tryptophane biosynthesis
pathway are linked via gene fusions. These
fusions do not give the order of the enzymes in
the pathway
16Gene fission in the evolution of carbamoyl
phosphate synthase B (carB)
17Predicting functional interactions between
proteins by the co-occurrence of their genes in
genomes.
Distribution of four M.genitalium genes among 25
genomes MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0
1 1 0 0 0 1 0 1 1 1 1 MG357(ackA) 0 0 0 1 1 0 0 0
0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG019(dnaJ) 0 0
1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1
1 MG305(dnaK) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0
0 1 1 1 1 1 1
Using the mutual information between genes as a
scoring heuristic for their co-occurrence. M(pta,
ackA)0.69 (phospotransacetylase, acetate
kinase) M(dnaJ, dnaK)0.55 (heat shock
proteins) M(dnaJ, ackA)0.19
18Gene co-occurrence/phylogenetic profiling
Distribution of 2 M.genitalium genes in 25
genomes, 1 implies that the gene is present, 0
that it is absent MG299 (pta) 0 0 0 1 1 0 0 0 0
1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG357(ackA) 0 0 0
1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1
Phosphotransacetylase (pta)
Acetate kinase (AckA)
Ack and pta are in the same pathway, explaining
their co-occurrence
19Distribution of 2 M.genitalium genes in 25
genomes, 1 implies that the gene is present, 0
that it is absent MG019(dnaJ) 0 0 1 1 1 1 1 1 0
1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 MG305(dnaK) 0 0 1
1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1
Models for the interaction of DnaK, DnaJ with
their substrate (unfolded and misfolded
Proteins). DnaK and DnaJ interact with each
other, explaining their genomic co-occurrence
20..entropy and mutual information
H (i) - Si Pi log Pi
H (j) - Sj Pj log Pj
H (i,j) - Si,j Pi,j log Pi,j
Entropy (H) is the disorderdness of the system,
is maximal when all states occur with equal
frequency, minimal when one state dominates the
distribution. In terms of the distribution of
genes,it is maximal when genes occur with 50
frequency.
M (i,j) H (i) H (j) - H (i,j)
Mutual information (M) is the sum of the
individual entropies minus the combined entropy.
It is maximal when individual entropies are
maximal (P0.5) and the combined entropy is
minimal (of the four possibilities, 0 0, 0 1, 1 0
and 1 1, only two are occupied 0 0 and 1 1 or 0
1 and 1 0)
21Applicability of using Genomic context
information for
M.genitalium genes
Gene-order 215
Fusion 27
480 genes in total
Co-occurrence 54
22Selectivity of Genomic Context for function
prediction
23Correlation between the strength of the genomic
and functional associations (operon)
24Correlation between the strength of the genomic
and functional associations (fusion)
25Correlation between the strength of the genomic
and functional associations (co-occurrence)
26The stronger the evolutionary conservation of the
genomic co-occurrence the more likely the
interaction
1
0.8
0.6
Kans dat de genen in dezelfde pathway liggen
0.4
Fusion
Gene Order
0.2
Co-occurrence
0
0
0.2
0.4
0.6
0.8
1
Evolutionaire conserverings score (hoe vaak
liggen de genen naast elkaar, zijn ze gefuseerd,
of in hoeverre zijn hun phylogenetische
verdelingen gelijk)
27Genomic context vs. homology based function
prediction in M.genitalium
Context 238
Homology 368
21
26
Added info from genomic context
28Combining homology information with genomic
association for function prediction
Repeated occurrence of MG009, one of the most
widespread enzymes on earth, encoding a
phosphohydrolase, with thymidilate kinase (tmk)
suggests a role of MG009 in pyrimidine metabolism.
29Conservation of gene order of the hypothetical
gene MG134 with dnaX, RecR suggests physical
interaction between their gene products
30(No Transcript)
31From pairwise interactions to functional modules,
pathways
32 The first iteration of trpB in M. jannaschii
(MJ1038) retrieves trpA (MJ1037), with which trpB
physically interacts
33(No Transcript)
34Genomic context indicates a link between the
Shikimate and Tryptophane synthesis pathways
tyrA
aroB
asd
truA
aroE
aroC
hemK
hyp
trpF
trpC
trpE
Shikimate pathway
trpG
trpA
trpD
trpB
Tryptophane synthesis pathway
hyp
2c-rr
35Biochemical pathways vs functional modules
Coverage gt70
Specificity ca 90
Von Mering et al. PNAS 100 (2003) 15428
36Limited Relevance of Gene Order for Functional
Interaction in eukaryotes
- operons in Nematodes
- Gene-order conservation of co-expressed genes
between the fungi of C.albicans and S.cerevisiae
37Blumenthal, 2004
38Finding Interaction Partners for a Human Disease
Gene frataxin
- Friedreichs ataxia
- No (homolog with) known function
- No gene fusion or gene order conservation
39(No Transcript)
40Iron-Sulfur (2Fe-2S) cluster in the Rieske
protein (Iwata et al, Structure 1996)
41Ancestor Proteobacteria
fdx
IscS
IscU
IscR
RnaM
(time)
42The mitochondrial HSP70 protein that is involved
in iron-sulfur cluster (isc) assembly in yeast is
derived from DnaK, rather than from HscA (the
proteobacterial isc HSP70), indicating a
paralogous switch in isc assembly from the
proteobacteria to the eukaryotes.
43Mitochondrial iron-sulfur assembly
Arh1/fpr
Atm1
Cys
NifS
e-
fdx
e-
S
2Fe2S
Ala
e.g. fdx, Complex I
Fe
NifU
HscA/SSQ1, HscB frataxin ?
44Prediction
Confirmation
45IND1, an FeS protein required for complex I
assembly
Katrine Bych Janneke Balk Stefan Kerscher
Klaus Zwicker Ulrich Brandt Daili J A Netz,
Antonio J Pierik Roland Lill,
PHILIPPS-UNIVERSTÄT MARBURG
Martijn Huynen
46Ind1 an Mrp-like NTPase
Ind1 has sequence is homologous to the cytosolic
proteins Cfd1 and Nbp35 in S. cerevisiae, which
can function as Fe-S scaffold proteins (Netz et
al. 2007 Nat. Chem. Biol. 3, 278-286).
47Ind1 has a mitochondrial location
Cells fractionated in mitochondria and
postmitochondrial natant (pms)
48Phylogenetic relationship between NBP35, CFD1,
CF101, ApbC and IND1 indicates various
independent origins from bacteria and archaea
49Co-evolution of IND1 with Complex I
CI FeS proteins
loss of IND1
75kD
gain of mito targeting signal
IND1
TYKY
PSST
51kD
24kD
MTS
Fungi (17)
Saccharomyces s.l. (5)
S.pombe
E.cuniculi
Vertebrata (16)
Insects (7)
Nematodes (3)
S.purpuratus
D.discoideum
E.histolytica
Plants, Algae (5)
Apicomplexa (10)
MTS
Ciliates (2)
Euglenozoa (5)
G.lamblia
T.vaginalis
50Deletion of IND1 in Y. lipolytica specifically
affects complex I activity in Y.lipolytica with
an alternative NADH dehydrogenase
51Deletion of IND1, or mutation of IND1 conserved
cysteines affects complex I activity in
mitochondrial membranes Similar reduction of
NADHDAR (oxidation of NADH by FMN on the 51 Kd
subunits) and dNADHDBQ (oxidation of NADH
coupled to reduction of ubiquinone) activity
suggests less complex I rather than impaired
complex I
52Deletion of IND1 reduces complex I abundance
cWT
ind1?
75-kDa NUAM
VD and VM dimeric and monomeric forms of complex
V I complex I S incompletely characterized
supercomplex that contains complex III IIID
dimeric form of complex III.
53Verified function predictions Making predictions
is easy, testing them is another matter.
Protein Context type of interaction function
ref
Mt-Ku gene order physical interaction double-stra
nded DNA repair 56 GnlK gene order physical
interaction signal transduction for ammonium
transport 57,58 PH0272 gene order metabolic
pathway methylmalonyl-CoA racemase
59 PrpD gene order metabolic
pathway 2-methylcitrate dehydratase 22,60 arok
gene order metabolic pathway shikimate
kinase 61 ComB gene order metabolic
pathway 2-phosphosulfolactate phosphatase 62 K
ynB gene order metabolic pathway kynurenine
formamidase 63 PvlArgDC gene
order metabolic pathway arginine decarboxylase
64 FabK gene order metabolic
pathway enoyl-ACP reductase 65 FabM gene
order metabolic pathway trans-2-decenoyl-ACP
isomerase 66 COG0042 gene order tRNA
modification tRNA-dihydrouridine synthase
67 Yfh1 co-occurrence process iron-sulfur
cluster assembly 68,69 YchB co-occurrence metab
olic pathway terpenoid synthesis
70 SmpB co-occurrence process trans-translat
ion 5,71 ThyX complementary enzymatic
activity thymidilate synthase 14,72 ThiN com
plementary enzymatic activity thiamine phosphate
synthase 73,74 ThiE complementary enzymatic
activity thiamine phosphate synthase
74 Prx fusion pathway peroxiredoxin 75
YgbB fusion/ gene order metabolic
pathway terpenoid synthesis 76 SelR fusion./or
der/co-o. enzymatic activity methionine sulfoxide
reductase 14,22,77 FadE reg
. sequence metabolic pathway acyl CoA
dehydrogenase 78,79 TogMNAB reg.
sequence metabolic pathway Oligogalacturonide
transport 80,81 MetD reg. sequence metabolic
pathway Methionine transport 82
54Further Reading
- Genomic context Huynen M, Snel B, Lathe W 3rd,
Bork P. (2000) Predicting protein function by
genomic context quantitative evaluation and
qualitative inferences.Genome Res.
10(8)1204-10. - Genomic context Gabaldon T, Huynen MA. (2004)
Prediction of protein function and pathways in
the genome era. Cell Mol Life Sci. 2004 61
930-44.