Title: Predicting interactions between genes based on genome comparisons The genomic context component of S
1Predicting interactions between genes based on
genome comparisonsThe genomic context
component of STRINGBioinformatics seminar
series5-10-2004Berend Snel
2To do
- Seminar (today) please ask questions
- Article a gene co-expression network for global
discovery of conserved genetic modules - Make schedule for article discussion (today)
- Read article (next couple of days)
- 5 minute discussion per person of the article
(Preferentially Monday 11 October)
3http//string.embl.de
4Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
5Complete genomes, now what?
- Post-genomic era we have the parts list
(complete genomes) - to understand the cell we need to know the
functions of the genes
6For most genes in any genome we need function
prediction
- E. Coli, the most intensively studied organism
- only 1924 genes (43) have been (partially)
experimentally characterized.
7Predicting protein function
What is function ? Various levels of
description Sequence similarity/homology has
the largest relevance for Molecular Function.
This aspect of protein function is best
conserved. Molecular function can often be
predicted from similarities between protein
sequences (BLAST), or structures.
8Beyond homology and molecular function
- Homolgy based function prediction works very
well, but - a large fraction of genes are poorly described
(no homologs, uncharacterized homologs this
holds for 60 of the human genes) - There are other aspects of function functional
associations, e.g. the target of a protein kinase
or a transcriptional regulator - Thus predicting these associations
9- Genome sequences
- Allowing us to interpret the function of proteins
within the context in which they occur - Reverse this process predict the function of a
protein from the context in which it tends to
occur ? prediction of protein function/pathways
from genome sequences Use the genome sequences
(through comparative genome analysis) for
interaction prediction genomic context methods - Genomic context methods have been shown to be
reliable indicators for functional associations
10There are many types of functional associations
(AKA functional interactions, interactions,
functional links, functional relations) in
molecular biology
Cellular process
11Types of functional associations
metabolic pathways filling gaps
12Types of functional associations
Transcription regulation
Signalling pathways
P
13Types of functional associations
Cellular process
Protein complexes
14Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
15Genomic context is an tool to predict functional
associations between genes
- Use the genome sequences (through comparative
genome analysis) for interaction prediction
genomic context methods - Genomic context methods have been shown to be
reliable indicators for functional interaction - Genomic context is also known as in silico
interaction prediction, or genomic associations
16Genomic context methods detect evolutionary
traces in genomes of functionally associated
proteins
trpA
trpB
17(No Transcript)
18Three different genomic context methods in STRING
- Gene fusion, Rosetta stone method
- Conserved gene order between divergent genomes
- Co-occurrence of genes across genomes,
phylogenetic profiles
19All genomic context methods use orthologs
corresponding genes between genomes
- Orthologs not just homologs related by
speciation - Orthologs are very likely to have the same
function - orthologs genomes alignment sequence
Gene Duplication
Speciation
20Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
21Gene fusion
- i.e. the orthologs of two genes in another
organism are fused into one polypeptide - A very reliable indicator for functional
interaction partly because it is an relatively
infrequent evolutionary event 3470 distinct
fusions when surveying 179 genomes
Fusion
22Gene fusion an example
23Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
24Gene order evolves rapidly
But
25Differential retention of divergent / convergent
gene pairs suggests that conservation implies a
functional association
26Comparison to pathways conservation implies a
functional association
27Conserved gene order
- i.e. genes that are present over sufficiently
large evolutionary distances in the same gene
cluster - Contributes by far the most predictions
28Conserved gene order
NB1 predicting operons is not trivial in fact
conserved gene order or functional association is
a major clue NB2 using only operons without
requiring conservation results in much less
reliable function prediction
29Conserved gene order an example from metabolism
of propionyl-CoA
target
query
30Conserved gene order an example from metabolism
of propionyl-CoA
Biochemical assays confirm the function of
members of COG0346 as a DL-methylmalonyl-CoA
racemase
31Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
32Presence / absence of genes
Gene content ? co-evolution. (The easy case, few
genomes. )
Differences between gene Content reflect
differences in Phenotypic potentialities
Genomes share genes for phenotypes they have in
common
33Presence / absence of genes
L. innocua (non-pathogen)
L. monocytogenes (pathogen)
34Presence / absence of genes
Genes involved in pathogenecity
L. monocytogenes (pathogenic)
L. innocua (non-pathogenic)
35Generalization phylogenetic profiles /
co-occurence
species 1 species 2 species 3 species 4
species 5 ...... ... .. ..
Gene 1 Gene 2 Gene 3 ....
species 1 species 2 species 3 species 4
species 5 ...... ... .. ..
Gene 1 1 0 1 1 0 1
Gene 2 1 1 0 0 1
0 Gene 3 0 1 0 0 1
0 ....
36 but phylogenetic signal in gene content!
Escherichia coli
Haemophilus influenzae
\s sp1 sp2 sp3 sp4 sp1 \1 0.2 0.4
0.2 sp2 \1 0.9 0.1 sp3
\1 0.3 sp4 \1
37Co-occurrence of genes across genomes
- i.e. two genes have the same presence/ absence
pattern over multiple genomes they have
co-evolved - AKA phylogenetic profiles
38Predicting function of a disease gene protein
with unknown function, frataxin, using
co-occurrence of genes across genomes
- Friedreichs ataxia
- No (homolog with) known function
39Frataxin has co-evolved with hscA and hscB
indicating that it plays a role in iron-sulfur
cluster assembly
A
.
a
e
B
o
u
l
i
c
c
h
u
n
R
s
.
e
S
p
r
y
a
r
P
D
X
H
n
o
.
N
P
.
.
.
a
.
e
V
i
M
r
.
f
E
w
e
B
.
c
a
C
a
n
m
.
.
m
r
c
a
.
s
f
h
d
.
g
s
M
u
l
u
z
e
t
h
i
c
o
e
M
i
l
coli
u
g
u
o
e
r
.
n
o
t
d
c
n
l
i
b
.
e
i
l
e
k
o
t
d
i
i
o
y
t
s
n
n
e
i
n
t
c
i
u
u
t
o
s
i
r
c
o
l
a
i
g
z
i
r
s
t
b
a
i
l
e
i
s
d
i
a
a
a
s
i
e
t
e
s
a
n
e
a
i
u
r
n
t
d
c
s
u
m
i
u
H
s
s
l
.
s
o
D.melan.
s
a
i
p
s
i
e
n
s
s
cyaY Yfh1
40Iron-Sulfur (2Fe-2S) cluster in the Rieske protein
41Prediction
Confirmation
42The opposite of co-occurrenceanti-correlation /
complementary patterns predicting analogous
enzymes
Genes with complementary phylogenetic profiles
tend to have a similar biochemical function.
A
B
A
B
43Complementary patterns in thiamin biosynthesis
predict analogous enzymes
44Prediction of analogous enzymes is confirmed
45Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Gene fusion
- Gene order
- Presence / absence of genes across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
46Benchmark and integration KEGG maps
47Integrating genomic context scores into one
single score
- Compare each individual method against an
independent benchmark (KEGG), and find
equivalency - Multiply the chances that two proteins are not
interacting and subtract from 1 naive bayesian
i.e. assuming independence
1
0.8
0.6
Fraction same KEGG map
0.4
Fusion
Gene Order
0.2
Co-occurrence
0
0
0.2
0.4
0.6
0.8
1
Score
48Benchmark
100000
10000
1000
Coverage (number of predicted links between
orthologous groups)
Integrated
Gene Order (norm.)
Gene Order (abs.)
100
Cooccurrence
Fusion (norm.)
Fusion (abs.)
10
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy (fraction of confirmed predictions,
i.e. same KEGG map)
49Performance of genomic context compared to
high-throughput interaction data
purified complexes TAP
Purified Complexes HMS-PCI
genomic context
mRNA co-expression
two methods
synthetic lethality
Coverage
combined evidence
fraction of reference set covered by data
yeast two-hybrid
three methods
raw data
filtered data
parameter choices
Accuracy
fraction of data confirmed by reference set
50Genomic context biochemistry by other means
Despite the high performance of genomic context
methods, as a tool for function prediction it is
not a button press method It is more like
biochemistry by other means. Often quite a lot
of manual input and expert knowledge from the
researcher is needed to distill associations into
a concrete function prediction Small-scale
bioinformatics?
51Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Fusion
- Gene order
- Co-occurrence across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
52STRING allows a network view
e.g. see not only to which genes the query gene
has an association, but also what the relations
are among these other genes
53STRING
Network output (depth1)
Assigning
uncharacterized archeal proteins
to a network around
Archeal flagellins
Archeal flagellin biosynth. ATPase
54STRING
Type IV secretion pathway
Network (depth2)
Connecting associated cellular processes
Archeal flagellins
Archeal flagella components
Chemotaxis- related
55STRING
Network (depth3)
Zooming out to other cellular processes
56Using the local network to detect
multi-functional proteins
57Contents
- Predicting functional interactions between
proteins - Genomic context methods
- General
- Fusion
- Gene order
- Co-occurrence across genomes
- Integration and benchmarking of predictions
- Interaction networks
- In addition to genomic context functional
genomics data
58- STRING currently in addition includes
- Functional association data from large scale /
high-throughput biochemical experiments
(functional genomics data) - protein complex purification
- yeast-2-hybrid
- ChIP-on-chip
- micro-array gene expression
- known functional relations, so called legacy
data, as present in PubMed abstracts and
databases like MIPS or KEGG.
59(No Transcript)