Title: CMBI
1CMBI Center for Molecular and Biomolecular
Informatics 1) Protein structure (Gert
Vriend) 2) Bacterial Genomics (Ronald
Siezen) 3) Comparative Genomics (Martijn Huynen)
2Vriend group protein structure, e.g. modeling
and design
3Siezen group annotation of (lactid acid)
bacterial genomes
4Huynen group Prediction of protein function and
pathways in the genome era
- Homology, Orthology
- Genomic Context
5Similar function
What is function ? Various levels of
description Sequence similarity, Homology
has the largest relevance for Molecular Function.
This is aspect of protein function that is best
conserved, protein sequence, structure can often
be interpreted in terms of function.
6Models of protein sequence evolution ??ways of
aligning sequences and of judging sequence
similarity. The simplest model All amino-acids
are equally dissimilar, are replaced at equal
rates, independent of the position in the
sequence. (aligning sequences purely based on
matches/identity matrices) A more complicated
model Some amino-acids are more equal than
others, independent of the position on the
sequence. (aligning sequences based on similarity
matrices) Bootstrapping in deriving the
substitution/similarity matrices The matrices
themselves are based on some sequence alignments,
and hence on some model of evolution..
7BLOSUM (62, 80 etc.) (BLocks SUbstitution Matrix)
based on (gap-less) alignments of sequences that
are maximally 62 80 etc. identical. By choosing
gapless alignments (Blocks) they are relatively
independent of the model used to make the
alignment.
The scores are the log-odds of the observed
substitution frequencies divided by the
frequencies of the individual amino acids
Sij ln qij/pipj
8Benchmarking homology detection with the
Smith-Waterman algorithm, using 3D-structures
(PDB40) as the golden rule for what is
homologous and what is not
. Use those E-values
9Practical and theoretical considerations in
pairwise sequence comparisons One of the
assumptions behind the statistics is that the
sequences are random ? No low complexity areas,
(SEG XXXXXXX). Convergence in sequence and in
structure space occurs e.g. in Transmembrane and
coiled coil areas. ? No homology but convergence
in structure and in sequence space. Databases
are assumed to be non-redundant ? E-values are
too high, Solution compare against non-redundant
databases. E-values are based on the whole
sequence ? search with separate domains if you
have indications that there are such. Databases
are full of indirectly annotated proteins ? there
is no solution, except by manually checking which
annotations are reliable.
10The main increase in sensitivity (2 to 3 fold)
comes from profile-based searches, Like
Position Specific Iterated BLAST (PSI-BLAST) and
from Hidden Markov Models PSI-Blast and HMMs
allow more complicated models of sequence
evolution rather than substitution matrix that
is equal for all positions, we have one for each
position, as well as position-specific
gap-penalties. (HMMS mathematical,
probabilistic models that generate our protein
domain allows us to asses the probability that
any sequence of interest has been generated by
any specific model)
11A very simple Hidden Markov Model
Pos. 1
Pos. 2
Pos. 3
Pos. 4
P(A)0.01 P(C)0.8 P(E)0.1 Etc.
P(A)0.3 P(C)0.01 P(E)0.02 Etc.
P(A)0.05 P(C)0.01 P(E)0.4 Etc.
P(A)0.01 P(C)0.01 P(E)0.3 Etc.
12You can get all the obvious homologs, align
them, make an HMM using HMMer, run it on
somewhere on a big computer. (dynamic
programming, slow) Or you can use PSI-Blast
!!!! (Altschul et al., 1997) Relatively fast and
easy, and a bit less accurate (alignment never
exceeds the length of the seed protein) An
example how making powerful bioinformatics tools
easy accessible leads to increase of usage,
speed-up of research.
13Comparison of various homology search techniques
in terms of sensitivity (number of homologues
detected) and selectivity (number of
non-homologous detected) SAM-T98 HMM ISS
Intermediate Sequence Search
14Protein domains
Proteins often consist of multiple domains.
Separate in structure ? structural definition
(need a 3D structure) Separate in evolution ?
comparative sequence analysis definition
15The multidomain architecture of proteins is one
of the reasons why protein function prediction by
best hit homology search is often incorrect.
2
1
A
B
B
Protein B is wrongly annotated as having the
function of domain 1, based on homology with the
multidomain protein A, but not with domain 1
16Detection of new domains
Find pieces of proteins that occur in the context
of different other pieces/domains
Be careful though when you do not detect
homology, that does not necessarily indicate that
it is not there -gt having a separate 3D structure
is often regarded as proof for being a separate
domain.
17Sequence domain databases prefab sets of HMMs
against which sequences can be scanned. PFAM -gt
alignments generated automatically. Large
coverage, less quality SMART -gt curated
alignments, less coverage (focus on signalling
domains and on mobile domains)
18Gene families show gene duplications, functional
differentiation
Species I
A
Gene A
Orthologs
B
Gene duplication
Species II
Speciation
19Genome I
30
35
25
23
Genome II
Orthologs are expected to have relatively high
levels of sequence identity to each other
(compared to to other non-orthologous homologs),
because they diverged relatively recently, and
because they have similar functions. (???) Large
scale orthology determination is often done using
bidirectional best hits
20Genome I
35
35
23
25
25
Genome II
30
22
40
35
20
Genome III
Multiple genomes can be used to check for
consistency of bidirectional best hits.
21A
Species I
Non-Orthologous
Orthologous
Species II
C
Orthologous
Species III
B
Gene duplication
Speciation
Strictly speaking, orthology is non-transitive
(in contrast to homology). If A is orthologous
to B and B is orthologous to C than A is not
necessarily orthologous to C
22Solution to the non-transitivity of the concept
of orthology sensu stricto is Group orthology
Conceptually all proteins that are directly
descended from one protein in the last common
ancestor are considered orthologous to each
other Operationally Combine all connected best
triangular hits into Clusters of Orthologous
Groups (COGs, Tatusov et al, 1997).
WWW.NCBI.NLM.GOV (Watch out for fusion/fission
though !!!)
23(No Transcript)
24.
Variations in the rate of evolution can lead to
misidentification of orthology relations when the
latter are based on bi/multi-directional best
hits.
25Because of independent loss events, and because
of variable rates of evolution, in large gene
families, orthology determination using
bi/multi-directional best hits does not always
resolve separate orthologous and/or functional
groups. One solution to this is the creation of
phylogenies
26(No Transcript)
27Prediction of orthology using phylogenies
(unrooted)
28Comparative genomics for pathway analysis,
prediction
29- Genome sequences
- Allowing us to interpret the function of proteins
within the context in which they occur - Reverse this process predict the function of a
protein from the context in which it tends to
occur ? prediction of protein function/pathways
from genome sequences
30Instead of interpreting, actually predicting
protein function using genomic association
deoxycitidine
Cdd
deoxyuridine, deoxythimidine
DeoA
Glyceraldehyde-3-p, acetaldehyde
deoB
deoC
deoxyribose-1-P
deoxyribose-5-P
DeoD
purine deoxyribonucleosides
deoB ?
M.genitalium M.tuberculosis
deoD deoC deoA cdd pmm
31- Prediction that the cdd gene encodes a protein
that (also) functions as a phosphoribomutase is
based on - Genomic association (operon) with genes involved
in the nucleoside salvage pathway. - Conservation of this association among distantly
related species. - Substrate specificity is less conserved than
catalytic function ? conserved is the mutase
function, altered is the substrate specificity
from a mannose/glucose to a ribose. - A phosphoribose mutase is required, and otherwise
absent from the genome - Such predictions of course have to be confirmed
by experimental research
32Types of Genomic Association for the Prediction
of Functional Interaction
- I gene fusion/fission
- II conservation of gene order (operons)
- III co-occurrence of genes in genomes
- IV shared regulatory elements
- V (conservation of) coexpression data
33Gene fission in the evolution of carbamoyl
phosphate synthase B (carB)
34Predicting functional interactions between
proteins by the co-occurrence of their genes in
genomes.
Distribution of four M.genitalium genes among 25
genomes MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0
1 1 0 0 0 1 0 1 1 1 1 MG357(ackA) 0 0 0 1 1 0 0 0
0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG019(dnaJ) 0 0
1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1
1 MG305(dnaK) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0
0 1 1 1 1 1 1
Using the mutual information between genes as a
scoring heuristic for their co-occurrence. M(pta,
ackA)0.69 (phospotransacetylase, acetate
kinase) M(dnaJ, dnaK)0.55 (heat shock
proteins) M(dnaJ, ackA)0.19
35Selectivity of Genomic Context for function
prediction
36Correlation between the strength of the genomic
and functional associations (operon)
37Genomic context vs. homology based function
prediction in M.genitalium
Context 238
Homology 368
21
26
Added info from genomic context
38Combining homology information with genomic
association for function prediction
Repeated occurrence of MG009, one of the most
widespread enzymes on earth, encoding a
phosphohydrolase, with thymidilate kinase (tmk)
suggests a role of MG009 in pyrimidine metabolism.
39Conservation of gene order of the hypothetical
gene MG134 with dnaX, RecR suggests physical
interaction between their gene products
40Conservation of co-expression after
gene-duplication or speciation
41Using KEGG metabolic maps to score functional
interaction
42Conservation of co-expression after speciation
(red) or after parallel gene duplication (black)
increases the signal for the prediction of
functional interactions
43(No Transcript)
44Finding Interaction Partners for a Human Disease
Gene frataxin
- Friedreichs ataxia
- No (homolog with) known function
- No gene fusion or gene order conservation
45(No Transcript)
46Ancestor Proteobacteria
fdx
IscS
IscU
IscR
RnaM
(time)
47Mitochondrial iron-sulfur assembly
Arh1/fpr
Atm1
Cys
NifS
e-
fdx
e-
S
2Fe2S
Ala
e.g. fdx, Complex I
Fe
NifU
HscA/SSQ1, HscB frataxin ?
48Protein Context type of interaction function r
ef  Mt-Ku gene order physical
interaction double-stranded DNA repair
44 GnlK gene order physical
interaction signal transduction for transport
55,56 PH0272 gene order metabolic
pathway methylmalonyl-CoA racemase 43 PrpD gene
order metabolic pathway 2-methylcitrate
dehydratase 23,57 arok gene order metabolic
pathway shikimate kinase 58 ComB gene
order metabolic pathway 2-phosphosulfolactate pho
sphatase59 Yfh1 co-occurrence process iron-sul
fur protein maturation 26,27 YchB co-occurrence
metabolic pathway terpenoid synthesis 60 SmpB co
-occurrence process trans-translation 8,61 Thy
X complement. enzymatic activity thymidilate
synthase 14,62 Prx fusion pathway
peroxiredoxin 63 YgbB fusion/order metabolic
pathway terpenoid synthesis 64 SelR fus./ord./c
o-o. process methionine sulfoxide
reductase 14,65,66
Genomic context based function predictions that
have been experimentally verified