Title: Computational functional genomics
1Computational functional genomics
2Introduction
- Piles of information but only flakes of knowledge.
Collections of genomic sequences. Expression
profiles Protein-protein interactions And many
more
3Introduction
- Computational biology strives to extract the
maximal possible information from known
sequences, by classifying them according to their
homologous relationships, predicting their
biochemical activity, cellular function,
3-dimensional structures and evolutionary origin.
4The COG-Clusters of Orthologous Groups of proteins
- Identification of orthologs is critical for
reliable prediction of gene function in newly
sequenced genomes.
- The purpose of COG is to serve as a platform for
functional annotation of newly sequenced genomes
and for study of genome evolution.
- Reflects one-to-one, one-to-many and many-to-many
relationships.
5The COG-statistics
- In 2003, there are 3307 COGs including 74059
proteins from 43 genomes.
- Genomes from- Bacteria, Archaea and Eukaryota.
- The database includes 17 functional groups.
6The COG- make on your own
- COG construction procedure is based on the notion
that any group of at least 3 proteins from
distant genomes that are more similar to each
other than to any other protein from the same
genomes, are most likely to belong to an
orthologous family.
7The COG- make on your own
All-against-all protein sequence comparison
8The COG- make on your own
9The COG- adding new genomes
- The COGNITOR program adds new proteins to
pre-existing COGs on the basis of multiple Best
Hits.
- 60-80 of the proteins of prokaryotes could be
included.
10The COG- more applications
- Convenient for variety of evolutionary-oriented
analyses of protein families.
11Methods
Biochemical and genetic experiments
Homology method (BLAST), mRNA expression
Phylogenetic profile
Fusion method (Rosetta stone analysis)
Gene neighbour method
12Homology method
- Homology method searches proteins whose AA
sequences are similar.
- 40-70 of new genome can be assigned to some
function.
- Involve identification of some molecular function.
13mRNA expression
- Analysis of correlated mRNA expression levels
enables to establish functional linkages, by
detecting changes in mRNA expression in different
cell types, or different environments.
14Phylogenetic profile
- Describes the pattern of presence or absence of a
particular protein, across a set of organisms.
- Number of possible profiles
- This number far exceeds the protein families.
15Phylogenetic profile
- Why would two proteins always both be inherited
into new species or neither inherited, unless the
two function together?
- If two proteins have the same phylogenetic
profile, it is inferred that they have a
functional link engaged in a common pathway or
complex.
16Phylogenetic profile
17Phylogenetic profile- example
- Analysis of three proteins RL7, FlgL and His5,
according to their phylogenetic profiles.
- RL7 more than half have function associated with
the ribosome.
- FlgL more than half include various flagellar
proteins and cell-wall maintenance proteins.
- His5 more than half involved in amino acid
metabolism.
18Phylogenetic profile- example
PgsA phospholipid synthesis YGGH hypothetical
YBEX hypothetical RL34 ribosome L34 RL36 ribosome
L36 RL27 ribosome L27 RL25 ribosome L25 YQCB
hypothetical YABO hypothetical YCEC
hypothetical RFH peptide release factor ClpB geat
shock protein
RL7 ribosome L7 RL15 ribosome L15 RL17 ribosome
L17 PTH peptidyl-tRNA hydrolase RNC ribonuclease
III
YJFH hypothethocal
RS14 ribosome S14
GidB glucose inhib. Division RL24 ribosome
L24 DEF polypeptide deformylase RL20 ribosome
L20 MesJ cell cycle protein RL19 ribosome
L19 RL21 ribosome L21 RL9 ribosome L9 SmpB small
protein B
G3P3 dehydrogenase
RL4 ribosome L4 NONE hypothtical
GrpE co-chaperone
19Phylogenetic profile
Phylogenetic profiles link protein with
similar keywords
20Fusion method or the Rosetta stone analysis
- Some pairs of interacting proteins have homologs
in another organism, fused into a single protein
chain.
- When two separate proteins in one organism, A and
B, are expressed as a fused protein in some other
species, there is a high probability that A and B
are linked in function.
21Fusion method
22The Rosetta Stone model
23Fusion method what is it good for?
- Predicts protein pairs that have related
biological functions. - Predicts potential protein-protein interactions.
- Can turn up complexes of proteins, or protein
pathways.
24Fusion method what is it good for?
25Fusion method
- The group searched the 4290 protein sequences of
the E.coli genome.
- The proteins could form at most (4290)(4289)/2
pair interactions. But we expect much less
- There were found 6809 candidate for pair
interactions.
26Fusion method validation
- Looking for a similar function in existing
annotations that would imply at least functional
interaction.
- Of the E.coli pairs that were found in the
Rosetta Stone analysis, 68 share at least one
keyword in their annotations, whereas from E.coli
proteins that were selected randomly, only 15
share a keyword.
27Fusion method validation
- From a database containing protein pairs that
have been found to interact (experimentally)
6.4 are linked by Rosetta Stone sequences.
- The phylogenetic profile method was applied to
the interactions predicted by the fusion method.
It found more than 8 times as many interactions
suggested by the phylogenetic profile method, as
for randomly chosen sets of interactions.
28Fusion method missing pairs
There was no fusion of the interacting proteins.
The fused protein disappeared during the course
of evolution.
29Fusion method False alarms
False prediction of physical interactions when
the proteins are fused, but are co-regulated and
dont interact.
Cannot distinguish between homologs that bind
and those that do not.
30Fusion method False alarms
- The false positive rate in E.coli due to the
inability to distinguish homologs is about 82.
- To reduce these errors the promiscuous domains
were found and removed during the analysis.
- By filtering of only 5 of all domains, we can
remove the majority of falsely predicted
interactions.
31Fusion method False alarms
32Neighbour method
- Functional links between genes can be identified
by examining whether the proximity of the genes
is conserved across multiple genomes.
- Powerful in uncovering functional linkages in
prokaryotes where operons are common.
33Neighbour method
34Neighbour method- definitions
- close proximate genes are on the same strand
within 300 bp, and transcribed in the same
direction.
- Direct link two proximate genes that are also
proximate in at least two other genomes of
different phylogenetic groups.
- Inferred link two genes that are not close but
with orthologs that are close in at least three
other genomes of different phylogenetic groups.
35Neighbour method- defenitions
36Neighbour method
- Proximity between genes is maintained mostly
because it facilitates their co-transfer to
another organism.
- Example restriction-modification systems.
37Neighbour method- validation
- Identification of links that are annotated in
KEGG or COG and calculate the fraction of those
in the same functional pathway / category.
- The functional correspondence is correlated to
the minimal number of phylogenetic groups, in
which the proximity is detected.
38Neighbour method- validation
N tradeoff
39Neighbour method- example
40Happy end???
- The group analyzed the 6,217 proteins of the
yeast Saccharomyces combining several methods. - one can expect each protein to be functionally
linked to perhaps 550 other proteins, giving
30,000300,000 biologically meaningful links.
41Happy end???
42Networks
- When methods of detecting functional linkages are
applied to all the proteins of an organism,
network of interacting, functionally linked
proteins can be traced.
- As methods improve for detecting protein
linkages, it seems likely that most of the
proteins will be included in the network.
43Networks
44 ????? ???