Title: Improving Gene Function Prediction Using Gene Neighborhoods
1Improving Gene Function Prediction Using Gene
Neighborhoods
- Kwangmin Choi
- Bioinformatics Program
- School of Informatics
- Indiana University, Bloomington, IN
2Introduction PLATCOM (A Platform for
Computational Comparative Genomics)
- PLATCOM is a system for the comparative analysis
of multiple genomes. - PLATCOM consists of 3 components
- Databases of biological entities
- e.g. fna, faa, ptt, gbk
- Databases of relationships among entities
- e.g. genome-genome, protein-protein pairwise
comparison - Mining tools over the databases
- The web interface of PLATCOM system is located at
http//biokdd.informatics.indiana.edu/kwchoi/platc
om/
3PLATCOM Web Interface Frontpage of Genome Plot
4Background What is operon ? http//biocyc.org15
55/ECOLI/new-image?objectTranscription-Units
- The operon structure was found in 1960 by 2
French biologists. Jacob,F. and Monod,J. (1961)
Genetic regulatory mechanisms in the synthesis of
proteins. J. Mol. Biol., 3, 318356. - An operon is a group of genes that encodes
functionally linked proteins. Its components are
- Adjacent (200-300 nt)
- On the same strand ( or -)
- Co-expressed by one promoter.
5Background How to identify or predict operon
structure?
- When a promoter and terminator are known
- Gene clusters Transcription Units
- Classical concept of operon
- When a promoter is not known
- Gene clusters Directrons
- Hypothetical operon candidates
- Depending on direction and proper intergenic
distance (200-300 nt) - Computational methods have been developed to find
gene clusters in bacterial genomes.
6PCBBH and PCH R.Overbeek et al. PNAS, 1999,
Vol.96, pp.2896-2901
PCBBH Pair of Close Bidirectional Best Hits BBH
Bidirectional Best Hits PCH Pair of Close
Homologs COG Clusters of Orthologous Genes
7Background Ãœber-operon P.Bork et al. Treds.
Biochem. Sci., Vol. 25, pp. 474-479
- Ãœber-operon A set of genes with a close
functional and regulatory contexts that tends to
be conserved despite numerous rearrangements. - This concept focus on the functional themes of
operons, not a specific genes or gene order.
8Background Why gene clusters are conserved ?
- Certain operons, particularly those that encode
subunits of multiprotein complexes (e.g.
ribosomal proteins) are conserved in
phylogenetically distant bacterial genomes. - These gene clusters might have been conserved
since the last universal common ancestor. Why? - Selfish-operon hypothesis Horizontal transfer of
an entire operon is favored by natural selection
over transfer of individual genes because
co-expression and co-regulation are preserved.
9Background Problems in Operon Prediction.
- Over 150 genomes have been fully sequenced until
today, but The biological functions of some genes
are still unknown. - There is only a few promoter detection
algorithms, but they are not fully satisfactory. - In many cases, genomic data files do not provide
full information of genes and their products. (
e.g. gene name, COG, PID.) - Operon tends to undergo multiple rearrangements
during evolution. - As a result, gene order at a lever above is
poorly conserved. (e.g. genes involved in de novo
purine synthesis)
10Background Problems in Computational
Algorithms to Predict Operons
- Direct Signal Finding
- Experiment-based approach
- Transcription promoters (5-end) and terminators
(3-end) were searched. - Only be effective for species whose transcription
signals are well known, E.coli. - Combination of gene expression data, functional
annotation and other experimental data. - Literature-based approach
- Primarily applicable to well studied genomes such
as E.coli, because data files are incomplete for
other genomes. - In many cases, genomic data files do not provide
full information of genes and their products. (
e.g. gene name, COG, PID.)
11Procedure
- As a part of PLATCOM project, an integrated whole
genome analysis system was built on BIOKDD
server. - Web interface for all-to-all pairwise comparison
DB and tools are also provided. - Several tools for multiple genomes analysis were
written in Perl and then gene neighborhoods was
reconstructed from the clustering data. - My gene clustering algorithm was used to
compensate the defect of the literature-based
approach. - Connected gene neighborhoods were analyzed to
predict gene function and functional coupling
between clusters.
12Materials/ Tools
- Raw Data
- 22 genomes were chosen for this study. (14
groups) - Protein-Protein Pairwise Comparison Data
- e.g. http//biokdd.informatics.indiana.edu/kwchoi/
Thesis/L42023.faa.U00096.faa.cmp.txt - PTT files from NCBI site
- e.g. http//biokdd.informatics.indiana.edu/kwchoi/
Thesis/U00096.ptt.txt - Data Generated by Web Tools
- Gene Clustering Data (based on sequence homology)
- e.g. http//biokdd.informatics.indiana.edu/kwchoi/
Thesis/clustering_13321_23_750.txt - Gene Clusters generated from PTT file (given
intergenic distance) - e.g. http//biokdd.informatics.indiana.edu/kwchoi/
Thesis/candidates_22211.htm - E. coli database for reality check
- http//biocyc.org/
- http//ecocyc.org/
13Genomes http//www.infobiogen.fr/services/deambulu
m/english/genomes2a.html
14Procedure My Approach to reconstruct Genomic
Neighborhoods
- The idea underlying this study is that
- Different genomes contain different, overlapping
parts of evolutionarily and functionally
connected gene neighborhoods - By generating a Tiling Path, the entire
neighborhood can be reconstructed. - Genomic context of well-known genome (e.g. E.coli
) is used as a contextual framework. - Start with looking at this framework and then
search a group of similar gene neighborhoods in
the target genomes. - Genomic context means the pattern of series of
COG. If COG is not given, we can predict the
function of a unknown gene based on my gene
clustering data. - We can also identify some Hitchhikers.
Hitchhikers are inserted genes that are
originated from different contexts/themes.
15Tiling PathV.Koonin et al. Nucleic Acids
Research, 2002, Vol.30, No.10, pp. 2212-2223
16Gene Neighborhoods
17Results
- Case 1
- Relationship between Gene Order and Phylogenetic
Distance - Case 2
- One theme Typical Operon (rbs operon)
- Reconstruct gene neighborhoods
- Find missing components from the reconstructed
gene clusters. - Case 3
- Two or more themes Functional Coupling ?
- Find genomic hitchhikers
- Predict gene function of uncharacterized protein
- Predict functional coupling
18Case 1 Gene Order and Phylogenetic Distance
- If gene order of two genome is well conserved,
the sequence of homologs should appear as a line
on the genome comparison diagonal plot. - What is the relationship between phylogenetic
distance and the conservation of gene order?
19Phylogenetic TreeV.Daubin et al. Genome
Research, Vol 12, Issue 7, 1080-1090
20Genome Comparison Diagonal Plot
Phylogenetically-Distant Species (Z-score over
500)
21Genome Comparison Diagonal Plot
Phylogenetically-Close Species (Z-score gt 1000)
22Fragmented Gene Clusters
23Case 1 Conclusion
- Gene order in phylogenetically-distant species
are poorly conserved. - But this observation does not mean that gene
order is conserved very well among the
phylogenetically-close species. - In case of very close species (e.g. E.coli vs.
H.influenza), gene orders are completely
scattered. - In most cases, only a small number of genes are
observed as a short line or cluster and we may
consider it as a putative operon. - In next step, this possibility will be
investigated deeply.
24Case 2 Rbs Operon (Typical Operon)
- Theme Ribose transport across membrane
- COG1869 D-ribose high-affinity transport system
membrane-associated protein - COG1129 ATP-binding component of D-ribose
high-affinity transport system - COG1172 D-ribose high-affinity transport system
- COG1879 D-ribose periplasmic binding protein
- COG0524 ribokinase
- COG1609 regulator for rbs operon
- http//biocyc.org1555/ECOLI/new-image?typeOPERON
objectTU00206 -
25Case 2 Rbs OperonZ-score over 750,
Intergenic Distance 300
26Case 2 Conclusion
- All components are involved in ribose transport
across bacterial cell membrane - In Rbs operon system, gene order pattern is
1869-1129-1172-1879-0524-1609. - 10 out of 22 genomes have this operon system.
- Exceptsome cases, this gene order pattern is
conserved very well. - So it is possible that there exists a kind of
General Contextual Framework of gene order.
27Case 3 Functional Coupling of 2 or more themes
- Theme 1 Transcription
- COG0779 Uncharacterized Conserved Protein
- COG0195 Transcription elongation factor
- COG2740 Predicted nucleic-acid-binding protein
(transcription termination?) - Theme 2 Translation
- COG1358 Ribosomal protein S17E
- COG0532 Translation initiation factor 2 (GTPase)
- COG1550 Uncharacterized Conserved Protein
- COG0858 Ribosome-binding factor A
- COG0184 Ribosomal protein S15P/S13E
- COG0130 tRNA Pseudouridine synthase
- Hitchhiker ?
- COG0196 FDA Synthase (Hitchhiker?)
- http//biocyc.org1555/ECOLI/new-image?typeOPERON
objectTU341
28Case 3 Functional CouplingZ-score over 750,
Intergenic Distance 300
29Case 3 Conclusion
- Functional Coupling
- In bacteria, transcription, translation and RNA
modification/degradation are coupled and the
advantages of co-regulation the corresponding
genes are obvious. - COG0779(Uncharacterized) is almost inseparable
from the COG0195(Transcription Elongation
Factor), so it is likely to be a functional
partner of COG0195. - Hitchhiker
- The association of the COG0196(FDA synthase) is
not as tight as the connections between the genes
belonging to the theme. - Gene function prediction
- The functions of 3 genes in AE0004092 genomes can
be predicted by reading genomic context.
30Conclusion
- Genome Comparison Diagonal Plot visualizes the
sequence comparison of 2 genomes. It is a simple
tool, but presents a very strong intuition to
understand the genome structure. - Conserved gene neighborhoods reconstructed from
many genomes by the Tiling Path Method can be
used to predict the functions of uncharacterized
genes and functional coupling between
well-characterized genes in those genomes. - Ultimately, We can use this methods to
reconstruct metabolic and functional subsystems.
31Acknowledgements
- Haifeng Zhao
- Genome Pairwise Comparison DB
- Scott Martin
- Server Management and Technical Suppor
- Dr. Sun Kim
- Graduate Advisor and P.I.