Improving Gene Function Prediction Using Gene Neighborhoods - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Improving Gene Function Prediction Using Gene Neighborhoods

Description:

http://biocyc.org:1555/ECOLI/new-image?object=Transcription-Units ... http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU00206. Case 2 : Rbs Operon ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 32
Provided by: kwangm
Category:

less

Transcript and Presenter's Notes

Title: Improving Gene Function Prediction Using Gene Neighborhoods


1
Improving Gene Function Prediction Using Gene
Neighborhoods
  • Kwangmin Choi
  • Bioinformatics Program
  • School of Informatics
  • Indiana University, Bloomington, IN

2
Introduction PLATCOM (A Platform for
Computational Comparative Genomics)
  • PLATCOM is a system for the comparative analysis
    of multiple genomes.
  • PLATCOM consists of 3 components
  • Databases of biological entities
  • e.g. fna, faa, ptt, gbk
  • Databases of relationships among entities
  • e.g. genome-genome, protein-protein pairwise
    comparison
  • Mining tools over the databases
  • The web interface of PLATCOM system is located at
    http//biokdd.informatics.indiana.edu/kwchoi/platc
    om/

3
PLATCOM Web Interface Frontpage of Genome Plot
4
Background What is operon ? http//biocyc.org15
55/ECOLI/new-image?objectTranscription-Units
  • The operon structure was found in 1960 by 2
    French biologists. Jacob,F. and Monod,J. (1961)
    Genetic regulatory mechanisms in the synthesis of
    proteins. J. Mol. Biol., 3, 318356.
  • An operon is a group of genes that encodes
    functionally linked proteins. Its components are
  • Adjacent (200-300 nt)
  • On the same strand ( or -)
  • Co-expressed by one promoter.

5
Background How to identify or predict operon
structure?
  • When a promoter and terminator are known
  • Gene clusters Transcription Units
  • Classical concept of operon
  • When a promoter is not known
  • Gene clusters Directrons
  • Hypothetical operon candidates
  • Depending on direction and proper intergenic
    distance (200-300 nt)
  • Computational methods have been developed to find
    gene clusters in bacterial genomes.

6
PCBBH and PCH R.Overbeek et al. PNAS, 1999,
Vol.96, pp.2896-2901
PCBBH Pair of Close Bidirectional Best Hits BBH
Bidirectional Best Hits PCH Pair of Close
Homologs COG Clusters of Orthologous Genes
7
Background Ãœber-operon P.Bork et al. Treds.
Biochem. Sci., Vol. 25, pp. 474-479
  • Ãœber-operon A set of genes with a close
    functional and regulatory contexts that tends to
    be conserved despite numerous rearrangements.
  • This concept focus on the functional themes of
    operons, not a specific genes or gene order.

8
Background Why gene clusters are conserved ?
  • Certain operons, particularly those that encode
    subunits of multiprotein complexes (e.g.
    ribosomal proteins) are conserved in
    phylogenetically distant bacterial genomes.
  • These gene clusters might have been conserved
    since the last universal common ancestor. Why?
  • Selfish-operon hypothesis Horizontal transfer of
    an entire operon is favored by natural selection
    over transfer of individual genes because
    co-expression and co-regulation are preserved.

9
Background Problems in Operon Prediction.
  • Over 150 genomes have been fully sequenced until
    today, but The biological functions of some genes
    are still unknown.
  • There is only a few promoter detection
    algorithms, but they are not fully satisfactory.
  • In many cases, genomic data files do not provide
    full information of genes and their products. (
    e.g. gene name, COG, PID.)
  • Operon tends to undergo multiple rearrangements
    during evolution.
  • As a result, gene order at a lever above is
    poorly conserved. (e.g. genes involved in de novo
    purine synthesis)

10
Background Problems in Computational
Algorithms to Predict Operons
  • Direct Signal Finding
  • Experiment-based approach
  • Transcription promoters (5-end) and terminators
    (3-end) were searched.
  • Only be effective for species whose transcription
    signals are well known, E.coli.
  • Combination of gene expression data, functional
    annotation and other experimental data.
  • Literature-based approach
  • Primarily applicable to well studied genomes such
    as E.coli, because data files are incomplete for
    other genomes.
  • In many cases, genomic data files do not provide
    full information of genes and their products. (
    e.g. gene name, COG, PID.)

11
Procedure
  • As a part of PLATCOM project, an integrated whole
    genome analysis system was built on BIOKDD
    server.
  • Web interface for all-to-all pairwise comparison
    DB and tools are also provided.
  • Several tools for multiple genomes analysis were
    written in Perl and then gene neighborhoods was
    reconstructed from the clustering data.
  • My gene clustering algorithm was used to
    compensate the defect of the literature-based
    approach.
  • Connected gene neighborhoods were analyzed to
    predict gene function and functional coupling
    between clusters.

12
Materials/ Tools
  • Raw Data
  • 22 genomes were chosen for this study. (14
    groups)
  • Protein-Protein Pairwise Comparison Data
  • e.g. http//biokdd.informatics.indiana.edu/kwchoi/
    Thesis/L42023.faa.U00096.faa.cmp.txt
  • PTT files from NCBI site
  • e.g. http//biokdd.informatics.indiana.edu/kwchoi/
    Thesis/U00096.ptt.txt
  • Data Generated by Web Tools
  • Gene Clustering Data (based on sequence homology)
  • e.g. http//biokdd.informatics.indiana.edu/kwchoi/
    Thesis/clustering_13321_23_750.txt
  • Gene Clusters generated from PTT file (given
    intergenic distance)
  • e.g. http//biokdd.informatics.indiana.edu/kwchoi/
    Thesis/candidates_22211.htm
  • E. coli database for reality check
  • http//biocyc.org/
  • http//ecocyc.org/

13
Genomes http//www.infobiogen.fr/services/deambulu
m/english/genomes2a.html
14
Procedure My Approach to reconstruct Genomic
Neighborhoods
  • The idea underlying this study is that
  • Different genomes contain different, overlapping
    parts of evolutionarily and functionally
    connected gene neighborhoods
  • By generating a Tiling Path, the entire
    neighborhood can be reconstructed.
  • Genomic context of well-known genome (e.g. E.coli
    ) is used as a contextual framework.
  • Start with looking at this framework and then
    search a group of similar gene neighborhoods in
    the target genomes.
  • Genomic context means the pattern of series of
    COG. If COG is not given, we can predict the
    function of a unknown gene based on my gene
    clustering data.
  • We can also identify some Hitchhikers.
    Hitchhikers are inserted genes that are
    originated from different contexts/themes.

15
Tiling PathV.Koonin et al. Nucleic Acids
Research, 2002, Vol.30, No.10, pp. 2212-2223
16
Gene Neighborhoods
17
Results
  • Case 1
  • Relationship between Gene Order and Phylogenetic
    Distance
  • Case 2
  • One theme Typical Operon (rbs operon)
  • Reconstruct gene neighborhoods
  • Find missing components from the reconstructed
    gene clusters.
  • Case 3
  • Two or more themes Functional Coupling ?
  • Find genomic hitchhikers
  • Predict gene function of uncharacterized protein
  • Predict functional coupling

18
Case 1 Gene Order and Phylogenetic Distance
  • If gene order of two genome is well conserved,
    the sequence of homologs should appear as a line
    on the genome comparison diagonal plot.
  • What is the relationship between phylogenetic
    distance and the conservation of gene order?

19
Phylogenetic TreeV.Daubin et al. Genome
Research, Vol 12, Issue 7, 1080-1090
20
Genome Comparison Diagonal Plot
Phylogenetically-Distant Species (Z-score over
500)
21
Genome Comparison Diagonal Plot
Phylogenetically-Close Species (Z-score gt 1000)
22
Fragmented Gene Clusters
23
Case 1 Conclusion
  • Gene order in phylogenetically-distant species
    are poorly conserved.
  • But this observation does not mean that gene
    order is conserved very well among the
    phylogenetically-close species.
  • In case of very close species (e.g. E.coli vs.
    H.influenza), gene orders are completely
    scattered.
  • In most cases, only a small number of genes are
    observed as a short line or cluster and we may
    consider it as a putative operon.
  • In next step, this possibility will be
    investigated deeply.

24
Case 2 Rbs Operon (Typical Operon)
  • Theme Ribose transport across membrane
  • COG1869 D-ribose high-affinity transport system
    membrane-associated protein
  • COG1129 ATP-binding component of D-ribose
    high-affinity transport system
  • COG1172 D-ribose high-affinity transport system
  • COG1879 D-ribose periplasmic binding protein
  • COG0524 ribokinase
  • COG1609 regulator for rbs operon
  • http//biocyc.org1555/ECOLI/new-image?typeOPERON
    objectTU00206

25
Case 2 Rbs OperonZ-score over 750,
Intergenic Distance 300
26
Case 2 Conclusion
  • All components are involved in ribose transport
    across bacterial cell membrane
  • In Rbs operon system, gene order pattern is
    1869-1129-1172-1879-0524-1609.
  • 10 out of 22 genomes have this operon system.
  • Exceptsome cases, this gene order pattern is
    conserved very well.
  • So it is possible that there exists a kind of
    General Contextual Framework of gene order.

27
Case 3 Functional Coupling of 2 or more themes
  • Theme 1 Transcription
  • COG0779 Uncharacterized Conserved Protein
  • COG0195 Transcription elongation factor
  • COG2740 Predicted nucleic-acid-binding protein
    (transcription termination?)
  • Theme 2 Translation
  • COG1358 Ribosomal protein S17E
  • COG0532 Translation initiation factor 2 (GTPase)
  • COG1550 Uncharacterized Conserved Protein
  • COG0858 Ribosome-binding factor A
  • COG0184 Ribosomal protein S15P/S13E
  • COG0130 tRNA Pseudouridine synthase
  • Hitchhiker ?
  • COG0196 FDA Synthase (Hitchhiker?)
  • http//biocyc.org1555/ECOLI/new-image?typeOPERON
    objectTU341

28
Case 3 Functional CouplingZ-score over 750,
Intergenic Distance 300
29
Case 3 Conclusion
  • Functional Coupling
  • In bacteria, transcription, translation and RNA
    modification/degradation are coupled and the
    advantages of co-regulation the corresponding
    genes are obvious.
  • COG0779(Uncharacterized) is almost inseparable
    from the COG0195(Transcription Elongation
    Factor), so it is likely to be a functional
    partner of COG0195.
  • Hitchhiker
  • The association of the COG0196(FDA synthase) is
    not as tight as the connections between the genes
    belonging to the theme.
  • Gene function prediction
  • The functions of 3 genes in AE0004092 genomes can
    be predicted by reading genomic context.

30
Conclusion
  • Genome Comparison Diagonal Plot visualizes the
    sequence comparison of 2 genomes. It is a simple
    tool, but presents a very strong intuition to
    understand the genome structure.
  • Conserved gene neighborhoods reconstructed from
    many genomes by the Tiling Path Method can be
    used to predict the functions of uncharacterized
    genes and functional coupling between
    well-characterized genes in those genomes.
  • Ultimately, We can use this methods to
    reconstruct metabolic and functional subsystems.

31
Acknowledgements
  • Haifeng Zhao
  • Genome Pairwise Comparison DB
  • Scott Martin
  • Server Management and Technical Suppor
  • Dr. Sun Kim
  • Graduate Advisor and P.I.
Write a Comment
User Comments (0)
About PowerShow.com