Title: Characterization of Prokaryotic Genomic Structure and Application to Biological Pathway Prediction
1Characterization of Prokaryotic Genomic Structure
and Application to Biological Pathway Prediction
- Ying Xu
- Biochemistry and Molecular Biology Department,
and - Institute of Bioinformatics
- University of Georgia
- http//csbl.bmb.uga.edu
2Deciphering Microbial Genomes
- Decipher microbial genomes through understanding
- individual basic units, e.g., genes, cis
regulatory elements, - organizational structures of the basic units
- linking genomic structural information to
molecular and cellular machinery
gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtg
tgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatg
agcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtag
acttcgcgcataaagctgcgcgagatgattgcaaagragttagatga.
3What We Know
- 300 microbes have their complete genomes
sequenced - Most genes in each genome have been
computationally predicted (quite accurately) - Genes are grouped into operons (transcriptional
units)
4What We Know
5What We Know
- While some of the concepts are well established,
little is known about how to identify them
accurately - Many other unknown genomic elements and
structures are yet to be identified
- RNA genes
- pseudo genes
- transposable elements
- horizontal transferred genes
- genomic islands
- genome rearrangements
- .
- regulatory binding motifs of all sorts
- other regulatory elements encoded in the genome
- .
6Deciphering Microbial Genomes
- Even if we have all the genomic elements and
structural information, we still need to figure
out - which genes encode what biological function
- how the genomic structures encode parts of an
organism - how the parts work together to accomplish complex
functions, e.g., biological clock
7Goals of the Project
- deciphering genomic structures of prokaryotic
organisms - investigate genomic structures beyond individual
genes through comparative genome analyses - ultimately, understand why prokaryotic genomes
are organized in the way they are organized - elucidating biological pathways and networks in
prokaryotic organisms through application of - gained information about genomic structures
- other experimental information, and
- computational modeling
8- PART I Deciphering genomic structure
gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtg
tgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatg
agcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtag
acttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagct
gatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagc
tgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgat
agctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgc
tagatcgtaggtagtagct
9Orthologous Gene Mapping-- the basic tool
- Finding equivalent genes across microbial
genomes - most fundamental operation in comparative genome
analysis - We have developed a novel method for orthologous
gene mapping using - both sequence similarity information and genomic
structure information
genome X
genome Y
Mao et al, PNAS, 2006 Wu et al, 2006
(submitted) Mao et al, 2006 (submitted)
10Orthologous Gene Mapping
- Observation the probability for a pair of
homologous genes across two genomes to be
orthologous is substantially higher than the
probability for them to be non-orthologous if
there is a pair of homologous genes in their
neighborhood - Have developed a scoring scheme for measuring the
possibility of being orthologous genes, based
on - the above observation, and
- sequence similarity information
Orthologous?
Wu et al, 2006 (submitted)
11Orthologous Gene Mapping
- For any group of homologous genes, construct a
map, representing possible orthology relationship
among homologous genes - Interestingly, the map has a hierarchical
structure! - Developed a database for hierarchically clustered
equivalent gene clusters (HCG) at different
resolution level
Wu et al, 2006 (submitted) Mao et al, 2006
(submitted)
12Deciphering Genomic Structures
- By examining orthologous gene mappings across
genomes, we can derive enormous amount of genomic
structure information - Operon genes arranged in tandem in genome as a
basic unit of transcriptional regulation genes
of an operon work together - Regulon a set of operons regulated by the same
(transcription) regulatory machinery genes of a
regulon work together under certain conditions
13Prediction of Operons
- Known features
- sharing common promoter and terminator
- genes of the same operon are functionally related
- conserved operonic structures across closely
related genomes - inter-genic distances are generally shorter than
inter-operonic distances - ..
- Mathematically, the problem can be formulated as
to partition a sequence of genes into groups so
that are most consistent with - conserved gene neighborhood relationships across
related genomes - functional prediction of genes
- promoter and terminator predictions
- known intergenic/operonic distributions
14Prediction of Operons
- We have developed a number of computer programs,
including JPOP, for operon prediction - Prediction accuracy is 80 when applied to new
genomes - Prediction accuracy could be improved when
time-course microarray data is available and used
Chen et al, NAR, 2004 Tran et al, NAR, 2006 (to
appear) Dam et al. 2006 (submitted)
15Prediction of Uber-operons
- Study of conservations among groups of operons
has uncovered the lost associations among the
operons that used to work together - A uber-opreon is a group of functionally related
operons whose union is conserved across multiple
genomes - We have developed an algorithm for predicting
uber-operons in a genome, which are useful for - prediction of component genes of biological
pathways - regulon prediction
g1, g2, g3, g4
g5, g6, g7, g8, g9, g10
genome X
g1, g2, g3, g4, g5
g6
g7, g8, g9, g10
genome Y
Che et al, NAR, 2006
16Prediction of Regulons
- A more challenging (and more information-rich)
problem is to predict regulons - Key characteristics of regulons a group of
operons sharing similar gene expression patterns
and having common cis (transcription factor)
binding sites - Challenging issues
- TF binding sites are difficult to predict
- existing predictions of operons and binding sites
both are noisy
17Prediction of Regulons
- Our strategy clustering of operons based on
- sharing common regulatory binding sites
- functional relatedness of involved genes
- prediction of co-regulated genes based on
microarray data - information derived from uber-operons
- Clustering of operons allows us to weed out some
of the erroneous predictions by individual
(noisy) predictors
Su et al, NAR, 2005 Che et al, 2006 (in
preparation)
18Prediction of Regulatory Binding Sites
- Mathematically, the problem can be formulated as
- Popular methods mainly rely on sampling
techniques (e.g., Gibbs sampling) to search for
such a set of k-mers.
Given a set of N promoter sequences and the
genome, find a k-mer from each promoter region so
the aligned N k-mers have high information
content and the statistical significance of
having this aligned N k-mers with such level of
information content is high.
TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA
AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG
TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG
TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT
ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG
ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC
19Prediction of Regulatory Binding Sites
- Our approach
- find conserved k-mers through data clustering
- validation through biophysical approach
- Binding site identification through data
clustering
TGGTGTGAAAGACTGTTTTTTTGATAACTGTCTGCATGGTCATATTTTT
AAATTGTGATGTGTATCGAAGTGTGTTAATGTGAGTTAGCTCACTCAT
TAGAATTCTGAGCGGATAACAATTTCACTTCTGTGAACTAAACCGAGG
TCATGAATTCTGTCACAGTGCAAATTCAGAGATTGTGATTCGATTCAC
ATTTAAATGTTGTGCTGTGGTTAACCCAATTACGGTGTCAAATACCGC
ACAGATGCGACCTGTGACGGAAGATCACTTCGCAATTTGTCAGTGGTC
GCACATATCCT
Olman et al, PSB, 2003 Olman et al, JBCB, 2003
20Tie to Structural Information
- We have developed a protein-DNA docking program
for assessing the binding affinity between a
protein and DNA motif - The core of the program is a statistics-based
energy function measuring 2-body, 3-body and
4-body interactions between amino acids and
nucleotides - On a test set with 18 TF structures and 2750
predicted binding motifs, our program ranks all
18 correct binding motifs among the top 25
binding predictions
Liu et al, NAR, 2005
21Prediction of Functional Modules
- A pair of genes are considered to be functionally
linked if they belong to the same (known)
pathways, regulons, complexes, . - We found that such a functional linkage
relationship could be predicted using - co-occurrence relationships
- co-evolutionary relationships
- functional relatedness defined in terms of GO
classification - Using such prediction, we have predicted
functional linkage maps for all sequenced
microbial genomes
22Prediction of Functional Modules
- Identification of sub-networks that might be
functional
- Sub-networks that are densely intra-connected
--- groups of genes that are functionally linked
with each other hence might indicate that these
genes work together - Sub-networks that are conserved across multiple
maps groups of genes whose functional linkage
relationships are conserved across multiple
organisms, indicating that there is an
evolutionary pressure for the conservation - These two types of relatedness are
complementary to each other
Wu et al, NAR, 2005 Wu et al, GIW, 2005
23Prediction of Functional Modules
Red Pathway Blue Regulon Green
Transcription Unit Purple Similar GO
assignments
24Other Related Work
- Identification and characterization of insertion
sequences (and other transposable elements) at
genome scale - Identification and characterization of
protein-binding motifs at genome scale - Functional classification of genes at
multi-resolution a framework beyond concepts of
homology/orthology - Evolutionary studies of operons
25Working Towards ..
- Deriving the genomic units and structures, at
different levels, of microbial genomes - making progress ..
- Understanding the organizational rules of the
basic units - through extensive comparative genome analyses
26- PART II Pathway and network prediction
27Biological Networks
- Biological network a group of bio-molecules
(protein, DNA, RNA) wired together to
accomplish a (complex) biological function - including regulatory, signaling and metabolic
components - pathways un-branched networks
- Example the process of nitrogen assimilation
Senses the availability of nitrogen in what forms
-gt activates the transporting process to uptake a
particular form of nitrogen into the cell -gt
reduces this form of nitrogen to a form the cell
could utilize directly (nitrate -gt nitrite -gt
ammonia -gt glutamine -gt glutamate) -gt may
trigger a number of biological processes
28Predicting Biological Networks
A1 A2 An
B1 B2 Bn
Z1 Z2 Zn
Y1 Y2 Yn
t 1
t N
t 2
What is the common regulation mechanism?
transcription regulation network
29Predicting Biological Networks
- Linear dynamic model for a regulatory network
- A transition matrix
- b constitutive expression level
- noise at time t
- expression level of all genes at time t
- Estimating matrix A as an optimization problem
- AI (AI)
- bA Ab
Building models consistent with gene expression
data
30Challenging Problems
- There are numerous other mathematical frameworks
for modeling biological networks - Experimental data is significantly limited
compared to the complexity of the networks to be
elucidated, - making the network prediction problem a
significantly under-constrained problem - leading to possibly infinitely many network
solutions, each of which explains the data
equally well
31Network Inference in Microbes-- our general
strategy
- Framework prediction of network topologies
that are most consistent with high-throughput
data and prior knowledge - Constraints derivation of as much information
about (a) component genes and (b) their
interactions as possible and using them as
prediction constraints - Sampling sample the feasible network topology
space to derive network topology distribution
Su et al, GIW, 2003 Ji and Xu, Bioinformatics,
2006
32Information Extractable from Literature to set
the framework
- Literature and database search
- to infer initial conceptual models for a target
pathway - to collect information about which genes are
involved in the target pathway and their
interaction relationships
Pathways to utilize phosphonates
2-AEP pathway (in Gram-positive microbes)
Transaminase
NH2CH2CH2PO3H 2-aminoethylphosphonate
COHCH2PO3H2 phosphoacetaldehyde
phosphonatase
Automated literature mining capabilities are
desperately needed!!!
CHOCH3 Pi acetaldehyde
33Derivation of Constraints
- Information derivable through comparative genome
analyses and analysis of other experimental data - Component genes (parts list) in a target network
- Functional roles of component genes
- Possible interaction relationships among
component genes - Higher level functional modules conserved
across organisms
using a systematic approach!
34Deriving Parts List
- Through analysis of microarray gene expression
data, one could possibly identify an initial list
of genes possibly involved in a particular
biological process - identification of differentially expressed genes,
co-expressed genes
g1, g2, , gk
The observed gene expression data are the results
of complex interactions of possibly many pathways
in a cell, which might work cooperatively,
competitively or independently with each other
Microarray data might need to be interpreted in
the context of a network model
Xu et al, NAR, 2003
35Deriving Parts List
- Refining parts-list through prediction and
application of genomic structures (guilt by
association) - Operons
- Uber-operons
- Regulons, and
- Functional modules
- ..
36Prediction of Interactions
- Two types of interactions we intend to capture
- physical interactions
- functional links
- There are a number of databases of experimentally
verified protein-protein interactions - DIP, BIND
Homology search against these data sets is the
key technique
Su et al, GIW, 2004
37Network Mapping across Genomes
- Related genomes may employ similar networks for a
particular biological process - Through mapping a homologous network across
genomes, one could possibly derive a network in
the target genome
?
38Network Mapping across Genomes
- Our approach -- mapping orthologous genes of a
pathway to a target genome, which best preserve
regulon structures, i.e., co-regulated operons - The basic idea find homologous gene pairs with
highest sequence similarity under condition that
mapped genes are grouped into co-regulated
operons
homologous genes
Using both homology and genomic structure
information, in mapping networks!
39Network Mapping across Genomes
- The problem was formulated and solved as a
Steiner network problem (called constrained
minimum spanning tree problem) - A recent solution solves the problem as an
integer programming problem
Have implemented the algorithm as a program P-MAP
Mao et al, PNAS, 2006 Olman et al, CSB, 2004
40Mapping KEGG Pathways
- (Generic) KEGG pathways consist of enzymes and
their interactions - Mapping a KEGG pathway is essentially to find
genes that encode the enzymes in the pathway
41Nitrogen Assimilation and Photosynthesis
- Known facts
- the core part of the nitrogen assimilation is
regulated by TF ntcA, forming ntcA regulon - A number of genes are known to be in the ntcA
regulons in some of the 16 sequenced
cyanobacterial genomes - known ntcA regulated operons in cyanobacteria
also have a s70-like binding motif in their
promoters - We predicted the binding motif of ntcA along with
the s70-like motif - Key idea predicting clustered motifs
Su et al, NAR, 2005
42Nitrogen Assimilation and Photosynthesis
- Using the profiles of the two binding motifs, we
searched the 16 genomes for additional nctA
regulated genes and identified a number of
additional operons - An interesting observation is that we
consistently found genes known to be involved in
photosynthesis across the 11 genomes, with ntcA
binding motifs - It was previously known that nitrogen
assimilation process is somehow coordinated with
the photosynthesis process but the molecular
level mechanism is not clear - We for the first time predicted a rough model for
the coordination process between these two
important biological processes, based on the
detailed functions and interactions of the
involved genes.
Su et al, NAR, 2006
43Nitrogen Assimilation and Photosynthesis
Nutrients
Light
CO2
Som
Periplasmic membrane
Plasma membrane
Photosystem
Calvin cycle
ATP NADPH
RbcL, RbcS, Icd
NrtP
Other pathways
NO3
NO3-
SYNY2460, 2468,2469,2474
2-OG
PII
NarB
Hypothetical proteins
SYNW2289
PetH
Hypothetical proteins
NO2-
SYNW0273
NirA
GOGAT
Glu
Rpod
NtcA
Urt
Urease
NH4
Gln
Urea
Urea
GS
Cyanase
DNA
Glu
Cyn
Cyanate
Cyanate
GltS
Amt
Glu
NH4
Shape codes
Color codes
NtcA regulon
transformation/translocation
transporter
Non-ntcA regulon
gene
regulation
protein
Transcription factor
44Summary
- Substantial amount of information about genomic
structures and organizational rules are derivable
through comparative genomics - This information makes it possible for
computational derivation of biological pathways
and networks of microbes - Network prediction is a systems problem, and it
requires a systems approach - Combined application of the multiple types of
information provides a powerful approach to
network elucidation
45Acknowledgment
- People of the project at UGA
- Zhengchang Su
- Fenglou Mao
- Hongwei Wu
- PhuongAn Dam
- Victor Olman
- Guojun Li
- Zhijie Liu
- Fengfeng Zhou
- Dongsheng Che
- Collaborators
- Tao Jiang, UCR
- Xin Chen, UCR
- Brian Palenik, UCSD
- Dong Xu, Univ of Missouri
- Arthur Grossman, Carnegie Inst
- Devaki Bhaya, Carnegie Inst
- Funding support
- NSF/BDI2 NSF/ITR
- DOE GTL project
http//csbl.bmb.uga.edu