Characterization of Prokaryotic Genomic Structure and Application to Biological Pathway Prediction

About This Presentation

Title:

Characterization of Prokaryotic Genomic Structure and Application to Biological Pathway Prediction

Description:

Decipher microbial genomes through understanding ... 2-AEP pathway (in Gram-positive microbes): Pathways to utilize phosphonates: NH2CH2CH2PO3H ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 46

Provided by: Kyl160

Category:

more less

Transcript and Presenter's Notes

Title: Characterization of Prokaryotic Genomic Structure and Application to Biological Pathway Prediction

1
Characterization of Prokaryotic Genomic Structure
and Application to Biological Pathway Prediction

Ying Xu
Biochemistry and Molecular Biology Department,
and
Institute of Bioinformatics
University of Georgia
http//csbl.bmb.uga.edu

2
Deciphering Microbial Genomes

Decipher microbial genomes through understanding
individual basic units, e.g., genes, cis
regulatory elements,
organizational structures of the basic units
linking genomic structural information to
molecular and cellular machinery

300 microbes have their complete genomes
sequenced
Most genes in each genome have been
computationally predicted (quite accurately)
Genes are grouped into operons (transcriptional
units)

4
What We Know

5
What We Know

While some of the concepts are well established,
little is known about how to identify them
accurately
Many other unknown genomic elements and
structures are yet to be identified

RNA genes
pseudo genes
transposable elements
horizontal transferred genes
genomic islands
genome rearrangements
.

regulatory binding motifs of all sorts
other regulatory elements encoded in the genome
.

6
Deciphering Microbial Genomes

Even if we have all the genomic elements and
structural information, we still need to figure
out
which genes encode what biological function
how the genomic structures encode parts of an
organism
how the parts work together to accomplish complex
functions, e.g., biological clock

7
Goals of the Project

deciphering genomic structures of prokaryotic
organisms
investigate genomic structures beyond individual
genes through comparative genome analyses
ultimately, understand why prokaryotic genomes
are organized in the way they are organized
elucidating biological pathways and networks in
prokaryotic organisms through application of
gained information about genomic structures
other experimental information, and
computational modeling

PART I Deciphering genomic structure

gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtg
tgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatg
agcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtag
acttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagct
gatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagc
tgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgat
agctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgc
tagatcgtaggtagtagct
9
Orthologous Gene Mapping-- the basic tool

Finding equivalent genes across microbial
genomes
most fundamental operation in comparative genome
analysis
We have developed a novel method for orthologous
gene mapping using
both sequence similarity information and genomic
structure information

genome X
genome Y
Mao et al, PNAS, 2006 Wu et al, 2006
(submitted) Mao et al, 2006 (submitted)
10
Orthologous Gene Mapping

Observation the probability for a pair of
homologous genes across two genomes to be
orthologous is substantially higher than the
probability for them to be non-orthologous if
there is a pair of homologous genes in their
neighborhood
Have developed a scoring scheme for measuring the
possibility of being orthologous genes, based
on
the above observation, and
sequence similarity information

Orthologous?
Wu et al, 2006 (submitted)
11
Orthologous Gene Mapping

For any group of homologous genes, construct a
map, representing possible orthology relationship
among homologous genes
Interestingly, the map has a hierarchical
structure!
Developed a database for hierarchically clustered
equivalent gene clusters (HCG) at different
resolution level

Wu et al, 2006 (submitted) Mao et al, 2006
(submitted)
12
Deciphering Genomic Structures

By examining orthologous gene mappings across
genomes, we can derive enormous amount of genomic
structure information
Operon genes arranged in tandem in genome as a
basic unit of transcriptional regulation genes
of an operon work together
Regulon a set of operons regulated by the same
(transcription) regulatory machinery genes of a
regulon work together under certain conditions

13
Prediction of Operons

Known features
sharing common promoter and terminator
genes of the same operon are functionally related
conserved operonic structures across closely
related genomes
inter-genic distances are generally shorter than
inter-operonic distances
..
Mathematically, the problem can be formulated as
to partition a sequence of genes into groups so
that are most consistent with
conserved gene neighborhood relationships across
related genomes
functional prediction of genes
promoter and terminator predictions
known intergenic/operonic distributions

14
Prediction of Operons

We have developed a number of computer programs,
including JPOP, for operon prediction
Prediction accuracy is 80 when applied to new
genomes
Prediction accuracy could be improved when
time-course microarray data is available and used

Chen et al, NAR, 2004 Tran et al, NAR, 2006 (to
appear) Dam et al. 2006 (submitted)
15
Prediction of Uber-operons

Study of conservations among groups of operons
has uncovered the lost associations among the
operons that used to work together
A uber-opreon is a group of functionally related
operons whose union is conserved across multiple
genomes
We have developed an algorithm for predicting
uber-operons in a genome, which are useful for
prediction of component genes of biological
pathways
regulon prediction

g1, g2, g3, g4
g5, g6, g7, g8, g9, g10
genome X
g1, g2, g3, g4, g5
g6
g7, g8, g9, g10
genome Y
Che et al, NAR, 2006
16
Prediction of Regulons

A more challenging (and more information-rich)
problem is to predict regulons
Key characteristics of regulons a group of
operons sharing similar gene expression patterns
and having common cis (transcription factor)
binding sites
Challenging issues
TF binding sites are difficult to predict
existing predictions of operons and binding sites
both are noisy

17
Prediction of Regulons

Our strategy clustering of operons based on
sharing common regulatory binding sites
functional relatedness of involved genes
prediction of co-regulated genes based on
microarray data
information derived from uber-operons
Clustering of operons allows us to weed out some
of the erroneous predictions by individual
(noisy) predictors

Su et al, NAR, 2005 Che et al, 2006 (in
preparation)
18
Prediction of Regulatory Binding Sites

Mathematically, the problem can be formulated as
Popular methods mainly rely on sampling
techniques (e.g., Gibbs sampling) to search for
such a set of k-mers.

Given a set of N promoter sequences and the
genome, find a k-mer from each promoter region so
the aligned N k-mers have high information
content and the statistical significance of
having this aligned N k-mers with such level of
information content is high.
TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA
AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG
TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG
TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT
ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG
ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC
19
Prediction of Regulatory Binding Sites

Our approach
find conserved k-mers through data clustering
validation through biophysical approach
Binding site identification through data
clustering

TGGTGTGAAAGACTGTTTTTTTGATAACTGTCTGCATGGTCATATTTTT
AAATTGTGATGTGTATCGAAGTGTGTTAATGTGAGTTAGCTCACTCAT
TAGAATTCTGAGCGGATAACAATTTCACTTCTGTGAACTAAACCGAGG
TCATGAATTCTGTCACAGTGCAAATTCAGAGATTGTGATTCGATTCAC
ATTTAAATGTTGTGCTGTGGTTAACCCAATTACGGTGTCAAATACCGC
ACAGATGCGACCTGTGACGGAAGATCACTTCGCAATTTGTCAGTGGTC
GCACATATCCT
Olman et al, PSB, 2003 Olman et al, JBCB, 2003
20
Tie to Structural Information

We have developed a protein-DNA docking program
for assessing the binding affinity between a
protein and DNA motif
The core of the program is a statistics-based
energy function measuring 2-body, 3-body and
4-body interactions between amino acids and
nucleotides
On a test set with 18 TF structures and 2750
predicted binding motifs, our program ranks all
18 correct binding motifs among the top 25
binding predictions

Liu et al, NAR, 2005
21
Prediction of Functional Modules

A pair of genes are considered to be functionally
linked if they belong to the same (known)
pathways, regulons, complexes, .
We found that such a functional linkage
relationship could be predicted using
co-occurrence relationships
co-evolutionary relationships
functional relatedness defined in terms of GO
classification
Using such prediction, we have predicted
functional linkage maps for all sequenced
microbial genomes

22
Prediction of Functional Modules

Identification of sub-networks that might be
functional

Sub-networks that are densely intra-connected
--- groups of genes that are functionally linked
with each other hence might indicate that these
genes work together
Sub-networks that are conserved across multiple
maps groups of genes whose functional linkage
relationships are conserved across multiple
organisms, indicating that there is an
evolutionary pressure for the conservation
These two types of relatedness are
complementary to each other

Wu et al, NAR, 2005 Wu et al, GIW, 2005
23
Prediction of Functional Modules
Red Pathway Blue Regulon Green
Transcription Unit Purple Similar GO
assignments
24
Other Related Work

Identification and characterization of insertion
sequences (and other transposable elements) at
genome scale
Identification and characterization of
protein-binding motifs at genome scale
Functional classification of genes at
multi-resolution a framework beyond concepts of
homology/orthology
Evolutionary studies of operons

25
Working Towards ..

Deriving the genomic units and structures, at
different levels, of microbial genomes
making progress ..
Understanding the organizational rules of the
basic units
through extensive comparative genome analyses

PART II Pathway and network prediction

27
Biological Networks

Biological network a group of bio-molecules
(protein, DNA, RNA) wired together to
accomplish a (complex) biological function
including regulatory, signaling and metabolic
components
pathways un-branched networks
Example the process of nitrogen assimilation

Senses the availability of nitrogen in what forms
-gt activates the transporting process to uptake a
particular form of nitrogen into the cell -gt
reduces this form of nitrogen to a form the cell
could utilize directly (nitrate -gt nitrite -gt
ammonia -gt glutamine -gt glutamate) -gt may
trigger a number of biological processes
28
Predicting Biological Networks
A1 A2 An
B1 B2 Bn
Z1 Z2 Zn
Y1 Y2 Yn
t 1
t N
t 2
What is the common regulation mechanism?
transcription regulation network
29
Predicting Biological Networks

Linear dynamic model for a regulatory network
A transition matrix
b constitutive expression level
noise at time t
expression level of all genes at time t
Estimating matrix A as an optimization problem
AI (AI)
bA Ab

Building models consistent with gene expression
data
30
Challenging Problems

There are numerous other mathematical frameworks
for modeling biological networks
Experimental data is significantly limited
compared to the complexity of the networks to be
elucidated,
making the network prediction problem a
significantly under-constrained problem
leading to possibly infinitely many network
solutions, each of which explains the data
equally well

31
Network Inference in Microbes-- our general
strategy

Framework prediction of network topologies
that are most consistent with high-throughput
data and prior knowledge
Constraints derivation of as much information
about (a) component genes and (b) their
interactions as possible and using them as
prediction constraints
Sampling sample the feasible network topology
space to derive network topology distribution

Su et al, GIW, 2003 Ji and Xu, Bioinformatics,
2006
32
Information Extractable from Literature to set
the framework

Literature and database search
to infer initial conceptual models for a target
pathway
to collect information about which genes are
involved in the target pathway and their
interaction relationships

Pathways to utilize phosphonates
2-AEP pathway (in Gram-positive microbes)

Transaminase
NH2CH2CH2PO3H 2-aminoethylphosphonate
COHCH2PO3H2 phosphoacetaldehyde
phosphonatase
Automated literature mining capabilities are
desperately needed!!!
CHOCH3 Pi acetaldehyde
33
Derivation of Constraints

Information derivable through comparative genome
analyses and analysis of other experimental data
Component genes (parts list) in a target network
Functional roles of component genes
Possible interaction relationships among
component genes
Higher level functional modules conserved
across organisms

using a systematic approach!
34
Deriving Parts List

Through analysis of microarray gene expression
data, one could possibly identify an initial list
of genes possibly involved in a particular
biological process
identification of differentially expressed genes,
co-expressed genes

g1, g2, , gk
The observed gene expression data are the results
of complex interactions of possibly many pathways
in a cell, which might work cooperatively,
competitively or independently with each other
Microarray data might need to be interpreted in
the context of a network model
Xu et al, NAR, 2003
35
Deriving Parts List

Refining parts-list through prediction and
application of genomic structures (guilt by
association)
Operons
Uber-operons
Regulons, and
Functional modules
..

36
Prediction of Interactions

Two types of interactions we intend to capture
physical interactions
functional links
There are a number of databases of experimentally
verified protein-protein interactions
DIP, BIND

Homology search against these data sets is the
key technique
Su et al, GIW, 2004
37
Network Mapping across Genomes

Related genomes may employ similar networks for a
particular biological process
Through mapping a homologous network across
genomes, one could possibly derive a network in
the target genome

?
38
Network Mapping across Genomes

Our approach -- mapping orthologous genes of a
pathway to a target genome, which best preserve
regulon structures, i.e., co-regulated operons
The basic idea find homologous gene pairs with
highest sequence similarity under condition that
mapped genes are grouped into co-regulated
operons

homologous genes
Using both homology and genomic structure
information, in mapping networks!
39
Network Mapping across Genomes

The problem was formulated and solved as a
Steiner network problem (called constrained
minimum spanning tree problem)
A recent solution solves the problem as an
integer programming problem

Have implemented the algorithm as a program P-MAP
Mao et al, PNAS, 2006 Olman et al, CSB, 2004
40
Mapping KEGG Pathways

(Generic) KEGG pathways consist of enzymes and
their interactions
Mapping a KEGG pathway is essentially to find
genes that encode the enzymes in the pathway

41
Nitrogen Assimilation and Photosynthesis

Known facts
the core part of the nitrogen assimilation is
regulated by TF ntcA, forming ntcA regulon
A number of genes are known to be in the ntcA
regulons in some of the 16 sequenced
cyanobacterial genomes
known ntcA regulated operons in cyanobacteria
also have a s70-like binding motif in their
promoters
We predicted the binding motif of ntcA along with
the s70-like motif
Key idea predicting clustered motifs

Su et al, NAR, 2005
42
Nitrogen Assimilation and Photosynthesis

Using the profiles of the two binding motifs, we
searched the 16 genomes for additional nctA
regulated genes and identified a number of
additional operons
An interesting observation is that we
consistently found genes known to be involved in
photosynthesis across the 11 genomes, with ntcA
binding motifs
It was previously known that nitrogen
assimilation process is somehow coordinated with
the photosynthesis process but the molecular
level mechanism is not clear
We for the first time predicted a rough model for
the coordination process between these two
important biological processes, based on the
detailed functions and interactions of the
involved genes.

Su et al, NAR, 2006
43
Nitrogen Assimilation and Photosynthesis
Nutrients
Light
CO2
Som
Periplasmic membrane
Plasma membrane
Photosystem
Calvin cycle
ATP NADPH
RbcL, RbcS, Icd
NrtP

Other pathways
NO3
NO3-
SYNY2460, 2468,2469,2474
2-OG
PII
NarB
Hypothetical proteins

SYNW2289
PetH

Hypothetical proteins
NO2-
SYNW0273
NirA
GOGAT
Glu
Rpod
NtcA
Urt
Urease
NH4
Gln
Urea
Urea
GS
Cyanase
DNA
Glu
Cyn
Cyanate
Cyanate
GltS
Amt
Glu
NH4
Shape codes
Color codes
NtcA regulon
transformation/translocation
transporter
Non-ntcA regulon
gene
regulation
protein
Transcription factor
44
Summary

Substantial amount of information about genomic
structures and organizational rules are derivable
through comparative genomics
This information makes it possible for
computational derivation of biological pathways
and networks of microbes
Network prediction is a systems problem, and it
requires a systems approach
Combined application of the multiple types of
information provides a powerful approach to
network elucidation

45
Acknowledgment