Title: A data model for Comparative Genomics
1A data model for Comparative Genomics
- Laboratory for Bioinformatics (LBI), Institute of
Computing (IC) - UNICAMP
2Topics
- Introduction
- Motivation
- The data model
- The PABdb system
- Conclusions
- Future work
3History
- In 2002 the following genomes
- Agrobacterium tumefaciens
- Mesorhizobium loti
- Ralstonia solanacearum
- Sinorhizobium meliloti
- Xanthomonas axonopodis pv. citri
- Xanthomonas campestris pv. campestris
- Xylella fastidiosa cvc
- Xylella fastidiosa Temecula1
- Were compared by the following people
- M. A. Van Sluys, C. B. Monteiro-Vitorello, L. E.
A. Camargo, C. F. M. Menck, A. C. R. da Silva, J.
A. Ferro, M. C. Oliveira, J. C. Setubal, J. P.
Kitajima, A.J. Simpson.
Plant associated-bacteria
4- To help the comparison a database was created gt
PAB database - Main author J. P. Kitajima
-
- Publication
- M. A. van Sluys, C. B. Monteiro-Vitorello, L. E.
A. Camargo, C. F. M. Menck, A. C. R. da Silva, J.
A. Ferro,M. C. Oliveira, J. C. Setubal, J. P.
Kitajima, and A. J. G. Simpson. Comparative
genomic analysis of plant-associated bacteria.
Annual Review of Phytopathology, 40, 169-189,
2002. - This publication presents analysis results, not
database description
5This work
- PAB database overhaul
- Redesign
- Repopulation (data reload)
- Incusion of new query and visualization tools
- PAB database description (there was none)
- Results
- It is now much more flexible
- can be used as building block of larger
information systems - Scalable
- Much easier to include more genomes
6Motivation for the work
- Growing number of complete genomes of bacteria
- Today there are about 130 complete genomes
- In few years there will be more than 1000
- The genomes of several species of a genus or
indeed the genomes of of several strains of the
same species have been sequenced. - This data growth has made necessary the
development of new systems and tools for
comparative genomics. - The new systems must be
- Flexible
- Scalable
7Scope
Xylella fastidiosa citrus grape almond olean
der
strains
Xanthomonas axonopodis pv. citri
campestris pv. campestris oryzae
vesicatoria
species
Plant associated bacteria
Agrobacterium tumefaciens Sinorhizobium
meliloti Xanthomonas axonopodis pv. citri Xylella
fastidiosa cvc
small sets of genomes
large sets of genomes
All microbial
8Basic concepts Replicon
- Any kind of cell unit that contains genetic
information (e.g. chromosomes, plasmids and
mitochondria)
9Basic concepts Homology
- Homology two genes are homologous if they share
a common ancestor.
10Basic concepts Homology (II)
- Paralogous genes are two (or more) genes
homologous in the same organisms. - Orthologous genes are homologous genes belong to
different organisms.
11Basic concepts gene family
12Basic concepts functional category
- I - Intermediary metabolism
- Degradation
- Degradation of polysaccharides and
oligosaccharides - Degradation of small molecules
- Degradation of lipids
- Central intermediary metabolism
- Energy metabolism, carbon
- Regulatory functions
- II - Biosynthesis of small molecules
- III - Macromolecule metabolism
- IV - Cell structure
- V - Cellular processes
- VI - Mobile genetic elements
- VII - Pathogenicity, virulence, and adaptation
- VIII - Hypothetical
13Motivation queries
- Given two or more genomes, what are the genes
shared between them and to what families do they
belong? - Given two or more genomes, what are the genes
specific to one in relation to the others, and to
what families do they belong? - Given a gene x from an organism not in the
system, does it have homologous in the system? If
so, how many?
14G1
G2
Gk
genomes
R1
R2
R3
R4
R5
Rp
Rp-1
replicons
genes
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gz
gx
gx
gx
gx
gx
gx
gw
gx
gx
gx
gx
gx
gx
gy
gx
gx
gx
gx
gx
gx
gx
gx
gx
15Attributes
- Attributes based in GenBank data
- Genome
- id, strain, source, taxid, description
- Replicon
- id, genome_id, description, sequence
- Genes
- id, replicon_id, start_pos, end_pos,
gene_synonym, orientation, product, name, gi,
category
16Conceptual model
Category
BLAST Hits
17Tables and relationships
18PABdb information system
- Plant Associated Bacteria Database
- Main objectives
- management of genome data
- comparison among genomes
- clustering of genes in gene families and in
categories - Allow easy inclusion of new comparison tools
19System overview
BLAST, category and family operations
LOCAL DBMS
converters of data
20Gene Families and Categories
- Gene families were created based on BLAST results
and on an undirected graph model G. - the connected components of G are the families
- Gene categories were assigned by
- automatic methods
- human curator
21PABdb tools
- Queries tools
- Query facilitators
- Visualization tools
- Genome overview
- Comparison of orthologous genes of two genomes
22Search mechanism
What are the genes in Xanthomonas axonopodis pv.
citri and Xylella fastidiosa cvc and not in
Xanthomonas campestris pv. campestris and
Xylella fastidiosa Temecula1?
Query facilitator
XML result file
result table
Browser
23Screenshot (1) search tool
24Genes in Xanthomonas axonopodis pv. citri and
Xylella fastidiosa cvc and not in Xanthomonas
campestris pv. campestris and Xylella fastidiosa
Temecula1
family_id gene_id categ_id product 2288 Xac-chrom
osome I.D.2 transcriptional regulator 2288 Xfcvc-c
hromosome I.D transcriptional regulator 2730 Xac-c
hromosome VI.B plasmid stability
protein 2730 Xfcvc-chromosome VI.B plasmid
stabilization protein 2739 Xac-chromosome VIII.A c
onserved hypothetical protein 2739 Xfcvc-pXF51 VI
II.A conserved hypothetical protein 3402 Xac-chrom
osome I.C.3 cytochrome like B561 3402 Xfcvc-chromo
some I.C.3 cytochrome B561 4520 Xac-chromosome VI.
A phage-related integrase 4520 Xfcvc-chromosome VI
.A phage-related integrase 5376 Xac-chromosome V.B
chromosome partitioning related
protein 5376 Xfcvc-chromosome V.B chromosome
partitioning related protein 5377 Xac-chromosome V
III.A conserved hypothetical protein 5377 Xfcvc-ch
romosome VIII.A hypothetical protein 5377 Xfcvc-ch
romosome VIII.A hypothetical protein 5378 Xac-chro
mosome VIII.A conserved hypothetical
protein 5378 Xfcvc-chromosome VIII.A conserved
hypothetical protein 5379 Xac-chromosome VIII.A co
nserved hypothetical protein 5379 Xfcvc-chromosome
VIII.A hypothetical protein 5380 Xac-chromosome V
III.A conserved hypothetical protein 5380 Xfcvc-ch
romosome VIII.A hypothetical protein
25family_id gene_id categ_id product 5381 Xac-chrom
osome VIII.A conserved hypothetical
protein 5381 Xfcvc-chromosome VIII.A hypothetical
protein 5382 Xac-chromosome III.A.2 single-strande
d DNA binding protein 5382 Xfcvc-chromosome III.A.
2 single-stranded DNA binding protein 5383 Xac-chr
omosome III.A.5 cytosine-specific DNA
methyltransferase 5383 Xfcvc-chromosome III.A.5 DN
A methyltransferase 5384 Xac-chromosome VIII.A con
served hypothetical protein 5384 Xfcvc-chromosome
VIII.A hypothetical protein 5385 Xac-chromosome VI
II.A conserved hypothetical protein 5385 Xfcvc-chr
omosome VIII.A hypothetical protein 5386 Xac-chrom
osome VIII.A conserved hypothetical
protein 5386 Xfcvc-chromosome VIII.A hypothetical
protein 5387 Xac-chromosome VIII.A conserved
hypothetical protein 5387 Xfcvc-chromosome VIII.A
hypothetical protein 5388 Xac-chromosome VIII.A co
nserved hypothetical protein 5388 Xfcvc-chromosome
VIII.A hypothetical protein 5389 Xac-chromosome V
I.B plasmid-related protein 5389 Xfcvc-chromosome
VI.B conserved plasmid protein 5390 Xac-chromosome
VIII.A conserved hypothetical protein 5390 Xfcvc-
chromosome VIII.A hypothetical protein 5391 Xac-ch
romosome VIII.A conserved hypothetical
protein 5391 Xfcvc-chromosome VIII.A hypothetical
protein 5413 Xac-chromosome VIII.A conserved
hypothetical protein 5413 Xfcvc-chromosome VIII.A
hypothetical protein 5414 Xac-chromosome VIII.A co
nserved hypothetical protein 5414 Xfcvc-chromosome
VIII.A hypothetical protein
26Search mechanism
Given the genomes Xanthomonas axonopodis pv.
citri and Xanthomonas campestris pv. campestris ,
what are the genes shared between them
(orthologous genes)? What are the genes specific
to one genome in relation to the other?
Query facilitator
result tables
XML result file
SVG result file
Visualization tool
27Screenshot (2) search tool
28Xanthomonas axonopodis pv. citri chromosome
compared with Xanthomonas campestris pv.
campestris chromosome
29Search mechanism
Given the genomes Xanthomonas axonopodis pv.
citri and Xanthomonas campestris pv. campestris,
what are the genes shared between them
(orthologous genes)?
Query facilitator
XML result file
result table
SVG result file
Visualization tool
30Screenshot (3) visualization tool
31Comparison of orthologous genes of Xanthomonas
axonopodis pv. citri and Xanthomonas campestris
pv. campestris
32Distribution of genes of each genome by category
33Conclusions
- The information systems for genomic management
must be scalable and allow exchange of data and
operations - This work presented a simple but flexible and
extensible data model for comparative genomics. A
first step in the design of a large information
system - The data model was used in a real application
(PABdb system).
34Future work
- Extend the data model to a richer context (e.g.
metabolic pathways) - Extend the model to include subdivisions between
family and category - Use of metadata to describe services and data
- Use of different methods to generate the gene
families.
35Thank you!
Laboratory for Bioinformatics www.lbi.ic.unicamp.b
r Institute of Computation (IC)
www.ic.unicamp.br University of Campinas
(UNICAMP) www.unicamp.br Luciano Antonio
Digiampietri luciano_at_ic.unicamp.br