A data model for Comparative Genomics - PowerPoint PPT Presentation

About This Presentation
Title:

A data model for Comparative Genomics

Description:

A data model for Comparative Genomics. Laboratory for ... II - Biosynthesis of small molecules. III - Macromolecule metabolism. IV - Cell structure ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 35
Provided by: lucianodig
Category:

less

Transcript and Presenter's Notes

Title: A data model for Comparative Genomics


1
A data model for Comparative Genomics
  • Laboratory for Bioinformatics (LBI), Institute of
    Computing (IC) - UNICAMP

2
Topics
  • Introduction
  • Motivation
  • The data model
  • The PABdb system
  • Conclusions
  • Future work

3
History
  • In 2002 the following genomes
  • Agrobacterium tumefaciens
  • Mesorhizobium loti
  • Ralstonia solanacearum
  • Sinorhizobium meliloti
  • Xanthomonas axonopodis pv. citri
  • Xanthomonas campestris pv. campestris
  • Xylella fastidiosa cvc
  • Xylella fastidiosa Temecula1
  • Were compared by the following people
  • M. A. Van Sluys, C. B. Monteiro-Vitorello, L. E.
    A. Camargo, C. F. M. Menck, A. C. R. da Silva, J.
    A. Ferro, M. C. Oliveira, J. C. Setubal, J. P.
    Kitajima, A.J. Simpson.

Plant associated-bacteria
4
  • To help the comparison a database was created gt
    PAB database
  • Main author J. P. Kitajima
  • Publication
  • M. A. van Sluys, C. B. Monteiro-Vitorello, L. E.
    A. Camargo, C. F. M. Menck, A. C. R. da Silva, J.
    A. Ferro,M. C. Oliveira, J. C. Setubal, J. P.
    Kitajima, and A. J. G. Simpson. Comparative
    genomic analysis of plant-associated bacteria.
    Annual Review of Phytopathology, 40, 169-189,
    2002.
  • This publication presents analysis results, not
    database description

5
This work
  • PAB database overhaul
  • Redesign
  • Repopulation (data reload)
  • Incusion of new query and visualization tools
  • PAB database description (there was none)
  • Results
  • It is now much more flexible
  • can be used as building block of larger
    information systems
  • Scalable
  • Much easier to include more genomes

6
Motivation for the work
  • Growing number of complete genomes of bacteria
  • Today there are about 130 complete genomes
  • In few years there will be more than 1000
  • The genomes of several species of a genus or
    indeed the genomes of of several strains of the
    same species have been sequenced.
  • This data growth has made necessary the
    development of new systems and tools for
    comparative genomics.
  • The new systems must be
  • Flexible
  • Scalable

7
Scope
Xylella fastidiosa citrus grape almond olean
der
strains
Xanthomonas axonopodis pv. citri
campestris pv. campestris oryzae
vesicatoria
species
Plant associated bacteria
Agrobacterium tumefaciens Sinorhizobium
meliloti Xanthomonas axonopodis pv. citri Xylella
fastidiosa cvc
small sets of genomes
large sets of genomes
All microbial
8
Basic concepts Replicon
  • Any kind of cell unit that contains genetic
    information (e.g. chromosomes, plasmids and
    mitochondria)

9
Basic concepts Homology
  • Homology two genes are homologous if they share
    a common ancestor.

10
Basic concepts Homology (II)
  • Paralogous genes are two (or more) genes
    homologous in the same organisms.
  • Orthologous genes are homologous genes belong to
    different organisms.

11
Basic concepts gene family
12
Basic concepts functional category
  • I - Intermediary metabolism
  • Degradation
  • Degradation of polysaccharides and
    oligosaccharides
  • Degradation of small molecules
  • Degradation of lipids
  • Central intermediary metabolism
  • Energy metabolism, carbon
  • Regulatory functions
  • II - Biosynthesis of small molecules
  • III - Macromolecule metabolism
  • IV - Cell structure
  • V - Cellular processes
  • VI - Mobile genetic elements
  • VII - Pathogenicity, virulence, and adaptation
  • VIII - Hypothetical

13
Motivation queries
  • Given two or more genomes, what are the genes
    shared between them and to what families do they
    belong?
  • Given two or more genomes, what are the genes
    specific to one in relation to the others, and to
    what families do they belong?
  • Given a gene x from an organism not in the
    system, does it have homologous in the system? If
    so, how many?

14
G1
G2
Gk
genomes
R1
R2
R3
R4
R5
Rp
Rp-1
replicons
genes
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gx
gz
gx
gx
gx
gx
gx
gx
gw
gx
gx
gx
gx
gx
gx
gy
gx
gx
gx
gx
gx
gx
gx
gx
gx
15
Attributes
  • Attributes based in GenBank data
  • Genome
  • id, strain, source, taxid, description
  • Replicon
  • id, genome_id, description, sequence
  • Genes
  • id, replicon_id, start_pos, end_pos,
    gene_synonym, orientation, product, name, gi,
    category

16
Conceptual model
Category
BLAST Hits
17
Tables and relationships
18
PABdb information system
  • Plant Associated Bacteria Database
  • Main objectives
  • management of genome data
  • comparison among genomes
  • clustering of genes in gene families and in
    categories
  • Allow easy inclusion of new comparison tools

19
System overview
BLAST, category and family operations
LOCAL DBMS
converters of data
20
Gene Families and Categories
  • Gene families were created based on BLAST results
    and on an undirected graph model G.
  • the connected components of G are the families
  • Gene categories were assigned by
  • automatic methods
  • human curator

21
PABdb tools
  • Queries tools
  • Query facilitators
  • Visualization tools
  • Genome overview
  • Comparison of orthologous genes of two genomes

22
Search mechanism
What are the genes in Xanthomonas axonopodis pv.
citri and Xylella fastidiosa cvc and not in
Xanthomonas campestris pv. campestris and
Xylella fastidiosa Temecula1?
Query facilitator
XML result file
result table
Browser
23
Screenshot (1) search tool
24
Genes in Xanthomonas axonopodis pv. citri and
Xylella fastidiosa cvc and not in Xanthomonas
campestris pv. campestris and Xylella fastidiosa
Temecula1
family_id gene_id categ_id product 2288 Xac-chrom
osome I.D.2 transcriptional regulator 2288 Xfcvc-c
hromosome I.D transcriptional regulator 2730 Xac-c
hromosome VI.B plasmid stability
protein 2730 Xfcvc-chromosome VI.B plasmid
stabilization protein 2739 Xac-chromosome VIII.A c
onserved hypothetical protein 2739 Xfcvc-pXF51 VI
II.A conserved hypothetical protein 3402 Xac-chrom
osome I.C.3 cytochrome like B561 3402 Xfcvc-chromo
some I.C.3 cytochrome B561 4520 Xac-chromosome VI.
A phage-related integrase 4520 Xfcvc-chromosome VI
.A phage-related integrase 5376 Xac-chromosome V.B
chromosome partitioning related
protein 5376 Xfcvc-chromosome V.B chromosome
partitioning related protein 5377 Xac-chromosome V
III.A conserved hypothetical protein 5377 Xfcvc-ch
romosome VIII.A hypothetical protein 5377 Xfcvc-ch
romosome VIII.A hypothetical protein 5378 Xac-chro
mosome VIII.A conserved hypothetical
protein 5378 Xfcvc-chromosome VIII.A conserved
hypothetical protein 5379 Xac-chromosome VIII.A co
nserved hypothetical protein 5379 Xfcvc-chromosome
VIII.A hypothetical protein 5380 Xac-chromosome V
III.A conserved hypothetical protein 5380 Xfcvc-ch
romosome VIII.A hypothetical protein
25
family_id gene_id categ_id product 5381 Xac-chrom
osome VIII.A conserved hypothetical
protein 5381 Xfcvc-chromosome VIII.A hypothetical
protein 5382 Xac-chromosome III.A.2 single-strande
d DNA binding protein 5382 Xfcvc-chromosome III.A.
2 single-stranded DNA binding protein 5383 Xac-chr
omosome III.A.5 cytosine-specific DNA
methyltransferase 5383 Xfcvc-chromosome III.A.5 DN
A methyltransferase 5384 Xac-chromosome VIII.A con
served hypothetical protein 5384 Xfcvc-chromosome
VIII.A hypothetical protein 5385 Xac-chromosome VI
II.A conserved hypothetical protein 5385 Xfcvc-chr
omosome VIII.A hypothetical protein 5386 Xac-chrom
osome VIII.A conserved hypothetical
protein 5386 Xfcvc-chromosome VIII.A hypothetical
protein 5387 Xac-chromosome VIII.A conserved
hypothetical protein 5387 Xfcvc-chromosome VIII.A
hypothetical protein 5388 Xac-chromosome VIII.A co
nserved hypothetical protein 5388 Xfcvc-chromosome
VIII.A hypothetical protein 5389 Xac-chromosome V
I.B plasmid-related protein 5389 Xfcvc-chromosome
VI.B conserved plasmid protein 5390 Xac-chromosome
VIII.A conserved hypothetical protein 5390 Xfcvc-
chromosome VIII.A hypothetical protein 5391 Xac-ch
romosome VIII.A conserved hypothetical
protein 5391 Xfcvc-chromosome VIII.A hypothetical
protein 5413 Xac-chromosome VIII.A conserved
hypothetical protein 5413 Xfcvc-chromosome VIII.A
hypothetical protein 5414 Xac-chromosome VIII.A co
nserved hypothetical protein 5414 Xfcvc-chromosome
VIII.A hypothetical protein
26
Search mechanism
Given the genomes Xanthomonas axonopodis pv.
citri and Xanthomonas campestris pv. campestris ,
what are the genes shared between them
(orthologous genes)? What are the genes specific
to one genome in relation to the other?
Query facilitator
result tables
XML result file
SVG result file
Visualization tool
27
Screenshot (2) search tool
28
Xanthomonas axonopodis pv. citri chromosome
compared with Xanthomonas campestris pv.
campestris chromosome
29
Search mechanism
Given the genomes Xanthomonas axonopodis pv.
citri and Xanthomonas campestris pv. campestris,
what are the genes shared between them
(orthologous genes)?
Query facilitator
XML result file
result table
SVG result file
Visualization tool
30
Screenshot (3) visualization tool
31
Comparison of orthologous genes of Xanthomonas
axonopodis pv. citri and Xanthomonas campestris
pv. campestris
32
Distribution of genes of each genome by category
33
Conclusions
  • The information systems for genomic management
    must be scalable and allow exchange of data and
    operations
  • This work presented a simple but flexible and
    extensible data model for comparative genomics. A
    first step in the design of a large information
    system
  • The data model was used in a real application
    (PABdb system).

34
Future work
  • Extend the data model to a richer context (e.g.
    metabolic pathways)
  • Extend the model to include subdivisions between
    family and category
  • Use of metadata to describe services and data
  • Use of different methods to generate the gene
    families.

35
Thank you!
Laboratory for Bioinformatics www.lbi.ic.unicamp.b
r Institute of Computation (IC)
www.ic.unicamp.br University of Campinas
(UNICAMP) www.unicamp.br Luciano Antonio
Digiampietri luciano_at_ic.unicamp.br
Write a Comment
User Comments (0)
About PowerShow.com