DDBJ activity of genome databases and bacterial gene evaluation - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

DDBJ activity of genome databases and bacterial gene evaluation

Description:

from the address of the primary submitter in the flat-file. ... Glimmer2 and RBS finder Glimmer 3. 2006/08/28. 17. ASM NCBI workshop ... – PowerPoint PPT presentation

Number of Views:773
Avg rating:3.0/5.0
Slides: 33
Provided by: tkos
Category:

less

Transcript and Presenter's Notes

Title: DDBJ activity of genome databases and bacterial gene evaluation


1
DDBJ activity of genome databases and bacterial
geneevaluation
  • GSC Meeting at Robinson College in Cambridge
  • September 21, 2006
  • Yoshio Tateno
  • DDBJ
  • National Institute of Genetics, Japan

2
Number of Entries from DDBJ Release Excluding
CON, TPA, WGS, and MGA
Japan Europe U.S.A.
DDBJ 10,092,078 JPO 754,184 EMBL 4,473,387
EPO 1,528,043 GenBank 38,208,838 USPTO
834,465
10,846,262 (19.4) 6,001,430 (10.7) 39,043,303
(69.9)
as of DDBJ release 65
3
Number of Nucleotides from DDBJ Release Excluding
CON, TPA, WGS, and MGA
Japan Europe U.S.A.
7,481,218,890 (12.4) 8,800,142,545
(14.5) 44,283,360,200 (73.1)
DDBJ 7,032,096,316 JPO 449,122,574 EMBL
7,752,604,380 EPO 1,047,538,165 GenBank
43,888,277,009 USPTO 395,083,191
as of DDBJ release 65
4
Breakdown Statistics
http//www.ddbj.nig.ac.jp/Welcome-e.html
What is 'Breakdown Statistics'? We, at DDBJ, have
indexed all entries of DDBJ release (INSDC data
excluding WGS and MGA) with respect to Date of
Publication Research Projects Nationality of
Submitters In particular, we identified the
research project from 'REFERENCE' line, and
determined the nationality from the address of
the primary submitter in the flat-file. The main
purpose of 'Breakdown Statistics' is providing
well visualized overviews of the trends of
submissions.
5
Breakdown StatisticsTrends in data growth and
submission
http//www.ddbj.nig.ac.jp/breakdown_stats/submissi
on/index-e.html
These figures show the temporal transitions in
the number of entries in each of divisions from
1992 to 2005.
The figures given in divisions show a drastic
shift from the gene-oriented to genome-oriented
research in the last several years.
6
Genome Annotation of E. coli K-12
The E. coli annotation workshop II was held at
Marine Biological Laboratory from 19 to 24 March
2005. The main purpose of the workshop was
re-annotation of E. coli K-12 MG1655(USA team)
and W3110(Japanese team) to apply the latest
biological knowledge. One of DDBJ annotators
participated in the workshop with Japanese
researchers from Nara Institute of Science and
Technology. The complete genome sequence (Acc
AP009048) of E. coli K-12 W3110 with the
annotation from the workshop was released from
DDBJ. cf. Riley, M., et al. (2006), NAR 34, 1-9.
LOCUS AP009048 4646332 bp
DNA circular BCT 25-MAR-2006 DEFINITION
Escherichia coli W3110 DNA, complete
genome. ACCESSION AP009048 AB001340 D10483
D26562 D83536 D90699-D90711 D90713-D90754
D90756-D90878 D90880-D90897 VERSION
AP009048.1
7
Rice Annotation Project Database (RAP-DB)
RAP1 data have been available since December
2005. Cf. Ohyanagi et al. (2006). NAR 34,
Database issue D741-D744
http//rapdb.lab.nig.ac.jp/
DDBJ is now contacting submitters to reflect
these annotation results on their entries
You can see the data via two types of genome
browsers GBrowse G-integra
8
Gene Trek in Prokaryote Space (GTPS)
9
How many genes are there in the microbial
world?How real are the genes that reportedly
exist in the genomes of microbial species?
  • We have developed a scalable and traceable
    protocol from the viewpoint of
  • gene prediction program
  • parameters e.g.
  • minimum length of ORFs
  • threshold of homology
  • consistency in
  • representation (data structure)
  • contents
  • versions of reference databases
  • annual updates

10
3a. Annotation description of products
  • unknown
  • hypothetical protein
  • probable ORF
  • predicted protein
  • putative protein
  • hypothetical conserved protein
  • uncharacterized protein
  • conserved domain protein
  • conseved hypothetical

11
Features of GTPS
  • Simultaneous prediction and evaluation of all
    the possible protein coding genes (ORFs) in
    prokaryote genomes
  • All the ORFs are graded
  • The criteria and evidence data are available
  • Web site
  • http//gtps.ddbj.nig.ac.jp/
  • Updated once a year

12
The very short history of GTPS
Species
Archaea
Bacteria
ver. (data froze in)
109
2003 (Jul 2003)
123
14
166
17
2004 (Sep 2004)
183
2005 (Feb 2006)
302
25
277
13
GTPS workflow (for 2003 and 2004)
genome sequence
masking of RNA region (rRNA, tRNA, Rfam)
Glimmer 2.0 (60aa and 15aa)
INSDC ORFs
seed sequence
(GIB)
RBSfinder
leader peptide ribosomal slippage exon-intron etc.
blastP, InterProScan
assignment of /pseudo
ORF grading
14
Blastp hit
InterProScan hit
Grade
coverage/homology
ORF
Alignment/ subject
known motif
AAAA
BBBB
? 70
unknown motif
AAA
BBB

or
AA
BB
no hit
Potential genes AAAA1-D3 grades
alignment/ ORF
mishit
A
B
? 70
no hit
C
known motif
? 70
D (hypo)
no hit

15
A-X grading is combined with the consistency with
the data in INSDC DB
1 2 3 4
identical 3'-matched newly predicted by this
study not predicted by this study
The grade was expressed as the combination of the
reliability indicator and the matching indicator
to INSDC DB
e.g. AAAA1 BB2 C3 D4
AAAA1-D3 grades Potential protein coding genes
16
Number of ORFs in each grade
GTPSver. 2003
GTPSver. 2004
GTPSver. 2005
Grades AAAA-A BBBB-B C D E X Total
283,247 7,208 4,680 79,779 6,788 466,681 848,383
431,672 10,250 7,511 107,382 10,225 687,110 1,254,
150
752,186 20,755 16,227 137,656 17,739 278,075
1,222,638
Glimmer2 and RBS finder Glimmer 3
Potential genes(AAAA1 - D3)
370,876
551,246
903,845
362,828
537,312
904,530
INSDC
17
Differences of the annotation between GTPS and
INSDC
Yellow colored ORFs were added to protein coding
genes
Streptococcus pyogenes SF370
position 43518..45065 (515AA) ID
AHB0110136008904 grade AAAA3 product name
putative phosphoribosylaminoimidazole- carboxamide
formyltransferase / IMP cyclohydrolase 98
identity to Q8K8Y6 of UniProtKB/Swiss-Prot
result of GTPS 2004
18
Wolinella succinogenes DSMZ 1740
position 453774..454973 (399AA) ID
ALO0110136006525 grade AAAA3 product name
protein chain elongation factor EF-Tu 100
identity to P42482 of UniProtKB/Swiss-Prot
position 455016..455174 (52AA) ID
ALO0110136007831 grade AAAA3 product name
ribosomal protein L33 88 identity to P66217
of UniProtKB/Swiss-Prot
result of GTPS 2004
19
Mycobacterium leprae TN
position 2342940..2343278 (112AA) ID
AEM0110136012863 grade A3 product name
hypothetical protein This ORF includes a motif,
IPR009418 (Protein of unknown function DUF1067).
result of GTPS 2004
20
Differences of the annotation between GTPS and
INSDC
Purple colored ORF (grade D4) in INSDC were
removed from protein coding genes.
Aeropyrum pernix K1
INSDC GTPS 2,694 ? 1,688
result of GTPS 2004
21
Differences of the annotation between GTPS and
INSDC
Red colored ORF (grade X) and purple colored ORF
(grade D4) were removed from protein coding
genes.
Erwinia carotovora subsp. atroseptica SCRI1043
INSDC GTPS
4,472 ? 3,893
Pirellula sp. strain 1
INSDC GTPS
7,325 ? 6,769
result of GTPS 2004
22
Newly predicted ORFs having 20-100AA and their
identity to UniProtKB/Swiss-Prot entries
Bacillus halodurans C-125
Aquifex aeolicus VF5
position 1503756..1503905 (49AA) ID
AAP0110136014866 grade AAAA3 product name 50S
ribosomal protein L33 83 identity to P66232
position 8457..8678 (73AA) ID
AAJ0110136007547 grade C3 product name
predicted in GTPS (50S ribosomal protein L29)
100 identity to P56613
Escherichia coli O157H7 RIMD 0509952 (Escherichi
a coli O157H7 str. Sakai)
Pyrococcus horikoshii OT3
position complement(5001462..5001662) (66AA) ID
ADL0110136010772 grade AAAA3 product name
thiamin biosynthesis, probable sulfur donor 95
identity to O32583
position complement(1096..1281) (61AA) ID
AFO0110136005704 grade AAAA3 product name
Preprotein translocase secE subunit 100
identity to P58199
result of GTPS 2004
23
Computation time
  • Blastp search 16 nodes PC cluster (Dual Xeon
    3.2 GHz)
  • 8 hours for 123 genomes
  • 12.5 hours for 183 genomes
  • Motif search (InterPro) 64 CPUs cluster system
  • 420 hours for 123 genomes
  • 624 hours for 183 genomes

24
For more details
  • http//gtps.ddbj.nig.ac.jp/

25
GTPS group
  • Takashi KOSUGE (DDBJ)
  • Takashi ABE (DDBJ)
  • Toshihisa OKIDO (DDBJ)
  • Naoto TANAKA (JST)
  • Masaki HIRAHATA (JST)
  • Yutaka MARUYAMA (JST)
  • Jun MASHIMA (DDBJ)
  • Aki TOMIKI (DDBJ)
  • Motoyoshi KUROSAWA (RIKEN/CC)
  • Ryutaro HIMENO (RIKEN/CC)
  • Satoshi FUKUCHI (DDBJ/GTOP)
  • Ken NISHIKAWA (DDBJ/GTOP)
  • Satoru MIYAZAKI (DDBJ)
  • Takashi GOJOBORI (DDBJ)
  • Yoshio TATENO (DDBJ)
  • Hideaki SUGAWARA (DDBJ)

Please visit GTPS sitehttp//gtps.ddbj.nig.ac.jp/
26
Genes TO Proteins DB (GTOP)
27
How many genes are out there?
The ratio of the clustered ORFs to all the ORFs
among the sampled genomes increased with the
increasing number of the genomes. The ratio is
not yet saturated at the 180 genomes (29.5 of
the genes among 180 genomes were clustered and
presumed to have the same function.).
28
Please click GTOP at http//www.ddbj.nig.ac.jp/
29
2. Parameters the cutoff length used
length
number
gt20 gt()30 gt33.3aa (100bp) gt40aa gt50aa gt60aa gt66.6
aa (200bp) gt80 gt100aa gt150aa gt200aa gt300aa gt400aa
1 25 3 1 7 4 1 2 6 1 1 1 1
30
3a. Annotation description of products
Hahella chejuensis KCTC 2396 CDS
1023521..1024429
/gene"argB"
/locus_tag"HCH_01027"
/EC_number"2.7.2.8"
/note"COG0548"
/codon_start1
/transl_table11
/product"Acetylglutamate kinase"
/protein_id"ABC27909.1"
/db_xref"GI83631942"
/translation"MLDRDNALQVAAVLSKALPYIQRFAGKTIVIKYGGN
AMTDEELK NSFARDVVMMKLVGINPIVV
HGGGPQIGDLLQRLNIKSSFINGLRVTDSETMDVVEMV
LGGSVNKDIVALINRNGGKAIGLTGKDANFITARKLEVTR
ATPDMQKPEIIDIGHVGE
VTGVRKDIITMLTDSDCIPVIAPIGVGQDGASYNINADLVAGKVAEVLQA
EKLMLLTN IAGLMNKEGKVLTGLSTKQV
DELIADGTIHGGMLPKIECALSAVKNGVHSAHIIDGRV
PHATLLEIFTDEGVGTLITRKGCDDA"
COG ID in the qualifier of /note
31
blastp hit
InterProScan hit
Grade
Coverage
Subject
Subject
alignment subject ?query
Function known motif
AAAA
BBBB
Not putative membrane nor unknown protein
? 70
Unknown motif
AAA
BBB

or
AA
BB
No hit
Potential genes AAAA1-D3 grades
alignment/ ORF
Putative membrane or unknown protein
Function known or unknown motif
? 70
A
B
? 70
Putative membrane protein
No hit
or
C
Function known or unknown motif
No hit
? 70
Unknown protein
D
No hit

? 70
Unknown protein
E
No hit
or
X
No hit
No hit
32
3b. Annotation inconsistency or biological
variation
Agrobacterium tumefaciens C58 circular chromosome
(Cereon)
Agrobacterium tumefaciens C58 circular chromosome
(U. Washington)
Write a Comment
User Comments (0)
About PowerShow.com