Title: Sharing Genomic Data and Annotations using GFF3 format
1Sharing Genomic Data and Annotations using GFF3
format
Dina Sulakhe and Natalia Maltsev Bioinformatics
Group MCS, Argonne National Laboratory Computation
Institute University of Chicago
2What we are going to talk about?
- GFF3 overview
- GFF3 Standards
- For Sharing Annotations
- For cross-referencing the data
- Extending GFF3
- Adding annotations from public databases
- Adding users annotations
- Sharing and exchanging annotations using
Web-services - GFF3 genomes repository at Argonne
- Downloads
- Web-services
3GFF3 overview (Lincoln Stein, 2004)
- A tab-delimited flat file representation of
genomic features - GFF3 format
- provides a mechanism for representing of
hierarchical grouping of genomic features and
sub-features - separates the ideas of group membership and
feature name/id - Enforces the use of controlled vocabularies by
imposing constraints on the definitions of
genomic features - allows a single feature (e.g. an exon) to belong
to more than one group at a time. - provides an explicit convention for pair wise
alignments - provides an explicit convention for features that
occupy disjoint regions
4An Example
5PUMA2/GNARE Systems
- PUMA2 (http//compbio.mcs.anl.gov/puma2) is an
Interactive Integrated Environment for
High-throughput Genetic Sequence analysis and
Metabolic reconstructions of public genomes with
Grid-based computational backend - GNARE is PUMA2 for analysis of user-submitted
genomes - http//compbio.mcs.anl.gov/gnare)
- PUMA2 contains
- Integrates Information from over 25 genomic,
metabolic, structural and taxonomic databases
(RefSeq, Unirot, IproClass, PDB, KEGG, EMP, CATH,
NCBI Taxonomy, Phenotypes, etc) - Pre-computed analysis of publicly available
completely and almost completely sequenced
genomes (517 bacteria, 41 archaeal, 24
eukaryotic, 638 mitochondrial and 2127 viral
genomes) in interactive PUMA2 framework - Automated Metabolic reconstructions for 300
completely sequenced organisms - GNARE User Models a framework for analysis of
genomes provided by users (Shewanella federation,
Apicomplexa genomes, strains of B. anthracis,
Yersinia, Staphylococcus, Haemophilus, etc) - A suite of unique tools for evolutionary analysis
of enzymes and metabolic networks (Chisel,
PhyloBlocks, etc) developed by our group - PUMA2 satellite databases Pathos (GLRCE
biodefence), TarGet (MCSG structural bilogy),
Sentra (prokaryotic signal transduction),
SubUnit, Physiological Profiles. MetaGenomes
(PNNL Hanford Site), etc
6GFF3 genomes repository at Argonneftp//ftp.mcs.a
nl.gov/pub/compbio/PUMA2/gff/gff_files/
- All completely sequenced genomes from RefSeq are
converted into GFF3 format. - GFF3 files for 8419 bacterial, eukaryotic,
mitochondrial, viral, etc genomes can be
downloaded from - ftp//ftp.mcs.anl.gov/pub/compbio/PUMA2/gff/gff_f
iles/ - The file names correspond to the NCBI-RefSeq
accession numbers, e.g - ftp//ftp.mcs.anl.gov/pub/compbio/PUMA2/gff/gff_fi
les/NC_006815.gff
7Future Plans Annotations
- In 2007 we will supplement Genome GFF3
annotations for public genomes with - additional annotations from public databases
(e.g. NCBI, UniProt, Integr8, GenomeNet, etc) and
- annotations from our analysis tools (e.g. Chisel
and PUMA2_FP), and other analysis tools - Supplement the GFF3 files for RefSeq genomes with
annotations provided by users via the GNARE system
8Future Plans Sharing Annotations and
Cross-referencing the data
- GFF3 format can be used to share annotations and
cross-references by different annotation centers - We plan to build services (Web-services and
Web-interfaces) to allow users to - Submit and share their annotations via the PUMA2
GFF3 converter - Extract public annotations from PUMA2 integrated
database as well as user-submitted annotations in
GFF3 format - Support customization of the GFF3 format (e.g.
include only the fields of interest to a user,
provide information from particular resource) - Cross-references to various databases (e.g.
NCBI-RefSeq, PIR, SwissProt, UniProt, and others)
will be included as feature data in the GFF3 - Explore the use of ontologies for extension of
the GFF3 format (we need your advice!)
9Future Plans Data Distribution..(GFF3 genomes
repository at Argonne)
- All the feature data collected and computed by
the PUMA2 project for publicly available genomes
will be distributed in the GFF3 format. - We will distribute the data through
- Web-services
- Web Interface (http)
- FTP downloads
10Acknowledgements
- Our Team and
- Globus Ian Foster, Mike Wilde, Nika Nefedova,
Jens Voeckler Condor Zach Miller, Miron Livny
OSG, TeraGrid - MCS Rick Stevens, systems, Susan Coghlan, and a
lot of others.