Title: GMODTools, Argos
1GMODTools, Argos cetera A Replicable Genome
infOrmation System of Common Components
Don Gilbert, gilbertd_at_indiana.edu
2Genome DB building blocks
- GMOD Tools for public data releases
- Argos framework for genome databases
- LuceGene fast document/object search
- Genome Directory System for genome data mining
- Unified Gene Pages (XML, web page)
3Tool Status
- GMOD Tools
- Using to make flybase pub data tested w/ SGD
lite - Argos framework
- Used now for 3 DBs replicated UK, JP several
test dbs - LuceGene
- indexer working well need web face work
- Genome Directory System
- Prelim. http//flybase.net/ws/services/Directory
- Unified Gene Pages
- Need time collabs. Have FlyBase, euGenes UGP
XML and other-mod web page scraper
4GMOD Tools Bulkfilescvs.sourceforge.net/cvsroot
/gmod checkout schema/GMODTools
5Genome Data Tools
- Support common data update and public release
tasks. - GmodTools to load and extract reagent sequences
(EST, cDNA, GSS) to/from Chado databases. - GMOD Bulkfiles creates bulk genome sequence and
feature files for public distribution from a
Chado database. - Citrina is a workflow tool to automate external
databank updates, such as GenBank and Gene
Ontologies.
612 New genomes to go
- Need to publish numerous new genomes
- Bulk files are standard public access
- Sequence (fasta, ), features (gff,), searches
(Blast, ..) - 11 new Drosophila genomes Daphnia genome many
more - Chado database XORT other GMOD Tools to export
data - http//flybase.net/species
7Bulkfiles
- Build release files from Chado DB
- Standardized files, headers
- DNA - fasta, raw
- Features - GFF3, gnomap
- Blast indices
- Lucene file indices
- Config files (blast, gbrowse,)
8Bulkfiles - BLAST indices
9Bulkfiles - Map features
10Bulkfiles OUTPUTS
- DNA files (full chromosomes) in raw and fasta
formats - GFF (v3) and FFF (used in FlyBase) feature files
- Fasta sequence for each feature set, with
standardized headers (ID,names,db_xref,)from
feature files - NCBI BLAST indices configs
- Gbrowse config files with feature sets matching
db - Others added as needed (more easily than before)
11Bulkfiles Logic
- Organism/database logic (mostly) in configuration
files - Dump all chado db features using simple sql to
common intermediate table files - Feature info is simple type, location, name/id,
and a few attributes (db_xrefs,.. GFF-like) - Easier checking of SQL to get all features
desired - Fast (30 - 60 min for full fly genome)
- Postprocess table files to create public use
formats - Tested with FOUR different Chado dbs (Dmel,
Dmel_hetero, Dpse_Dmel, and SGDLite)
12Bulkfiles stages
- postprocess table files in stages
- Recode feature oddities to public view needs
- Better debugging of steps in the process
- Engineering time and configuration here
- Stages are loosely coupled go back, tweak
configurations, re-run partially as needed. - convert common feature table dna to several
output formats in one step. - combine features from several dbs and other
sources like cytology here.
13Bulkfiles config example
fbreleases driver"Pg" name"dmel_chado"
host"localhost" port"7302" user"
password"" / (FBgnFBti)\dattern filesets
featuresets
-
- name"fbbulk-r3" relid"3"
- ROOT"GMOD_ROOT/"
- TMP"GMOD_ROOT/tmp"
- datadir"genomes/Drosophila_melanogaster"
-
- FlyBase Chado DB r3.2
-
- Configuration for feature and sequence
- bulk files from FlyBase chado data release
3.2.1 -
- dmel
- Drosophila melanogaster
-
- D. melanogaster euchromatin genome data from
FlyBase Release 3.2.1. See http//flybase.net/ann
ot/dmel_r3.2.1.txt -
-
14Bulkfiles quick test
- get soft
- cvs -d cvsd co -d GMODTools schema/GMODTools
-
- load a genome chado db to Postgres
- wget http//sgdlite.princeton.edu/download/sgdli
te/-2004_05_19_sgdlite.sql.gz - createdb sgdlite_20040519
- (zcat sgdlite.sql.gz psql -d
sgdlite_20040519 -f - ) log.load -
- generate file set for sgdbulk1
- cd GMODTools
- env GMOD_ROOTPWD perl -I./lib/
bin/bulkfiles.pl sgdbulk1
15 ARGOS http//www.gmod.org/argos
16ARGOS Genome DBs
17ARGOS Focus
- Automate genome database install update
- Eliminate fetch, compile, install, configure,
cycle - Developers test, compile, config once others
copy/run - Start new project quickly - copy existing project
edit to suit - Clone servers easily (local cluster global
mirrors company/lab laptop) - Compatible with most GMOD projects
- Secure collaborative genome db features
- Goal easy for biologists to use with minimal
informatics expertise
18ARGOS Components
19ARGOS INSTALL
20ARGOS INSTALL
21Edit wFleaBase
22Lucegene (Lucy Jean)for Genome Information
Search and Retrieval
23LuceGene
- Document/Object Search and Retrieval in Genome
Databases - high-volume data search and retrieval system for
genomics and bioinformatics databases - standard search features booleans, phrase, near,
relevance - performance exceeds and extends relational
databases - suited to range of genome data genes,
literature, sequences, XML annotations, Medline
abstracts, HTML, PDF and text documents.
24Example LuceGene libraries
- FlyBase database
- Annotation GAME XML, Medline XML (gamexml,
medxml) - Genes, Annotation, References (fbgn, fban, fbrf)
- Web, literature PDF Documents (docs)
- Unified Gene Page XML (ugpxml)
- Sequences, Genome Features (seqs)
- euGenes database
- Gene summaries, Sequences, Genome Features
- Unified Gene Page XML
- Web Documents
- wFleaBase database
- Sequences, Medline XML, Web documents
25Thanks to these folks
- Josh Goodman (gmod)
- Paul Poole (gmod/iubio)
- Hardik Sheth (flybase)
- Nihar Sheth (flybase)
- Vasanth Singan (gmod)
- Victor Strelets (flybase)
- And to many developers whose work we learn from
and borrow from