GMODTools, Argos - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

GMODTools, Argos

Description:

Argos & cetera. A Replicable Genome infOrmation System. of Common Components ... Argos framework for genome databases. LuceGene fast document/object search ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 25
Provided by: dong167
Category:
Tags: argos | gmodtools

less

Transcript and Presenter's Notes

Title: GMODTools, Argos


1
GMODTools, Argos cetera A Replicable Genome
infOrmation System of Common Components
  • GMOD Meeting, Oct. 2004

Don Gilbert, gilbertd_at_indiana.edu
2
Genome DB building blocks
  • GMOD Tools for public data releases
  • Argos framework for genome databases
  • LuceGene fast document/object search
  • Genome Directory System for genome data mining
  • Unified Gene Pages (XML, web page)

3
Tool Status
  • GMOD Tools
  • Using to make flybase pub data tested w/ SGD
    lite
  • Argos framework
  • Used now for 3 DBs replicated UK, JP several
    test dbs
  • LuceGene
  • indexer working well need web face work
  • Genome Directory System
  • Prelim. http//flybase.net/ws/services/Directory
  • Unified Gene Pages
  • Need time collabs. Have FlyBase, euGenes UGP
    XML and other-mod web page scraper

4
GMOD Tools Bulkfilescvs.sourceforge.net/cvsroot
/gmod checkout schema/GMODTools
5
Genome Data Tools
  • Support common data update and public release
    tasks.
  • GmodTools to load and extract reagent sequences
    (EST, cDNA, GSS) to/from Chado databases.
  • GMOD Bulkfiles creates bulk genome sequence and
    feature files for public distribution from a
    Chado database.
  • Citrina is a workflow tool to automate external
    databank updates, such as GenBank and Gene
    Ontologies.

6
12 New genomes to go
  • Need to publish numerous new genomes
  • Bulk files are standard public access
  • Sequence (fasta, ), features (gff,), searches
    (Blast, ..)
  • 11 new Drosophila genomes Daphnia genome many
    more
  • Chado database XORT other GMOD Tools to export
    data
  • http//flybase.net/species

7
Bulkfiles
  • Build release files from Chado DB
  • Standardized files, headers
  • DNA - fasta, raw
  • Features - GFF3, gnomap
  • Blast indices
  • Lucene file indices
  • Config files (blast, gbrowse,)

8
Bulkfiles - BLAST indices
9
Bulkfiles - Map features
10
Bulkfiles OUTPUTS
  • DNA files (full chromosomes) in raw and fasta
    formats
  • GFF (v3) and FFF (used in FlyBase) feature files
  • Fasta sequence for each feature set, with
    standardized headers (ID,names,db_xref,)from
    feature files
  • NCBI BLAST indices configs
  • Gbrowse config files with feature sets matching
    db
  • Others added as needed (more easily than before)

11
Bulkfiles Logic
  • Organism/database logic (mostly) in configuration
    files
  • Dump all chado db features using simple sql to
    common intermediate table files
  • Feature info is simple type, location, name/id,
    and a few attributes (db_xrefs,.. GFF-like)
  • Easier checking of SQL to get all features
    desired
  • Fast (30 - 60 min for full fly genome)
  • Postprocess table files to create public use
    formats
  • Tested with FOUR different Chado dbs (Dmel,
    Dmel_hetero, Dpse_Dmel, and SGDLite)

12
Bulkfiles stages
  • postprocess table files in stages
  • Recode feature oddities to public view needs
  • Better debugging of steps in the process
  • Engineering time and configuration here
  • Stages are loosely coupled go back, tweak
    configurations, re-run partially as needed.
  • convert common feature table dna to several
    output formats in one step.
  • combine features from several dbs and other
    sources like cytology here.

13
Bulkfiles config example
fbreleases driver"Pg" name"dmel_chado"
host"localhost" port"7302" user"
password"" / (FBgnFBti)\dattern filesets
featuresets
  • name"fbbulk-r3" relid"3"
  • ROOT"GMOD_ROOT/"
  • TMP"GMOD_ROOT/tmp"
  • datadir"genomes/Drosophila_melanogaster"
  • FlyBase Chado DB r3.2
  • Configuration for feature and sequence
  • bulk files from FlyBase chado data release
    3.2.1
  • dmel
  • Drosophila melanogaster
  • D. melanogaster euchromatin genome data from
    FlyBase Release 3.2.1. See http//flybase.net/ann
    ot/dmel_r3.2.1.txt

14
Bulkfiles quick test
  • get soft
  • cvs -d cvsd co -d GMODTools schema/GMODTools
  • load a genome chado db to Postgres
  • wget http//sgdlite.princeton.edu/download/sgdli
    te/-2004_05_19_sgdlite.sql.gz
  • createdb sgdlite_20040519
  • (zcat sgdlite.sql.gz psql -d
    sgdlite_20040519 -f - ) log.load
  • generate file set for sgdbulk1
  • cd GMODTools
  • env GMOD_ROOTPWD perl -I./lib/
    bin/bulkfiles.pl sgdbulk1

15
ARGOS http//www.gmod.org/argos
16
ARGOS Genome DBs
17
ARGOS Focus
  • Automate genome database install update
  • Eliminate fetch, compile, install, configure,
    cycle
  • Developers test, compile, config once others
    copy/run
  • Start new project quickly - copy existing project
    edit to suit
  • Clone servers easily (local cluster global
    mirrors company/lab laptop)
  • Compatible with most GMOD projects
  • Secure collaborative genome db features
  • Goal easy for biologists to use with minimal
    informatics expertise

18
ARGOS Components
19
ARGOS INSTALL
20
ARGOS INSTALL
21
Edit wFleaBase
22
Lucegene (Lucy Jean)for Genome Information
Search and Retrieval
23
LuceGene
  • Document/Object Search and Retrieval in Genome
    Databases
  • high-volume data search and retrieval system for
    genomics and bioinformatics databases
  • standard search features booleans, phrase, near,
    relevance
  • performance exceeds and extends relational
    databases
  • suited to range of genome data genes,
    literature, sequences, XML annotations, Medline
    abstracts, HTML, PDF and text documents.

24
Example LuceGene libraries
  • FlyBase database
  • Annotation GAME XML, Medline XML (gamexml,
    medxml)
  • Genes, Annotation, References (fbgn, fban, fbrf)
  • Web, literature PDF Documents (docs)
  • Unified Gene Page XML (ugpxml)
  • Sequences, Genome Features (seqs)
  • euGenes database
  • Gene summaries, Sequences, Genome Features
  • Unified Gene Page XML
  • Web Documents
  • wFleaBase database
  • Sequences, Medline XML, Web documents

25
Thanks to these folks
  • Josh Goodman (gmod)
  • Paul Poole (gmod/iubio)
  • Hardik Sheth (flybase)
  • Nihar Sheth (flybase)
  • Vasanth Singan (gmod)
  • Victor Strelets (flybase)
  • And to many developers whose work we learn from
    and borrow from
Write a Comment
User Comments (0)
About PowerShow.com