Title: Efforts to Link Ecological Metadata with Bacterial Gene Sequences at the Sapelo Island Microbial Observatory
1Efforts to Link Ecological Metadata with
Bacterial Gene Sequences at theSapelo Island
Microbial Observatory
- Wade M. Sheldon
- Mary Ann Moran
- James T. Hollibaugh
2Genetic Sequence Databases
- Major informatics success story
- Large repositories for nucleotide sequences (e.g.
GenBank/EMBL/NDDJ 16M) - Automated and web-based data submission -
required as part of publication process - Standardized alignment/search tools support use
for classification - Numerous environmental sequences ecologists
now using to study biogeography, community
structure, eco-physiology
3Problems with GenBank
- Metadata voluntary limited in scope
- Title (definition), authors, key words, comments,
literature citation - Many sequences unpublished, undescribed
- Quality control standards poorly enforced
- No direct way to provide links to ancillary data
(URLs not officially supported, often removed) - Very inefficient and often impossible for
investigators to obtain ecological context
information, even from journals - Comparisons of matched taxa by traits not possible
4Consequence
- Tremendous amount of bacterial sequence data
relevant to microbial ecologists - No established interface
5Example Insufficient Metadata
6Sapelo Island Microbial Observatory
(http//simo.marsci.uga.edu)
- MObs NSF-funded network of sites or "microbial
observatories" established to discover novel
microorganisms, microbial consortia, communities,
activities and other novel properties, and to
study their roles in diverse environments - Projects supported are expected to establish or
participate in an established, Internet-accessible
knowledge network to disseminate the information
resulting from these activities - SIMO - Investigating the diversity of
prokaryotes, their physiological and genetic
characteristics, and their biogeochemical
activities in a salt marsh/estuarine ecosystem in
the southeastern U.S. - Knowledge networks
- GenBank
- GCE-LTER IS
- SIMO 16S rRNA Database
7SIMO 16S rRNA Database
- Purpose LIMS, research tool, data dissemination
- Designed to store sequence data and all
supporting SIMO research information - Hierarchical structure modeled after research
workflow - Metadata on site geography, sample collection,
all methodology, personnel, ancillary
measurements - Extensive content control, error checking
- Links to information in external databases (RDP
II, GenBank, GCE-LTER) - Queries by phylogenic and/or ecological
characteristics
8Conceptual Diagram of the SIMO Database
9List-based data entry linked to metadata tables
10Controlled vocabulary supports finely-targeted
queriesAutomatic hyperlinks provide links to
tasks
11List-based queries also simplify public interface
12Phylogenetic and ecological characteristics
combined dynamically to create overview and query
interface
13SIMO Metadata
- Metadata primarily stored in managed lists,
linked to records by foreign key fields - Scalable design details can be added
independently without altering data records - Complete metadata for sequences generated by
relational joins - Links to external metadata in GCE-LTER database
adds site geography, research history, long-term
environmental characteristics
14Metadata Standards
- No existing standard for environmental sequence
metadata - Sequence formats (FASTA, BIOML, BSML) designed
for data parsing, sequence annotation - SIMO metadata currently displayed in summary form
on sequence detail pages - Exploring adopting emerging standards like EML
15Sequence Details
16Future Directions
- Incorporating batch upload features for library
submissions - Integrating database with RDP SeqMatch Agent
programs for automatic phylogenetic analysis,
sequence annotation - Provide full metadata in formatted/printable and
parsable ASCII formats (XML) - Participate in Entrez Link-Out to provide links
to SIMO sequence entries from GenBank