Title: BeeBase The Honey Bee Model Organism Database Chris Elsik celsiktamu'edu
1BeeBase - The Honey Bee Model Organism
DatabaseChris Elsikc-elsik_at_tamu.edu
2Outline
- BeeBase - what it is now
- How it works
- Future Plans
3BeeBasehttp//racerx00.tamu.edu/PHP/bee_search.ph
p
- Predicted Gene and Homolog Search Page
- Genome Browser
- Comparative Map Viewer
- Protein Families Database with Bee, Fly and
Mosquito proteins - The newest assembly ( release 2.0)
http//racerx00.tamu.edu/cgi-bin/gbrowse/bee_genom
e2
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Gbrowse
- A module of the Generic Model Organism Database
Project (GMOD), www.gmod.org - A graphical viewer of features along a reference
sequence - Based on MySQL and Perl
- The configuration file allows us to
- Change fonts, colors, text.
- Change overview sequence scaffold, contig,
genetic map, karyotype. - Define tracks.
- Modify track appearance.
12Gbrowse Internals
- BioPerl Library - allows browser to run on top of
a variety of database management systems and
schemata - BioGraphics module - used to graphically render
any type of nucleotide or protein feature - BioDBGFF Database - uses a flat coordinate
system to represent genomic features. Optimized
for queries that retrieve features by ID, type or
region of genome
13Our task is to generate GFF data
- GFF generic feature format
- A standard format that aids data exchange
- Allows you to specify a substring of a biological
sequence - The current version (2) uses terms from the
Sequence Ontology project - - A set of terms used to describe features on a
nucleotide or protein sequence. It encompasses
both "raw" features, such as nucleotide
similarity hits, and interpretations, such as
gene models. - For information on the specifications
- http//www.sanger.ac.uk/Software/formats/GFF/
14Computing Data for Tracks
- Markers
- Compare marker sequences to genome scaffolds
using BLASTN - Use ePCR (primersearch) for markers with primers,
but no sequence - ESTs
- Compare ESTs to genome scaffolds using fasta or
BLAT - Use exonerate (http//www.ebi.ac.uk/guy/exonerate
/) to predict exon/intron boundaries for each
match - Protein Homologs
- Compare protein sequences to genome scaffolds
using tfastx to identify matches - Use exonerate to predict exon/intron boundaries
for each match
15Annotating Tracks
- The most time consuming task in computing tracks
is providing annotations for protein homologs. - Annotations come from different sources and are
in different formats depending on protein
dataset. - We use UniProt for all homolog tracks in assembly
1.1 and 1.2 browsers. - Assembly 2 uses proteome sets for Drosophila
(FlyBase), C. elegans (WormBase), Yeast (SGD),
Mosquito (Ensembl) and Human (Ensembl) to avoid
redundancy within proteomes. - The fasta formatted sequences are not annotated
(except yeast). - The other insect track will come from UniProt.
- To identify which sequences are insect, we use
taxon-id and a locally installed NCBI taxonomy
database.
16CMAP
- CMap is a web-based tool that allows users to
view comparisons of genetic and physical maps. - The package also includes tools for curating map
data. - MySQL and Perl
- Consists of modules for data, logic (howmaps are
layed out), and presentation. - Our work is to modify the configuration file and
format data.
17Future BeeBase Plans
- Redo protein families analysis after final gene
prediction set is released add proteins from
additional model organisms (worm, yeast, mouse,
human) - Phylogenetic analysis to identify orthologs
- Gene Ontology assignment
- Create gene pages for each gene, similar to
FlyBase, using the new Turnkey gmod-web module
18More BeeBase Plans
- Curate literature for orthologs to provide an
entry into the BeeSpace conceptual navigation
system. - Incorporate QTL viewer using Dave Adelsons QTL
viewer software, which was developed for cattle. - Incorporate OpenGeneX gene expression database
and expression data from the BeeSpace project.
19Gene Ontology For Honey Bee
20Gene Ontology Consortiumhttp//www.geneontology.o
rg/
- The goal of the Gene OntologyTM (GO) Consortium
is to produce a controlled vocabulary that can be
applied to all organisms even as knowledge of
gene and protein roles in cells is accumulating
and changing. - GO provides three structured networks of defined
terms to describe gene product attributes. - Molecular Function Ontology the tasks performed
by individual gene products examples are
carbohydrate binding and ATPase activity - Biological Process Ontology broad biological
goals, such as mitosis or purine metabolism, that
are accomplished by ordered assemblies of
molecular functions - Cellular Component Ontology subcellular
structures, locations, and macromolecular
complexes examples include nucleus, telomere,
and origin recognition complex
21GO Evidence Codes
- IDA inferred from direct assay - Enzyme assays,
In vitro reconstitution (e.g. transcription),
Immunofluorescence (for cellular component), Cell
fractionation (for cellular component), Physical
interaction/binding assay - IEP inferred from expression pattern - useful for
biological process ontology - IGI inferred from genetic interaction -
"Traditional" genetic interactions such as
suppressors, synthetic lethals, etc., Functional
complementation, Rescue experiments, Inference
about one gene drawn from the phenotype of a
mutation in a different gene - IMP inferred from mutant phenotype - Any gene
mutation/knockout, Overexpression/ectopic
expression of wild-type or mutant genes,
Anti-sense experiments, RNAi experiments,
Specific protein inhibitors - IPI inferred from physical interaction -
2-hybrid interactions, Co-purification,
Co-immunoprecipitation, Ion/protein binding
experiments - IEA inferred from electronic annotation
- ISS inferred from sequence or structural
similarity - IC inferred by curator, TAS traceable author
statement, NAS non-traceable author statement ,
ND no biological data available, NR not recorded
22Applying GO to Honey Bee
- We must rely heavily on IEA (inferred from
electronic annotation - no curator) or ISS
(inferred from sequence similarity - inspected by
curator) - We must make the most reliable inferences
possible - based on orthology instead of homology
23BackgroundEvolution-based functional inference
and orthology
24Evolution Allows us to Infer Function
- The most powerful method for inferring function
of a gene or protein is by similarity searching a
sequence database. - Our ability to characterize biological properties
of a protein using sequence data alone stems from
properties conserved through evolutionary time. - Homologous (evolutionarily related) proteins
always share a common 3-dimensional folding
structure. - They often contain common active sites or binding
domains. - They frequently share common functions.
- Predictions made using similar, but
non-homologous proteins are much less reliable.
25Orthologs
- Homologs genes that are evolutionarily related
- There are two kinds of homologs
- Orthologs genes in different species that have
diverged from a common gene in an ancestral
species. - Paralogs genes that have diverged due to gene
duplication. - Orthologs are more likely than paralogs to have
conserved function. - Orthologs cannot be identified using BLAST or
FASTA sequence comparison alone. - Reliable ortholog identification requires
phylogenetic methods.
26Example Gene Tree (with plant genes)
Rice-2b
paralogs
Rice-2a
Maize-2
paralogs
Wheat-2
Sorghum-2
Barley-1
Wheat-1
orthologs
Maize-1
Sorghum-1
Arabidopsis
27Why shouldnt we depend on inferences based on
paralogs?
- Paralogs emerge after a gene duplication.
- Possible fates of duplicated genes
- Loss of function for one of the duplicates - lack
of selective pressure allows gene to mutate
beyond recognition - Emergence of new functional paralogs - one
duplicate aquires a new function, so selection
favors its maintenance in the genome - Sub-functionalization - both duplicates are
required to maintain the function of the original
28Back to Gene Ontology for Honey BeeProposed
Evidence Codes within ISS
- ISS inferred from sequence similarity
(inspected by a curator) - We can break this down into
- Inferred from homology (lowest)
- Inferred from a ortholog in one species
- Inferred orthologs in more than one species, all
of which have the same GO classification
(highest). - What if they dont all have the same GO
classification? Move up in the diacylic graph to
a point where GO classifications converge. - This can be tricky since the graph is diacyclic
and each node can have more than one parant
29Some Ongoing Gene Ontology Work in the Elsik Lab
- Cattle
- Cattle EST Gene Family Database
- Cattle gene families were created using
assembled, translated ESTs grouped with
homologous human protein families. - Database is searchable using GO for the human
proteins. - The next step is phylogenetic analysis to
identify human/cattle orthologs.
30Searching by Gene Ontology
31(No Transcript)
32(No Transcript)
33Borrowing More From Cattle
- Bovine QTL Database - David Adelson, TAMU
34The Bovine QTL viewer Interface
35Image showing all chromosomes
36Image showing one chromosome
37QTL Details
38OpenGeneX
- Web-based access to database
- PostgreSQL
- Includes as a curation tool a client side Java
application that formats data in MAGE-ML - Includes several statistical routines and data
analysis tools - Uses R statistical analysis package (open source)
39Acknowledgements
- Collaborators
- Bruce Schatz, Gene Robinson and the BeeSpace
group, UIUC - William Gelbart - FlyBase (Harvard University)
- Spencer Johnston (TAMU)
- Danny Weaver, Bee Power LP
- Elsik Lab
- Justin Reese
- Kyounghwa Bae
- Anand Venkatraman
- Shreyas Murthi
- Michael Dickens
- Juan Anzola