BeeBase The Honey Bee Model Organism Database Chris Elsik celsiktamu'edu presentation

About This Presentation

Transcript and Presenter's Notes

Title: BeeBase The Honey Bee Model Organism Database Chris Elsik celsiktamu'edu

1
BeeBase - The Honey Bee Model Organism
DatabaseChris Elsikc-elsik_at_tamu.edu
2
Outline

BeeBase - what it is now
How it works
Future Plans

3
BeeBasehttp//racerx00.tamu.edu/PHP/bee_search.ph
p

Predicted Gene and Homolog Search Page
Genome Browser
Comparative Map Viewer
Protein Families Database with Bee, Fly and
Mosquito proteins
The newest assembly ( release 2.0)
http//racerx00.tamu.edu/cgi-bin/gbrowse/bee_genom
e2

4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Gbrowse

A module of the Generic Model Organism Database
Project (GMOD), www.gmod.org
A graphical viewer of features along a reference
sequence
Based on MySQL and Perl
The configuration file allows us to
Change fonts, colors, text.
Change overview sequence scaffold, contig,
genetic map, karyotype.
Define tracks.
Modify track appearance.

12
Gbrowse Internals

BioPerl Library - allows browser to run on top of
a variety of database management systems and
schemata
BioGraphics module - used to graphically render
any type of nucleotide or protein feature
BioDBGFF Database - uses a flat coordinate
system to represent genomic features. Optimized
for queries that retrieve features by ID, type or
region of genome

13
Our task is to generate GFF data

GFF generic feature format
A standard format that aids data exchange
Allows you to specify a substring of a biological
sequence
The current version (2) uses terms from the
Sequence Ontology project
- A set of terms used to describe features on a
nucleotide or protein sequence. It encompasses
both "raw" features, such as nucleotide
similarity hits, and interpretations, such as
gene models.
For information on the specifications
http//www.sanger.ac.uk/Software/formats/GFF/

14
Computing Data for Tracks

Markers
Compare marker sequences to genome scaffolds
using BLASTN
Use ePCR (primersearch) for markers with primers,
but no sequence
ESTs
Compare ESTs to genome scaffolds using fasta or
BLAT
Use exonerate (http//www.ebi.ac.uk/guy/exonerate
/) to predict exon/intron boundaries for each
match
Protein Homologs
Compare protein sequences to genome scaffolds
using tfastx to identify matches
Use exonerate to predict exon/intron boundaries
for each match

15
Annotating Tracks

The most time consuming task in computing tracks
is providing annotations for protein homologs.
Annotations come from different sources and are
in different formats depending on protein
dataset.
We use UniProt for all homolog tracks in assembly
1.1 and 1.2 browsers.
Assembly 2 uses proteome sets for Drosophila
(FlyBase), C. elegans (WormBase), Yeast (SGD),
Mosquito (Ensembl) and Human (Ensembl) to avoid
redundancy within proteomes.
The fasta formatted sequences are not annotated
(except yeast).
The other insect track will come from UniProt.
To identify which sequences are insect, we use
taxon-id and a locally installed NCBI taxonomy
database.

16
CMAP

CMap is a web-based tool that allows users to
view comparisons of genetic and physical maps.
The package also includes tools for curating map
data.
MySQL and Perl
Consists of modules for data, logic (howmaps are
layed out), and presentation.
Our work is to modify the configuration file and
format data.

17
Future BeeBase Plans

Redo protein families analysis after final gene
prediction set is released add proteins from
additional model organisms (worm, yeast, mouse,
human)
Phylogenetic analysis to identify orthologs
Gene Ontology assignment
Create gene pages for each gene, similar to
FlyBase, using the new Turnkey gmod-web module

18
More BeeBase Plans

Curate literature for orthologs to provide an
entry into the BeeSpace conceptual navigation
system.
Incorporate QTL viewer using Dave Adelsons QTL
viewer software, which was developed for cattle.
Incorporate OpenGeneX gene expression database
and expression data from the BeeSpace project.

19
Gene Ontology For Honey Bee
20
Gene Ontology Consortiumhttp//www.geneontology.o
rg/

The goal of the Gene OntologyTM (GO) Consortium
is to produce a controlled vocabulary that can be
applied to all organisms even as knowledge of
gene and protein roles in cells is accumulating
and changing.
GO provides three structured networks of defined
terms to describe gene product attributes.
Molecular Function Ontology the tasks performed
by individual gene products examples are
carbohydrate binding and ATPase activity
Biological Process Ontology broad biological
goals, such as mitosis or purine metabolism, that
are accomplished by ordered assemblies of
molecular functions
Cellular Component Ontology subcellular
structures, locations, and macromolecular
complexes examples include nucleus, telomere,
and origin recognition complex

21
GO Evidence Codes

IDA inferred from direct assay - Enzyme assays,
In vitro reconstitution (e.g. transcription),
Immunofluorescence (for cellular component), Cell
fractionation (for cellular component), Physical
interaction/binding assay
IEP inferred from expression pattern - useful for
biological process ontology
IGI inferred from genetic interaction -
"Traditional" genetic interactions such as
suppressors, synthetic lethals, etc., Functional
complementation, Rescue experiments, Inference
about one gene drawn from the phenotype of a
mutation in a different gene
IMP inferred from mutant phenotype - Any gene
mutation/knockout, Overexpression/ectopic
expression of wild-type or mutant genes,
Anti-sense experiments, RNAi experiments,
Specific protein inhibitors
IPI inferred from physical interaction -
2-hybrid interactions, Co-purification,
Co-immunoprecipitation, Ion/protein binding
experiments
IEA inferred from electronic annotation
ISS inferred from sequence or structural
similarity
IC inferred by curator, TAS traceable author
statement, NAS non-traceable author statement ,
ND no biological data available, NR not recorded

22
Applying GO to Honey Bee

We must rely heavily on IEA (inferred from
electronic annotation - no curator) or ISS
(inferred from sequence similarity - inspected by
curator)
We must make the most reliable inferences
possible - based on orthology instead of homology

23
BackgroundEvolution-based functional inference
and orthology
24
Evolution Allows us to Infer Function

The most powerful method for inferring function
of a gene or protein is by similarity searching a
sequence database.
Our ability to characterize biological properties
of a protein using sequence data alone stems from
properties conserved through evolutionary time.
Homologous (evolutionarily related) proteins
always share a common 3-dimensional folding
structure.
They often contain common active sites or binding
domains.
They frequently share common functions.
Predictions made using similar, but
non-homologous proteins are much less reliable.

25
Orthologs

Homologs genes that are evolutionarily related
There are two kinds of homologs
Orthologs genes in different species that have
diverged from a common gene in an ancestral
species.
Paralogs genes that have diverged due to gene
duplication.
Orthologs are more likely than paralogs to have
conserved function.
Orthologs cannot be identified using BLAST or
FASTA sequence comparison alone.
Reliable ortholog identification requires
phylogenetic methods.

26
Example Gene Tree (with plant genes)
Rice-2b
paralogs
Rice-2a
Maize-2
paralogs
Wheat-2
Sorghum-2
Barley-1
Wheat-1
orthologs
Maize-1
Sorghum-1
Arabidopsis
27
Why shouldnt we depend on inferences based on
paralogs?

Paralogs emerge after a gene duplication.
Possible fates of duplicated genes
Loss of function for one of the duplicates - lack
of selective pressure allows gene to mutate
beyond recognition
Emergence of new functional paralogs - one
duplicate aquires a new function, so selection
favors its maintenance in the genome
Sub-functionalization - both duplicates are
required to maintain the function of the original

28
Back to Gene Ontology for Honey BeeProposed
Evidence Codes within ISS

ISS inferred from sequence similarity
(inspected by a curator)
We can break this down into
Inferred from homology (lowest)
Inferred from a ortholog in one species
Inferred orthologs in more than one species, all
of which have the same GO classification
(highest).
What if they dont all have the same GO
classification? Move up in the diacylic graph to
a point where GO classifications converge.
This can be tricky since the graph is diacyclic
and each node can have more than one parant

29
Some Ongoing Gene Ontology Work in the Elsik Lab
- Cattle

Cattle EST Gene Family Database
Cattle gene families were created using
assembled, translated ESTs grouped with
homologous human protein families.
Database is searchable using GO for the human
proteins.
The next step is phylogenetic analysis to
identify human/cattle orthologs.

30
Searching by Gene Ontology
31
(No Transcript)
32
(No Transcript)
33
Borrowing More From Cattle

Bovine QTL Database - David Adelson, TAMU

34
The Bovine QTL viewer Interface
35
Image showing all chromosomes
36
Image showing one chromosome
37
QTL Details
38
OpenGeneX

Web-based access to database
PostgreSQL
Includes as a curation tool a client side Java
application that formats data in MAGE-ML
Includes several statistical routines and data
analysis tools
Uses R statistical analysis package (open source)

39
Acknowledgements

Collaborators
Bruce Schatz, Gene Robinson and the BeeSpace
group, UIUC
William Gelbart - FlyBase (Harvard University)
Spencer Johnston (TAMU)
Danny Weaver, Bee Power LP

Elsik Lab
Justin Reese
Kyounghwa Bae
Anand Venkatraman
Shreyas Murthi
Michael Dickens
Juan Anzola

Write a Comment

User Comments (0)

About PowerShow.com

BeeBase The Honey Bee Model Organism Database Chris Elsik celsiktamu'edu PowerPoint PPT Presentation