BeeBase The Honey Bee Model Organism Database Chris Elsik celsiktamu'edu PowerPoint PPT Presentation

presentation player overlay
1 / 39
About This Presentation
Transcript and Presenter's Notes

Title: BeeBase The Honey Bee Model Organism Database Chris Elsik celsiktamu'edu


1
BeeBase - The Honey Bee Model Organism
DatabaseChris Elsikc-elsik_at_tamu.edu
2
Outline
  • BeeBase - what it is now
  • How it works
  • Future Plans

3
BeeBasehttp//racerx00.tamu.edu/PHP/bee_search.ph
p
  • Predicted Gene and Homolog Search Page
  • Genome Browser
  • Comparative Map Viewer
  • Protein Families Database with Bee, Fly and
    Mosquito proteins
  • The newest assembly ( release 2.0)
    http//racerx00.tamu.edu/cgi-bin/gbrowse/bee_genom
    e2

4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Gbrowse
  • A module of the Generic Model Organism Database
    Project (GMOD), www.gmod.org
  • A graphical viewer of features along a reference
    sequence
  • Based on MySQL and Perl
  • The configuration file allows us to
  • Change fonts, colors, text.
  • Change overview sequence scaffold, contig,
    genetic map, karyotype.
  • Define tracks.
  • Modify track appearance.

12
Gbrowse Internals
  • BioPerl Library - allows browser to run on top of
    a variety of database management systems and
    schemata
  • BioGraphics module - used to graphically render
    any type of nucleotide or protein feature
  • BioDBGFF Database - uses a flat coordinate
    system to represent genomic features. Optimized
    for queries that retrieve features by ID, type or
    region of genome

13
Our task is to generate GFF data
  • GFF generic feature format
  • A standard format that aids data exchange
  • Allows you to specify a substring of a biological
    sequence
  • The current version (2) uses terms from the
    Sequence Ontology project
  • - A set of terms used to describe features on a
    nucleotide or protein sequence. It encompasses
    both "raw" features, such as nucleotide
    similarity hits, and interpretations, such as
    gene models.
  • For information on the specifications
  • http//www.sanger.ac.uk/Software/formats/GFF/

14
Computing Data for Tracks
  • Markers
  • Compare marker sequences to genome scaffolds
    using BLASTN
  • Use ePCR (primersearch) for markers with primers,
    but no sequence
  • ESTs
  • Compare ESTs to genome scaffolds using fasta or
    BLAT
  • Use exonerate (http//www.ebi.ac.uk/guy/exonerate
    /) to predict exon/intron boundaries for each
    match
  • Protein Homologs
  • Compare protein sequences to genome scaffolds
    using tfastx to identify matches
  • Use exonerate to predict exon/intron boundaries
    for each match

15
Annotating Tracks
  • The most time consuming task in computing tracks
    is providing annotations for protein homologs.
  • Annotations come from different sources and are
    in different formats depending on protein
    dataset.
  • We use UniProt for all homolog tracks in assembly
    1.1 and 1.2 browsers.
  • Assembly 2 uses proteome sets for Drosophila
    (FlyBase), C. elegans (WormBase), Yeast (SGD),
    Mosquito (Ensembl) and Human (Ensembl) to avoid
    redundancy within proteomes.
  • The fasta formatted sequences are not annotated
    (except yeast).
  • The other insect track will come from UniProt.
  • To identify which sequences are insect, we use
    taxon-id and a locally installed NCBI taxonomy
    database.

16
CMAP
  • CMap is a web-based tool that allows users to
    view comparisons of genetic and physical maps.
  • The package also includes tools for curating map
    data.
  • MySQL and Perl
  • Consists of modules for data, logic (howmaps are
    layed out), and presentation.
  • Our work is to modify the configuration file and
    format data.

17
Future BeeBase Plans
  • Redo protein families analysis after final gene
    prediction set is released add proteins from
    additional model organisms (worm, yeast, mouse,
    human)
  • Phylogenetic analysis to identify orthologs
  • Gene Ontology assignment
  • Create gene pages for each gene, similar to
    FlyBase, using the new Turnkey gmod-web module

18
More BeeBase Plans
  • Curate literature for orthologs to provide an
    entry into the BeeSpace conceptual navigation
    system.
  • Incorporate QTL viewer using Dave Adelsons QTL
    viewer software, which was developed for cattle.
  • Incorporate OpenGeneX gene expression database
    and expression data from the BeeSpace project.

19
Gene Ontology For Honey Bee
20
Gene Ontology Consortiumhttp//www.geneontology.o
rg/
  • The goal of the Gene OntologyTM (GO) Consortium
    is to produce a controlled vocabulary that can be
    applied to all organisms even as knowledge of
    gene and protein roles in cells is accumulating
    and changing.
  • GO provides three structured networks of defined
    terms to describe gene product attributes.
  • Molecular Function Ontology the tasks performed
    by individual gene products examples are
    carbohydrate binding and ATPase activity
  • Biological Process Ontology broad biological
    goals, such as mitosis or purine metabolism, that
    are accomplished by ordered assemblies of
    molecular functions
  • Cellular Component Ontology subcellular
    structures, locations, and macromolecular
    complexes examples include nucleus, telomere,
    and origin recognition complex

21
GO Evidence Codes
  • IDA inferred from direct assay - Enzyme assays,
    In vitro reconstitution (e.g. transcription),
    Immunofluorescence (for cellular component), Cell
    fractionation (for cellular component), Physical
    interaction/binding assay
  • IEP inferred from expression pattern - useful for
    biological process ontology
  • IGI inferred from genetic interaction -
    "Traditional" genetic interactions such as
    suppressors, synthetic lethals, etc., Functional
    complementation, Rescue experiments, Inference
    about one gene drawn from the phenotype of a
    mutation in a different gene
  • IMP inferred from mutant phenotype - Any gene
    mutation/knockout, Overexpression/ectopic
    expression of wild-type or mutant genes,
    Anti-sense experiments, RNAi experiments,
    Specific protein inhibitors
  • IPI inferred from physical interaction -
    2-hybrid interactions, Co-purification,
    Co-immunoprecipitation, Ion/protein binding
    experiments
  • IEA inferred from electronic annotation
  • ISS inferred from sequence or structural
    similarity
  • IC inferred by curator, TAS traceable author
    statement, NAS non-traceable author statement ,
    ND no biological data available, NR not recorded

22
Applying GO to Honey Bee
  • We must rely heavily on IEA (inferred from
    electronic annotation - no curator) or ISS
    (inferred from sequence similarity - inspected by
    curator)
  • We must make the most reliable inferences
    possible - based on orthology instead of homology

23
BackgroundEvolution-based functional inference
and orthology
24
Evolution Allows us to Infer Function
  • The most powerful method for inferring function
    of a gene or protein is by similarity searching a
    sequence database.
  • Our ability to characterize biological properties
    of a protein using sequence data alone stems from
    properties conserved through evolutionary time.
  • Homologous (evolutionarily related) proteins
    always share a common 3-dimensional folding
    structure.
  • They often contain common active sites or binding
    domains.
  • They frequently share common functions.
  • Predictions made using similar, but
    non-homologous proteins are much less reliable.

25
Orthologs
  • Homologs genes that are evolutionarily related
  • There are two kinds of homologs
  • Orthologs genes in different species that have
    diverged from a common gene in an ancestral
    species.
  • Paralogs genes that have diverged due to gene
    duplication.
  • Orthologs are more likely than paralogs to have
    conserved function.
  • Orthologs cannot be identified using BLAST or
    FASTA sequence comparison alone.
  • Reliable ortholog identification requires
    phylogenetic methods.

26
Example Gene Tree (with plant genes)
Rice-2b
paralogs
Rice-2a
Maize-2
paralogs
Wheat-2
Sorghum-2
Barley-1
Wheat-1
orthologs
Maize-1
Sorghum-1
Arabidopsis
27
Why shouldnt we depend on inferences based on
paralogs?
  • Paralogs emerge after a gene duplication.
  • Possible fates of duplicated genes
  • Loss of function for one of the duplicates - lack
    of selective pressure allows gene to mutate
    beyond recognition
  • Emergence of new functional paralogs - one
    duplicate aquires a new function, so selection
    favors its maintenance in the genome
  • Sub-functionalization - both duplicates are
    required to maintain the function of the original

28
Back to Gene Ontology for Honey BeeProposed
Evidence Codes within ISS
  • ISS inferred from sequence similarity
    (inspected by a curator)
  • We can break this down into
  • Inferred from homology (lowest)
  • Inferred from a ortholog in one species
  • Inferred orthologs in more than one species, all
    of which have the same GO classification
    (highest).
  • What if they dont all have the same GO
    classification? Move up in the diacylic graph to
    a point where GO classifications converge.
  • This can be tricky since the graph is diacyclic
    and each node can have more than one parant

29
Some Ongoing Gene Ontology Work in the Elsik Lab
- Cattle
  • Cattle EST Gene Family Database
  • Cattle gene families were created using
    assembled, translated ESTs grouped with
    homologous human protein families.
  • Database is searchable using GO for the human
    proteins.
  • The next step is phylogenetic analysis to
    identify human/cattle orthologs.

30
Searching by Gene Ontology
31
(No Transcript)
32
(No Transcript)
33
Borrowing More From Cattle
  • Bovine QTL Database - David Adelson, TAMU

34
The Bovine QTL viewer Interface
35
Image showing all chromosomes
36
Image showing one chromosome
37
QTL Details
38
OpenGeneX
  • Web-based access to database
  • PostgreSQL
  • Includes as a curation tool a client side Java
    application that formats data in MAGE-ML
  • Includes several statistical routines and data
    analysis tools
  • Uses R statistical analysis package (open source)

39
Acknowledgements
  • Collaborators
  • Bruce Schatz, Gene Robinson and the BeeSpace
    group, UIUC
  • William Gelbart - FlyBase (Harvard University)
  • Spencer Johnston (TAMU)
  • Danny Weaver, Bee Power LP
  • Elsik Lab
  • Justin Reese
  • Kyounghwa Bae
  • Anand Venkatraman
  • Shreyas Murthi
  • Michael Dickens
  • Juan Anzola
Write a Comment
User Comments (0)
About PowerShow.com