GenBank Huge amounts of data, easily accessible - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

GenBank Huge amounts of data, easily accessible

Description:

Archive data and trees (repeat old analyses with ... The case of the Harp seal. TreeBASE and GenBank have harp seals under two different names, only ITIS knows ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 28
Provided by: Roderi
Category:

less

Transcript and Presenter's Notes

Title: GenBank Huge amounts of data, easily accessible


1
GenBank Huge amounts of data, easily accessible
2
Rate of growth of phylogenetic knowledge
Number of papers with molecular and phylogeny
in Web of Science
Number of studies in TreeBASE
3
Why have a phylogeny database?
  • Archive data and trees (repeat old analyses with
    new tools)
  • Synthesize new data sets and trees (supermatrices
    and supertrees)
  • Big scale questions (tree shape, bias in tree
    building methods, stability of trees over time)
  • Hypothesis testing find all phylogenies for
    taxa with members in Gondwana -- do they show
    similar area cladograms, amounts of sequence
    divergence, etc.
  • Who knows(we wont know until we try)

4
Obstacles in the way
  • Ontologies (consistent names for organisms,
    genes, and other kinds of data)
  • How to store and query trees (what kind of
    queries do we want?)
  • Summarising information in trees (supertrees) and
    matrices (supermatrices)
  • Visualising very big trees

5
Peruvian Diving-petrel(or, whats in a name?)
  • ITIS Pelecanoides garnotii
  • NCBI Pelecanoides garnoti
  • TreeBASE Pelecanoides garnoti AF076073

6
TreeBASE Names Projecthttp//darwin.zoology.gla.a
c.uk/rpage/TreeBASE/
  • Aim is to map every name in TreeBASE onto a valid
    taxonomic name (i.e., a name in a database, or in
    the literature)
  • Use exact-, substring-, and approximate string
    matching ( BLAST)
  • So far 26819 out of 35084 names mapped

7
Hemideina maori (weta)
18 TreeBASE names 1 real name
8
catodon
catadon
macrocephalus
3 TreeBASE names 1 real name
Physeter catodon (Sperm Whale)
9
The case of the Harp seal
TreeBASE and GenBank have harp seals under two
different names, only ITIS knows that they are
the same thing
10
(No Transcript)
11
  • There are known knowns, things we know that we
    know
  • There are known unknowns, things we now know we
    dont know
  • But there are also unknown unknowns, things we do
    not know we don't know

12
Why taxonomy matters
(or
vs.
)
13
Searching on Aves in TreeBASEfinds 4 studies
with birds
  • Study 1 Gauthier, J., A.G. Kluge, and T. Rowe.
    1988. Amniote phylogeny and the importance of
    fossils.
  • Study 2 Harshman, J., C. J. Huddleston, J. P.
    Bollback, T. J. Parsons, and M. J. Braun. 2003
    inpress. True and False Gavials A Nuclear Gene
    Phylogeny of Crocodylia.
  • Hedges, S. B., K. D. Moberg, and L. R.
    Maxson.1990. Tetrapod phylogeny inferred from 18s
    and 28s ribosomal RNA sequences and a review of
    the evidence for amniote relationships.
  • van Dijk, M. A. M., E. Paradis, F. Catzeflis, and
    W. de Jong. 1999. The virtues of gaps Xenarthran
    (Edentate) monophyly supported by a unique
    deletion in alphaA-crystallin.

14
but there are other birds in TreeBASE!
15
Tree space in TreeBASE (overlap 1)
16
There are 24 bird studies in TreeBASE, but tree
surfing wont find them
17
Fig. 1. The data availability matrix' for green
plant protein sequences from GenBank (release
132). A set of 130304 sequences for 14667 species
sequences were clustered into 61117 groups of
homologous proteins by a combination of BLAST and
single-linkage clustering (using the program
Blastclust from the NCBI Blast toolkit
http//www.ncbi.nlm.nih.gov/BLAST/ ). A column
represents a protein or protein family a row
represents one of the species in the dataset and
a dot indicates the existence of a sequence for
that species and protein. Species are sorted
vertically by their number of sequences the
most-represented species ( Arabidopsis thaliana )
is at the top. Proteins are sorted horizontally
by the number of taxa for which they have been
sequenced the most heavily sequenced gene ( rbcL
) is on the right. This figure shows the most
heavily sampled corner of the data availability
matrix the remainder of the matrix is even more
sparse.
18
Seeing the tree (best seen when printed on 1.5 m
wide paper)
19
http//darwin.zoology.gla.ac.uk/rpage/MyToL/www
20
Demo 1
21
Demo 2
22
Comparing classificationsfor Psocoptera
Lienhard Smithers (2002) courtesy of Kevin
Johnson 4363 species
NCBI (GenBank) 9 species
23
(No Transcript)
24
Bioinformatics envy - GenBank should NOT be our
role model
www.biomoby.org www.gmod.org
25
From journal to database
Problem not enough data and trees in journals
make it into databases
26
Elseviers journal Molecular Phylogenetics and
Evolution is a criminal waste of our efforts
Text, data, trees locked up in paper and PDF
27
the database is the journal
  1. Data trees go into database
  2. Text (annotation) added
  3. Automatically generate a report summarising the
    results
  4. The report is the publication (can have a DOI)
  5. Open Access data and text
Write a Comment
User Comments (0)
About PowerShow.com