Title: Genomes
1Genomes Databases
2Biologists Collect Lots of Data
- Hundreds of thousands of species
- Millions of articles in scientific journals
- Genetic information
- gene names
- phenotype of mutants
- location of genes/mutations on chromosmes
- linkage (distances between genes)
3Gene Sequences
- 1980s - sequenced one gene at a time
- 1990s - DNA sequencing machines!!
- whole genomes
- cDNA (made from RNA)
- short Expressed Sequence Tags (ESTs)
- Now we have the genome of the month club
- The Poodle ??!!?? (Science. Sep 26, 2003)
4Protein Sequences
- Protein sequences are derived from genome and
cDNA - using the genetic code and gene-prediction
software - Functional information is mostly about proteins
- metabolic pathways
- conserved domains
- 3-Dimensional structure
- protein-protein interaction
- DNA binding (regulatory proteins transcription
factors)
5Related Data
- Gene expression
- Measure levels of each gene in different types of
cells, in response to different stimuli, disease,
etc - Polymorphism
- mutations and allelic differences
- population biology
6What is a Database?
- Organized data
- Like a spreadsheet
- Columns are fields
- Rows are records
- Fields contain data of the same type
- Rows contain data that is related to one object
- Can search for a term within just one field
- speed and specificity
- Or combine searches in several fields
- this cant be done in a simple text file
7(No Transcript)
8Large Databases
- Once upon a time, GenBank sent out sequence
updates on CD-ROM disks a few times per year. - Now (Oct/2003) GenBank is over 250 Gigabytes
- (100 billion bases)
- Most biocomputing sites update their copy of
GenBank every day over the internet. - Scientists access GenBank directly over the Web
9GenBank is a Flatfile Database
- Composed entirely of text
- you could print the whole thing out
- Each submitted sequence is a record
- Had fields for Organism, Date, Author, etc.
- Unique identifier for each sequence
- Locus and Accession
10Finding Sequences in Databases
- The public DNA and protein sequence databases are
huge. - In order for these databases to be useful, the
data must be readily accessible to researchers.
11Raw Genome Data
12What Are You Looking For?
- A gene?
- DNA or protein sequence?
- DNA sequences are essentially all in GenBank
- Genomic, mRNA, cDNA, EST?
- Proteins are harder to pin down
- GenPept (GenBank Peptides) is huge and poorly
annotated - lots of junk - SwissProt is carefully annotated, but not fully
comprehensive - PIR is somewhere in between
- PDB has protein 3-D structures
13Federated Databases
- EMBL (European Molecular Biology Lab)
- has the same data as GenBank, but in a different
format and sometimes with different accession
numbers - uses SRS as query tool
- DDBJ (DNA DataBank of Japan)
- also same data, diff. format, diff. acc. s
14Finding Genes in GenBank
- These billions of G, A, T, and C letters would
be almost useless without descriptions of what
genes they contain, the organisms they come from,
etc. - All of this information is contained in the
"annotation" part of each sequence record.
15(No Transcript)
16Accession Numbers!!
- Databases are designed to be searched by
accession numbers (and locus IDs) - These are guaranteed to be non-redundant,
accurate, and not to change. - Searching by gene names and keywords is doomed to
frustration and probable failure - Neither scientists nor computers can be trusted
to accurately and consistently annotate database
entries - If only scientists would refer to genes by
accession numbers in all published work!
17(No Transcript)
18Relational Databases
- GenBank is a DNA sequence database
- What about proteins?
- Could just add them into the same database
- some fields are the same
- (Organism, Author, Date, etc)
- but some are different (phosphorylation site)
- Better to make a separate table
- But how to connect DNA and protein for the same
gene? - Need a primary key (accession )
19Other Related Information
- Also related to GenBank are protein structures,
literature in PubMed, UniGene collections of cDNA
sequences, etc - Each dataset deserves its own table - but by
linking to a GenBank accession, they can all be
cross referenced
20Entrez is a Relational Database
- The Entrez database contains all of the
nucleotide and protein sequences in GenBank
(updated daily) along with all of the literature
in MEDLINE and the 3-D protein structures in PDB
(Protein Data Base). - Entrez is much more than a database, it is a both
a powerful search engine and a pre-computed list
of relationships between all of its data elements
21Entrez is Internally Cross-linked
- DNA and protein sequences are linked to other
similar sequences - Medline citations are linked to other citations
that contain similar keywords - 3-D structures are linked to similar structures
22Databases contain more than just DNA protein
sequences
23Type in a Query term
- Enter your search words in the
- query box and hit the Go button
24 25Related Items
- You can search for a text term in sequence
annotations or in MEDLINE abstracts, and find all
articles, DNA, and protein sequences that mention
that term. - Then from any article or sequence, you can move
to "related articles" or "related sequences". - Relationships between sequences are computed with
BLAST - Relationships between articles are computed with
"MESH" terms (shared keywords - Relationships between DNA and protein sequences
rely on accession numbers - Relationships between sequences and MEDLINE
articles rely on both shared keywords and the
mention of accession numbers in the articles.
26- These pre-computed relationships might include
genes in the same multi-gene family, articles
written about genes that have the same function,
or other proteins that function in the same
biochemical pathway - This potential for horizontal movement through
the linked databases makes Entrez really
exciting so you can think like a biologist, not
a database programmer! - A researcher can start with only a vague set of
keywords or a sequence identified in the
laboratory and rapidly access a set of relevant
literature and a list of related database
sequences.
27Refine the Query
- Often a search finds too many (or too few)
sequences, so you can go back and try again with
more (or fewer) keywords in your query - The History feature allows you to combine any
of your past queries. - The Limits feature allows you to limit a query
to specific organisms, sequences submitted during
a specific period of time, etc. - Many other features are designed to search for
literature in MEDLINE
28Other Sequence Search Tools
- SRS (Sequence Retrival Service) was created by
Dr. Thure Etzold CABIOS 9(1) 49-57 (1993) - It is a meta search engine for all types of
biological data in hundreds of databases as well
as about 20 sequence analysis programs - SRS can be accessed over the WWW from many
servers (mostly in Europe) - http//srs.ebi.ac.uk/
- http//www.infobiogen.fr/srs6bin/cgi-bin/wgetz?-pa
getop - http//www.sanger.ac.uk/srs6bin/cgi-bin/wgetz?-pag
etop - http//iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz
?-pagetop
29(No Transcript)
30(No Transcript)
31Why So Many Databases?
- If GenBank has all sequence data and Entrez is
such a good query tool, then why are there so
many other sequence databases? - Specialized data (single species,
immunoglobulins, etc.) - Better annotation (i.e. SwissProt)
- Sequences linked to other data (ACEDB)
- Subbornness and local pride - EMBL, DDBJ
- Well designed databases are interlinked with
others for supplemental data - It is very hard to get all relevant information
across all databases for any gene
32Other Genetic Databases
- Genome Sequence - where does a gene fall on the
genome - integrate multiple layers of information
- Sequence contigs, mRNAs, predicted exons, etc.
- Single species?
- ESTs dbEST _at_ NCBI
- SNPs dbSNP _at_ NCBI,
- http//snp.cshl.org (SNP Consortium)
- Metabolism/Pathways
- Gene Function (Genome Ontology)
- Protein motifs/domains and protein families
33Genome Databases
- New area - in desperate need of development
- ChromosomesSequenceContigsClones STS
MarkersGenetic MarkersGenes
FeaturesExpression dataPhenotype - No single database can hold it all
- UCSC is probably the best right now
- genome.ucsc.edu
- Need a data exchange and linkage infrastructure
34(No Transcript)
35(No Transcript)
36Ensembl at EBI/EMBL
37ESTs (Expressed Sequence Tags)
- partial cDNA sequences
- dbEST at NCBI
- a comprehensive set of all public EST data
- UniGene at NCBI
- clusters of ESTs and know genes from key species
- does NOT have consensus sequences
- has far too many clusters to be representative of
real genes (129 K human clusters)
38IDENTIFIERS dbEST Id 101883 EST name
yb01a01.s1 GenBank Acc T48601 GenBank gi
650461 GDB Id 490761 CLONE INFO Clone
Id IMAGE69864 (3') Other ESTs on
cloneyb01a01.r1 DNA type
cDNA PRIMERS Sequencing -21m13 PolyA Tail
Unknown SEQUENCE GGCGGCTCAGTAGCAGGTGCC
GTCCACCTCCGCCATGACAACAGACACATTGACATGGGT
GGGTTTACCACCAAGCGTCCGATGGTCTTCTGTGTGAAGGCCAG
CCAGGCGCCTCCATGG
CACCATGCAGGAGAAGGNCTCCCCCTTCTTCCAGTCCTCGGCTGCCACGC
GCAGTATGCT GGTCACACGAAGGTCGTGGTGCC
CTGGCTGGNTCCTNCANGGATGCCCAAGTCAGGTACT
TNTCGCGGGGCAGCTCCTGTGACCCCTGCAGCCAGCGAACCAGCAC
GTCCTTGGGGCTTN AAGCNGCGCTACCAGGCAC
TTCAACCGTTCNCCAGCTTCGTTCAGGGCCANCTTC Quality
High quality sequence stops at base 277 Entry
Created Feb 6 1995 Last Updated Feb 6
1995 COMMENTS High qality sequence stops
277 Source IMAGE Consortium,
LLNL This clone is available
royalty-free through LLNL contact
the IMAGE Consortium (info_at_image.llnl.gov) for
further information. PUTATIVE ID
Assigned by submitter similar
to gbS71043_rna1 IG ALPHA-2 CHAIN C REGION
(HUMAN) LIBRARY Lib Name Stratagene
placenta (937225) Organism Homo
sapiens Sex male Organ
placenta Lab host SOLR cells (kanamycin
resistant) Vector pBluescript SK- R.
Site 1 EcoRI R. Site 2
XhoI Description Cloned unidirectionally.
Primer Oligo dT. Caucasian.
Average insert size 1.2 kb Uni-ZAP XR Vector
5' adaptor sequence 5'
GAATTCGGCACGAG 3' 3' adaptor sequence 5'
CTCGAGTTTTTTTTTTTTTTTTTT 3'
39(No Transcript)
40(No Transcript)
41Digital Differential Display
42(No Transcript)
43SNPs (Single Nucleotide Polymorphisms)
- Genetic variation
- Can be alleles of genes
- also differences in non-coding regions collected
from genome sequencing of different individuals - dbSNP at the NCBI - all public SNP data
- SNP Consortium at CSHL - high quality set
44(No Transcript)
45(No Transcript)
46Human Genetic Variation
- Every human has essentially the same set of genes
- But there are different forms of each gene -
known as alleles - blue vs. brown eyes
- genetic diseases such as cystic fibrosis or
Huntingtons disease are caused by dysfunctional
alleles
47(No Transcript)
48KEGG Kyoto Encylopedia of Genes and Genomes
- Enzymatic and regulatory pathways
- Mapped out by EC number and cross-referenced to
genes in all known organisms - (wherever sequence information exits)
- Parallel maps of regulatory pathways
49(No Transcript)
50(No Transcript)
51(No Transcript)
52Protein-Protein Interactions
- Metabolic and regulatory pathways
- Transcription factors
- Co-expression
- Biochemical data
- crosslinking
- yeast 2-hybrid
- affinity tagging
- Useful feedback to genome annotation/protein
function and gene expression
53(No Transcript)
54BIND - The Biomolecular Interaction Network
Database
55Genome Ontology
- Genetics is a messy science
- Scientists have been working in isolation on
individual species for many years - naming genes,
mutants, odd phenotypes - sonic hedgehog
- Now that we have complete genome sequences, how
to reconcile the names across all species? - Genome Ontology uses a single 3 part system
- Molecular function (specific tasks)
- Biological process (broad biologial goals - e.g
cell division) - Cellular component (location)
56(No Transcript)
57Database Search Strategies
- General search principles - not limited to
sequence (or to biology) - Use accession numbers whenever possible
- Start with broad keywords and narrow the search
using more specific terms - Try variants of spelling, numbers, etc.
- Search all relevant databases
- Be persistent!!