Title: Protein Sequence Databases for Proteomics The good, the bad
1Protein Sequence Databases for ProteomicsThe
good, the bad the ugly
- US HUPO Bioinformatics for Proteomics
- Nathan Edwards March 12, 2006
2Protein Sequence Databases
- Link between mass spectra and proteins
- A proteins amino-acid sequence provides a basis
for interpreting - Enzymatic digestion
- Separation protocols
- Fragmentation
- We must interpret database information as
carefully as mass spectra.
3More than sequence
- Protein sequence databases provide much more than
sequence - Names
- Descriptions
- Facts
- Predictions
- Links to other information sources
- Protein databases provide a link to the current
state of our understanding about a protein.
4Much more than sequence
- Names
- Accession, Name, Description
- Biological Source
- Organism, Source, Taxonomy
- Literature
- Function
- Biological process, molecular function, cellular
component - Known and predicted
- Features
- Polymorphism, Isoforms, PTMs, Domains
5Database types
6Human Sequences
- Number of Human Genes is believed to be between
20,000 and 25,000
7Accessions
- Permanent labels
- Short, machine readable
- Enable precise communication
- Typos render them unusable!
- Each database uses a different format
- Swiss-Prot P17947
- Ensembl ENSG00000066336
- PIR S60367 S60367
- GO GO0003700
8Names / IDs
- Compact mnemonic labels
- Not guaranteed permanent
- Require careful curation
- Conceptual objects
- Swiss-Prot names changed last year!
- ALBU_HUMAN
- Serum Albumin
- RT30_HUMAN
- Mitochondrial 28S ribosomal protein S30
- CP3A7_HUMAN
- Cytochrome P450 3A7
9Description / Name
- Free text description
- Human readable
- Space limited
- Hard for computers to interpret!
- No standard nomenclature or format
- Often abused.
- COX7R_HUMAN
- Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor
10FASTA Format
11FASTA Format
-
- Accession number
- No uniform format
- Multiple accessions separated by
- One line of description
- Usually pretty cryptic
- Organism of sequence?
- No uniform format
- Official latin name not necessarily used
- Amino-acid sequence in single-letter code
- Usually spread over multiple lines.
12Organism / Species / Taxonomy
- The proteins organism
- or the source of the biological sample
- The most reliable sequence annotation available
- Useful only to the extent that it is correct
- NCBIs taxonomy is widely used
- Provides a standard of sorts Heirachical
- Other databases dont necessarily keep up
- Organism specific sequence databases are also
available.
13Organism / Species / Taxonomy
- Buffalo rat
- Gunn rats
- Norway rat
- Rattus PC12 clone IS
- Rattus norvegicus
- Rattus norvegicus8
- Rattus norwegicus
- Rattus rattiscus
- Rattus sp.
- Rattus sp. strain Wistar
- Sprague-Dawley rat
- Wistar rats
- brown rat
- laboratory rat
- rat
- rats
- zitter rats
14Controlled Vocabulary
- Middle ground between computers and people
- Provides precision for concepts
- Searching, sorting, browsing
- Concept relationships
- Vocabulary / Ontology must be established
- Human curation
- Link between concept and object
- Manually curated
- Automatic / Predicted
15Controlled Vocabulary
16Controlled Vocabulary
17Controlled Vocabulary
18Controlled Vocabulary
19Controlled Vocabulary
20Controlled Vocabulary
21Ontology Structure
- NCBI Taxonomy
- Tree
- Gene Ontology (GO)
- Molecular function
- Biological process
- Cellular component
- Directed, Acyclic Graph (DAG)
- Unstructured labels
- InterPro, Pfam, Swiss-Prot keywords
- Overlapping?
22Ontology Structure
23Protein Families
- Similar sequence implies similar function
- Similar structure implies similar function
- Common domains imply similar function
- Bootstrap up from small sets of proteins with
well understood characteristics - Usually a hybrid manual / automatic approach
24Protein Families
25Protein Families
26Protein Families
- PROSITE, PFam, InterPro, PRINTS
- Swiss-Prot keywords
- Differences
- Motif style, ontology structure, degree of manual
curation - Similarities
- Primarily sequence based, cross species
27Gene Ontology
- Hierarchical
- Molecular function
- Biological process
- Cellular component
- Describes the vocabulary only!
- Protein families provide GO association
- Not necessarily any appropriate GO category.
- Not necessarily in all three hierarchies.
- Sometimes general categories are used because
none of the specific categories are correct.
28Protein Family / Gene Ontology
29Sequence Variants
- Protein sequence can vary due to
- Polymorphism
- Alternative splicing
- Post-translational modification
- Sequence databases typically do not capture all
versions of a proteins sequence
30Sequence Variants
- Swiss-Prot a curated protein sequence database
which strives to provide a high level of
annotation (such as the description of the
function of a protein, its domains structure,
post-translational modifications, variants,
etc.), a minimal level of redundancy and high
level of integration with other databases - - Swiss-Prot web site front page
31Sequence Variants
- b) Minimal redundancy
- Many sequence databases contain, for a given
protein sequence, separate entries which
correspond to different literature reports. In
Swiss-Prot we try as much as possible to merge
all these data so as to minimize the redundancy
of the database. If conflicts exist between
various sequencing reports, they are indicated in
the feature table of the corresponding entry. - - Swiss-Prot User Manual, Section 1.1
32Sequence Variants
- IPI provides a top level guide to the main
databases that describe the proteomes of higher
eukaryotic organisms. IPI - 1. effectively maintains a database of cross
references between the primary data sources - 2. provides minimally redundant yet maximally
complete sets of proteins for featured species
(one sequence per transcript) - 3. maintains stable identifiers (with
incremental versioning) to allow the tracking of
sequences in IPI between IPI releases. - - IPI web site front page
33Sequence Variants
- Swiss-Prot variants, isoforms and conflicts are
retained as features - Script varsplic.pl can enumerate all sequence
variants - Command-line options for full enumeration
- -which full -varsplic -variant -conflict
34Swiss-Prot Variant Annotations
35Swiss-Prot Variant Annotations
36Swiss-Prot Variant Annotations
Feature viewer
Variants
37Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF
38Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ
39Omnibus Database Redundancy Elimination
- Source databases often contain the same sequences
with different descriptions - Omnibus databases keep one copy of the sequence,
and - An arbitrary description, or
- All descriptions, or
- Particular description, based on source
preference - Good definitions can be lost, including taxonomy
40Omnibus Database Redundancy Elimination
- NCBIs nr
- Keeps all descriptions, separated by A
- MSDB
- Pecking order PIR1-4, TrEMBL, GenBank,
Swiss-Prot, NRL3D - IPI
- All accessions, one description
41Description Elimination
- gi12053249embCAB66806.1 hypothetical protein
Homo sapiens - gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens - gi42632621gbAAS22242.1 COMMD4 Homo
sapiens - gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens - gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4 - gi49065330embCAG38483.1 COMMD4 Homo
sapiens
42Description Elimination
- gi2947219gbAAC39645.1 UDP-galactose 4'
epimerase Homo sapiens - gi1119217gbAAB86498.1 UDP-galactose-4-epimera
se Homo sapiens - gi14277913pdb1HZJB Chain B, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site - gi14277912pdb1HZJA Chain A, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site - gi2494659spQ14376GALE_HUMAN UDP-glucose
4-epimerase (Galactowaldenase) (UDP-galactose
4-epimerase) - gi1585500prf2201313AUDP galactose
4'-epimerase
43Description Elimination
- gi4261710gbAAD14010.1 chlordecone reductase
Homo sapiens - gi2117443pirA57407 chlordecone reductase (EC
1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
(EC 1.1.1.-) I validated human - gi1839264gbAAB47003.1 HAKRa
product/3 alpha-hydroxysteroid dehydrogenase
homolog human, liver, Peptide, 323 aa - gi1705823spP17516AKC4_HUMAN Aldo-keto
reductase family 1 member C4 (Chlordecone reductas
e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
(3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
(HAKRA) - gi7328948dbjBAA92885.1 dihydrodiol
dehydrogenase 4 Homo sapiens - gi7328971dbjBAA92893.1dihydrodiol
dehydrogenase 4 Homo sapiens
44DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
45Translated sequences
- Gene models describe introns and exons
- Start site?
- Splice sites?
- Alternative splicing?
- ESTs provide limited evidence of transcription
only - There is a lot we dont know about what protein
sequences result from a gene - Recent revision of number of human genes suggest
a bigger role for alternative splicing.
46Genome Browsers
- Link genomic, transcript, and protein sequence in
a graphical manner - Genes, ESTs, SNPs, cross-species, etc.
- UC Santa Cruz
- http//genome.ucsc.edu
- Ensembl
- http//www.ensembl.org
- NCBI Map View
- http//www.ncbi.nlm.nih.gov/mapview
47UCSC Genome Browser
- Shows many sources of protein sequence evidence
in a unified display - Can use EST accession as a location!
48Summary
- Protein sequence databases should be interpreted
with as much care as mass spectra - Use controlled vocabularies
- Understand the structure of ontologies
- Take advantage of computational predictions
- Look for sequence variants
- Be careful with omnibus databases