Protein Sequence Databases for Proteomics The good, the bad

About This Presentation

Title:

Protein Sequence Databases for Proteomics The good, the bad

Description:

... incremental versioning) to allow the tracking of sequences in IPI between IPI releases. ... UC Santa Cruz. http://genome.ucsc.edu. Ensembl. http://www. ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 49

Provided by: nathanjoh

Category:

more less

Transcript and Presenter's Notes

Title: Protein Sequence Databases for Proteomics The good, the bad

1
Protein Sequence Databases for ProteomicsThe
good, the bad the ugly

US HUPO Bioinformatics for Proteomics
Nathan Edwards March 12, 2006

2
Protein Sequence Databases

Link between mass spectra and proteins
A proteins amino-acid sequence provides a basis
for interpreting
Enzymatic digestion
Separation protocols
Fragmentation
We must interpret database information as
carefully as mass spectra.

3
More than sequence

Protein sequence databases provide much more than
sequence
Names
Descriptions
Facts
Predictions
Links to other information sources
Protein databases provide a link to the current
state of our understanding about a protein.

4
Much more than sequence

Names
Accession, Name, Description
Biological Source
Organism, Source, Taxonomy
Literature
Function
Biological process, molecular function, cellular
component
Known and predicted
Features
Polymorphism, Isoforms, PTMs, Domains

5
Database types
6
Human Sequences

Number of Human Genes is believed to be between
20,000 and 25,000

7
Accessions

Permanent labels
Short, machine readable
Enable precise communication
Typos render them unusable!
Each database uses a different format
Swiss-Prot P17947
Ensembl ENSG00000066336
PIR S60367 S60367
GO GO0003700

8
Names / IDs

Compact mnemonic labels
Not guaranteed permanent
Require careful curation
Conceptual objects
Swiss-Prot names changed last year!
ALBU_HUMAN
Serum Albumin
RT30_HUMAN
Mitochondrial 28S ribosomal protein S30
CP3A7_HUMAN
Cytochrome P450 3A7

9
Description / Name

Free text description
Human readable
Space limited
Hard for computers to interpret!
No standard nomenclature or format
Often abused.
COX7R_HUMAN
Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor

10
FASTA Format
11
FASTA Format

Accession number
No uniform format
Multiple accessions separated by
One line of description
Usually pretty cryptic
Organism of sequence?
No uniform format
Official latin name not necessarily used
Amino-acid sequence in single-letter code
Usually spread over multiple lines.

12
Organism / Species / Taxonomy

The proteins organism
or the source of the biological sample
The most reliable sequence annotation available
Useful only to the extent that it is correct
NCBIs taxonomy is widely used
Provides a standard of sorts Heirachical
Other databases dont necessarily keep up
Organism specific sequence databases are also
available.

13
Organism / Species / Taxonomy

Buffalo rat
Gunn rats
Norway rat
Rattus PC12 clone IS
Rattus norvegicus
Rattus norvegicus8
Rattus norwegicus
Rattus rattiscus
Rattus sp.

Rattus sp. strain Wistar
Sprague-Dawley rat
Wistar rats
brown rat
laboratory rat
rat
rats
zitter rats

14
Controlled Vocabulary

Middle ground between computers and people
Provides precision for concepts
Searching, sorting, browsing
Concept relationships
Vocabulary / Ontology must be established
Human curation
Link between concept and object
Manually curated
Automatic / Predicted

15
Controlled Vocabulary
16
Controlled Vocabulary
17
Controlled Vocabulary
18
Controlled Vocabulary
19
Controlled Vocabulary
20
Controlled Vocabulary
21
Ontology Structure

NCBI Taxonomy
Tree
Gene Ontology (GO)
Molecular function
Biological process
Cellular component
Directed, Acyclic Graph (DAG)
Unstructured labels
InterPro, Pfam, Swiss-Prot keywords
Overlapping?

22
Ontology Structure
23
Protein Families

Similar sequence implies similar function
Similar structure implies similar function
Common domains imply similar function
Bootstrap up from small sets of proteins with
well understood characteristics
Usually a hybrid manual / automatic approach

24
Protein Families
25
Protein Families
26
Protein Families

PROSITE, PFam, InterPro, PRINTS
Swiss-Prot keywords
Differences
Motif style, ontology structure, degree of manual
curation
Similarities
Primarily sequence based, cross species

27
Gene Ontology

Hierarchical
Molecular function
Biological process
Cellular component
Describes the vocabulary only!
Protein families provide GO association
Not necessarily any appropriate GO category.
Not necessarily in all three hierarchies.
Sometimes general categories are used because
none of the specific categories are correct.

28
Protein Family / Gene Ontology
29
Sequence Variants

Protein sequence can vary due to
Polymorphism
Alternative splicing
Post-translational modification
Sequence databases typically do not capture all
versions of a proteins sequence

30
Sequence Variants

Swiss-Prot a curated protein sequence database
which strives to provide a high level of
annotation (such as the description of the
function of a protein, its domains structure,
post-translational modifications, variants,
etc.), a minimal level of redundancy and high
level of integration with other databases
- Swiss-Prot web site front page

31
Sequence Variants

b) Minimal redundancy
Many sequence databases contain, for a given
protein sequence, separate entries which
correspond to different literature reports. In
Swiss-Prot we try as much as possible to merge
all these data so as to minimize the redundancy
of the database. If conflicts exist between
various sequencing reports, they are indicated in
the feature table of the corresponding entry.
- Swiss-Prot User Manual, Section 1.1

32
Sequence Variants

IPI provides a top level guide to the main
databases that describe the proteomes of higher
eukaryotic organisms. IPI
1. effectively maintains a database of cross
references between the primary data sources
2. provides minimally redundant yet maximally
complete sets of proteins for featured species
(one sequence per transcript)
3. maintains stable identifiers (with
incremental versioning) to allow the tracking of
sequences in IPI between IPI releases.
- IPI web site front page

33
Sequence Variants

Swiss-Prot variants, isoforms and conflicts are
retained as features
Script varsplic.pl can enumerate all sequence
variants
Command-line options for full enumeration
-which full -varsplic -variant -conflict

34
Swiss-Prot Variant Annotations
35
Swiss-Prot Variant Annotations
36
Swiss-Prot Variant Annotations
Feature viewer
Variants
37
Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF

38
Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ

39
Omnibus Database Redundancy Elimination

Source databases often contain the same sequences
with different descriptions
Omnibus databases keep one copy of the sequence,
and
An arbitrary description, or
All descriptions, or
Particular description, based on source
preference
Good definitions can be lost, including taxonomy

40
Omnibus Database Redundancy Elimination

NCBIs nr
Keeps all descriptions, separated by A
MSDB
Pecking order PIR1-4, TrEMBL, GenBank,
Swiss-Prot, NRL3D
IPI
All accessions, one description

41
Description Elimination

gi12053249embCAB66806.1 hypothetical protein
Homo sapiens
gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens
gi42632621gbAAS22242.1 COMMD4 Homo
sapiens
gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens
gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4
gi49065330embCAG38483.1 COMMD4 Homo
sapiens

42
Description Elimination

gi2947219gbAAC39645.1 UDP-galactose 4'
epimerase Homo sapiens
gi1119217gbAAB86498.1 UDP-galactose-4-epimera
se Homo sapiens
gi14277913pdb1HZJB Chain B, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi14277912pdb1HZJA Chain A, Human
Udp-Galactose 4-Epimerase Accommodation Of
Udp-N- Acetylglucosamine Within The Active Site
gi2494659spQ14376GALE_HUMAN UDP-glucose
4-epimerase (Galactowaldenase) (UDP-galactose
4-epimerase)
gi1585500prf2201313AUDP galactose
4'-epimerase

43
Description Elimination

gi4261710gbAAD14010.1 chlordecone reductase
Homo sapiens
gi2117443pirA57407 chlordecone reductase (EC
1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
(EC 1.1.1.-) I validated human
gi1839264gbAAB47003.1 HAKRa
product/3 alpha-hydroxysteroid dehydrogenase
homolog human, liver, Peptide, 323 aa
gi1705823spP17516AKC4_HUMAN Aldo-keto
reductase family 1 member C4 (Chlordecone reductas
e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
(3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
(HAKRA)
gi7328948dbjBAA92885.1 dihydrodiol
dehydrogenase 4 Homo sapiens
gi7328971dbjBAA92893.1dihydrodiol
dehydrogenase 4 Homo sapiens

44
DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
45
Translated sequences

Gene models describe introns and exons
Start site?
Splice sites?
Alternative splicing?
ESTs provide limited evidence of transcription
only
There is a lot we dont know about what protein
sequences result from a gene
Recent revision of number of human genes suggest
a bigger role for alternative splicing.

46
Genome Browsers

Link genomic, transcript, and protein sequence in
a graphical manner
Genes, ESTs, SNPs, cross-species, etc.
UC Santa Cruz
http//genome.ucsc.edu
Ensembl
http//www.ensembl.org
NCBI Map View
http//www.ncbi.nlm.nih.gov/mapview

47
UCSC Genome Browser

Shows many sources of protein sequence evidence
in a unified display
Can use EST accession as a location!

48
Summary

Protein sequence databases should be interpreted
with as much care as mass spectra
Use controlled vocabularies
Understand the structure of ontologies
Take advantage of computational predictions
Look for sequence variants
Be careful with omnibus databases

Write a Comment

User Comments (0)

About PowerShow.com

Protein Sequence Databases for Proteomics The good, the bad - PowerPoint PPT Presentation

Protein Sequence Databases for Proteomics The good, the bad

... incremental versioning) to allow the tracking of sequences in IPI between IPI releases. ... UC Santa Cruz. http://genome.ucsc.edu. Ensembl. http://www. ... – PowerPoint PPT presentation