Protein Sequence Databases for Proteomics - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Protein Sequence Databases for Proteomics

Description:

Motif style, ontology structure, degree of manual curation. Similarities: ... Script varsplic.pl can enumerate all sequence variants ... – PowerPoint PPT presentation

Number of Views:419
Avg rating:3.0/5.0
Slides: 51
Provided by: nathanjoh
Category:

less

Transcript and Presenter's Notes

Title: Protein Sequence Databases for Proteomics


1
Protein Sequence Databases for Proteomics
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park

2
Protein Sequence Databases
  • Link between mass spectra and proteins
  • A proteins amino-acid sequence provides a basis
    for interpreting
  • Enzymatic digestion
  • Separation protocols
  • Fragmentation
  • Peptide ion masses
  • We must interpret database information as
    carefully as mass spectra.

3
More than sequence
  • Protein sequence databases provide much more than
    sequence
  • Names
  • Descriptions
  • Facts
  • Predictions
  • Links to other information sources
  • Protein databases provide a link to the current
    state of our understanding about a protein.

4
Much more than sequence
  • Names
  • Accession, Name, Description
  • Biological Source
  • Organism, Source, Taxonomy
  • Literature
  • Function
  • Biological process, molecular function, cellular
    component
  • Known and predicted
  • Features
  • Polymorphism, Isoforms, PTMs, Domains
  • Derived Data
  • Molecular weight, pI

5
Database types
6
SwissProt
  • From ExPASy
  • Expert Protein Analysis System
  • Swiss Institute of Bioinformatics
  • 180,000 protein sequence entries
  • 9,000 species represented
  • 12,000 Human proteins
  • Highly curated
  • Minimal redundancy
  • Some restrictions on commercial use

7
PIR
  • Protein Information Resource
  • Georgetown University Medical Center
  • 280,000 protein sequence entries
  • Highly curated
  • Public domain resource
  • 10,500 Human proteins
  • Grew out of the Atlas of Protein Sequence and
    Structure (1965-1978) edited by Margaret Dayhoff.

8
TrEMBL
  • Translated EMBL nucleotide sequences
  • European Molecular Biology Laboratory
  • European Bioinformatics Institute (EBI)
  • Computer annotated
  • Only sequences absent from SwissProt
  • 165,000 protein sequence entries
  • 88,000 species
  • 52,000 Human proteins

9
RefSeq
  • Reference Sequence
  • From NCBI (National Center for Biotechnology
    Information), NLM, NIH
  • Integrated genomic, transcript, and protein
    sequences.
  • Varying levels of curation
  • Reviewed, Validated, , Predicted,
  • 1,350,000 protein sequence entries
  • 44,000 reviewed
  • 28,000 Human proteins

10
RefSeq
  • Particular focus on major research organisms
  • Tightly integrated with genome projects.
  • Curated entries NP accesssions
  • Predicted entries XP accessions

11
UniProt
  • Universal Protein Resource
  • Combination of
  • Swiss-Prot
  • TrEMBL
  • PIR
  • Knowledgebase is highly curated
  • Similar sequence clusters are available
  • 50, 90, 100 sequence similarity

12
IPI
  • International Protein Index
  • From EBI
  • For a specific species, combines
  • UniProt, RefSeq, Ensembl
  • Species specific databases
  • 48,000 protein sequence entries
  • Human, mouse, rat, zebra fish, arabidopsis

13
NCBIs nr
  • non-redundant
  • Contains
  • GenBank CDS translations
  • RefSeq Proteins
  • Protein Data Bank (PDB)
  • SwissProt, TrEMBL, PIR
  • Others
  • Similar sequences suppressed
  • 100 sequence similarity
  • 1,800,000 protein sequence entries
  • 33,000 species

14
MSDB
  • From the Imperial College (London)
  • Combines
  • PIR, TrEMBL, GenBank, SwissProt
  • Distributed with Mascot
  • so well integrated with Mascot

15
Others
  • HPRD
  • Manually curated integration of literature
  • PDB
  • Focus on protein structure
  • dbEST
  • Part of GenBank - EST sequences
  • Genome Sequences

16
Human Sequences
  • Number of Human Genes is believed to be between
    20,000 and 25,000

17
DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
18
Genome Browsers
  • Link genomic, transcript, and protein sequence in
    a graphical manner
  • Genes, ESTs, SNPs, cross-species, etc.
  • UC Santa Cruz
  • http//genome.ucsc.edu
  • Ensembl
  • http//www.ensembl.org
  • NCBI Map View
  • http//www.ncbi.nlm.nih.gov/mapview

19
UCSC Genome Browser
  • Shows many sources of protein sequence evidence
    in a unified display
  • Can use EST accession as a location!

20
Accessions
  • Permanent labels
  • Short, machine readable
  • Enable precise communication
  • Typos render them unusable!
  • Each database uses a different format
  • Swiss-Prot P17947
  • Ensembl ENSG00000066336
  • PIR S60367 S60367
  • GO GO0003700

21
Names / IDs
  • Compact mnemonic labels
  • Not guaranteed permanent
  • Require careful curation
  • Conceptual objects
  • Swiss-Prot names changed recently!
  • ALBU_HUMAN
  • Serum Albumin
  • RT30_HUMAN
  • Mitochondrial 28S ribosomal protein S30
  • CP3A7_HUMAN
  • Cytochrome P450 3A7

22
Description / Name
  • Free text description
  • Human readable
  • Space limited
  • Hard for computers to interpret!
  • No standard nomenclature or format
  • Often abused.
  • COX7R_HUMAN
  • Cytochrome c oxidase subunit VIIa-related
    protein, mitochondrial Precursor

23
FASTA Format
24
FASTA Format
  • gt
  • Accession number
  • No uniform format
  • Multiple accessions separated by
  • One line of description
  • Usually pretty cryptic
  • Organism of sequence?
  • No uniform format
  • Official latin name not necessarily used
  • Amino-acid sequence in single-letter code
  • Usually spread over multiple lines.

25
Organism / Species / Taxonomy
  • The proteins organism
  • or the source of the biological sample
  • The most reliable sequence annotation available
  • Useful only to the extent that it is correct
  • NCBIs taxonomy is widely used
  • Provides a standard of sorts Heirachical
  • Other databases dont necessarily keep up
  • Organism specific sequence databases starting to
    become available.

26
Organism / Species / Taxonomy
  • Buffalo rat
  • Gunn rats
  • Norway rat
  • Rattus PC12 clone IS
  • Rattus norvegicus
  • Rattus norvegicus8
  • Rattus norwegicus
  • Rattus rattiscus
  • Rattus sp.
  • Rattus sp. strain Wistar
  • Sprague-Dawley rat
  • Wistar rats
  • brown rat
  • laboratory rat
  • rat
  • rats
  • zitter rats

27
Controlled Vocabulary
  • Middle ground between computers and people
  • Provides precision for concepts
  • Searching, sorting, browsing
  • Concept relationships
  • Vocabulary / Ontology must be established
  • Human curation
  • Link between concept and object
  • Manually curated
  • Automatic / Predicted

28
Controlled Vocabulary
29
Controlled Vocabulary
30
Controlled Vocabulary
31
Controlled Vocabulary
32
Controlled Vocabulary
33
Ontology Structure
  • NCBI Taxonomy
  • Tree
  • Gene Ontology (GO)
  • Molecular function
  • Biological process
  • Cellular component
  • Directed, Acyclic Graph (DAG)
  • Unstructured labels
  • Overlapping?

34
Ontology Structure
35
Protein Families
  • Similar sequence implies similar function
  • Similar structure implies similar function
  • Common domains imply similar function
  • Bootstrap up from small sets of proteins with
    well understood characteristics
  • Usually a hybrid manual / automatic approach

36
Protein Families
37
Protein Families
38
Protein Families
  • PROSITE, PFam, InterPro, PRINTS
  • Gene Ontology
  • Swiss-Prot keywords
  • Differences
  • Motif style, ontology structure, degree of manual
    curation
  • Similarities
  • Primarily sequence based, cross species

39
Sequence Variants
  • Protein sequence can vary due to
  • Polymorphism
  • Alternative splicing
  • Post-translational modification
  • Sequence databases typically do not capture all
    versions of a proteins sequence

40
Sequence Variants
  • Swiss-Prot variants, isoforms and conflicts are
    retained as features
  • Script varsplic.pl can enumerate all sequence
    variants
  • Command-line options for full enumeration
  • -which full -varsplic -variant -conflict

41
Swiss-Prot Variant Annotations
42
Swiss-Prot Variant Annotations
43
Swiss-Prot Variant Annotations
Feature viewer
Variants
44
Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF


45
Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ


46
Omnibus Database Redundancy Elimination
  • Source databases often contain the same sequences
    with different descriptions
  • Omnibus databases keep one copy of the sequence,
    and
  • An arbitrary description, or
  • All descriptions, or
  • Particular description, based on source
    preference
  • Good definitions can be lost, including taxonomy

47
Description Elimination
  • gi12053249embCAB66806.1 hypothetical protein
    Homo sapiens
  • gi46255828gbAAH68998.1 COMMD4 protein Homo
    sapiens
  • gi42632621gbAAS22242.1 COMMD4 Homo
    sapiens
  • gi21361661refNP_060298.2 COMM domain
    containing 4 Homo sapiens
  • gi51316094spQ9H0A8COM4_HUMAN COMM domain
    containing protein 4
  • gi49065330embCAG38483.1 COMMD4 Homo
    sapiens

48
Description Elimination
  • gi2947219gbAAC39645.1 UDP-galactose 4'
    epimerase Homo sapiens
  • gi1119217gbAAB86498.1 UDP-galactose-4-epimera
    se Homo sapiens
  • gi14277913pdb1HZJB Chain B, Human
    Udp-Galactose 4-Epimerase Accommodation Of
    Udp-N- Acetylglucosamine Within The Active Site
  • gi14277912pdb1HZJA Chain A, Human
    Udp-Galactose 4-Epimerase Accommodation Of
    Udp-N- Acetylglucosamine Within The Active Site
  • gi2494659spQ14376GALE_HUMAN UDP-glucose
    4-epimerase (Galactowaldenase) (UDP-galactose
    4-epimerase)
  • gi1585500prf2201313AUDP galactose
    4'-epimerase

49
Description Elimination
  • gi4261710gbAAD14010.1 chlordecone reductase
    Homo sapiens
  • gi2117443pirA57407 chlordecone reductase (EC
    1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase
    (EC 1.1.1.-) I validated human
  • gi1839264gbAAB47003.1 HAKRa
    product/3 alpha-hydroxysteroid dehydrogenase
    homolog human, liver, Peptide, 323 aa
  • gi1705823spP17516AKC4_HUMAN Aldo-keto
    reductase family 1 member C4 (Chlordecone reductas
    e) (CDR) (3-alpha-hydroxysteroid dehydrogenase)
    (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4)
    (HAKRA)
  • gi7328948dbjBAA92885.1 dihydrodiol
    dehydrogenase 4 Homo sapiens
  • gi7328971dbjBAA92893.1dihydrodiol
    dehydrogenase 4 Homo sapiens

50
Summary
  • Protein sequence databases should be interpreted
    with as much care as mass spectra
  • Protein sequences come from genes
  • Use controlled vocabularies
  • Understand the structure of ontologies
  • Take advantage of computational predictions
  • Look for sequence variants
  • Be careful with omnibus databases
Write a Comment
User Comments (0)
About PowerShow.com