Title: Protein and Proteome Annotation
1Protein and Proteome Annotation
- David Wishart
- University of Alberta
- Edmonton, AB
- david.wishart_at_ualberta.ca
2Annotating 2D Gels
Trypsin Gel punch
p53
Trx
G6PDH
3Is This Annotated?
p53
Information 1) pI 2) MW 3) name (abbr) 4)
accession 5) relative amnt
Trx
G6PDH
4How About This?
Information 1) name (abbr) 2) accession 3)
relative amnt 4) coexpressors
5Is This Annotated?
gtP12345 Sequence 1 GATTACAGATTACAGATTACAGATTACAGAT
TACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGA
TTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGATTACAGA
TTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGATTACAGA
TTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AGATTACAG
ATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAG
ATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAG
A TTACAGATTACAGATTACAGATTACAGATTACAGAT
6Protein Annotation
- Objective - identify and describe all the
physico-chemical, functional and structural
properties of a protein including its sequence,
accession , mass, pI, absorptivity, solubility,
active sites, binding sites, reactions,
substrates, homologues, function, name(s),
abundance, location, 2o structure, 3D structure,
domains, pathways, interacting partners
7Protein vs. Proteome Annotation
- Protein annotation is concerned with one or a
small number (lt50) proteins from one or several
types of organisms - Proteome annotation is concerned with entire
proteomes (gt2000 proteins) from a specific
organism (or for all organisms) - need for speed
8Different Levels of Annotation
- Sparse typical of many gel or microarray
annotations, usually just includes name and
accession number - Moderate typical of many sequence databases or
of experiments aimed at identifying protein
complexes or ligands - Detailed not typical (occasionally found in
organism-specific databases)
9Different Levels of Database Annotation
- GenBank (large of sequences, minimal
annotation) - PIR (large of sequences, slightly better
annotation) - SwissProt (small of sequences, even better
annotation) - Organsim-specific DB (very small of sequences,
best annotation)
10GenBank Annotation
11PIR Annotation
12Swiss-Prot Annotation
13CCDB Annotation
14CCDB Annotation
15Ultimate Goal...
- To achieve the same level of protein/proteome
annotation as found in CCDB for all
genes/proteins from 2D GE data, from microarray
data or for sequence databases in general
How?
16Annotation Methods
- Annotation by homology (BLAST)
- requires a large, well annotated database of
protein sequences - Annotation by sequence composition
- simple statistical/mathematical methods
- Annotation by sequence features, profiles or
motifs - requires sophisticated sequence analysis tools
17Annotation by Homology
- Statistically significant sequence matches
identified by BLAST searches against GenBank
(nr), SWISS-PROT, PIR, ProDom, BLOCKS, KEGG, WIT,
Brenda, BIND - Properties or annotation inferred by name,
keywords, features, comments
Databases Are Key
18Sequence Databases
- GenBank
- www.ncbi.nlm.nih.gov/
- EMBL/trEMBL
- www.ebi.ac.uk/trembl/
- DDBJ
- www.nig.ac.jp/
- PIR
- http//pir.georgetown.edu/
- SwissProt
- www.expasy.ch/sprot/
- UniProt
- http//www.pir.uniprot.org/
19Structure Databases
- RCSB-PDB
- http//www.rcsb.org/pdb/
- MSD
- http//www.ebi.ac.uk/msd/index.html
- CATH
- http//www.biochem.ucl.ac.uk/bsm/cath/
- SCOP
- http//scop.mrc-lmb.cam.ac.uk/scop/
20Expression Databases
- Swiss 2D Page
- http//ca.expasy.org/ch2d/
- SMD
- http//genome-www5.stanford.edu/MicroArray/SMD/
- ArrayExpress
- http//www.ebi.ac.uk/arrayexpress/
- Gene Expr. Omnibus
- http//www.ncbi.nlm.nih.gov/geo/
21Metabolism Databases
- KEGG
- http//www.genome.ad.jp/kegg/metabolism.html
- Roche/Boeringer
- http//www.expasy.org/cgi-bin/search-biochem-index
- EcoCyc
- www.ecocyc.org/
- MetaCyc
- http//metacyc.org/
22Interaction Databases
- BIND
- http//www.blueprint.org/bind/bind.php
- DIP
- http//dip.doe-mbi.ucla.edu/
- MINT
- http//mint.bio.uniroma2.it/mint/
- IntAct
- http//www.ebi.ac.uk/intact/index.html
23Bibliographic Databases
- PubMed Medline
- http//www.ncbi.nlm.nih.gov/PubMed/
- Science Citation Index
- http//isi4.isiknowledge.com/portal.cgi
- Your Local eLibrary
- www.XXXX.ca
- Current Contents
- http//www.isinet.com/isi/
24Annotation by HomologyAn Example
- 76 residue protein from Methanobacter
thermoautotrophicum (newly sequenced) - What does it do?
- MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTA
LPGLAVDGELKIMGRVASKEEIKKILS
25PSI BLAST
Select Database
26PSI-BLAST
27PSI-BLAST
28PSI-BLAST
29Conclusions
- Protein is a thioredoxin or glutaredoxin
(function, family) - Protein has thioredoxin fold (2o and 3D
structure) - Active site is from residues 11-14 (active site
location) - Protein is soluble, cytoplasmic (cellular
location)
30Annotation Methods
- Annotation by homology (BLAST)
- requires a large, well annotated database of
protein sequences - Annotation by sequence composition
- simple statistical/mathematical methods
- Annotation by sequence features, profiles or
motifs - requires sophisticated sequence analysis tools
31Annotation by Composition
- Molecular Weight
- Isoelectric Point
- UV Absorptivity
- Hydrophobicity
32Where To Go
33Isoelectric Point
- The pH at which a protein has a net charge0
- Q S Ni/(1 10pH-pKi)
34UV Absorptivity
- OD280 (5690 x W 1280 x Y)/MW x Conc.
- Conc. OD280 x MW/(5690 X W 1280 x Y)
OH
N
35Hydrophobicity
- Indicates Solubility
- Indicates Stability
- Indicates Location (membrane or cytoplasm)
- Indicates Globularity or tendency to form
spherical structure
36Annotation Methods
- Annotation by homology (BLAST)
- requires a large, well annotated database of
protein sequences - Annotation by sequence composition
- simple statistical/mathematical methods
- Annotation by sequence features, profiles or
motifs - requires sophisticated sequence analysis tools
37Where To Go
38Sequence Feature Databases
- PROSITE - http//www.expasy.ch/
- BLOCKS - http//blocks.fhcrc.org/
- DOMO - http//www.infobiogen.fr/services/domo/
- PFAM - http//pfam.wustl.edu
- PRINTS - http//www.biochem.ucl.ac.uk/bsm/dbrowser
/PRINTS - SEQSITE - PepTool
39What Can Be Predicted?
- O-Glycosylation Sites
- Phosphorylation Sites
- Protease Cut Sites
- Nuclear Targeting Sites
- Mitochondrial Targ Sites
- Chloroplast Targ Sites
- Signal Sequences
- Signal Sequence Cleav.
- Peroxisome Targ Sites
- ER Targeting Sites
- Transmembrane Sites
- Tyrosine Sulfation Sites
- GPInositol Anchor Sites
- PEST sites
- Coil-Coil Sites
- T-Cell/MHC Epitopes
- Protein Lifetime
- A whole lot more.
40Cutting Edge Sequence Feature Servers
- Membrane Helix Prediction
- http//www.cbs.dtu.dk/services/TMHMM-2.0/
- T-Cell Epitope Prediction
- http//syfpeithi.bmi-heidelberg.com/scripts/MHCSer
ver.dll/home.htm - O-Glycosylation Prediction
- http//www.cbs.dtu.dk/services/NetOGlyc/
- Phosphorylation Prediction
- http//www.cbs.dtu.dk/services/NetPhos/
- Protein Localization Prediction
- http//psort.nibb.ac.jp/
41Subcellular Localization
42Subcellular Localization
http//www.cs.ualberta.ca/bioinfo/PA/Sub/
43Proteome Analyst (SubCell)
442o Structure Prediction
- PredictProtein-PHD (72)
- http//cubic.bioc.columbia.edu/predictprotein/
- Jpred (73-75)
- http//www.compbio.dundee.ac.uk/www-jpred/submit.
html - SAM-T02 (75)
- http//www.cse.ucsc.edu/research/compbio/HMM-apps/
T02-query.html - PSIpred (77)
- http//bioinf.cs.ucl.ac.uk/psipred/psiform.html
45Putting It All Together
Seq Motifs
Composition
Homology
46Putting It All Together
- PEDANT
- http//pedant.gsf.de/
- GeneQuiz
- http//jura.ebi.ac.uk8765/ext-genequiz/
- Magpie
- http//magpie.ucalgary.ca/
- Proteome Analyst
- http//www.cs.ualberta.ca/bioinfo/PA/
47(No Transcript)
48Programs Used By Pedant
- HMMER
- PSORT
- PREDATOR
- COILS
- FGENESH
- pI
- PROSEARCH
- TargetP
- SAPS
- NCBI-BLAST
- SEG
- InterProScan
- SignalP
- TMHMM
- tRNAscan-SE
- GENSCAN
49Databases Used By Pedant
- EMBL
- PIR-PSD
- SWISS-PROT
- Functional Cat
- PROSITE
- TrEMBL
- Blocks
- PDB
- SCOP
- COGs
- Pfam
- STRIDE
50(No Transcript)
51http//jura.ebi.ac.uk8765/gqsrv/submit
52GeneQuiz Functions
- Amino acid biosynthesis
- Biosynthesis of cofactors, prosthetic
groups, carriers - Cell envelope
- Cellular processes
- Central intermediary metabolism
- Energy metabolism
- Fatty acid and phospholipid metabolism
- Other categories
- Purines, pyrimidines, nucleosides, and
nucleotides - Regulatory functions
- Replication
- Transcription
- Translation
- Transport and binding proteins
- Unknown
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57Home Page
58Proteome Analyst
- Uses PSI-BLAST, PSI-PRED and motif analysis tools
- Extracts keyword information from homologues and
uses Naïve Bayes classifiers to infer function - Combines sequence motif and sequence profile
information to complete functional classification - Supports custom classifier/ontology
59BacMap
- Picking up where we left off with the CCDB
(Google bacmap) - Idea is to generate a visual atlas of all (not
just Escherichia coli) bacterial chromosomes and
plasmids but with links to extensive genome
annotation - Attempt to re-use annotation and graphing tools
originally developed for the CCDB
60BacMap
http//wishart.biology.ualberta.ca/BacMap/
61BacMap
62Text Search Tools
63Sequence Search Tools
64Bacterial Biography Card
65Genome Statistics
66Proteome Statistics
67BacMap
- Each genome has a short description of the
organism and sequence data - Supports zoomable, hyperlinked, clickable map
views of the genome - Supports text search of gene names, protein names
and synonyms - Supports BLAST search and supplies genome-wide
stats - Currently going through major update
Stothard P, et al. BacMap an interactive picture
atlas of annotated bacterial genomes. Nucleic
Acids Res. 2005 Jan 133 Database IssueD317-20.
68What if Your Organism or Genome isnt in BacMap?
http//wishart.biology.ualberta.ca/basys/
69BASys
- Bacterial Annotation System
- A publicly available web server that performs
automated annotation of bacterial genomes given
only the gene sequence of a chromosome or plasmid - Takes about 24 hrs for an average genome (4
megabases) - Output includes images and annotation text (about
70 fields for each gene)
70Typical BASys Result
71Conclusion
- Genome annotation is the same as proteome
annotation required after any gene sequencing
and gene ID effort - Can be done either manually or automatically
- Need for high throughput, automated pipelines
to keep up with the volume of genome sequence
data - Area of active research and development with
about ½ of all bioinformaticians working on some
aspect of this process