Title: Protein Feature Identification
 1Protein Feature Identification
- Microbiology 343 
- David Wishart 
- david.wishart_at_ualberta.ca
2Objectives
- To show that almost everything you do in the lab 
 or what you need to do to work with a protein can
 be done on a computer
- Learning methods and algorithms for predicting 
 composition and sequence features
- Learning when to use these tools
3Proteins
- Exhibit far more sequence and chemical 
 complexity than DNA or RNA
- Properties and structure are defined by the 
 sequence and side chains of their constituent
 amino acids
- The engines of life 
- gt95 of all drugs targets are proteins 
- Favorite topic of post-genomic era
4The Post-genomic Challenge 
- How to rapidly identify a protein? 
- How to rapidly purify a protein? 
- How to identify post-trans modification? 
- How to find information about function? 
- How to find information about activity? 
- How to find information about location? 
- How to find information about structure?
Answer Look at Protein Features 
 5Protein Features
ACEDFHIKNMF SDQWWIPANMC ASDFDPQWERE LIQNMDKQERT QA
TRPQDS...
Sequence View Structure View 
 6Different Types of Features
- Composition Features 
- Mass, pI, Absorptivity, Rg, Volume 
- Sequence Features 
- Active sites, Binding Sites, Targeting, Location, 
 Property Profiles, 2o structure
- Structure Features 
- Supersecondary Structure, Global Fold, ASA, 
 Volume
7Where To Go
http//www.expasy.org/tools/ 
 8Compositional Features
- Molecular Weight 
- Amino Acid Frequency 
- Isoelectric Point 
- UV Absorptivity 
- Solubility, Size, Shape 
- Radius of Gyration 
- Free Energy of Folding
9Molecular Weight 
 10Molecular Weight
- Useful for SDS PAGE and 2D gel analysis 
- Useful for deciding on SEC matrix 
- Useful for deciding on MWC for dialysis 
- Essential in synthetic peptide analysis 
- Essential in peptide sequencing (classical or 
 mass-spectrometry based)
- Essential in proteomics and high throughput 
 protein characterization
11Molecular Weight
- Crude MW calculation MW  110 X Numres 
- Exact MW calculation MW  SAAi x MWi 
- Remember to add 1 water (18.01 amu) after adding 
 all res.
- Note isotopic weights 
- Corrections for CHO, PO4, Acetyl, CONH2 
12Amino Acid versus Residue
R
R
C
C
CO
 N
COOH
H2N
H
H
H
Amino Acid Residue 
 13Protein Identification via MW
- MOWSE 
- http//srs.hgmp.mrc.ac.uk/cgi-bin/mowse 
- CombSearch 
- http//ca.expasy.org/tools/CombSearch/ 
- Mascot 
- http//www.matrixscience.com/search_form_select.ht
 ml
- AACompSim/AACompIdent 
- http//ca.expasy.org/tools/
14Molecular Weight  Proteomics
2-D Gel QTOF Mass Spectrometry 
 15Amino Acid Frequency
- Deviations greater than 2X average indicate 
 something of interest
- High K or R indicates possible nucleoprotein 
- High Cs indicate stable but hard-to-fold protein 
- High G, P, Q, or N says lack of stable structure
16Isoelectric Point (pI)
- The pH at which a protein has a net charge0 
-  Q  S Ni/(1  10pH-pKi) 
Transcendental equation 
 17Isoelectric Point
- Calculation is only approximate (/- 1 pH) 
- Does not include 3o structure interactions 
- Can be used in developing purification protocols 
 via ion exchange chromatography
- Can be used in estimating spot location for 
 isoelectric focusing gels
- Can be used to decide on best pH to store or 
 analyze protein
18UV Spectroscopy 
 19UV Absorptivity
- UV (Ultraviolet light) has a wavelength of 200 to 
 400 nm
- Most proteins and peptides (and all nucleic 
 acids) absorb UV light quite strongly
- UV spectroscopy is the most common form of 
 spectroscopy performed today
- UV spectra can be used to identify or classify 
 some proteins or protein classes
20UV Absorptivity
- OD280  (5690 x W  1280 x Y)/MW x Conc. 
- Conc.  OD280 x MW/(5690 X W  1280 x Y)
OH
N 
 21Hydrophobicity
- Indicates Solubility 
- Indicates Stability 
- Indicates Location (membrane or cytoplasm) 
- Indicates Globularity or tendency to form 
 spherical structure
22Hydrophobicity
- Average Hydrophobicity AH  S AAi x Hi 
- Hydrophobic Ratio RH  S H(-)/S H() 
- Hydrophobic  Ratio RHP  philic/phobic 
- Linear Charge Density LIND  (KRDEH2)/ 
- Solubility SOL  RH  LIND - 0.05AH
- Average AH  2.5  2.5 Insol gt 0.1 Unstrc lt 
 -6
- Average RH  1.2  0.4 Insol lt 0.8 Unstrc gt 
 1.9
- Average RHP  0.9  0.2 Insol lt 0.7 Unstrc gt 1.4 
- Average LIND  0.25 Insol lt 0.2 Unstrc gt 0.4 
- Average SOL  1.6  0.5 Insol lt 1.1 Unstrc gt 2.5 
23Different Types of Features
- Composition Features 
- Mass, pI, Absorptivity, Hydrophobicity 
- Sequence Features 
- Active sites, Binding Sites, Targeting, Location, 
 Property Profiles, 2o structure
- Structure Features 
- Supersecondary Structure, Global Fold, ASA, 
 Volume
24Sequence Features
 AHGQSDFILDEADGMMKSTVPN HGFDSAAVLDEADHILQWERTY 
 GGGNDEYIVDEADSVIASDFGH LIVMLIVMDEADLIVM
LIVM (EIF 4A ATP DEPENDENT HELICASE) 
 25Sites that Support Pattern Queries
- OWL Database 
- http//umber.sbs.man.ac.uk/dbbrowser/OWL/ 
- PIR Website 
- http//pir.georgetown.edu/pirwww/search/patmatch.h
 tml
- SCNPSITE at EXPASY 
- http//ca.expasy.org/tools/scanprosite/ 
- PattinProt 
- http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
 agenpsa_pattinprot.html/
26Regular Expressions
- CACGT - Matches CAT, CCT and CGT only 
- C . T - Matches CAT, CaT, C1T, CXT, not CT 
- CA?T - Matches CT or CAT only 
- CT - Matches CT, CCT, CCCT, CCCCT 
- C(HE)?ATP - Matches CHEAT, CAT, CHEAP, CAP 
- SA-I,L-Q,T-Z?LKA-I,L-Q,T-Z?A - Matches SLKA
27PROSITE Pattern Expressions
C - ACG - T - Matches CAT, CCT and CGT only C - 
X -T - Matches CAT, CCT, CDT, CET, etc. C - A 
-T - Matches every CXT except CAT C - (1,3) - T - 
Matches CT, CCT, CCCT C - A(2) - TP - Matches 
CAAT, CAAP LIV - VIC - X(2) - G - DENQ - X 
- LIVFM (2) -G 
 28Sequence Feature Databases
- PROSITE - http//ca.expasy.org/prosite/ 
- BLOCKS - http//www.blocks.fhcrc.org/ 
- DOMO - http//www.infobiogen.fr/services/domo/ 
- PFAM - http//pfam.wustl.edu 
- PRINTS - http//www.bioinf.man.ac.uk/dbbrowser/PRI
 NTS/
- SEQSITE - PepTool
29Phosphorylation Sites
pY
pT
pS
PO4
PO4
CH3
PO4 
 30Phosphorylation Sites 
 31Signaling Sites 
 32Protease Cut Sites 
 33Binding Sites 
 34Family Signature Sequences 
 35Enzyme Active Sites 
 36Better Methods for Sequence Feature ID
- Sequence Profiles/Scoring Matrices 
- Neural Networks 
- Hidden Markov Models 
- Bayesian Belief Nets 
- Reference Point Logistics
37What Can Be Predicted?
- O-Glycosylation Sites 
- Phosphorylation Sites 
- Protease Cut Sites 
- Nuclear Targeting Sites 
- Mitochondrial Targ Sites 
- Chloroplast Targ Sites 
- Signal Sequences 
- Signal Sequence Cleav. 
- Peroxisome Targ Sites 
- ER Targeting Sites 
- Transmembrane Sites 
- Tyrosine Sulfation Sites 
- GPInositol Anchor Sites 
- PEST sites 
- Coil-Coil Sites 
- T-Cell/MHC Epitopes 
- Protein Lifetime 
- A whole lot more.
38Cutting Edge Sequence Feature Servers
- Membrane Helix Prediction 
- http//www.cbs.dtu.dk/services/TMHMM-2.0/ 
- T-Cell Epitope Prediction 
- http//syfpeithi.bmi-heidelberg.com/scripts/MHCSer
 ver.dll/home.htm
- O-Glycosylation Prediction 
- http//www.cbs.dtu.dk/services/NetOGlyc/ 
- Phosphorylation Prediction 
- http//www.cbs.dtu.dk/services/NetPhos/ 
- Protein Localization Prediction 
- http//psort.nibb.ac.jp/
39Subcellular Localization
http//www.cs.ualberta.ca/bioinfo/PA/Sub/ 
 40Profiles  Motifs are Useful
- Helped identify active site of HIV protease 
- Helped identify SH2/SH3 class of STPs 
- Helped identify important GTP oncoproteins 
- Helped identify hidden leucine zipper in HGA 
- Used to scan for lectin binding domains 
- Regularly used to predict T-cell epitopes
41Amino Acid Property Profiles 
 42Amino Acid Property Profiles
- Intent is to predict proteins physical 
 properties directly from sequence as opposed to
 composition or wet chemistry
- Offers a more detailed, graphical view of 
 sequence-specific properties than compositional
 analysis (more powerful?)
- Underlying assumption is amino acid properties 
 are additive
43Common Property Profiles
- Hydrophobicity (Watch Scales!) 
- Helical Wheel (Not a True Profile) 
- Hydrophobic Moments (Helix  Beta sheet) 
- Flexibility (Thermal B Factors) 
- Surface Accessibility (ASA) 
- Antigenicity (B-cell epitopes/T-cell epitopes) 
44Hydrophobicity Profile
- Plotted using ltHgti  S Hn/(2k  1) 
- Shows location of membrane spanning regions, 
 epitopes, surface exposed AAs, etc.
45Flexibility
- B factors from X-ray crystallography 
- Potentially identifies antigenic and active sites 
 from sequence data alone
46Membrane Spanning Regions 
 47Predicting via Hydrophobicity
Bacteriorhodoposin OmpA 
 48Predicting via Hydrophobicity 
 49Predicting via Neural Nets 
- PHDhtm http//cubic.bioc.columbia.edu/predictpro
 tein/submit_adv.html
- TMAP http//www.mbb.ki.se
 /tmap/index.html
- TMPred http//www.ch.embnet.org/software/TMPRED
 _form.html
ACDEGF... 
 50Secondary Structure 
 51Secondary Structure Prediction 
 52Secondary Structure Prediction
- Statistical (Chou-Fasman, GOR) 
- Homology or Nearest Neighbor (Levin) 
- Physico-Chemical (Lim, Eisenberg) 
- Pattern Matching (Cohen, Rooman) 
- Neural Nets (Qian  Sejnowski, Karplus) 
- Evolutionary Methods (Barton, Niemann) 
- Combined Approaches (Rost, Levin, Argos) 
53Chou-Fasman Statistics 
 54Prediction Performance 
 55Best of the Best
- PredictProtein-PHD (72) 
- http//cubic.bioc.columbia.edu/predictprotein 
- Jpred (73-75) 
- http//www.compbio.dundee.ac.uk/www-jpred/ 
- PSIpred (77) 
- http//bioinf.cs.ucl.ac.uk/psipred/ 
- Proteus (88) 
- http//129.128.185.1848080/proteus/ 
56Sample Exam Questions
- Here is the sequence for protein X, calculate its 
 molar absorptivity
- Here is the sequence for protein Y, try to locate 
 the likely membrane spanning regions  explain
 your reasoning
- Here is the sequence for protein Z, show the 
 tryptic cleavage points