Title: Protein Expression, Structural Proteomics
1Protein Expression, Structural Proteomics
Bioinformatics
- David Wishart
- University of Alberta
- Edmonton, AB
- david.wishart_at_ualberta.ca
2Expression Questions
- Which host cell system?
- Which expression vector?
- Which cloning/expression protocols?
- Is it membrane or water soluble?
- Is it single domain or multi-domain?
- How soluble and how stable?
- Where will this protein be found?
- How to purify how to identify?
3Host Cell System?
- Escherichia coli
- Other bacteria
- Pichia pastoris
- Other yeast
- Baculovirus
- Animal cell culture
- Plants
- Sheep/cows/humans
- Cell free
Polyhedra
4Host Cell System?
- Choice depends on size and character of protein
- Large proteins (gt100 kD)? Choose eukaryote
- Small proteins (lt30 kD)? Choose prokaryote
- Glycosylation essential? Choose baculovirus or
mammalian cell culture - Isotopic labelling esential? Choose E. coli
- Post-translational modifications essential?
Choose yeast, baculovirus or other eukaryote
5Host Cell System?
- Try different hosts when optimizing expression
(protease negative, strains with enhanced
expression of rare tRNAs) - Expression levels can vary by a factor of 10 or
more depending on strain choice - Example E. coli strains
- MC1061, UT580, GM48, JM101, DH5, MG1065, NM522,
MC4100, TOP10F, BL21(DE3) BL21-CodonPlus (DE3)
6Codon Bias
http//www.kazusa.or.jp/codon/
7Arginine Codon Bias
E. coli M. jannaschii H. sapiens AGA
2.7 AGA 27.5 AGA 11.2 AGG 1.6 AGG 9.9
AGG 11.1
Eubacteria (rare)
Archaebacteria (abundant)
Eukaryote (normal)
8Host Cell System?
- American Type Culture Collection
- http//www.atcc.org
- Clontech Cell Lines
- http//www.clontech.com
- Stratagene Cells (BL21)
- http//stratagene.com
- Invitrogen Cell Lines (Pichia)
- http//www.invitrogen.com
9Fermentor or Shake Flask?
10Media Optimization
- Still using L-broth? Try using T-broth
- Tryptone - 12 g, Yeast Extract - 24 g, glycerol -
4 ml, KH2PO4 - 2.3g, K2HPO4 - 12.5g - Extra Spicy Media
- More ATP 10 ml/L glycerol 10g glucose/L
- More AA Add 10g casamino acids 10mg L-Trp
- Add more media (30) when you induce
- Add more antibiotic when you induce
- prevents overgrowth by cells that lost plasmid
11Expression Questions
- Which host cell system?
- Which expression vector?
- Which cloning/expression protocols?
- Is it membrane or water soluble?
- Is it single domain or multi-domain?
- How soluble and how stable?
- Where will this protein be found?
- How to purify how to identify?
12Which Vector?
- Must be compatible with host cell system
(prokaryotic vectors for prokaryotic cells,
eukaryotic vectors for eukaryotic cells) - Needs a good combination of
- strong promoters
- ribosome binding sites
- termination sequences
- affinity tag or solubilization sequences
- multi-enzyme restriction site
13Which Vector?
- Promoters
- arabinose systems (pBAD), phage T7 (pET), Trc/Tac
promoters, phage lambda PL or PR - Tags
- His6 for metal affinity chromatography (Ni)
- FLAG epitope tage DYKDDDDK
- CBP-calmodulin binding peptide (26 residues)
- E-coil/K-coil tags (poly E35 or poly K35)
- c-myc epitope tag EQKLISEEDL
- Glutathione-S-transferase (GST) tags
- Celluluose binding domain (CBD) tags
14Which Vector?
- VectorDB
- http//vectordb.atcg.com
- Invitrogen Vectors
- http//www.invitrogen.com/vectors.html
- Qiagen Vectors
- http//www.qiagen.com/literature/vectors.asp
- Stratagene Vectors
- http//stratagene.com/vectors/vectors.htm
15How to Clone?
Echo Cloning
16How to Clone?
Yeast Cells
17How to Clone?
Mammalian Cells
18Gateway System (Invitrogen)
- No need to design, construct or ID unique
restriction sites - Uses lambda phage site-specific recombination for
gene/plasmid integration - No need for restriction enzyme digestions
- No need for gel fragment separation and
purification - Ideal for high throughput proteomics efforts
19Gateway System (Invitrogen)
Entry Vector
Entry Clone
PCR product
Desired Clone
X
Destination Vector
20Gateway System (Invitrogen)
Gene
-ve selector (anti-gyrase)
attR1
attR2
attL2
attL1
Entry Clone
Kmr
Ampr
Int IHF Xis
Gene
-ve selector (anti-gyrase)
Desired Clone
Dead-end Clone
Ampr
Kmr
21Gateway Protocol
- Mix and incubate for 60 _at_ 25 oC
- Add proteinase K and incubate for 10 at 37 oC
- Transfer to E. coli (competent) DH5 cells
- Express for 60 and plate on LB-Amp
Ingredients
- Clonase reaction buffer 4 mL
- Destination Vector 300 ng
- Entry Clone 100 ng
- Clonase Enzyme mix 4 mL
- Total volume 20 mL
22Expression/Cloning -- Which Protocols?
- Molecular Cloning 3rd Edition (Sambrook and
Maniatis / Russell) - http//www.molecularcloning.com
- Molecular Biology Protocols
- http//micro.nwfsc.noaa.gov/protocols/
- Molecular Biology Shortcuts
- http//highveld.com/f/fprotocols.html
- NeeHow Protocols
- http//www.neehow.org/wonderful/protocols
23Expression Questions
- Which host cell system?
- Which expression vector?
- Which cloning/expression protocols?
- Is it membrane or water soluble?
- Is it single domain or multi-domain?
- How soluble and how stable?
- Where will this protein be found?
- How to purify how to identify?
24Membrane or Water Soluble?
25Membrane or Water Soluble?
- Most protein scientists prefer to work with water
soluble proteins or domains - Membrane proteins are very difficult to clone,
express and purify and special techniques must be
used - Potential problems can be avoided by knowing
whether the protein contains one or more membrane
spanning helices and where these helices are
located (cleaved?)
26Predicting via Hydrophobicity
Bacteriorhodoposin OmpA
27Membrane Helix Prediction
- Neural Network and HMM methods now claim gt80
accuracy - PredictProtein (PHDhtm)
- http//cubic.bioc.columbia.edu/predictprotein/
- TMpred
- http//www.ch.embnet.org/software/TMPRED_form.html
- TMHMM
- http//www.cbs.dtu.dk/services/TMHMM-2.0/
28TMPred (Principles)
29TMHMM
30PredictProtein
31Expression Questions
- Which host cell system?
- Which expression vector?
- Which cloning/expression protocols?
- Is it membrane or water soluble?
- Is it single domain or multi-domain?
- How soluble and how stable?
- Where will this protein be found?
- How to purify how to identify?
32Single Domain or MultiDomain?
33Modular Protein Domains
BH PDZ
FYVE PH
DED DEATH
SH3 1433
WW FHA
PTB SH2
34Single Domain or MultiDomain?
- Many eukaryotic proteins are multi-domain
- Size is a good indicator (roughly 1 domain for
every 15 kD) - Small domains behave better (Xray NMR)
- Limited proteolysis allows experimental
identification of domains prior to structure
determination by NMR or X-ray - digestion followed by HPLC or MS analysis to
detect fragments gt 10 kD
35Domain Prediction
- Domain Prediction (PredictProtein-GLOBE)
- http//cubic.bioc.columbia.edu/predictprotein
- BLAST alignments can be used to detect or predict
the presence of domains by sequence homology - Protein domains can also be predicted using CDD
(Conserved Domain Database) at http//www.ncbi.nlm
.nih.gov/Structure/cdd/cdd.shtml
36(No Transcript)
37(No Transcript)
38Expression Questions
- Which host cell system?
- Which expression vector?
- Which cloning/expression protocols?
- Is it membrane or water soluble?
- Is it single domain or multi-domain?
- How soluble and how stable?
- Where will this protein be found?
- How to purify how to identify?
39Predicting Solubility
- Even if a protein is identified to be a
non-membrane protein this does not necessarily
indicate it will be soluble - Solubility depends on many factors
- size (smaller ones are more soluble)
- hydrophobicity (average and local hphob)
- 3D structure and ligand interactions
- overall charge, predicted accessibility
- distribution and frequency of amino acids
40Predicting Solubility
- Solvent accessibility prediction
- PredictProtein (PHDacc)
- http//cubic.bioc.columbia.edu/predictprotein/
- Protein property/scale prediction
- EXPASY ProtScale
- http//www.expasy.ch/cgi-bin/protscale.pl
- PepTool
- www.biotools.com
41Accessible Surface Area
Reentrant Surface
Accessible Surface
Solvent Probe
Van der Waals Surface
42Predicted Accessibility
3 2 1 0
43Buried Surface Area (BASA) Fractional Burial
(FB)
- For an average protein
- ASA (NP) 0.35 x BASA
- ASA (P) 0.61 x BASA
- ASA (/-) 0.04 x BASA
- BASA can be estimated from a proteins amino acid
composition BASA S AAi x FBi
44ProtScale
45ProtScale
46Solubility (PepTool)
- Average Hydrophobicity AH S AAi x Hi
- Hydrophobic Ratio RH S H(-)/S H()
- Hydrophobic Ratio RHP philic/phobic
- Linear Charge Density LIND(KRDEH2)/
- Solubility SOLRH LIND - 0.05AH
- Average AH 2.5 /- 2.5 Insol gt 0.1 Unstrc lt
-6 - Average RH 1.2 /- 0.4 Insol lt 0.8 Unstrc gt
1.9 - Average RHP 0.9 /- 0.2 Insol lt 0.7 Unstrc gt
1.4 - Average LIND 0.25 Insol lt 0.2 Unstrc gt 0.4
- Average SOL 1.6 /- 0.5 Insol lt 1.1 Unstrc gt
2.5
47Structural Proteomics and Solubility Prediction
- Global efforts have led to the cloning and
attempted expression of more than 5000 water
soluble proteins - Data contained on databases such as TargetDB
allow correlations to be developed between
sequence and expression levels and solubility - Excellent opportunity to used data mining to find
rules to predict protein solubility
48 49Binary Decision Trees
- Used to partition or classify data that is not
linearly separable - Unknown objects are classified by traversing
the tree - Traversing is accomplished by performing tests at
each node, direction of traversal determined by
results of the test - Decision trees can be trained (test threshold
cutoff, test order, architecture)
50Binary Decision Trees
not forming crystals
forming crystals
51Predicting Protein Solubility
1) Residue frequency ACDEFGHIKLMNPQRSTVWY 2)
Grouped residue frequency KR,NR,DE,ST
LIM,FWY,HKR,AVILM,DENQ,GAVL,SCTM 3
) Predicted secondary structure a,b,c 4)
Presence of signal sequence 5) Length of
polypeptide 6) Number of residues in low
complexity region (L,S) 7) Normalized low
complexity value (SEG/Len) 8) Maximum
hydrophobicity value 9) Length of maximum
hydrophobic region
52Solubility Decision Tree
Size of black oval that are soluble
53Binary Decision Trees
- Have been used to predict protein solubility and
protein crystallization - Somewhat similar to self-organizing feature maps
(SOFM) - Bertone P, Kluger Y, Lan N, Zheng D, Christendat
D, Yee A, Edwards AM,
Arrowsmith CH, Montelione GT, Gerstein M. Nucleic
Acids Res 2001 129(13)2884-98
54Predicting Stability
- Even if a protein expresses and remains soluble
it may turn out to be quite unstable (easily
proteolyzed) - Proteins that are rich in Proline (P), Glutamic
acid (E), Serine (S) and Threonine (T) or which
have regions that are rich in these amino acids
(PEST sequences) tend to have half lives of less
than 2 hours
55PEST Finder
http//www.at.embnet.org/embnet/tools/bio/PESTfind
/
56Expression Questions
- Which host cell system?
- Which expression vector?
- Which cloning/expression protocols?
- Is it membrane or water soluble?
- Is it single domain or multi-domain?
- How soluble and how stable?
- Where will this protein be found?
- How to purify how to identify?
57Protein Localization
- Is it exported? Does it go to the nucleus? Does
it go through the ER? Does it localize to
mitochondria? Chloroplasts? Does it go to the
membrane? How do you tell? - Eukaryotic signal sequences are usually
incompatible with prokaryotic signal sequences so
expressing eukaryotic proteins in bacteria can
lead to problems
58Location Prediction
http//psort.nibb.ac.jp
59Proteome Analyst
http//www.cs.ualberta.ca/bioinfo/PA/Sub/
60PSORT-B (bacteria)http//www.psort.org/psortb/ind
ex.html
61Location Prediction
http//www.cbs.dtu.dk/services/TargetP/submission
62Other Sites or Modifications?
- Phosphorylation
- NetPhos http//cbs.dtu.dk/services/NetPhos/
- O-Glycosylation
- NetOGlyc http/cbs.dtu.dk/services/NetOGlyc/
- Coil-Coil Dimerization domains
- www.ch.embnet.org/software/COILS_form.html
- Tyrosine Sulfation
- http//ca.expasy.org/tools/sulfinator/
63NetPhos 2.0
64Expression Questions
- Which host cell system?
- Which expression vector?
- Which cloning/expression protocols?
- Is it membrane or water soluble?
- Is it single domain or multi-domain?
- How soluble and how stable?
- Where will this protein be found?
- How to purify how to identify?
65Finding and Identifying Your Protein
66Isoelectric Point
- The pH at which protein has charge0
- Q S Ni/(1 10pH-pKi)
67Isoelectric Point MW Calculation
68More Help?
- http//www.abrf.org
- http//www.abrf.org.JBT/JBTindex.html
- http//www.BioTechniques.com
- http//expasy.ch/alinks.html
- http//www.neehow.org/wonderful/protocols
- http//research.newfsc.noaa.gov/protocols.html
- http//www.horizonpress.com/gateway/protocols.html
69Bioinformatics Structural Proteomics
- Key to identifying targets
- Key to reducing time and material wastage in
protein expression/purification steps - Key to tracking and communicating target
progression (multi-lab LIMS) - Key to reducing redundancy and duplication by
other X-ray or NMR structure labs (TargetDB,
SPINE)
70TargetDB
http//targetdb.pdb.org/apps/TargetDB.html
71Structural Proteomics - Status
- 18 registered centres (30 organisms)
- 50330 targets have been selected
- 25202 targets have been cloned
- 14728 targets have been expressed
- 5122 targets are soluble
- 600 X-ray structures determined
- 164 NMR structures determined
- 633 Structures deposited in PDB (03/04/04)
72Structural Proteomics - Status
- 135 structures deposited by Riken
- 117 structures deposited by Mid-West
- 85 structures deposited by North-East
- 74 structures deposited by New York
- 59 structures deposited by JCSG (UCSD)
- 34 structures deposited by Berkeley
- 24 structures deposited by Montreal/Kingston
73Protein Expression in E. coli
good promising unfolded poor precipitated
Proc. Natl. Acad. Sci. USA, Vol. 99,1825-1830,
2002
74Protein Expression in E. coli
M. th. Methanobacter thermoautotrophicum E. coli
Escherichia coli S. ce. Saccharomyces
cerevisae Myx. Myxoma virus T. ma. Thermotoga
maritima
75X-ray vs. NMR Results for Methanobacter
76Conclusions
- The success of proteomics (structural,
functional, expressional) hinges almost entirely
on successful protein production and expression - Bioinformatics (web databases, servers, data
mining tools, NNs, HMMs) can and does play an
increasingly important role in optimizing or
improving protein expression and coordinating
large scale proteomics efforts