Title: The Genome Access Course Protein Structure
1TheGenomeAccessCourseProtein Structure
HSP 70 (1DKG, 1DKZ) and prefoldin (1FXK)
2Protein structure
- What is the correct amino acid sequence?
- Is the predicted protein complete (ATG real?)
- To be sure - use ORF finder at NCBI
3ORF finder to BLAST
http//www.ncbi.nlm.nih.gov/gorf/gorf.html
4Protein Structural Elements
- 2o Structural Elements
- a-Helix
- ß-Sheet
- Globular regions
- Domains
- SH2
- Leucine Zipper
5Protein function - different categories
- Protein of known function
- Protein of inferred function
- Protein of unknown function
6Protein of known function
- Work already done
- Ancillary databases (e.g Pubmed. OMIM, MGI, other
organism specific databases) - Warning - make sure it really is the SAME protein
- First port of call - LocusLink/Entrez Gene
7Human genes and OMIM
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbO
MIM
8Mouse genes and MGI
http//www.informatics.jax.org/
9And the list goes on.
10Protein of inferred function
- Similar to protein of known function
- Annotated
- BLAST
- Paralogue (same species) or orthologue (different
species) or just similar - Make sure key residues are conserved e.g Pairwise
or Multiple alignment
11Protein of inferred function
Human X Chr
12Protein of Unknown Function
- Not similar at the primary sequence level to a
protein of known function - Can you predict function - so many caveats!
- Transmembrane protein?
- TMPred http//www.ch.embnet.org/software/TMPRED_fo
rm.html - Protein domains
- can infer function e.g Homeobox
- Warning - some domains are poorly and/or widely
predicted
13Domains
- Discrete structural units
- Can infer boundaries from sequence analysis
- 25 500 residues long
- Most lt 200 residues
- Less than 50 residues usually stabilized by SS
bonds or metal ions
14LipoxygenaseDomain
gt500 residues
15WW Domain
33 residues
16Domain Determination
- Internal duplications
- Detect with a dotplot
- Transmembrane segments
- Hydrophobic, 1535 residues
- Segments easy to predict
- Topology and multiple segments harder to predict
- PHD, TMHMM, TMpred
- Low complexity segments
- Composition typically non-random
- Non-compact folds coiled coils, rods, flexible
domain linkers - Complexity function (SEG)
- Small-pitch overlapping repeats (XNU)
17Protein sequence databases
- Non curated
- Trembl - automatically predicts proteins from CDS
in Genbank/EMBL/ddBJ - Entrez protein www.ncbi.nlm.nih.gov80/entrez/quer
y.fcgi?dbProtein - Curated
- Swisprot - proteins identified with confidence
manually added to database - Uniprot (e.g hosted at EBI http//www.expasy.unip
rot.org/index.shtml
18Proteins of Unknown function
Protein domain databases e.g. Interpro
http//www.ebi.ac.uk/interpro/index.html
19Comparison of Protein Family DBs
Pfam
SMART
CDD
PROSITE
SRS
20- Conserved Domain Database (NCBI)
- Linked into other NCBI resources
- Includes Pfam and SMART domains (but does not
necessarily give the same answer)
21Proteins in Ensembl
22Proteins in UCSC
23- HMM family profiles constructed by hand
- Structural data in alignments
- No hierarchy
- No specific compositional bias
- Good graphical output
24Pfam-A and Pfam-B
- Pfam-A (75)
- Curated, annotated families
- Pfam-B (19)
- Families derived automatically from ProDom
- Other
25- Protein fingerprint database (fingerprints are
groups of conserved motifs that characterize a
protein family) - Regular grammar for describing profiles (e.g.
EDQ-x-G-x-DN-A-x-x-GALI) - Profile search is sensitive, but low coverage
(signaling) - Pattern search has high false positive rate
26- Highly conserved, ungapped MSAs
- Derived from PROSITE
27- Fingerprints are sets of ungapped weight matrices
- Hierarchical classification for important
families - Families, domains, and proteins
28- Simple Modular Architecture Research Tool
- Collected by Ponting and Bork (641 HMMs)
- Focuses on
- Signaling Domains
- Extracellular domains
- Nuclear domains
- High quality nice graphics
29Alignment of Representative Members
Profile-HMM built with HMMer 2.0
Search Protein DB
Description
Full alignment
30- Profiles automatically built from PSI-BLAST
alignments of Swiss-PROT - No annotation
- As with other automated DBs (Pfam-B, DOMO),
useful for seeing if region appears in different
contexts
31Protein Sequence Analysis
- Biochemical/biophysical properties
- Secondary Structure
- Super-secondary (signal peptides, domains,
motifs) - 3D prediction (Threading)
32Amphipathic Helix
Edge Strand
Buried Strand
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Viewing 3D Structures
- Cn3d
- Chime
- RasMol
- Protein Explorer
37(No Transcript)
38Protein of inferred or unknown function
- All predictions must be taken as exactly that
- PREDICTIONS!!
- The true function of a protein is NOT known until
it has been proven in the lab