Title: Bioinformatics: Applications
1Bioinformatics Applications
- ZOO 4903
- Fall 2006, MW 1030-1145
- Sutton Hall, Room 312
- Jonathan Wren
- Proteomics II Motifs Domains
2Lecture overview
- What weve talked about so far
- High-throughput detection of protein abundance
and identification - Overview
- Proteins have constituent components and
functional structures - Proteins can be modified post-translationally
33 Kinds of Proteomics
- Expression or Analytical Proteomics
- 2 dimensional electrophoresis gels
- Mass Spectrometry, Microsequencing
- Functional or Interaction Proteomics
- Protein Domains motifs
- Post-translational modifications
- Structural Proteomics
- High throughput X-ray Crystallography/Modeling
- High throughput NMR Spectroscopy/Modeling
4How do proteins evolve?
- Proteins couldnt have began with the functions
(or size) they have today
5How do proteins evolve?
- Proteins couldnt have began with the functions
(or size) they have today - Proteins can be broken down into constituent
components
6The Protein Parts List
740-60 proteins of unknown function in the human
genome
8Protein structures
- Primary, secondary, tertiary, quaternary
9Methods of grouping proteins
- Protein motif
- Protein domain
-
- 3-D structure
-
- Whole-protein
10Protein Domains
- Group of residues with high contact density,
number of contacts within domains is higher than
the number of contacts between domains
- A stable unit of protein structure that can fold
autonomously - A rigid body linked to other domains by flexible
linkers - A portion of the protein that can be
active/stable on its own if you remove it from
the rest of the protein
11Protein Domains
- The term fold is commonly used in the context
of a 3D structure - Together, a group of proteins that share a
particular domain is known as a family - Domains are often further qualified with respect
to function - Zinc finger bind DNA
- Intracellular domain soluble cytoplasmic
protein - Extracellular domain found on the outside of
the cell membrane
12Protein Domains
- Domains can be 25 to 500 residues long most are
less than 200 residues - The average protein contains 2 or 3 domains
- The total number of different types of domains
2000 - The same or similar domains are found in
different proteins.Nature is a tinkerer and
not an inventor (Jacob, 1977). - Usually, each domain plays a specific role in the
function of the protein - Generally, two sequences with over 30 identity
are likely to have the same fold.
13Linkers
- Domain linkers link the protein domains together
and have been found to contain an amino acid
signature that is distinct from the structurally
compact domains - Average linker size 8-9 amino acids
- Linkers are flexible and more susceptible to
protease attack
14Divisibility of proteins by domain helps 3D
Structure Determination
- X-ray crystallography
- grow crystal
- collect diffract. data
- calculate e- density
- trace chain
- NMR spectroscopy
- label protein
- collect NMR spectra
- assign spectra NOEs
- calculate structure using distance geom.
153D Structure
Proteins share the same fold suggesting homology
Beta B1 Crystallin
Gamma Crystallin C
16Protein Domains
17Motifs are built from Multiple Alignmennts
18Motif detection via MSA
Structure
alignment
tree
Functional site
Lichtarge et al, JMB 1996 Lichtarge et al, JMB
1997 Lichtarge et al, PNAS 1996 Sowa et al, NSB
2001
19Motifs are Functionally Relevant
Trp1 domain of Hop
Dihydropteroate Synthase
Galectin CRD
Cluster Type
Ligand binding site
ET clusters
Structural Epitope Yellow ligand, Blue
Residues within 5Å of the ligand ET Clusters
Yellow ligand, Red Largest Cluster,
Other colors trace residues
20Domains/Motifs
- These domains have conserved sequences
- Often much more similar than their respective
proteins - Exon splicing theory (Gilbert)
- Exons correspond to folding domains which in
turn serve as functional units - Unrelated proteins may share a single similar
exon (i.e. ATPase or DNA binding function)
21Sequence Motif regular expressions
- The simplest method of defining short amino acid
sequence motifs - Example the nuclear receptor motif
- C-x(2)-C-x-DE-x(5)-HN-FY-x(4)-C-x(2)-C-x(2)
-F-F-x-R - DE either D or E
- x(5) five undefined positions
- FYW any non-aromatic amino acid
22Motif Patterns (Regular Expressions)
- Signature Patterns for Functional Motifs
ProClass Motif Alignments
23(No Transcript)
24Simple domains
- Common structural domains
- Membrane spanning
- Signal peptide
- Coiled coil
- Helix-turn-helix
25DNA Binding domainZinc-Finger
26Methyl-binding domains
MBD methyl CpG binding domain
27Multidomain proteins
28Protein domains
29Protein domain abundance is skewed
30Remember Proteins Interact
31Protein Interaction Domains
32Proteins Assemble
33Proteins localize
http//www.cs.ualberta.ca/bioinfo/PA/Sub/
34Web servers that predict secondary structures
- Predict Protein server
- http//www.predictprotein.org/
- TMpred (transmembrane prediction)
- http//www.ch.embnet.org/software/TMPRED_form.html
- COILS (coiled coil prediction)
- http//www.ch.embnet.org/software/COILS_form.html
- SignalP (signal peptides)
- http//www.cbs.dtu.dk/services/SignalP/
35Protein Domain Databases
- Known protein domains have been collected in
databases - Best database is PROSITE
- The Dictionary of Protein Sites and Patterns
- Maintained by Amos Bairoch, at the Univ. of
Geneva, Switzerland - Contains a comprehensive list of documented
protein domains constructed by expert molecular
biologists - Alignments and patterns built by hand!
- http//www.expasy.org/prosite/
36PROSITE is based on Patterns
- Each domain is defined by a simple pattern
- Patterns can have alternate amino acids in each
position and defined spaces, but no gaps - Pattern searching is by exact matching, so any
new variant will not be found (can allow
mismatches, but this weakens the algorithm)
37PIR Pattern Search
- From Text/Sequence search result or pattern
search interface - One Query Sequence Against PROSITE Pattern
Database - One Query Pattern (PROSITE or User-Defined)
Against Sequence DB
38Pattern detection in sequence
39Sequence search using pattern
40PRINTS database
- Most protein families are characterized not by
one, but by several conserved motifs - Fingerprints are groups of conserved motifs
excised from sequence alignments - Taken together, they provide diagnostic family
signatures. They are are the basis of the PRINTS
database, and are stored in the form of aligned
motifs - Input on protein families is done manually
- True members match all elements of the
fingerprint in order, subfamily members may match
part of fingerprint
41BLOCKS
- The BLOCKS database uses an extension of the
motif approach - block an alignment of the motif sequences from
a family of proteins - BLOCKS are used to produce the BLOSUM matrices
- e.g. BLOSUM62 is derived from those blocks that
are at least 62 identical
42Pfam
- Pfam is a collection of alignments of protein
domain sequences - Some families generated using HMMs, some created
by hand - HMM Hidden Markov Model, a statistical method
increasingly used in gene and protein modelling - HMMs are rigorous algorithms which allow for
varying gap scores
43Integrating Pattern databases
- InterPro - Integrated Documentation Resource of
Protein Families, Domains and Functional Sites. - InterPro is a database of protein families,
domains and functional sites in which
identifiable features found in known proteins can
be applied to unknown protein sequences. - The aim is to provide a one-stop-shop for protein
family diagnostics
44InterPro
- Member Databases
- Prosite (regular expressions and profiles)
- Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, Gene3D and
SUPERFAMILY (hidden Markov Models - HMMs) - PRINTS (groups of aligned, un-weighted motifs)
- ProDom (uses cluster analysis to group sequences)
- Release 13.0 contains 13,147 entries and covers
77.6 of UniProtKB - 2530773 of 3260640 proteins - Types of entries Family, Domain, Repeat, PTM,
Binding Site, Active Site - http//www.ebi.ac.uk/interpro/
45Discovery of new Motifs
- All of the tools discussed so far rely on a
database of existing domains/motifs - How to discover new motifs
- Start with a set of related proteins
- Make a multiple alignment
- Build a pattern or profile
- You will need access to a fairly powerful UNIX
computer to search databases with custom built
profiles or HMMs.
46Patterns in Unaligned Sequences
- Sometimes sequences may share just a small common
region - common signal peptide
- new transcription factors
- MEME San Diego Supercomputing Facility
- http//meme.sdsc.edu/meme/meme-intro.html
47Post-translational Modifications
Post-translational modification is the chemical
modification of a protein after its translation.
Translation is the process of synthesizing the
peptide chain of amino acids specified by the
nucleotide sequence on the mRNA.
48The Central Dogma
- Transcription
- Translation
It is not necessary that the final product of
translation should be the final product of
protein synthesis.
49Types of Post-translational modifications
- Several types of PTMs characterized. Some of
them - Proteolytic cleavage
- Glycosylation (N)
- Methylation (D - E - K)
- Phosphorylation (S - T - Y)
- Sulfation (Y)
- Acetylation (D - E - K)
- Disulfide bond formation (C)
- Carboxylation, Hydroxylation, Prenylation,
Formylation, etc. 300 PTMs total
50Phosphorylation
Phosphorylation is the addition of a phosphate
(PO4) group to a protein or a small molecule
- Phosphorylation and dephosphorylation
responsible for activating or deactivation many
enzymes and receptors - Phosphorylation catalyzed by various specific
protein kinases, dephosphorylation by
phosphatases - Can occur on Serine, Threonine, Tyrosine
- gt30 of all proteins are phosphorylated during
their functional life cycle
51Phosphorylation Sites
pY
pT
pS
PO4
PO4
CH3
PO4
52Glycosylation
Glycosylation is the addition of saccharide to a
protein or a lipid molecule
- N-Linked Glycosylation
- Amide nitrogen of Asparagine
- O-Linked Glycosylation
- - Hydroxy oxygen of Serine and Threonine
53PTMs have significant biological functions
- Extend the range of possible functions that can
be exhibited by a protein by introducing new
chemical groups. - Alter the hydrophobicity of a protein (synthesis
of membrane proteins). - Activating or inactivating an enzyme.
- Energy metabolism
- Oxidative phosphorylation in respiration
- Photophosphorylation in protein synthesis
- Signal transduction
- Protein degradation
- Blood coagulation
- Immune system
54PTMs affect protein behavior
Mann Nat Biotech 2003
55Post-translational modifications
56PTMs can be characterized or predicted
- Experimental methods
- Crystallography
- Mass Spectrometry
- PTM Prediction tools
- Auto-motif server
- Sulfinator
- NetPhos server
- Predphospho server
- eMOTIF
- PROSITE
57PTM detection
- Pattern prediction (PROSITE)
- Short or weak signal
- Frequent hit producer
- Best method is experimental
- MS/MS detection
- Most methods use rules joining pattern
detection and knowledge to predict sites.
58ExPASY protein tools
http//www.expasy.org/tools/
- ChloroP - Prediction of chloroplast transit
peptides - LipoP - Prediction of lipoproteins and signal
peptides in Gram negative bacteria - MITOPROT - Prediction of mitochondrial targeting
sequences - SignalP - Prediction of signal peptide cleavage
sites - NetAcet - Prediction of N-acetyltransferase A
(NatA) substrates - NetOGlyc - Prediction of O-GalNAc (mucin type)
glycosylation sites in mammalian proteins - NetNGlyc - Prediction of N-glycosylation sites in
human proteins - YinOYang - O-beta-GlcNAc attachment sites in
eukaryotic protein sequences - big-PI Predictor - GPI Modification Site
Prediction - DGPI - Prediction of GPI-anchor and cleavage
sites (Mirror site) - Myristoylator - Prediction of N-terminal
myristoylation by neural networks - NetPhos - Prediction of Ser, Thr and Tyr
phosphorylation sites in eukaryotic proteins - NMT - Prediction of N-terminal N-myristoylation
- PrePS - Prenylation Prediction Suite
- Sulfinator - Prediction of tyrosine sulfation
sites - SUMOplot - Prediction of SUMO protein attachment
sites - TermiNator - Prediction of N-terminal
modification
59PTM Databases
- General PTM Databases
- RESID
- Unimod
- Delta Mass
- PTM Databases for Specific Proteins
- Histone sequence database
- Human Protein Reference Database
- Plasma Proteome Database
- Databases for Specific PTMs
- Phospho.ELM Phosphorylation
- GlycoSuiteDB, SweetDB Glycosylation
60Limitations of current PTM databases
- PTMs mostly annotated in a static fashion. e.g.,
an amino acid is denoted as either modified or
unmodified. In reality, some amino acids are
modified under one condition, and return to their
initial state when the condition changes. - The status of a specific amino acid site with
respect to a modification is highly associated
with biological functionality of the protein. But
this association is often not annotated in the
database. - Phosphorylation vs. signal transduction
- Glycosylation vs. cell-cell interaction
- Different PTMs on the same protein may be
associated with each other. These associations
are not annotated in the current databases either.
61Summary
- Many different protein signature databases exist
(from small patterns to alignments to complex
HMMs) - The quality of a database/server is best tested
with a sequence you know very well - Positive controls - submit sequences for which
you know the right answer - Negative controls - random or shuffled sequences
- Many proteins function only after they have been
further chemically modified
62For next time
- Homework 6 due
- Read Mount, Chapter 10