Title: Protein domains, function and associated prediction
1Lecture 14
Protein domains, function and associated
prediction
Introduction to Bioinformatics
2Metabolomics fluxomics
3Experimental
- Structural genomics
- Functional genomics
- Protein-protein interaction
- Metabolic pathways
- Expression data
4Issue when elucidating function experimentally
- Typically done through knock-out experiments
- Partial information (indirect interactions) and
subsequent filling of the missing steps - Negative results (elements that have been shown
not to interact, enzymes missing in an organism) - Putative interactions resulting from
computational analyses
5Protein function categories
- Catalysis (enzymes)
- Binding transport (active/passive)
- Protein-DNA/RNA binding (e.g. histones,
transcription factors) - Protein-protein interactions (e.g.
antibody-lysozyme) (experimentally determined by
yeast two-hybrid (Y2H) or bacterial two-hybrid
(B2H) screening ) - Protein-fatty acid binding (e.g. apolipoproteins)
- Protein small molecules (drug interaction,
structure decoding) - Structural component (e.g. ?-crystallin)
- Regulation
- Signalling
- Transcription regulation
- Immune system
- Motor proteins (actin/myosin)
6Catalytic properties of enzymes
Michaelis-Menten equation
Vmax S V -------------------
Km S
Vmax
- Km kcat
- E S ES E P
- E enzyme
- S substrate
- ES enzyme-substrate complex (transition state)
- P product
- Km Michaelis constant
- Kcat catalytic rate constant (turnover number)
- Kcat/Km specificity constant (useful for
comparison)
Moles/s
Vmax/2
Km
S
7Protein interaction domains
http//pawsonlab.mshri.on.ca/html/domains.html
8Energy difference upon binding
- Examples of protein interactions (and of
functional importance) include - Protein protein (pathway analysis)
- Protein small molecules (drug interaction,
structure decoding) - Protein peptides, DNA/RNAÂ
- The change in Gibbs Free Energy of the
protein-ligand binding interaction can be
monitored and expressed by the following
equation - ? G ? H T ? S      Â
- (HEnthalpy, SEntropy and TTemperature)
9(No Transcript)
10Protein function
- Many proteins combine functions
- Some immunoglobulin structures are thought to
have more than 100 different functions (and
active/binding sites) - Alternative splicing can generate (partially)
alternative structures
11Protein function Interaction
Active site / binding cleft
Shape complementarity
12Protein function evolution
Chymotrypsin
... to a more elaborate active site with four
different features, all helping to optimise
proteolysis (cleavage)
From a simple ancestral active site for cutting
protein chains...
Gene duplication has resulted in two-domain
protein
13Protein function evolution
Chymotrypsin
The active site lies between the two domains. It
consists of residues on the same two loops
(firstly between beta-strands 3 and 4, secondly
between beta strands 5 and 6) of each of the two
barrel domains. Four features of the active site
are indicated in the figure.
The Substrate Specificity Pocket
Main Chain Substrate-binding
The Oxyanion Hole (white)
Catalytic triad
Chymotrypsin cleaves peptides at the carboxyl
side of tyrosine, tryptophan, and phenylalanine
because those three amino acids contain phenyl
rings.
14How to infer function
- Experiment
- Deduction from sequence
- Multiple sequence alignment conservation
patterns - Homology searching
- Deduction from structure
- Threading
- Structure-structure comparison
- Homology modelling
15A domain is a
- Compact, semi-independent unit (Richardson,
1981). - Stable unit of a protein structure that can fold
autonomously (Wetlaufer, 1973). - Recurring functional and evolutionary module
(Bork, 1992). - Nature is a tinkerer and not an inventor
(Jacob, 1977). - Smallest unit of function
-
16Delineating domains is essential for
- Obtaining high resolution structures (x-ray but
particularly NMR size of proteins) - Sequence analysis
- Multiple sequence alignment methods
- Prediction algorithms (SS, Class,
secondary/tertiary structure) - Fold recognition and threading
- Elucidating the evolution, structure and function
of a protein family (e.g. Rosetta Stone method) - Structural/functional genomics
- Cross genome comparative analysis
17Domain connectivity
linker
18Structural domain organisation can be nasty
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
19Domain size
- The size of individual structural domains varies
widely - from 36 residues in E-selectin to 692 residues in
lipoxygenase-1 (Jones et al., 1998) - the majority (90) having less than 200 residues
(Siddiqui and Barton, 1995) - with an average of about 100 residues (Islam et
al., 1995). - Small domains (less than 40 residues) are often
stabilised by metal ions or disulphide bonds. - Large domains (greater than 300 residues) are
likely to consist of multiple hydrophobic cores
(Garel, 1992).
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Analysis of chain hydrophobicity in multidomain
proteins
28Analysis of chain hydrophobicity in multidomain
proteins
29Domain characteristics
- Domains are genetically mobile units, and
multidomain families are found in all three
kingdoms (Archaea, Bacteria and Eukarya)
underlining the finding that Nature is a
tinkerer and not an inventor (Jacob, 1977). - The majority of genomic proteins, 75 in
unicellular organisms and more than 80 in
metazoa, are multidomain proteins created as a
result of gene duplication events (Apic et al.,
2001). - Domains in multidomain structures are likely to
have once existed as independent proteins, and
many domains in eukaryotic multidomain proteins
can be found as independent proteins in
prokaryotes (Davidson et al., 1993).
30Protein function evolution- Gene (domain)
duplication -
Active site
Chymotrypsin
31Pyruvate phosphate dikinase
- 3-domain protein
- Two domains catalyse 2-step reaction
- A? B ? C
- Third so-called swivelling domain actively
brings intermediate enzymatic product (B) over
45Ã… from one active site to the other
/
32Pyruvate phosphate dikinase
- 3-domain protein
- Two domains catalyse 2-step reaction
- A? B ? C
- Third so-called swivelling domain actively
brings intermediate enzymatic product (B) over
45Ã… from one active site to the other
33- The DEATH Domain
- Present in a variety of Eukaryotic proteins
involved with cell death. - Six helices enclose a tightly packed hydrophobic
core. - Some DEATH domains form homotypic and
heterotypic dimers.
http//www.mshri.on.ca/pawson
34Detecting Structural Domains
- A structural domain may be detected as a compact,
globular substructure with more interactions
within itself than with the rest of the structure
(Janin and Wodak, 1983). - Therefore, a structural domain can be determined
by two shape characteristics compactness and its
extent of isolation (Tsai and Nussinov, 1997). - Measures of local compactness in proteins have
been used in many of the early methods of domain
assignment (Rossmann et al., 1974 Crippen, 1978
Rose, 1979 Go, 1978) and in several of the more
recent methods (Holm and Sander, 1994 Islam et
al., 1995 Siddiqui and Barton, 1995 Zehfus,
1997 Taylor, 1999).
35Detecting Structural Domains
- However, approaches encounter problems when faced
with discontinuous or highly associated domains
and many definitions will require manual
interpretation. - Consequently there are discrepancies between
assignments made by domain databases (Hadley and
Jones, 1999).
36Detecting Domains using Sequence only
- Even more difficult than prediction from
structure!
37Integrating protein multiple sequence alignment,
secondary and tertiary structure prediction in
order to predict structural domain boundaries in
sequence data
SnapDRAGON
- Richard A. George
- George R.A. and Heringa, J. (2002) J. Mol. Biol.,
316, 839-851. - Â
38Protein structure hierarchical levels
39Protein structure hierarchical levels
40Protein structure hierarchical levels
41Protein structure hierarchical levels
42SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
- Input Multiple sequence alignment (MSA) and
predicted secondary structure - Generate 100 DRAGON 3D models for the protein
structure associated with the MSA - Assign domain boundaries to each of the 3D models
(Taylor, 1999) - Sum proposed boundary positions within 100 models
along the length of the sequence, and smooth
boundaries using a weighted window
George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
43SnapDragon
Folds generated by Dragon
Multiple alignment
Boundary recognition (Taylor, 1999)
Predicted secondary structure
Summed and Smoothed Boundaries
CCHHHCCEEE
44SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
- Input Multiple sequence alignment (MSA)
- Sequence searches using PSI-BLAST (Altschul et
al., 1997) - followed by sequence redundancy filtering using
OBSTRUCT (Heringa et al.,1992) - and alignment by PRALINE (Heringa, 1999)
- and predicted secondary structure
- PREDATOR secondary structure prediction program
George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
45Domain prediction using DRAGON
Distance Regularisation Algorithm for Geometry
OptimisatioN
(Aszodi Taylor, 1994)
- Folded protein models based on the requirement
that (conserved) hydrophobic residues cluster
together. - First construct a random high dimensional Ca
distance matrix. - Distance geometry is used to find the 3D
conformation corresponding to a prescribed target
matrix of desired distances between residues.
46SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
- Generate 100 DRAGON (Aszodi Taylor, 1994)
models for the protein structure associated with
the MSA - DRAGON folds proteins based on the requirement
that (conserved) hydrophobic residues cluster
together - (Predicted) secondary structures are used to
further estimate distances between residues (e.g.
between the first and last residue in a
b-strand). - It first constructs a random high dimensional Ca
(and pseudo Cb) distance matrix - Distance geometry is used to find the 3D
conformation corresponding to a prescribed matrix
of desired distances between residues (by gradual
inertia projection and based on input MSA and
predicted secondary structure) - DRAGON Distance Regularisation Algorithm for
Geometry OptimisatioN
47Multiple alignment
C? distance matrix
Target matrix
Predicted secondary structure
N
N
3
N
N
100 randomised initial matrices 100 predictions
CCHHHCCEEE
Input data
N
- The C? distance matrix is divided into smaller
clusters. - Separately, each cluster is embedded into a local
centroid. - The final predicted structure is generated from
full embedding of the multiple centroids and
their corresponding local structures.
48Lysozyme 4lzm
PDB
DRAGON
49Methyltransferase 1sfe
PDB
DRAGON
50Phosphatase 2hhm-A
PDB
DRAGON
51Taylor method (1999)DOMAIN-3D
- 3. Assign domain boundaries to each of the 3D
models (Taylor, 1999) - Easy and clever method
- Uses a notion of spin glass theory (disordered
magnetic systems) to delineate domains in a
protein 3D structure - Steps
- Take sequence with residue numbers (1..N)
- Look at neighbourhood of each residue (first
shell) - If (average nghhood residue number gt res no)
resno resno1 - else resno resno-1
- If (convergence) then take regions with identical
residue number as domains and terminate
Taylor,WR. (1999) Protein structural domain
identification. Protein Engineering 12 203-216
52Taylor method (1999)
repeat until convergence if 41 lt
(56567889)/5 then Res 41 42 (up 1)
else Res 41 40 (down 1)
5
78
6
41
56
89
53Taylor method (1999)
initial situation
Iterate until convergence
continuous
discontinuous
54SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
- Sum proposed boundary positions within 100 models
along the length of the sequence, and smooth
boundaries using a weighted window (assign
central position) - Window score ?1 i l Si Wi
- Where Wi (p - p-i)/p2 and p ½(n1).
- It follows that ?l Wi 1
Wi
i
George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
55SNAPDRAGON
- Statistical significance
- Convert peak scores to Z-scores using
- z (x-mean)/stdev
- If z gt 2 then assign domain boundary
- Statistical significance using random models
- Test hydrophibic collapse given distribution of
hydrophobicity over sequence - Make 5 scrambled multiple alignments (MSAs) and
predict their secondary structure - Make 100 models for each MSA
- Compile mean and stdev from the boundary
distribution over the 500 random models - If observed peak z gt 2.0 stdev (from random
models) then assign domain boundary
56SnapDRAGON prediction assessment
- Test set of 414 multiple alignments183 single
and 231 multiple domain proteins. - Boundary predictions are compared to the region
of the protein connecting two domains (maximally
?10 residues from true boundary)
57SnapDRAGON prediction assessment
- Baseline method I
- Divide sequence in equal parts based on number of
domains predicted by SnapDRAGON - Baseline method II
- Similar to Wheelan et al., based on domain length
partition density function (PDF) - PDF derived from 2750 non-redundant structures
(deposited at NCBI) - Given sequence, calculate probability of
one-domain, two-domain, .., protein - Highest probability taken and sequence split
equally as in baseline method I
58Average prediction results per protein
Coverage is the linkers predicted
(TP/TPFN) Success is the of correct
predictions made (TP/TPFP)
59Average prediction results per protein
60Protein-protein interaction networks
61Protein Function Prediction
- How can we get the edges (connections) of the
cellular networks? - We can predict functions of genes or proteins so
we know where they would fit in a metabolic
network - There are also techniques to predict whether two
proteins interact, either functionally (e.g. they
are involved in a two-step metabolic process) or
directly physically (e.g. are together in a
protein complex)
62Protein Function Prediction
The state of the art its not complete Many
genes are not annotated, and many more are
partially or erroneously annotated. Given a
genome which is partially annotated at best, how
do we fill in the blanks? Of each sequenced
genome, 20-50 of the functions of proteins
encoded by the genomes remains unknown! How then
do we build a reasonably complete networks when
the parts list is so incomplete?
63Protein Function Prediction
For all these reasons, improving automated
protein function prediction is now a cornerstone
of bioinformatics and computational biology New
methods will need to integrate signals coming
from sequence, expression, interaction and
structural data, etc.
64Classes of function prediction methods (recap)
- Sequence based approaches
- protein A has function X, and protein B is a
homolog (ortholog) of protein A Hence B has
function X - Structure-based approaches
- protein A has structure X, and X has so-so
structural features Hence As function sites are
. - Motif-based approaches
- a group of genes have function X and they all
have motif Y protein A has motif Y Hence
protein As function might be related to X - Function prediction based on guilt-by-association
- gene A has function X and gene B is often
associated with gene A, B might have function
related to X
65Phylogenetic profile analysis
- Function prediction of genes based on
guilt-by-association a non-homologous
approach - The phylogenetic profile of a protein is a string
that encodes the presence or absence of the
protein in every sequenced genome - Because proteins that participate in a common
structural complex or metabolic pathway are
likely to co-evolve, the phylogenetic profiles of
such proteins are often similar' - This means that such proteins have a good chance
of being physically or metabolically connected
66Phylogenetic profile analysis
- Phylogenetic profile (against N genomes)
- For each gene X in a target genome (e.g., E
coli), build a phylogenetic profile as follows - If gene X has a homolog in genome i, the ith bit
of Xs phylogenetic profile is 1 otherwise it
is 0
67Phylogenetic profile analysis
- Example phylogenetic profiles based on 60
genomes
genome
gene
orf1034111011011001011111010001010000000011110001
1111110110111010101 orf10361011110001000001010000
010010000000010111101110011011010000101 orf103711
01100110000001110010000111111001101111101011101111
000010100 orf103811101001100100101100100111000001
01110101101111111111110000101 orf1039111111111111
1111111111111111111111111111101111111111111111101
orf104 10001010000000000000001010000000001100000
00000000100101000100 orf1040111011111111110111110
1111100000111111100111111110110111111101 orf10411
11111111111111111011111111111110111111110111111111
1111111101 orf10421110100101010010010110000100001
001111110111110101101100010101 orf104311101001100
10000010100111100100001111110101111011101000010101
orf104411111001111100100101110101111110011111111
11111101101100010101 orf1045111111011011001111111
1111111111101111111101111111111110010101 orf10460
10110000001000101100000011111000001010000000101001
0100000000 orf10470000000000000001000010000001000
100000000000000010000000000000 orf105
01101101101000101111011010101110011011001011111000
10000010001 orf1054010010011000000110000100010000
0000100100100001000100100000000
By correlating the rows (open reading frames
(ORF) or genes) you find out about joint presence
or absence of genes this is a signal for a
functional connection
Genes with similar phylogenetic profiles have
related functions or functionally linked D
Eisenberg and colleagues (1999)
68Phylogenetic profile analysis
- Phylogenetic profiles contain great amount of
functional information - Phlylogenetic profile analysis can be used to
distinguish orthologous genes from paralogous
genes - Example Subcellular localization 361 yeast
nucleus-encoded mitochondrial proteins were
identified at 50 accuracy with 58 coverage
through phylogenetic profile analysis - Functional complementarity By examining inverse
phylogenetic profiles, one can find functionally
complementary genes that might have evolved
through one of several mechanisms of convergent
evolution. - Phylogenetic profiling typically has low accuracy
(specificity) but can have high coverage.
69Domain fusion example
- Vertebrates have a multi-enzyme protein
(GARs-AIRs-GARt) comprising the enzymes GAR
synthetase (GARs), AIR synthetase (AIRs), and GAR
transformylase (GARt) - In insects, the polypeptide appears as
GARs-(AIRs)2-GARt - In yeast, GARs-AIRs is encoded separately from
GARt - In bacteria each domain is encoded separately
(Henikoff et al., 1997). - GAR glycinamide ribonucleotide
- AIR aminoimidazole ribonucleotide
70Using observed domain fusion for prediction of
protein-protein interactions Rosetta stone
method
-
- Gene fusion is the an effective method for
prediction of protein-protein interactions - If proteins A and B are homologous to two domains
of a multi-domain protein C, A and B are
predicted to have interaction
A
B
C
Though gene-fusion has low prediction coverage,
its false-positive rate is low (high specificity)
71Protein interaction database
- There are numerous databases of protein-protein
interactions - DIP is a popular protein-protein interaction
database
The DIP database catalogs experimentally
determined interactions between proteins. It
combines information from a variety of sources to
create a single, consistent set of
protein-protein interactions.
72Protein interaction databases
- BIND - Biomolecular Interaction Network Database
- DIP - Database of Interacting Proteins
- PIM Hybrigenics
- PathCalling Yeast Interaction Database
- MINT - a Molecular Interactions Database
- GRID - The General Repository for Interaction
Datasets - InterPreTS - protein interaction prediction
through tertiary structure - STRING - predicted functional associations among
genes/proteins - Mammalian protein-protein interaction database
(PPI) - InterDom - database of putative interacting
protein domains - FusionDB - database of bacterial and archaeal
gene fusion events - IntAct Project
- The Human Protein Interaction Database (HPID)
- ADVICE - Automated Detection and Validation of
Interaction by Co-evolution - InterWeaver - protein interaction reports with
online evidence - PathBLAST - alignment of protein interaction
networks - ClusPro - a fully automated algorithm for
protein-protein docking - HPRD - Human Protein Reference Database
73Protein interaction database
74Network of protein interactions and predicted
functional links involving silencing information
regulator (SIR) proteins. Filled circles
represent proteins of known function open
circles represent proteins of unknown function,
represented only by their Saccharomyces genome
sequence numbers ( http//genome-www.stanford.edu/
Saccharomyces). Solid lines show experimentally
determined interactions, as summarized in the
Database of Interacting Proteins19
(http//dip.doe-mbi.ucla.edu). Dashed lines show
functional links predicted by the Rosetta Stone
method12. Dotted lines show functional links
predicted by phylogenetic profiles16. Some
predicted links are omitted for clarity.
75Network of predicted functional linkages
involving the yeast prion protein20 Sup35. The
dashed line shows the only experimentally
determined interaction. The other functional
links were calculated from genome and expression
data11 by a combination of methods, including
phylogenetic profiles, Rosetta stone linkages and
mRNA expression. Linkages predicted by more than
one method, and hence particularly reliable, are
shown by heavy lines. Adapted from ref. 11. Â
76STRING - predicted functional associations among
genes/proteins
- STRING is a database of predicted functional
associations among genes/proteins. - Genes of similar function tend to be maintained
in close neighborhood, tend to be present or
absent together, i.e. to have the same
phylogenetic occurrence, and can sometimes be
found fused into a single gene encoding a
combined polypeptide. - STRING integrates this information from as many
genomes as possible to predict functional links
between proteins.
Berend Snel (UU), Martijn Huynen (RUN) and the
group of Peer Bork (EMBL, Heidelberg)
77STRING - predicted functional associations among
genes/proteins
- STRING is a database of known and predicted
protein-protein interactions.The interactions
include direct (physical) and indirect
(functional) associations they are derived from
four sources - Genomic Context (Synteny)
- High-throughput ExperimentsÂ
- (Conserved) Co-expressionÂ
- Previous Knowledge
- STRING quantitatively integrates interaction
data from these sources for a large number of
organisms, and transfers information between
these organisms where applicable. The database
currently contains 736429 proteins in 179 species
78STRING - predicted functional associations among
genes/proteins
Conserved Neighborhood This view shows
runs of genes that occur repeatedly in close
neighborhood in (prokaryotic) genomes. Genes
located together in a run are linked with a black
line (maximum allowed intergenic distance is 300
bp). Note that if there are multiple runs for a
given species, these are separated by white
space. If there are other genes in the run that
are below the current score threshold, they are
drawn as small white triangles. Gene fusion
occurences are also drawn, but only if they are
present in a run.
79Wrapping up
- Understand chymotrypsin example evolution via
gene duplication of an optimised two-domain
barrel enzyme with active site residues from
either domain. - Understand domain issues structural and
functional - Understand the basic steps of the Snap-DRAGON
method for domain boundary prediction but no
need to memorize it all - Understand phylogenetic profiling and the Rosetta
Stone method (guilt-by-association) - Understand that conservation patterns in the
order of genes that are nearby on the genome
(synteny) indicate functional relationships (used
in STRING method) - Also co-expression (genes being expressed (or
not) at the same time) indicates a functional
relationship (used in STRING method)