Title: Identification of Protein Domains
1Identification of Protein Domains
- Eden Dror
- Menachem Schechter
Computational Biology Seminar 2004
2Overview
- Introduction to protein domains.
- Classification of homologs.
- Representing a domain.
- PSSM
- HMM
- Internet resources
- Pfam
- SMART
- PROSITE
- InterPro
- Research example.
3Protein domains
- A discrete portion of a protein assumed to fold
independently, and possessing its own function. - Mobile domain (module) a domain that can be
found associated with different domain
combinations in different proteins.
4Protein domains
- The assumption The domain is the fundamental
unit of protein structure and function. - Protein family all proteins containing a
specific domain.
5What can we learn from them?
- Common ancestors homology information of a set
of proteins. - Homology can induce properties of a protein like
functionality localization. - Therefore, domains can be used to classify a new
protein to a family, inferring functionality.
6Classification of homologs
- Homology is not a sufficiently well-defined term
to describe the evolutionary relationships
between genes. - Homologous genes can be derived by two major
ways - Gene duplication (in the same species).
- Speciation (splitting of one species into two).
7Classification of homologs
8Classification of homologs
- Orthologs Two genes from two different species
that derive from a single gene in the last common
ancestor of the species. - Paralogs Two genes that derive from a single
gene that was duplicated within a genome.
9Classification of homologs
10Classification of homologs
- Inparalogs - paralogs that evolved by gene
duplication after the speciation event. - Outparalogs - paralogs that evolved by gene
duplication before the speciation event.
11Classification of homologs
12What can we learn from them?
- Ortholog proteins are evolutionary, and typically
functional counterparts in different species. - Paralog proteins are important for detecting
lineage-specific adaptations. - Both of them can reveal information on a specific
species or a set of species.
13Protein domains summary
- By identifying domains we can
- infer functionality localization of a protein.
- Learn on a specific species.
- Learn on a set of species as a group.
14Domain representation
- Different methods to represent (model) domains
- Patterns (regular expressions).
- PSSM (Position specific score matrix).
- HMM (Hidden Markov model).
15PSSM
- Position specific score matrix
- Score matrix representing the score for having
each amino acid in a given position in a specific
sequence. - Based on the independent probabilities P(ai) of
observing amino acid a in position i.
16PSSM Example
17PSSM Identifying a domain
- Given a sequence and a PSSM
- Run over all positions.
- Score each sub-sequence according to the matrix.
18HMM Hidden Markov Model
- Markov model a way of describing a process that
goes through a series of states. - Each state has a probability of transitioning to
the other states. - xi is a random variable of state.
19HMM Markov Model
- Example
- States are ÃŽ 0,1
20HMM Markov Model
21HMM Markov Model
- State transition example
- States are the nucleotides A, T, G, C.
22HMM Hidden Markov Model
- Hidden Markov model
- Each state x emits an output y, at a specific
probability. - We only know the output (observations).
- Thus, the states are hidden.
23HMM Hidden Markov Model
- Example states are ÃŽ 0,1, output ÃŽ 0,1
24HMM Hidden Markov Model
25HMM What can we do with it?
- Given (A, B)
- Probability of given states and outputs
- Probability of a given output sequence
- Most likely sequence of states that generated a
given output sequence
26HMM What can we do with it?
- Learning
- Given state and output sequences calculate the
most probable (A, B). - Easy when the states are known.
- Otherwise use a training algorithm.
27HMM Profile HMM
- Use HMM to represent sequence families.
- A particular type of HMM suited to modeling
multiple alignments. - (Assume we have a multiple alignment).
28HMM Trivial profile HMM
- We begin with ungapped regions.
- Each position corresponds to a state.
- Transitions are of probability 1.
29HMM Trivial profile HMM
- Let ei(a) be the independent probability of
observing amino acid a in position i. - The probability of a new sequence x, according to
the model
30HMM Trivial profile HMM
- We can score the sequence x
- Where q indicates the probability under a random
model.
31HMM Trivial profile HMM
- Consider the values
- They behave like elements in a score matrix.
- The trivial profile HMM is equivalent to a PSSM.
32HMM profile HMM
- Lets untrivialize by allowing for gaps
insertions and deletions. - Start off with the PSSM HMM.
33HMM profile HMM
- Handling insertions
- Introduce new states Ij match insertions after
position j. - These states have random emission probabilities.
34HMM profile HMM
- The score of a gap of length k
35HMM profile HMM
- Handling deletions
- Introduce silent states Dj.
- These states do not emit.
36HMM profile HMM
37Internet resources
- Databases of protein families.
- Family information and identification.
- Considerations
- Type of representation (pattern, PSSM, HMM).
- Choice of seed multiple alignment proteins.
- Quality control.
- Database features (links, annotations, views).
- Database Specificity (organism, functions).
38Pfam Home
39Pfam
- Protein families database of alignments and HMMs
- Uses profile-HMMs to represent families.
- For each family in Pfam you can
- Look at multiple alignments
- View protein domain architectures
- Examine species distribution
- Follow links to other databases
- View known protein structures
40Pfam Databases
- 2 databases
- Pfam-A curated multiple alignments.
- Grows slowly.
- Quality controlled by experts.
- Pfam-B automatic clustering (ProDom derived).
- Complements Pfam-A.
- New sequences instantly incorporated.
- Unchecked false positives, etc.
41Pfam Features
- Search by Sequence, keyword, domain, taxonomy.
- Browsing by family or genome.
- Evolutionary tree
42Pfam Construction
- Source of seed alignments
- Pfam-B families.
- Published articles.
- 'domain hunting' studies.
- occasionally using entries from other databases
(e.g. MEROPS for peptidases).
43Pfam Domain information
44Pfam Domain organization
45Pfam Multiple alignment
46Pfam HMM logo
47Pfam Species distribution
48Pfam Genome comparison
49PROSITE
- Database of protein families.
- Matching according to simple patterns or PSSM
profiles. - Browsing all proteins of a specific family.
- Latest release knows 1696 protein families.
50PROSITE Features
- Comprehensive domain documentation.
- All profile matches checked by experts.
- Specificity/sensitivity
- Specificity true-pos/all-pos
- Sensitivity true-pos/(true-pos false-neg)
51PROSITE Example
- Specificity of Zinc finger C2H2 type domain
52SMART
53SMART
- Simple Modular Architecture Research Tool
- Identification and annotation of genetically
mobile domains and the analysis of domain
architectures. - SMART consists of a library of HMMs.
- Knows 665 HMMs to date.
54SMART Features
- finding proteins containing specific domains
i.e. of the same family - Function prediction
- Sub-cellular localization
- Binding partners
- Architecture
- Alternative splicing information
- Orthology information
55SMART Domain selection example
- Tyrosine kinase (TyrKc) AND Transmembrane region
(TRANS)
56InterPro
- InterPro combines 9 other databases such as
SMART, Pfam, Prodom and more. - Queries can use many different methods (as the
other databases use different methods). - However, thresholds are predefined and cannot be
changed for those methods.
57InterPro
- Provides more results, but can sometimes be
redundant. - Coverage statistics
- 93 of Swiss-Prot v42.5 128540 out of 138922
proteins - 81 of TrEMBL v25.5 819966 out of 1013263
proteins
58InterPro Features
- Searching by Protein/DNA sequences
- Finding domains homologs
- List of InterPro entries of type
- Family
- Domain
- Repeat
- PTM- Post Transcriptional modifications
- Binding Site
- Active Site
- Keyword
59InterPro Example
60Research Example Introduction
- Goal The systematic identification of novel
protein domain families. - Using computational methods.
61Research Example Method
62Research Example Results
- 28 New Domains identified
- 15 domains in diverse contexts, in different
species. - 3 domains species specific.
- 7 domains with weak similarity to previously
described domains. - 3 extension domains.
63Predictions of Function
- On the basis of reports in literature and/or
occurrence with other identified domains,
functional features can be predicted for our
novel domain families. - Examples
- Chromatin binding
- Protein Interaction
- Predicted sub-cellular localization
64Predictions of FunctionChromatin-Binding example
- The novel domain CSZ is contained in protein
SPT6, which regulates transcription via chromatin
structure modification. - SPT6 has a histone-binding capability,
experimentally confirmed. - Other domains (S1, SH2) in SPT6 are unlikely to
bind histones or chromatin. - Conclusion CSZ has a predicted histone binding
function.
65Predictions of FunctionLocalization example
- Some of the novel domains are only found within
proteins from the initial set of nuclear domains. - This predicts that these domains have a nuclear
function. - The other domains are likely to have roles in
both nucleus and cytoplasm.
66Conclusion
- Domains are the functional units of proteins.
- Identifying a domain within a new protein may
teach us much about it. - There are several types of models to represent
domains. - These models can also be used to identify the
domain they represent. - Many Internet databases available to catalogue
and identify families. - Protocol to identify new domains using old ones.
67Resources
- Pfamhttp//www.sanger.ac.uk/Software/Pfam/
- SMART http//smart.embl-heidelberg.de/
- PROSITEhttp//www.expasy.org/prosite/
- InterProhttp//www.ebi.ac.uk/interpro/
68The End