Identification of Protein Domains - PowerPoint PPT Presentation

1 / 68

About This Presentation

Title:

Identification of Protein Domains

Description:

Let's untrivialize by allowing for gaps: insertions and deletions. Start off with the PSSM HMM. ... insertions: Introduce new states Ij match insertions ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 69

Provided by: Nac69

Category:

more less

Transcript and Presenter's Notes

Title: Identification of Protein Domains

1
Identification of Protein Domains

Eden Dror
Menachem Schechter

Computational Biology Seminar 2004
2
Overview

Introduction to protein domains.
Classification of homologs.
Representing a domain.
PSSM
HMM
Internet resources
Pfam
SMART
PROSITE
InterPro
Research example.

3
Protein domains

A discrete portion of a protein assumed to fold
independently, and possessing its own function.
Mobile domain (module) a domain that can be
found associated with different domain
combinations in different proteins.

4
Protein domains

The assumption The domain is the fundamental
unit of protein structure and function.
Protein family all proteins containing a
specific domain.

5
What can we learn from them?

Common ancestors homology information of a set
of proteins.
Homology can induce properties of a protein like
functionality localization.
Therefore, domains can be used to classify a new
protein to a family, inferring functionality.

6
Classification of homologs

Homology is not a sufficiently well-defined term
to describe the evolutionary relationships
between genes.
Homologous genes can be derived by two major
ways
Gene duplication (in the same species).
Speciation (splitting of one species into two).

7
Classification of homologs
8
Classification of homologs

Orthologs Two genes from two different species
that derive from a single gene in the last common
ancestor of the species.
Paralogs Two genes that derive from a single
gene that was duplicated within a genome.

9
Classification of homologs
10
Classification of homologs

Inparalogs - paralogs that evolved by gene
duplication after the speciation event.
Outparalogs - paralogs that evolved by gene
duplication before the speciation event.

11
Classification of homologs
12
What can we learn from them?

Ortholog proteins are evolutionary, and typically
functional counterparts in different species.
Paralog proteins are important for detecting
lineage-specific adaptations.
Both of them can reveal information on a specific
species or a set of species.

13
Protein domains summary

By identifying domains we can
infer functionality localization of a protein.
Learn on a specific species.
Learn on a set of species as a group.

14
Domain representation

Different methods to represent (model) domains
Patterns (regular expressions).
PSSM (Position specific score matrix).
HMM (Hidden Markov model).

15
PSSM

Position specific score matrix
Score matrix representing the score for having
each amino acid in a given position in a specific
sequence.
Based on the independent probabilities P(ai) of
observing amino acid a in position i.

16
PSSM Example
17
PSSM Identifying a domain

Given a sequence and a PSSM
Run over all positions.
Score each sub-sequence according to the matrix.

18
HMM Hidden Markov Model

Markov model a way of describing a process that
goes through a series of states.
Each state has a probability of transitioning to
the other states.
xi is a random variable of state.

19
HMM Markov Model

Example
States are Î 0,1

20
HMM Markov Model

Transition matrix

21
HMM Markov Model

State transition example
States are the nucleotides A, T, G, C.

22
HMM Hidden Markov Model

Hidden Markov model
Each state x emits an output y, at a specific
probability.
We only know the output (observations).
Thus, the states are hidden.

23
HMM Hidden Markov Model

Example states are Î 0,1, output Î 0,1

24
HMM Hidden Markov Model

Emission matrix

25
HMM What can we do with it?

Given (A, B)
Probability of given states and outputs

Probability of a given output sequence

Most likely sequence of states that generated a
given output sequence

26
HMM What can we do with it?

Learning
Given state and output sequences calculate the
most probable (A, B).
Easy when the states are known.
Otherwise use a training algorithm.

27
HMM Profile HMM

Use HMM to represent sequence families.
A particular type of HMM suited to modeling
multiple alignments.
(Assume we have a multiple alignment).

28
HMM Trivial profile HMM

We begin with ungapped regions.
Each position corresponds to a state.
Transitions are of probability 1.

29
HMM Trivial profile HMM

Let ei(a) be the independent probability of
observing amino acid a in position i.
The probability of a new sequence x, according to
the model

30
HMM Trivial profile HMM

We can score the sequence x
Where q indicates the probability under a random
model.

31
HMM Trivial profile HMM

Consider the values
They behave like elements in a score matrix.
The trivial profile HMM is equivalent to a PSSM.

32
HMM profile HMM

Lets untrivialize by allowing for gaps
insertions and deletions.
Start off with the PSSM HMM.

33
HMM profile HMM

Handling insertions
Introduce new states Ij match insertions after
position j.
These states have random emission probabilities.

34
HMM profile HMM

The score of a gap of length k

35
HMM profile HMM

Handling deletions
Introduce silent states Dj.
These states do not emit.

36
HMM profile HMM

The complete profile HMM

37
Internet resources

Databases of protein families.
Family information and identification.
Considerations
Type of representation (pattern, PSSM, HMM).
Choice of seed multiple alignment proteins.
Quality control.
Database features (links, annotations, views).
Database Specificity (organism, functions).

38
Pfam Home
39
Pfam

Protein families database of alignments and HMMs
Uses profile-HMMs to represent families.
For each family in Pfam you can
Look at multiple alignments
View protein domain architectures
Examine species distribution
Follow links to other databases
View known protein structures

40
Pfam Databases

2 databases
Pfam-A curated multiple alignments.
Grows slowly.
Quality controlled by experts.
Pfam-B automatic clustering (ProDom derived).
Complements Pfam-A.
New sequences instantly incorporated.
Unchecked false positives, etc.

41
Pfam Features

Search by Sequence, keyword, domain, taxonomy.
Browsing by family or genome.
Evolutionary tree

42
Pfam Construction

Source of seed alignments
Pfam-B families.
Published articles.
'domain hunting' studies.
occasionally using entries from other databases
(e.g. MEROPS for peptidases).

43
Pfam Domain information
44
Pfam Domain organization
45
Pfam Multiple alignment
46
Pfam HMM logo
47
Pfam Species distribution
48
Pfam Genome comparison
49
PROSITE

Database of protein families.
Matching according to simple patterns or PSSM
profiles.
Browsing all proteins of a specific family.
Latest release knows 1696 protein families.

50
PROSITE Features

Comprehensive domain documentation.
All profile matches checked by experts.
Specificity/sensitivity
Specificity true-pos/all-pos
Sensitivity true-pos/(true-pos false-neg)

51
PROSITE Example

Specificity of Zinc finger C2H2 type domain

52
SMART
53
SMART

Simple Modular Architecture Research Tool
Identification and annotation of genetically
mobile domains and the analysis of domain
architectures.
SMART consists of a library of HMMs.
Knows 665 HMMs to date.

54
SMART Features

finding proteins containing specific domains
i.e. of the same family
Function prediction
Sub-cellular localization
Binding partners
Architecture
Alternative splicing information
Orthology information

55
SMART Domain selection example

Tyrosine kinase (TyrKc) AND Transmembrane region
(TRANS)

56
InterPro

InterPro combines 9 other databases such as
SMART, Pfam, Prodom and more.
Queries can use many different methods (as the
other databases use different methods).
However, thresholds are predefined and cannot be
changed for those methods.

57
InterPro

Provides more results, but can sometimes be
redundant.
Coverage statistics
93 of Swiss-Prot v42.5 128540 out of 138922
proteins
81 of TrEMBL v25.5 819966 out of 1013263
proteins

58
InterPro Features

Searching by Protein/DNA sequences
Finding domains homologs
List of InterPro entries of type
Family
Domain
Repeat
PTM- Post Transcriptional modifications
Binding Site
Active Site
Keyword

59
InterPro Example

Kringle domain

60
Research Example Introduction

Goal The systematic identification of novel
protein domain families.
Using computational methods.

61
Research Example Method
62
Research Example Results

28 New Domains identified
15 domains in diverse contexts, in different
species.
3 domains species specific.
7 domains with weak similarity to previously
described domains.
3 extension domains.

63
Predictions of Function

On the basis of reports in literature and/or
occurrence with other identified domains,
functional features can be predicted for our
novel domain families.
Examples
Chromatin binding
Protein Interaction
Predicted sub-cellular localization

64
Predictions of FunctionChromatin-Binding example

The novel domain CSZ is contained in protein
SPT6, which regulates transcription via chromatin
structure modification.
SPT6 has a histone-binding capability,
experimentally confirmed.
Other domains (S1, SH2) in SPT6 are unlikely to
bind histones or chromatin.
Conclusion CSZ has a predicted histone binding
function.

65
Predictions of FunctionLocalization example

Some of the novel domains are only found within
proteins from the initial set of nuclear domains.
This predicts that these domains have a nuclear
function.
The other domains are likely to have roles in
both nucleus and cytoplasm.

66
Conclusion