Identification of Protein Domains - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Identification of Protein Domains

Description:

Let's untrivialize by allowing for gaps: insertions and deletions. Start off with the PSSM HMM. ... insertions: Introduce new states Ij match insertions ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 69
Provided by: Nac69
Category:

less

Transcript and Presenter's Notes

Title: Identification of Protein Domains


1
Identification of Protein Domains
  • Eden Dror
  • Menachem Schechter

Computational Biology Seminar 2004
2
Overview
  • Introduction to protein domains.
  • Classification of homologs.
  • Representing a domain.
  • PSSM
  • HMM
  • Internet resources
  • Pfam
  • SMART
  • PROSITE
  • InterPro
  • Research example.

3
Protein domains
  • A discrete portion of a protein assumed to fold
    independently, and possessing its own function.
  • Mobile domain (module) a domain that can be
    found associated with different domain
    combinations in different proteins.

4
Protein domains
  • The assumption The domain is the fundamental
    unit of protein structure and function.
  • Protein family all proteins containing a
    specific domain.

5
What can we learn from them?
  • Common ancestors homology information of a set
    of proteins.
  • Homology can induce properties of a protein like
    functionality localization.
  • Therefore, domains can be used to classify a new
    protein to a family, inferring functionality.

6
Classification of homologs
  • Homology is not a sufficiently well-defined term
    to describe the evolutionary relationships
    between genes.
  • Homologous genes can be derived by two major
    ways
  • Gene duplication (in the same species).
  • Speciation (splitting of one species into two).

7
Classification of homologs
8
Classification of homologs
  • Orthologs Two genes from two different species
    that derive from a single gene in the last common
    ancestor of the species.
  • Paralogs Two genes that derive from a single
    gene that was duplicated within a genome.

9
Classification of homologs
10
Classification of homologs
  • Inparalogs - paralogs that evolved by gene
    duplication after the speciation event.
  • Outparalogs - paralogs that evolved by gene
    duplication before the speciation event.

11
Classification of homologs
12
What can we learn from them?
  • Ortholog proteins are evolutionary, and typically
    functional counterparts in different species.
  • Paralog proteins are important for detecting
    lineage-specific adaptations.
  • Both of them can reveal information on a specific
    species or a set of species.

13
Protein domains summary
  • By identifying domains we can
  • infer functionality localization of a protein.
  • Learn on a specific species.
  • Learn on a set of species as a group.

14
Domain representation
  • Different methods to represent (model) domains
  • Patterns (regular expressions).
  • PSSM (Position specific score matrix).
  • HMM (Hidden Markov model).

15
PSSM
  • Position specific score matrix
  • Score matrix representing the score for having
    each amino acid in a given position in a specific
    sequence.
  • Based on the independent probabilities P(ai) of
    observing amino acid a in position i.

16
PSSM Example
17
PSSM Identifying a domain
  • Given a sequence and a PSSM
  • Run over all positions.
  • Score each sub-sequence according to the matrix.

18
HMM Hidden Markov Model
  • Markov model a way of describing a process that
    goes through a series of states.
  • Each state has a probability of transitioning to
    the other states.
  • xi is a random variable of state.

19
HMM Markov Model
  • Example
  • States are ÃŽ 0,1

20
HMM Markov Model
  • Transition matrix

21
HMM Markov Model
  • State transition example
  • States are the nucleotides A, T, G, C.

22
HMM Hidden Markov Model
  • Hidden Markov model
  • Each state x emits an output y, at a specific
    probability.
  • We only know the output (observations).
  • Thus, the states are hidden.

23
HMM Hidden Markov Model
  • Example states are ÃŽ 0,1, output ÃŽ 0,1

24
HMM Hidden Markov Model
  • Emission matrix

25
HMM What can we do with it?
  • Given (A, B)
  • Probability of given states and outputs
  • Probability of a given output sequence
  • Most likely sequence of states that generated a
    given output sequence

26
HMM What can we do with it?
  • Learning
  • Given state and output sequences calculate the
    most probable (A, B).
  • Easy when the states are known.
  • Otherwise use a training algorithm.

27
HMM Profile HMM
  • Use HMM to represent sequence families.
  • A particular type of HMM suited to modeling
    multiple alignments.
  • (Assume we have a multiple alignment).

28
HMM Trivial profile HMM
  • We begin with ungapped regions.
  • Each position corresponds to a state.
  • Transitions are of probability 1.

29
HMM Trivial profile HMM
  • Let ei(a) be the independent probability of
    observing amino acid a in position i.
  • The probability of a new sequence x, according to
    the model

30
HMM Trivial profile HMM
  • We can score the sequence x
  • Where q indicates the probability under a random
    model.

31
HMM Trivial profile HMM
  • Consider the values
  • They behave like elements in a score matrix.
  • The trivial profile HMM is equivalent to a PSSM.

32
HMM profile HMM
  • Lets untrivialize by allowing for gaps
    insertions and deletions.
  • Start off with the PSSM HMM.

33
HMM profile HMM
  • Handling insertions
  • Introduce new states Ij match insertions after
    position j.
  • These states have random emission probabilities.

34
HMM profile HMM
  • The score of a gap of length k

35
HMM profile HMM
  • Handling deletions
  • Introduce silent states Dj.
  • These states do not emit.

36
HMM profile HMM
  • The complete profile HMM

37
Internet resources
  • Databases of protein families.
  • Family information and identification.
  • Considerations
  • Type of representation (pattern, PSSM, HMM).
  • Choice of seed multiple alignment proteins.
  • Quality control.
  • Database features (links, annotations, views).
  • Database Specificity (organism, functions).

38
Pfam Home
39
Pfam
  • Protein families database of alignments and HMMs
  • Uses profile-HMMs to represent families.
  • For each family in Pfam you can
  • Look at multiple alignments
  • View protein domain architectures
  • Examine species distribution
  • Follow links to other databases
  • View known protein structures

40
Pfam Databases
  • 2 databases
  • Pfam-A curated multiple alignments.
  • Grows slowly.
  • Quality controlled by experts.
  • Pfam-B automatic clustering (ProDom derived).
  • Complements Pfam-A.
  • New sequences instantly incorporated.
  • Unchecked false positives, etc.

41
Pfam Features
  • Search by Sequence, keyword, domain, taxonomy.
  • Browsing by family or genome.
  • Evolutionary tree

42
Pfam Construction
  • Source of seed alignments
  • Pfam-B families.
  • Published articles.
  • 'domain hunting' studies.
  • occasionally using entries from other databases
    (e.g. MEROPS for peptidases).

43
Pfam Domain information
44
Pfam Domain organization
45
Pfam Multiple alignment
46
Pfam HMM logo
47
Pfam Species distribution
48
Pfam Genome comparison
49
PROSITE
  • Database of protein families.
  • Matching according to simple patterns or PSSM
    profiles.
  • Browsing all proteins of a specific family.
  • Latest release knows 1696 protein families.

50
PROSITE Features
  • Comprehensive domain documentation.
  • All profile matches checked by experts.
  • Specificity/sensitivity
  • Specificity true-pos/all-pos
  • Sensitivity true-pos/(true-pos false-neg)

51
PROSITE Example
  • Specificity of Zinc finger C2H2 type domain

52
SMART
53
SMART
  • Simple Modular Architecture Research Tool
  • Identification and annotation of genetically
    mobile domains and the analysis of domain
    architectures.
  • SMART consists of a library of HMMs.
  • Knows 665 HMMs to date.

54
SMART Features
  • finding proteins containing specific domains
    i.e. of the same family
  • Function prediction
  • Sub-cellular localization
  • Binding partners
  • Architecture
  • Alternative splicing information
  • Orthology information

55
SMART Domain selection example
  • Tyrosine kinase (TyrKc) AND Transmembrane region
    (TRANS)

56
InterPro
  • InterPro combines 9 other databases such as
    SMART, Pfam, Prodom and more.
  • Queries can use many different methods (as the
    other databases use different methods).
  • However, thresholds are predefined and cannot be
    changed for those methods.

57
InterPro
  • Provides more results, but can sometimes be
    redundant.
  • Coverage statistics
  • 93 of Swiss-Prot v42.5 128540 out of 138922
    proteins
  • 81 of TrEMBL v25.5 819966 out of 1013263
    proteins

58
InterPro Features
  • Searching by Protein/DNA sequences
  • Finding domains homologs
  • List of InterPro entries of type
  • Family
  • Domain
  • Repeat
  • PTM- Post Transcriptional modifications
  • Binding Site
  • Active Site
  • Keyword

59
InterPro Example
  • Kringle domain

60
Research Example Introduction
  • Goal The systematic identification of novel
    protein domain families.
  • Using computational methods.

61
Research Example Method
62
Research Example Results
  • 28 New Domains identified
  • 15 domains in diverse contexts, in different
    species.
  • 3 domains species specific.
  • 7 domains with weak similarity to previously
    described domains.
  • 3 extension domains.

63
Predictions of Function
  • On the basis of reports in literature and/or
    occurrence with other identified domains,
    functional features can be predicted for our
    novel domain families.
  • Examples
  • Chromatin binding
  • Protein Interaction
  • Predicted sub-cellular localization

64
Predictions of FunctionChromatin-Binding example
  • The novel domain CSZ is contained in protein
    SPT6, which regulates transcription via chromatin
    structure modification.
  • SPT6 has a histone-binding capability,
    experimentally confirmed.
  • Other domains (S1, SH2) in SPT6 are unlikely to
    bind histones or chromatin.
  • Conclusion CSZ has a predicted histone binding
    function.

65
Predictions of FunctionLocalization example
  • Some of the novel domains are only found within
    proteins from the initial set of nuclear domains.
  • This predicts that these domains have a nuclear
    function.
  • The other domains are likely to have roles in
    both nucleus and cytoplasm.

66
Conclusion
  • Domains are the functional units of proteins.
  • Identifying a domain within a new protein may
    teach us much about it.
  • There are several types of models to represent
    domains.
  • These models can also be used to identify the
    domain they represent.
  • Many Internet databases available to catalogue
    and identify families.
  • Protocol to identify new domains using old ones.

67
Resources
  • Pfamhttp//www.sanger.ac.uk/Software/Pfam/
  • SMART http//smart.embl-heidelberg.de/
  • PROSITEhttp//www.expasy.org/prosite/
  • InterProhttp//www.ebi.ac.uk/interpro/

68
The End
Write a Comment
User Comments (0)
About PowerShow.com