Characterization of Secondary Structure of Proteins using Different Vocabularies - PowerPoint PPT Presentation

About This Presentation
Title:

Characterization of Secondary Structure of Proteins using Different Vocabularies

Description:

Sample Protein: MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITAL ... n' most similar reference segment vectors are retrieved ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 51
Provided by: MadhaviGan1
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Characterization of Secondary Structure of Proteins using Different Vocabularies


1
Characterization of Secondary Structure of
Proteins using Different Vocabularies
  • Madhavi K. Ganapathiraju
  • Language Technologies Institute
  • Advisors
  • Raj Reddy, Judith Klein-Seetharaman,
  • Roni Rosenfeld

2nd Biological Language Modeling
Workshop Carnegie Mellon University May 13-14
2003
2
Presentation overview
  • Classification of Protein Segments by their
    Secondary Structure types
  • Document Processing Techniques
  • Choice of Vocabulary in Protein Sequences
  • Application of Latent Semantic Analysis
  • Results
  • Discussion

3
Secondary Structure of Protein
Sample Protein MEPAPSAGAELQPPLFANASDAYPSACPSAGANA
SGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTA
TNIYIFNLALADALATSTLPFQSA
Sample Protein MEPAPSAGAELQPPLFANASDAYPSACPSAGANA
SGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTA
TNIYIFNLALADALATSTLPFQSA
4
Application of Text Processing
  • Letters ? Words ? Sentences
  • Letter counts in languages
  • Word counts in Documents
  • Residues ? Secondary Structure ?Proteins?Genomes

Can unigrams distinguish Secondary Structure
Elements from one another
5
Unigrams for Document Classification
  • Word-Document matrix
  • represents documents in terms of their word
    unigrams

Bag-of-words model since the position of words
in the document is not taken into account
6
Word Document Matrix
7
Document Vectors
8
Document Vectors
Doc-1
9
Document Vectors
Doc-2
10
Document Vectors
Doc-3
11
Document Vectors
Doc-N
12
Document Comparison
  • Documents can be compared to one another in terms
    of dot-product of document vectors


.
13
Document Comparison
  • Documents can be compared to one another in terms
    of dot-product of document vectors


.
14
Document Comparison
  • Documents can be compared to one another in terms
    of dot-product of document vectors


.
  • Formal Modeling of documents is
  • presented in next few slides

15
Vector Space Model construction
  • Document vectors in word-document matrix are
    normalized
  • By word counts in entire document collection
  • By document lengths
  • This gives a Vector Space Model (VSM) of the set
    of documents
  • Equations for Normalization

16
Word count normalization
(Word count in document)
(document length)
(depends on word count in corpus)
t_i is the total number of times word i
occurs in the corpus
17
Word-Document Matrix
Normalized Word-Document Matrix
18
Document vectors after normalisation
...
19
Use of Vector Space Model
  • A query document is also represented as a vector
  • It is normalized by corpus word counts
  • Documents related to the query-doc are identified
  • by measuring similarity of document vectors to
    the query document vector

20
Application to Protein Secondary Structure
Prediction
21
Protein Secondary Structure
  • Dictionary of Secondary Structure Prediction
    annotation of each residue with its structure
  • based on hydrogen bonding patterns and
    geometrical constraints
  • 7 DSSP labels for PSS
  • H
  • G
  • B
  • E
  • S
  • I
  • T

Helix types
Strand types
Coil types
22
Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
  • PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Key to DSSP labels
T, S, I,_ Coil E, B Strand
H, G Helix
23
Reference Model
  • Proteins are segmented into structural Segments
  • Normalized word-document matrix
  • constructed from structural segments

24
Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
  • PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Structural Segments obtained from the given
sequence PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL A
LPPTP YLGAMKY NLLH
25
Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
  • PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Structural Segments obtained from the given
sequence PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL A
LPPTP YLGAMKY NLLH
Unigrams in the structural segments
26
Structural Segments
Amino-acid Structural-Segment Matrix
Amino Acids
27
Structural Segments
Amino-acid Structural-Segment Matrix
Amino Acids
Similar to Word-Document Matrix
28
Document Vectors
Word Vectors
29
Document Vectors
Query Vector
Word Vectors

30
Data Set used for PSSP
  • JPred data
  • 513 protein sequences in all
  • lt25 homology between sequences
  • Residues corresponding DSSP annotations are
    given
  • We used
  • 50 sequences for model construction (training)
  • 30 sequences for testing

31
Classification
  • Proteins from test set
  • segmented into structural elements
  • Called query segments
  • Segment vectors are constructed
  • For each query segment
  • n most similar reference segment vectors are
    retrieved
  • Query segment is assigned same structure as that
    of the majority of the retrieved segments

k-nearest neighbour classification
32
Structure type assignment to QVector
Reference Model
Query Vector
Helix Strand Coil
Key
Hence Structure-type assigned to Query Vector is
Coil
33
Choice of Vocabulary in Protein Sequences
  • Amino Acids
  • But Amino acids are
  • Not all distinct..
  • Similarity is primarily due to chemical
    composition
  • ?So,
  • Represent protein segments in terms of types of
    amino acids
  • Represent in terms of chemical composition

34
Representation in terms of types of AA
  • Classify based on Electronic Properties
  • e- donors D,E,A,P
  • weak e-donors I,L,V
  • Ambivalent G,H,S,W
  • weak e- acceptor T,M,F,Q,Y
  • e- acceptor K,R,N
  • C (by itself, another group)
  • Use Chemical Groups

35
Representation using Chemical Groups
36
Results of Classification with AA as words
Leave 1-out testing of reference vectors Unseen
query segments
37
Results with chemical groups as words
  • Build VSM using both reference segments and test
    segments
  • Structure labels of reference segments are known
  • Structure labels of query segments are unknown

38
Modification to Word-Document matrix
  • Latent Semantic Analysis
  • Word document matrix is transformed
  • by Singular Value Decomposition

39
(No Transcript)
40
Results with AA as words, using LSA
41
Results with types of AA as wordsusing LSA
42
Results with chemical groups as wordsusing LSA
43
LSA results for Different Vocabularies
Amino acids LSA
Types of Amino acid LSA
Chemical Groups LSA
44
Model construction using all data
Matrix models constructed using both reference
and query documents together. This gives better
models both for normalization and in construction
Of latent semantic model
Amino Acid
Chemical Groups
Amino acid types
45
Applications
  • Complement other methods for protein structure
    prediction
  • Segmentation approaches
  • Protein classifications as all-alpha, all-beta,
    alphabeta or alpha/beta types
  • Automatically assigning new proteins into SCOP
    families

46
References
  1. Kabsch, Sander Dictionary of Secondary Structure
    Prediction, Biopolymers.
  2. Dwyer, D.S., Electronic properties of the amino
    acid side chains contribute to the structural
    preferences in protein folding. J Biomol Struct
    Dyn, 2001. 18(6) p. 881-92.
  3. Bellegarda, J., Exploiting Latent Semantic
    Information in Statistical Language Modeling,
    Proceedings of the IEEE, Vol 888, 2000.

47
Thank you!
48
Use of SVD
  • Representation of Training and test segments very
    similar to that in VSM
  • Structure type assignment goes through same
    process, except that it is done with the LSA
    matrices

49
Classification of Query Document
  • A query document is also represented as a vector
  • It is normalized by corpus word counts
  • Documents related to the query are identified
  • by measuring similarity of document vectors to
    the query document vector
  • Query Document is assigned the same Structure as
    of those retrieved by similarity measure
  • Majority voting

k-nearest neighbour classification
50
Notes
  • Results described are per-segment
  • Normalized Word document matrix does not preserve
    document lengths
  • Hence per residue accuracies of structure
    assignments cannot be computed
Write a Comment
User Comments (0)
About PowerShow.com