Data Mining for Protein Structure Prediction - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Data Mining for Protein Structure Prediction

Description:

Data Mining for Protein Structure Prediction – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 55

Provided by: ipam

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining for Protein Structure Prediction

1
Data Mining for Protein Structure Prediction

Mohammed J. Zaki
SPIDER Data Mining Project Scalable, Parallel
and Interactive Data Mining and Exploration at
RPI
http//www.cs.rpi.edu/zaki

2
Outline of the Talk

How do proteins form?
Protein folding problem
Contact map mining
Using HMMs based on local motifs
Mining physical dense frequent patterns
(non-local motifs)
Future directions
Heuristic rules
Folding pathways

3
How do Proteins Form?
4
How do Proteins Form?

Building Blocks of Biological Systems
DNA (nucleotides, 4 types) information
carrier/encoder
RNA bridge from DNA to protein
Protein (amino acids, 20 types) action
molecules.
Processes
Replication of DNA
Transcription of gene (DNA) to messenger RNA
(mRNA)
Splicing of non-coding regions of the genes
(introns)
Translation of mRNA into proteins
Folding of proteins into 3D structure
Biochemical or structural functions of proteins

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Protein Folding Problem
10
Protein Structures

Primary structure
Un-branched polymer
20 side chains (residues or amino acids)
Higher order structures
Secondary local (consecutive) in sequence
Tertiary 3D fold of one polypeptide chain
Quaternary Chains packing together

11
Amino Acid
12
Polypeptide Chain
13
Torsion Angles
14
The Protein Folding Problem
15
Contact Map Mining
16
Contact Map

Amino acids Ai and Aj are in contact if their 3D
distance is less than threshold (7å)
Sequence separation is given as i-j
Contact map C is an N x N matrix, where
C(i,j) 1 if Ai and Aj are in contact
C(i,j) 0 otherwise
Consider all pairs with i-j gt 4

17
Protein 2igd 3D Structure
Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
18
Contact Map (2igd PDB)
Anti-parallel Beta Sheets
Parallel Beta Sheets
Alpha Helix
Amino Acid Aj
Amino Acid Ai
19
How much information in Amino Acids Alone
Classification Problem

A pair of amino acids (Ai,Aj) is an instance
The class C (1) or NC (0), i.e., contact or
non-contact
Highly skewed class distribution
1.7 C and 98.3 NC 300K C vs 17,3M NC
Features for each instance
Ai and Aj
Class C or NC

20
Predicting Protein Contacts

Predict contacts for new sequence

21
Classification via Association Mining

Association mining good for skewed data
Mining Mine frequent itemsets in C data (Dc)
P(X Dc) Frequency(X Dc) / Dc
Counting find P(X Dnc)
Pruning
Likelihood of a contact r P(XDc) / P(XDnc)
Prune pattern X if ratio r of contact to
non-contact probability is less then some
threshold
i.e., keep only the patterns highly predictive of
contacts

22
Testing Phase

90-10 split into training and testing
2.4 million pairs, with 36K contacts (1.5)
Evidence calculation
Find matching patterns P for each instance
Compute cumulative frequency in C and NC
Sc Sum of frequency (X Dc) where X in P
Snc Sum of frequency (X Dnc) where X in P
Compute evidence ratio of Sc / Snc
Prediction Sort instances on evidence
Predict top PR fraction as contacts

23
Experiments

794 Proteins from Protein Data Bank
Distinct structures (lt 25 similarity)
Longest 907, Smallest 35 amino acids
90-10 split for training-testing
Total pairs 20 million (gt 2.5 GB)
Contacts 330 thousand (1.6)
Highly uneven class distribution

24
Evaluation Metrics

Na set of all pairs
Na all pairs with positive evidence
Ntc true contacts in test data
Ntc true contacts with positive evidence
Npc predicted contacts
Ntpc correctly predicted contacts
Accuracy Ntpc / Npc
Coverage Ntpc / Ntc
Prediction Ratio (PR) Ntc/Na
Random Predictor Accuracy Ntc/Na

25
Results (Amino Acids All Lengths)
Crossover 7 accuracy and 7 coverage 2 times
over Random
26
Results (Amino Acids by length)
1-100 12 accuracy(A) and coverage (C) 100-170
6 A and C
170-300 4.5 A and C 300 2 A and C
27
Using HMMs based on Local Motifs to Improve
Classification
28
An HMM for Local Predictions

HMMSTR (Chris Bystroff, Biology, RPI)
Build a library of short sequences that tend to
fold uniquely across protein families the
I-Sites Library
Treat each motif as a Markov chain
Merge the motifs into a global HMM for local
structure prediction

29
Training the HMM

Build I-sites Library
Short sequence motifs (3 to 19)
Exhaustive clustering of sequences
Non-redundant PDB dataset (lt 25 similarity)
Build an HMM
Each of 262 motifs is a chain of Markov states
Each state has sequence and structure for one
position
Merge I-sites motifs hierarchically to get one
global HMM for all the motifs

30
HMM Output

Total of 282 States in the HMM
Each state produces or emits
Amino acid profile (20 probability values)
Secondary structure (D) (helix, strand or loop)
Backbone angles (R) (11 dihedral angle symbols)
Finer structural context (C) (10 context symbols)

31
I-Sites Motifs (Initiation Sites)
Beta Hairpin
Beta to Alpha
Helix C-Cap
32
(No Transcript)
33
Data Format and Preparation

Take the 794 PDB proteins
Compute optimal alignment to HMM
Find best state sequence for the observed acids
Output probability distribution of a residue over
all the 282 HMM states
Integrate the 3 datasets
Alignment probability distribution (Nx282)
Amino acid and context information (D, R, C)
Contact map (NxN)

34
HMMSTR Output (per Protein)
35
Adding features from HMMSTR

The class C (1) or NC (0)
Highly skewed class distribution
Approx 1.5 C and 98.5 NC
Features for each instance
Ai Aj Di Dj Ri Rj Ci Cj
Profile pi1 pi2 pi20 pj1 pj2 pj20
HMM States qi1 qi2 .. qi282 qj1 qj2 .. qj282
Class C or NC

36
HMM and AA (R,D,C) All Lengths
Left Crossover 19 accuracy and coverage 5.3
times over Random
Right Crossover (RDC) 17 accuracy and
coverage 5 times over Random
37
HMM AA R,D,C (by length)
1-100 30 accuracy(A) and coverage (C) 100-170
17 A and C
170-300 10 A and C 300 6 A and C
38
Predicted Contact Map (2igd)
39
Summary of Classification Results

Challenging prediction problem
In essence, we have to predict a contact matrix
for a new protein
Hybrid HMM/Associations approach
Best results to-date 19 overall
accuracy/coverage, 30 for short proteins
14.4 Accuracy (Fariselli, Casadio 99 NN)
13 Accuracy (Thomas et al 96)
Short proteins 26 (Olmea, Valencia, 97)

40
Mining Physical Dense Frequent Patterns
(non-local motifs)
41
Characterizing Physical, Protein-like Contact Maps

A very small subset of all contact maps code for
physically possible proteins (self-avoiding,
globular chains)
A contact map must
Satisfy geometric constraints
Represent low-energy structure
What are the typical non-local interactions?
Frequent dense 0/1 submatrices in contact maps
3-step approach 1) data generation, 2) dense
pattern mining, and 3) mapping to structure space

42
Dense Pattern Mining

12,524 protein-like 60 residue structures
Use HMMSTR to generate protein-like sequences
Use ROSETTA to generate their structures
Monte Carlo fragment insertion (from I-sites
library)
Up to 5 possible low-energy structures retained
Frequent 2D Pattern Mining
Use WxW sliding window W window size
Measure density under each window
(N-W)2 / 2 possible windows per N length protein
Look for minimum density scale away from diag
Try different window sizes

43
Counting Dense Patterns

Naïve Approachfor W5, N60 there are 1485
windows per protein. Total 15 Million possible
windows for 12,524 proteins
Test if two submatrices are equal
Linear search O(P x W2) with P current dense
patterns
Hash based O(W2)
Our Approach 2-level Hashing
O(W) time

44
Pattern (WxW Submatrix) Encoding

Encode submatrix as string (W integers)
Submatrix Integer Value
00000 0
01100 12
01000 8
01000 8
00000 0
Concatenated String 0.12.8.8.0

45
Two-level Hashing

String ID (M)
Level 1 (approximate)
Level2 (exact) h2 (M) StringID (M)

46
Binding Patterns to Proteins Sequence and
Structure

Using window size, W5
StringID0.12.8.8.0, Support 170
00000
01100
01000
01000
00000
Occurrences
pdb-name (X,Y) X_sequence Y_sequence Interaction
1070.0 52,30 ILLKN TFVRI
alphabeta
1145.0 51,13 VFALH GFHIA
alphastrand
1251.2 42,6 EVCLR GSKFG
alphastrand
1312.0 54,11 HGYDE ATFAK
alphabeta
1732.0 49,6 HRFAK KELAG
alphabeta
2895.0 49,7 SRCLD DTIYY
alphabeta
...

47
Frequent Dense Local Patterns
48
Frequent Dense Non-Local Patterns
Alpha Alpha
Alpha Beta Sheet
49
Frequent Dense Non-Local Patterns
Alpha Beta Turn
Beta Sheet Beta Turn
50
Future Directions
51
Mining Physicality Rules

Comprehensive list of non-local motifs
I-sites library catalogs local motifs
Mining heuristic rules for physicality
Based on simple geometric constraints
Rules governing contacts and non-contacts
Parallel Beta Sheets If C(i,j) 1 and
C(i2,j2) 1, then C(i,j2) 0 and C(i2,j)
0
Anti-parallel Beta Sheets If C(i,j2) 1 and
C(i2,j) 1, then C(i,j) 0 and C(i2,j2) 0
Alpha Helices If C(i,i4) 1, C(i,j) 1, and
C(i4,j) 1, then C(i2,j) 0

52
Heuristic Rules of Physicality
Parallel Beta Sheets
Anti-parallel Beta Sheets
i2
j2
i2
j
i
j
i
j2
If C(i,j2) 1 and C(i2,j) 1, then C(i,j) 0
and C(i2,j2) 0
If C(i,j) 1 and C(i2,j2) 1, then C(i,j2)
0 and C(i2,j) 0
53
Protein Folding Pathways