Title: Data Mining for Protein Structure Prediction
1Data Mining for Protein Structure Prediction
- Mohammed J. Zaki
- SPIDER Data Mining Project Scalable, Parallel
and Interactive Data Mining and Exploration at
RPI - http//www.cs.rpi.edu/zaki
2Outline of the Talk
- How do proteins form?
- Protein folding problem
- Contact map mining
- Using HMMs based on local motifs
- Mining physical dense frequent patterns
(non-local motifs) - Future directions
- Heuristic rules
- Folding pathways
3How do Proteins Form?
4How do Proteins Form?
- Building Blocks of Biological Systems
- DNA (nucleotides, 4 types) information
carrier/encoder - RNA bridge from DNA to protein
- Protein (amino acids, 20 types) action
molecules. - Processes
- Replication of DNA
- Transcription of gene (DNA) to messenger RNA
(mRNA) - Splicing of non-coding regions of the genes
(introns) - Translation of mRNA into proteins
- Folding of proteins into 3D structure
- Biochemical or structural functions of proteins
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9Protein Folding Problem
10Protein Structures
- Primary structure
- Un-branched polymer
- 20 side chains (residues or amino acids)
- Higher order structures
- Secondary local (consecutive) in sequence
- Tertiary 3D fold of one polypeptide chain
- Quaternary Chains packing together
11Amino Acid
12Polypeptide Chain
13Torsion Angles
14The Protein Folding Problem
15Contact Map Mining
16Contact Map
- Amino acids Ai and Aj are in contact if their 3D
distance is less than threshold (7Ã¥) - Sequence separation is given as i-j
- Contact map C is an N x N matrix, where
- C(i,j) 1 if Ai and Aj are in contact
- C(i,j) 0 otherwise
- Consider all pairs with i-j gt 4
17Protein 2igd 3D Structure
Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
18Contact Map (2igd PDB)
Anti-parallel Beta Sheets
Parallel Beta Sheets
Alpha Helix
Amino Acid Aj
Amino Acid Ai
19How much information in Amino Acids Alone
Classification Problem
- A pair of amino acids (Ai,Aj) is an instance
- The class C (1) or NC (0), i.e., contact or
non-contact - Highly skewed class distribution
- 1.7 C and 98.3 NC 300K C vs 17,3M NC
- Features for each instance
- Ai and Aj
- Class C or NC
20Predicting Protein Contacts
- Predict contacts for new sequence
21Classification via Association Mining
- Association mining good for skewed data
- Mining Mine frequent itemsets in C data (Dc)
- P(X Dc) Frequency(X Dc) / Dc
- Counting find P(X Dnc)
- Pruning
- Likelihood of a contact r P(XDc) / P(XDnc)
- Prune pattern X if ratio r of contact to
non-contact probability is less then some
threshold - i.e., keep only the patterns highly predictive of
contacts
22Testing Phase
- 90-10 split into training and testing
- 2.4 million pairs, with 36K contacts (1.5)
- Evidence calculation
- Find matching patterns P for each instance
- Compute cumulative frequency in C and NC
- Sc Sum of frequency (X Dc) where X in P
- Snc Sum of frequency (X Dnc) where X in P
- Compute evidence ratio of Sc / Snc
- Prediction Sort instances on evidence
- Predict top PR fraction as contacts
23Experiments
- 794 Proteins from Protein Data Bank
- Distinct structures (lt 25 similarity)
- Longest 907, Smallest 35 amino acids
- 90-10 split for training-testing
- Total pairs 20 million (gt 2.5 GB)
- Contacts 330 thousand (1.6)
- Highly uneven class distribution
24Evaluation Metrics
- Na set of all pairs
- Na all pairs with positive evidence
- Ntc true contacts in test data
- Ntc true contacts with positive evidence
- Npc predicted contacts
- Ntpc correctly predicted contacts
- Accuracy Ntpc / Npc
- Coverage Ntpc / Ntc
- Prediction Ratio (PR) Ntc/Na
- Random Predictor Accuracy Ntc/Na
25Results (Amino Acids All Lengths)
Crossover 7 accuracy and 7 coverage 2 times
over Random
26Results (Amino Acids by length)
1-100 12 accuracy(A) and coverage (C) 100-170
6 A and C
170-300 4.5 A and C 300 2 A and C
27Using HMMs based on Local Motifs to Improve
Classification
28An HMM for Local Predictions
- HMMSTR (Chris Bystroff, Biology, RPI)
- Build a library of short sequences that tend to
fold uniquely across protein families the
I-Sites Library - Treat each motif as a Markov chain
- Merge the motifs into a global HMM for local
structure prediction
29Training the HMM
- Build I-sites Library
- Short sequence motifs (3 to 19)
- Exhaustive clustering of sequences
- Non-redundant PDB dataset (lt 25 similarity)
- Build an HMM
- Each of 262 motifs is a chain of Markov states
- Each state has sequence and structure for one
position - Merge I-sites motifs hierarchically to get one
global HMM for all the motifs
30HMM Output
- Total of 282 States in the HMM
- Each state produces or emits
- Amino acid profile (20 probability values)
- Secondary structure (D) (helix, strand or loop)
- Backbone angles (R) (11 dihedral angle symbols)
- Finer structural context (C) (10 context symbols)
31I-Sites Motifs (Initiation Sites)
Beta Hairpin
Beta to Alpha
Helix C-Cap
32(No Transcript)
33Data Format and Preparation
- Take the 794 PDB proteins
- Compute optimal alignment to HMM
- Find best state sequence for the observed acids
- Output probability distribution of a residue over
all the 282 HMM states - Integrate the 3 datasets
- Alignment probability distribution (Nx282)
- Amino acid and context information (D, R, C)
- Contact map (NxN)
34HMMSTR Output (per Protein)
35Adding features from HMMSTR
- The class C (1) or NC (0)
- Highly skewed class distribution
- Approx 1.5 C and 98.5 NC
- Features for each instance
- Ai Aj Di Dj Ri Rj Ci Cj
- Profile pi1 pi2 pi20 pj1 pj2 pj20
- HMM States qi1 qi2 .. qi282 qj1 qj2 .. qj282
- Class C or NC
36HMM and AA (R,D,C) All Lengths
Left Crossover 19 accuracy and coverage 5.3
times over Random
Right Crossover (RDC) 17 accuracy and
coverage 5 times over Random
37HMM AA R,D,C (by length)
1-100 30 accuracy(A) and coverage (C) 100-170
17 A and C
170-300 10 A and C 300 6 A and C
38Predicted Contact Map (2igd)
39Summary of Classification Results
- Challenging prediction problem
- In essence, we have to predict a contact matrix
for a new protein - Hybrid HMM/Associations approach
- Best results to-date 19 overall
accuracy/coverage, 30 for short proteins - 14.4 Accuracy (Fariselli, Casadio 99 NN)
- 13 Accuracy (Thomas et al 96)
- Short proteins 26 (Olmea, Valencia, 97)
40Mining Physical Dense Frequent Patterns
(non-local motifs)
41Characterizing Physical, Protein-like Contact Maps
- A very small subset of all contact maps code for
physically possible proteins (self-avoiding,
globular chains) - A contact map must
- Satisfy geometric constraints
- Represent low-energy structure
- What are the typical non-local interactions?
- Frequent dense 0/1 submatrices in contact maps
- 3-step approach 1) data generation, 2) dense
pattern mining, and 3) mapping to structure space
42Dense Pattern Mining
- 12,524 protein-like 60 residue structures
- Use HMMSTR to generate protein-like sequences
- Use ROSETTA to generate their structures
- Monte Carlo fragment insertion (from I-sites
library) - Up to 5 possible low-energy structures retained
- Frequent 2D Pattern Mining
- Use WxW sliding window W window size
- Measure density under each window
- (N-W)2 / 2 possible windows per N length protein
- Look for minimum density scale away from diag
- Try different window sizes
43Counting Dense Patterns
- Naïve Approachfor W5, N60 there are 1485
windows per protein. Total 15 Million possible
windows for 12,524 proteins - Test if two submatrices are equal
- Linear search O(P x W2) with P current dense
patterns - Hash based O(W2)
- Our Approach 2-level Hashing
- O(W) time
44Pattern (WxW Submatrix) Encoding
- Encode submatrix as string (W integers)
- Submatrix Integer Value
- 00000 0
- 01100 12
- 01000 8
- 01000 8
- 00000 0
- Concatenated String 0.12.8.8.0
45Two-level Hashing
- String ID (M)
- Level 1 (approximate)
- Level2 (exact) h2 (M) StringID (M)
46Binding Patterns to Proteins Sequence and
Structure
- Using window size, W5
- StringID0.12.8.8.0, Support 170
- 00000
- 01100
- 01000
- 01000
- 00000
- Occurrences
- pdb-name (X,Y) X_sequence Y_sequence Interaction
- 1070.0 52,30 ILLKN TFVRI
alphabeta - 1145.0 51,13 VFALH GFHIA
alphastrand - 1251.2 42,6 EVCLR GSKFG
alphastrand - 1312.0 54,11 HGYDE ATFAK
alphabeta - 1732.0 49,6 HRFAK KELAG
alphabeta - 2895.0 49,7 SRCLD DTIYY
alphabeta - ...
47Frequent Dense Local Patterns
48Frequent Dense Non-Local Patterns
Alpha Alpha
Alpha Beta Sheet
49Frequent Dense Non-Local Patterns
Alpha Beta Turn
Beta Sheet Beta Turn
50Future Directions
51Mining Physicality Rules
- Comprehensive list of non-local motifs
- I-sites library catalogs local motifs
- Mining heuristic rules for physicality
- Based on simple geometric constraints
- Rules governing contacts and non-contacts
- Parallel Beta Sheets If C(i,j) 1 and
C(i2,j2) 1, then C(i,j2) 0 and C(i2,j)
0 - Anti-parallel Beta Sheets If C(i,j2) 1 and
C(i2,j) 1, then C(i,j) 0 and C(i2,j2) 0 - Alpha Helices If C(i,i4) 1, C(i,j) 1, and
C(i4,j) 1, then C(i2,j) 0
52Heuristic Rules of Physicality
Parallel Beta Sheets
Anti-parallel Beta Sheets
i2
j2
i2
j
i
j
i
j2
If C(i,j2) 1 and C(i2,j) 1, then C(i,j) 0
and C(i2,j2) 0
If C(i,j) 1 and C(i2,j2) 1, then C(i,j2)
0 and C(i2,j) 0
53Protein Folding Pathways
- Rules for Pathways in Contact Map Space
- Pathway is time-ordered sequence of contacts
- Condensation rule New contact within Smax
- U(i,j) lt Smax U(i,j) is unfolded residues from
i to j - Pathway prediction is complementary to structure
prediction
54Contact Map Folding Pathways