Title: Capstone Project Presentation
1Capstone Project Presentation
- Predicting Deleterious Mutations
- Young SP, Radivojac P, Mooney SD
2Predicting Deleterious Mutations
- Deleterious
- Hurtful or injurious to life or health noxious
- (Oxford English Dictionary)
- Tis pity wine should be so deleterious, For tea
and coffee leave us much more serious. - (BYRON Juan IV, 1821)
3Predicting Deleterious Mutations
- SNPs
- What is an SNP (single nucleotide
polymorphism)? - Why are SNPs important?
- Some SNPs are nonsynonymous
- The molecular effects of SNPs vary widely
4Predicting Deleterious Mutations
- MOTIVATION
- Improve on the existing deleterious prediction
methods - Use protein sequence, evolution and structure
data combined with machine learning to identify
potentially disease-causing SNPs
5Predicting Deleterious Mutations
- SNP data is increasingly available
- Over 40 major online databases
- dbSNP is the primary SNP database (contains
5,000,000 validated human SNPs) - Many databases contain potentially
disease-causing SNPs related to a particular
disease
6Predicting Deleterious Mutations
- Deleterious effects of mutations on proteins
- Function
- Stability
- Expression
- Protein-Protein Interactions
7Current Classification Tools
- Sequence Approaches
- BLOSUM62
- An amino acid substitution score matrix
- SIFT
- Collects sequence homologues in multiple
alignments and identifies non-conservative
changes in amino acids - Ng P Henikoff S, 'Predicting Deleterious Amino
Acid Substitutions. Genome Research, 2001,
11863-874.
8Current Classification Tools
- Structural Approaches
- Expert rules
- Uses evolutionary and structural data
- Sunyaev et al, 'Prediction of deleterious human
alleles. Human Molecular Genetics, 2001, Vol.
10, No. 6, 593. - Decision Trees
- Improved performance based on sequence and
structural data - Produces intuitive rules
9Our foundation for the project
- Saunders CT Baker D
- Evaluation of Structural and Evolutionary
- Contributions to Deleterious Mutation Prediction
- J. Mol. Biol. (2002) 322, 891901
- Structural and evolutionary features
- Trained classifiers based on two data sets -
experimental mutations and human alleles
10Predicting Deleterious Mutations
- S B - Training Sets
- Experimental mutations (5,000)
- HIV-1 protease
- E. Coli Lac repressor
- T4 Lysozyme
- Human alleles (350 mutations)
- 103 hot human genes
11Predicting Deleterious Mutations
- Why two training sets?
- Unbiased human data is hard to get
- Many disease-associated mutations are
discovered through genetics association studies
and may not be causative (i.e., only linked with
the causative allele) - Effect of mutations is hard to measure
- Experimental whole gene mutagenesis data is
used considered unbiased
12Predicting Deleterious Mutations
- Features used in SB Study
- SIFT
- SIFT Solvent Accessibility(SA)
- SIFT normalized B-factor
- SIFT Sunyaev expert rules
- SIFT SA B-factor
13Predicting Deleterious Mutations
Hypothesis Can we improve on the results of
Saunders and Baker by using more structural and
sequence properties?
14Predicting Deleterious Mutations
- Experimental Design
- Classification algorithm
- Decision Trees
- Support Vector
- Neural Nets
- Additional Features
- Amino acid relative frequencies
- Additional structural properties
15Predicting Deleterious Mutations
- Structural Property Values
- Russ Altman (Stanford) developed a vector
representation of protein structural sites - Spheres (1.875Å ? 7.5Å) centered on C-alpha
atom of the mutation position - 66 features
- Atom/residue counts within sphere and other
features, e.g. - Solubility
- Solvent accessibility
16Predicting Deleterious Mutations
- Amino Acid Windows
- AA frequencies within a window on either side of
the mutation position - 20 AAs 20 features
- LEFT and RIGHT ? 40 features
17Predicting Deleterious Mutations
18Predicting Deleterious Mutations
- Tools
- Databases
- PDB - Protein structure data
- S-BLEST - Structural features
- Software
- Perl 5.8.0
- Matlab (NN, PRTools(DT), SVC)
19Predicting Deleterious Mutations
- List of Features Used
- BLOSUM62, disorder, secondary structure,
molecular weight - Grouped amino acid frequency windows of varying
widths - SIFT
- S-BLEST (vector contains four sub-shells
spreading outward from site) - Solvent accessibility (C-beta density, i.e., the
number of C-beta atoms around the site)
20Predicting Deleterious Mutations
Comparison with SB Results
21Predicting Deleterious Mutations
- 1. Human Data Set
- Human allele dataset as train and test set
- Ensembles of decision trees for classification
- 20-fold cross validation
- Progressively added features to see their affect
on performance - Because structural data was not available for
all mutation sites, we used a subset of the
original Saunders and Baker training set
22Predicting Deleterious Mutations
Best Features
23Predicting Deleterious Mutations
- 1. Experimental Data Set
- Same as human data set but using experimental
mutations for training and testing
24Predicting Deleterious Mutations
Evaluation of S-BLEST Using a Random Subset of
the Experimental Training Set
25Predicting Deleterious Mutations
- 3. Cross-classification
- Used the same features described above
- Trained on one dataset and tested on the other
- Human to experimental
- Experimental to human
- Experimental gene to exp. gene
26Predicting Deleterious Mutations
27Predicting Deleterious Mutations
28Predicting Deleterious Mutations
29Predicting Deleterious Mutations
30Predicting Deleterious Mutations
- Summary of Results
- Human data set
- 80 accuracy (up from 70)
- Experimental data set
- 87 accuracy (up from 79.5)
-
31Predicting Deleterious Mutations
- Conclusion
- Prediction tools CAN identify deleterious
mutations - We believe that further study is warranted to
identify over-fitted classifiers to further
improve classification accuracy on real world data
32Acknowledgements
People Andrew Campen (CCBB IT, IUPUI) Brandon
Peters (CCBB, IUPUI) Haixu Tang (Capstone
Coordinator, IUB) Funding This work was funded by
a grant from the Showalter Trust (Sean Mooney,
PI), INGEN, and a IUPUI McNair Scholarship. The
Indiana Genomics Initiative (INGEN) Indiana
University is supported in part by Lilly
Endowment Inc.
33Predicting Deleterious Mutations
Thank You
34Predicting Deleterious Mutations
35Predicting Deleterious Mutations