Title: Guan N. Lin (Nick)
1Bioinformatics
Prediction of Plant Protein-Protein Interaction
Using sequence Only
Guan N. Lin (Nick) Bioinformatics Intern
2Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- PPI prediction for leading genes
- Acknowledgements
3Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- PPI prediction for leading genes
- Acknowledgements
4Protein-protein interaction (PPI)
- PPI
- Each living cell is packed with proteins that
continuously interact with each other to control
the cell's growth, function and eventual fate. - They have effects on altering protein kinetic
properties, substrate binding, catalysis, etc. - Researchers have developed a variety of chemical
and biochemical techniques to understand the who,
what, where, when and why of those interactions.
5Systems biology From cell to network
6PPI (Protein-protein interaction) prediction
- A study combining bioinformatics and structural
biology to identify and catalog interactions
between pairs or groups of proteins. - Determination by experiments
- yeast 2-hybrids, affinity purification,
co-immunoprecipitation, etc. - Prediction by computations
- Model building through pattern discovery using
sequences, protein structural information,
evolutionary information, etc. - PPI network construction provides important
insight in investigating intracellular signaling
pathways.
7Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- PPI prediction for leading genes
- Acknowledgements
8Project goals and obstacles
- Goals
- Using parts of free tools and open-source codes
to build a PPI prediction pipeline system based
on protein sequence information only. - Using cross-species PPI data, such as Human,
Drosophila, Yeast and C. elegans, to do
genome-scale plant PPI prediction. - Obstacles
- Open-source codes lack of organizations and
descriptions for system integration. - Computational complexity hinders the analysis
speed within limited amount of time. - Difficult to generalize the consistent pattern
from cross species data.
9Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Basic scheme
- Tool development
- Model design and tuning
- Results Analysis
- PPI prediction for leading genes
- Acknowledgements
10Basic scheme (how do we do it?)
Rationale PPIs are basic structural elements for
molecular circuitries in biological systems and
will provide valuable insights for
optimization/MOA
Training data (sequences of interacting proteins)
Predict new interactions from sequences
SVM Kernel classifier
Sequence patterns
Validation
Training set for SVM kernel classifier
Positive training set (experimental interactions,
some for training, some for validation)
Negative training set (mostly random generated
pairs)
11Using Conjoint Triads for sequence pattern
construction
- Reduced-alphabet sequence pattern training
- Classify 20 AA types into 7 classes based on
their properties (hydrogen bonding, hydrophobic,
volumes of sidechains, etc). - Build AA triplets using 7 classes, called
conjoint triad (343 unique types). Save in V - Calculate frequency of each triad for each
protein sequence.
Shen, PNAS 2007
12Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Basic scheme
- Tool development
- Model design and tuning
- Results Analysis
- PPI prediction for leading genes
- Acknowledgements
13System/Tool design flowchart
Java Codes
C/C Codes
SVM Prediction
Input Sequence
SVM Training
Build sequence pattern
Test sequences
Conjoint Triads
Optimize parameter (C, ?)
SVM test input SVM training model
Triads Frequency
Build SVM training model
Prediction
SVM training Input
Negative PPI pairs are generated based on
proteins positive PPI pairs. If AB and IJ are
positive PPIs, then AI, AJ, BI and BJ could be
considered the negative pairs. of negative
pairs of positive pairs
Generate negative PPI pairs
Prepare training Evidence
Raw PPI file
14Screenshot of the PPI prediction tool
15Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Basic scheme
- Tool development
- Model design and tuning
- Results Analysis
- PPI prediction for leading genes
- Acknowledgements
16Public available experimental data
- Arabidopsis
- 4,400 PPI pairs (Tair, Biogrid, intAct), 3,000
genes - C. elegans
- 5,400 PPI pairs (Biogrid, intAct)
- Human
- 23,000 PPI pairs (HPRD, intAct), 6,900 genes
- Drosophila
- 24,000 PPI pairs (intAct), 7,000 genes
- Yeast
- 48,000 PPI pairs (Biogrid, intAct), 7,000 genes
17SVM for triad pattern model training and tuning
SVM training parameters
SVM parameters optimization is performed using
grid-search procedure. Parameters C cost to
minimize training error (value range 0.125 -gt
512) ? kernel gamma maximize training
capability (value range 0.125 -gt 8)
18Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- Preliminary results and problems
- Further method modification
- Further results
- PPI prediction for leading genes
- Acknowledgements
19Accuracy measurements
Real Outcome
TRUE FALSE
TRUE TP (True Positive) FP (False Positive)
FALSE FN (False Negative) TN (True Negative)
Predicted Outcome
Sensitivity
Specificity
Sensitivity TP/(TP FN) Specificity TN/(FP
TN)
20Preliminary results and observations
- Prediction for Arabidopsis 2,600 positive PPI
2,600 negative PPI using different data sets
without any filtering or processing.
Species
Arabidopsis X X X X
Human X X X
Yeast X X X
Drosophila X
C. Elegans X
Accuracy TP 45 TN 86 TP 2 TN 96 TP 98 TN 3 TP 10 TN 92 TP 63 TN 38 TP 25 TN 55
Observations 1. Overall low accuracies. 2.
Different species data exhibit very different
prediction pattern, some like Human and Yeast
have completely different prediction extreme
patterns. gt Conclusion not meaningful and
useful predictions so far.
21Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- Preliminary results and problems
- Further method modification
- Further results
- PPI prediction for leading genes
- Acknowledgements
22- Carefully selection of subsets of cross species
data for training is essential to get valid
results - Using GO (Gene Ontology) slim category for data
filtering
- Red bar Arabidopsis whole genome proteins
- Blue bar Arabidopsis PPI proteins
- It shows correlation of 0.92 between them
- Proteins from PPI does represent overall trend of
whole genome - Filtering species data by GO Tair slim.
23How to categorize proteins into GO slim terms -
using GO level indexing
Step1 make GO index
Ontology files
Step2 link GO index to genes
Gene to GO association Files
Step3 get GO slim term GO_Index
YBR085W -gt GO0055085 transmembrane transport
3-10-5-44
Developmental process(GO0007252)
3-9-26 Transport(GO0006810) 3-10-5 Signal
transduction(GO0007165) 3-7-7-15
YBR085W belongs to Transport slim category
24Next step Using slim category frequency
distribution to select subsets of cross-species
data
Use percentages shown in Arabidopsis data to
select similar subsets for other species
25Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- Preliminary results and problems
- Further method modification
- Further results
- PPI prediction for leading genes
- Acknowledgements
26Models results comparison with modified datasets
Data selection Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 YuChen
Arabidopsis X ?
Human X X
Yeast X X ?
C. elegans X X
Drosophila X X
Accuracy
True positive 944/1917 49.24 433/1917 22 492/1917 25.7 501/1917 26.1 440/1917 23 756/1917 39.4 51/1917 2.7
True negative 1647/1915 86 1724/1915 90 1673/1915 87.4 1473/1915 76.9 1665/1915 86.9 1418/1915 74 NA
Sensitivity 78 69 67 53 64 72 NA
Specificity 64 55 54 51 53 58 NA
Test data 1917 positive-evidence Arabidopsis PPI
pairs 1915 negative Arabidopsis PPI pairs. The
probability of predict a random pair to be a true
PPI is 2.6. Observation The modified datasets
are able to remove almost all negative pairs.
27Using ROC curves to show the powers of model
prediction are much better than random prediction.
Prepared by Xiao Yang
28Model prediction pattern correlations
Prediction proabability correlation Prediction proabability correlation Prediction proabability correlation Prediction proabability correlation
arab human yeast c. elegans cdhy drosophila
arab 1 0.123131 0.158611 0.177077 0.250751 0.047598
human 1 0.603687 0.210406 0.09851 0.553002
yeast 1 0.309629 0.47832 0.290454
c. elegans 1 0.242561 0.101113
combined 1 -0.22017
drosophila 1
Note 1. "cdhs means combining c. elegans,
drosophila, human and yeast data together. 2.
Drosophila dataset has the poorest prediction
trend correlation with Arabidopsis dataset. 3.
Combined dataset exhibits the stronger
correlation than any other individual dataset.
29Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- Preliminary results and problems
- Further method modification
- Further results
- PPI prediction for leading genes
- Acknowledgements
30Summary
- Built an easy use and successful system for PPI
prediction based on sequence information only. - Construct the PPI prediction models and prove the
concept of using cross-species information for
plant species PPI prediction in case of lacking
of experimental information. - Apply PPI prediction for leading genes MOA study.
31Outline
- Project Background
- Goals and Obstacles
- Tool Development Method Design
- Results Analysis
- Preliminary results and problems
- Further method modification
- Further results
- PPI prediction for leading genes
- Acknowledgements
32Acknowledgements
- Zheng Li
- J.D. Liu
- Everyone in bioinformatics team
- Paggy Sullivan and University relations