Guan N. Lin (Nick) - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Guan N. Lin (Nick)

Description:

Positive training set (experimental interactions, some for ... FALSE. TRUE. TP (True Positive) FP (False Positive) FALSE. FN (False Negative) TN (True Negative) ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 33

Provided by: cane6

Category:

more less

Transcript and Presenter's Notes

Title: Guan N. Lin (Nick)

1
Bioinformatics
Prediction of Plant Protein-Protein Interaction
Using sequence Only
Guan N. Lin (Nick) Bioinformatics Intern
2
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Results Analysis
PPI prediction for leading genes
Acknowledgements

3
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Results Analysis
PPI prediction for leading genes
Acknowledgements

4
Protein-protein interaction (PPI)

PPI
Each living cell is packed with proteins that
continuously interact with each other to control
the cell's growth, function and eventual fate.
They have effects on altering protein kinetic
properties, substrate binding, catalysis, etc.
Researchers have developed a variety of chemical
and biochemical techniques to understand the who,
what, where, when and why of those interactions.

5
Systems biology From cell to network
6
PPI (Protein-protein interaction) prediction

A study combining bioinformatics and structural
biology to identify and catalog interactions
between pairs or groups of proteins.
Determination by experiments
yeast 2-hybrids, affinity purification,
co-immunoprecipitation, etc.
Prediction by computations
Model building through pattern discovery using
sequences, protein structural information,
evolutionary information, etc.
PPI network construction provides important
insight in investigating intracellular signaling
pathways.

7
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Results Analysis
PPI prediction for leading genes
Acknowledgements

8
Project goals and obstacles

Goals
Using parts of free tools and open-source codes
to build a PPI prediction pipeline system based
on protein sequence information only.
Using cross-species PPI data, such as Human,
Drosophila, Yeast and C. elegans, to do
genome-scale plant PPI prediction.
Obstacles
Open-source codes lack of organizations and
descriptions for system integration.
Computational complexity hinders the analysis
speed within limited amount of time.
Difficult to generalize the consistent pattern
from cross species data.

9
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Basic scheme
Tool development
Model design and tuning
Results Analysis
PPI prediction for leading genes
Acknowledgements

10
Basic scheme (how do we do it?)
Rationale PPIs are basic structural elements for
molecular circuitries in biological systems and
will provide valuable insights for
optimization/MOA
Training data (sequences of interacting proteins)
Predict new interactions from sequences
SVM Kernel classifier
Sequence patterns
Validation
Training set for SVM kernel classifier
Positive training set (experimental interactions,
some for training, some for validation)
Negative training set (mostly random generated
pairs)
11
Using Conjoint Triads for sequence pattern
construction

Reduced-alphabet sequence pattern training
Classify 20 AA types into 7 classes based on
their properties (hydrogen bonding, hydrophobic,
volumes of sidechains, etc).
Build AA triplets using 7 classes, called
conjoint triad (343 unique types). Save in V
Calculate frequency of each triad for each
protein sequence.

Shen, PNAS 2007
12
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Basic scheme
Tool development
Model design and tuning
Results Analysis
PPI prediction for leading genes
Acknowledgements

13
System/Tool design flowchart
Java Codes
C/C Codes
SVM Prediction
Input Sequence
SVM Training
Build sequence pattern
Test sequences
Conjoint Triads
Optimize parameter (C, ?)
SVM test input SVM training model
Triads Frequency
Build SVM training model
Prediction
SVM training Input
Negative PPI pairs are generated based on
proteins positive PPI pairs. If AB and IJ are
positive PPIs, then AI, AJ, BI and BJ could be
considered the negative pairs. of negative
pairs of positive pairs
Generate negative PPI pairs
Prepare training Evidence
Raw PPI file
14
Screenshot of the PPI prediction tool
15
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Basic scheme
Tool development
Model design and tuning
Results Analysis
PPI prediction for leading genes
Acknowledgements

16
Public available experimental data

Arabidopsis
4,400 PPI pairs (Tair, Biogrid, intAct), 3,000
genes
C. elegans
5,400 PPI pairs (Biogrid, intAct)
Human
23,000 PPI pairs (HPRD, intAct), 6,900 genes
Drosophila
24,000 PPI pairs (intAct), 7,000 genes
Yeast
48,000 PPI pairs (Biogrid, intAct), 7,000 genes

17
SVM for triad pattern model training and tuning
SVM training parameters
SVM parameters optimization is performed using
grid-search procedure. Parameters C cost to
minimize training error (value range 0.125 -gt
512) ? kernel gamma maximize training
capability (value range 0.125 -gt 8)
18
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Results Analysis
Preliminary results and problems
Further method modification
Further results
PPI prediction for leading genes
Acknowledgements

19
Accuracy measurements
Real Outcome
TRUE FALSE
TRUE TP (True Positive) FP (False Positive)
FALSE FN (False Negative) TN (True Negative)
Predicted Outcome
Sensitivity
Specificity
Sensitivity TP/(TP FN) Specificity TN/(FP
TN)
20
Preliminary results and observations

Prediction for Arabidopsis 2,600 positive PPI
2,600 negative PPI using different data sets
without any filtering or processing.

Species
Arabidopsis X X X X
Human X X X
Yeast X X X
Drosophila X
C. Elegans X
Accuracy TP 45 TN 86 TP 2 TN 96 TP 98 TN 3 TP 10 TN 92 TP 63 TN 38 TP 25 TN 55
Observations 1. Overall low accuracies. 2.
Different species data exhibit very different
prediction pattern, some like Human and Yeast
have completely different prediction extreme
patterns. gt Conclusion not meaningful and
useful predictions so far.
21
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Results Analysis
Preliminary results and problems
Further method modification
Further results
PPI prediction for leading genes
Acknowledgements

Carefully selection of subsets of cross species
data for training is essential to get valid
results
Using GO (Gene Ontology) slim category for data
filtering

Red bar Arabidopsis whole genome proteins
Blue bar Arabidopsis PPI proteins
It shows correlation of 0.92 between them
Proteins from PPI does represent overall trend of
whole genome
Filtering species data by GO Tair slim.

23
How to categorize proteins into GO slim terms -
using GO level indexing
Step1 make GO index
Ontology files
Step2 link GO index to genes
Gene to GO association Files
Step3 get GO slim term GO_Index
YBR085W -gt GO0055085 transmembrane transport
3-10-5-44
Developmental process(GO0007252)
3-9-26 Transport(GO0006810) 3-10-5 Signal
transduction(GO0007165) 3-7-7-15
YBR085W belongs to Transport slim category
24
Next step Using slim category frequency
distribution to select subsets of cross-species
data
Use percentages shown in Arabidopsis data to
select similar subsets for other species
25
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Results Analysis
Preliminary results and problems
Further method modification
Further results
PPI prediction for leading genes
Acknowledgements

26
Models results comparison with modified datasets
Data selection Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 YuChen
Arabidopsis X ?
Human X X
Yeast X X ?
C. elegans X X
Drosophila X X
Accuracy
True positive 944/1917 49.24 433/1917 22 492/1917 25.7 501/1917 26.1 440/1917 23 756/1917 39.4 51/1917 2.7
True negative 1647/1915 86 1724/1915 90 1673/1915 87.4 1473/1915 76.9 1665/1915 86.9 1418/1915 74 NA
Sensitivity 78 69 67 53 64 72 NA
Specificity 64 55 54 51 53 58 NA
Test data 1917 positive-evidence Arabidopsis PPI
pairs 1915 negative Arabidopsis PPI pairs. The
probability of predict a random pair to be a true
PPI is 2.6. Observation The modified datasets
are able to remove almost all negative pairs.
27
Using ROC curves to show the powers of model
prediction are much better than random prediction.
Prepared by Xiao Yang
28
Model prediction pattern correlations
Prediction proabability correlation Prediction proabability correlation Prediction proabability correlation Prediction proabability correlation
arab human yeast c. elegans cdhy drosophila
arab 1 0.123131 0.158611 0.177077 0.250751 0.047598
human 1 0.603687 0.210406 0.09851 0.553002
yeast 1 0.309629 0.47832 0.290454
c. elegans 1 0.242561 0.101113
combined 1 -0.22017
drosophila 1
Note 1. "cdhs means combining c. elegans,
drosophila, human and yeast data together. 2.
Drosophila dataset has the poorest prediction
trend correlation with Arabidopsis dataset. 3.
Combined dataset exhibits the stronger
correlation than any other individual dataset.
29
Outline

Project Background
Goals and Obstacles
Tool Development Method Design
Results Analysis
Preliminary results and problems
Further method modification
Further results
PPI prediction for leading genes
Acknowledgements

30
Summary

Built an easy use and successful system for PPI
prediction based on sequence information only.
Construct the PPI prediction models and prove the
concept of using cross-species information for
plant species PPI prediction in case of lacking
of experimental information.
Apply PPI prediction for leading genes MOA study.

31
Outline