Title: Using protein interaction networks to identify
1Using protein interaction networks to identify
candidate disease genes in psychiatric illness.
Richard Adams, Psychiatric Genetics Lab, Medical
Genetics Section, Molecular Medicine Centre, WGH,
University of Edinburgh. richard.adams_at_ed.ac.uk
Human genetics successful in identifying genetic
causes of monogenic (Mendelian) disorders -
strong correlation between mutation and disease
- mapping studies able to pinpoint susceptible
region quite precisely - small number of
candidate genes to study - reproducible
results across populations. - gt1000 disease
genes identified to date
Less successful in identifying genetic basis of
complex (non-Mendelian) genetic disease - many
mutations in many genes may contribute to the
disease - often there is a high environmental
influence - weak correlation between mutation
and disease. - mapping studies often able to
localise susceptible region only with 10s of Mb
- hundreds of candidate genes - irreproducible
results across populations - lt10 undisputed
disease genes identified to date
2Bipolar disorder
- - Bipolar disorder is a serious psychiatric
disease (manic depression) - - Affects 1 of the population
- - Family, twin and adoption studies indicate a
genetic risk factor (50). - - No molecular or cellular mechanism of the
disease. - - candidate gene approach has not yielded
breakthrough due to the - irreproducibility of the
results. - - little evidence of
alterations in brain structure - - Experimental approach challenging
- - no disease model in experimental organisms.
- - difficulty in obtaining tissue samples.
- Many genomic regions implicated in BPAD
predisposition - - many 00s of possible genes.
- - currently impractical to
study all simultaneously. - - how can we prioritize genes for further
study?
Schizophrenia
- about 10 genes with multiple lines of evidence
- most impact on the biology of the synapse
- receptors (CHRNA7, G72, GRM3)
- synaptogenesis (DISC1, neuregulin, dysbindin,
calcineurin) - Signal transduction ( RGS4)
3Linkage studies identify a candidate region on
chr 4p.
- - approx. 9.5Mb is shared between 3 of the 4
families - 33 known genes in this region, only 1 obvious
candidate, a dopamine receptor - no convincing - association shown to date .
4Approaches to disease gene prediction
Annotation based e.g., GeneSeeker -
assists in a candidate gene approach by searching
multiple databases and finding genes
based on user-provided query e.g., Bork group -
associate pathological conditions with Gene
Ontology terms. these assume some idea of the
genes underlying the disease, or will
find genes consistent with what is already known
of the disease.
e.g., POCUS - given 2 genomic regions,
identifies genes sharing GO ids and protein
domains makes no assumptions on protein
function or expression. but results are
dependent on extent of GO annotation.
Sequence based Ponting et al., - disease genes
under represented in slowly evolving ,
intracellular widely expressed genes.
Disease genes, in general have slightly
different sequence characteristics
(tend to be longer and more conserved).
Use these differences for
developing classifiers (e.g., Prospectr).
5Disease genes are significantly longer than
non-disease genes (p lt 4e-16).
Disease genes All genes
frequency
Protein length(aa)
6Prospectr, an alternating decision tree classifier
7Improved approaches to predicting disease genes?
Sequence based Hypothesis independent
approach Data available for all genes.
Possibility of discovering unexpected disease
genes - do not take into account biological
knowledge - predictions are not disease
specific.
- ideal approach would have information for all
genes but would be hypothesis independent. - -Therefore large -scale protein interaction and
expression data may be key.
8Protein interaction networks - key facts
- Aggregation of multiple bimolecular interactions
into - an undirected , scale free cyclic graph
- Networks are not random many sparsely
connected proteins are linked through a smaller
number of highly connected hubs. - Networks are also modular - lt20 highly connected
proteins loosely connected to rest of network. - Hubs are likely to be more ancient, conserved
proteins. - Compared to random networks, any 2 nodes on
average can be connected by shorter paths.
- Can the properties of proteins in interaction
networks be used to discriminate disease vs
non-disease genes?
9Clustering coefficient Neighbor
count Articulation points Motif membership
10Clustering coefficient ( 2n/k(k-1) ) Neighbor
count Articulation points Motif membership
11Clustering coefficient Neighbor
count Articulation points Motif membership
12Clustering coefficient Neighbor count
(degree) Articulation points Motif membership
13Clustering coefficient Neighbor
count Articulation points Motif membership
14Clustering coefficient Neighbor
count Articulation points Motif membership
15Sources of human protein interaction
data. Thought to be 40 000 - 200 000 pairwise
interactions in humans Experimental-
literature (HPRD, BIND), 8802 proteins, 19000
interactions real
interactions not all hi
throughput - derived from many experimental
approaches - nomenclature
problems Predictions i) - extrapolated from
lower organism high throughput data, 3890
proteins, 11 000 interactions (high
quality), 11 000 proteins, 70 000 interactions
(low quality) (Lehner and Fraser Genome
Biology 5 R63) not
entirely theoretical -
only contains ancient proteins, no vertebrate
specific proteins . ii)
Extraploated from vertebrate interactions
iii) -predicted from structural
considerations (HPID)
20 000 proteins, 22 100 000 interactions.
probably contains most
theoretically possible interactions.
good coverage of genome.
- majority will be false
positives, needs filters to be useful.
16Generating composite protein interaction sets
- need to get as big a dataset as possible with as
many interactions as possible - therefore have developed Perl modules to merge
datasets from disparate sources - - modules are part of the BioPerl
distribution - - deals with nomenclature issues when
deciding if an interaction is common to - both datasets or not.
17Comparing network properties of disease vs Non-
disease genes.
Merge
HPRD
L-F
Database analysis
ND
D
ND
D
ND
D
15.3
18.2
24.6
23.8
19.8
20.2
Articulation point
5.8
6.4
4.8
4.6
5.3
5.1
median no. nbors
0.13
0.12
0.13
0.14
0.12
0.11
Median clust. Coeff..
Small difference, little discrimination in
classifier.
18Identifying candidate disease genes by molecular
triangulation (Krauthammer et al., PNAS 2004
101 15148-15153)
- - Alzheimers Disease is a complex genetic
disease with 4 known affective genes. - - use a custom protein interaction database
derived from text-mining, with 3100 nodes and - 17000 interactions.
-
- - assume that
- 1 . AD related genes are clustered in a
subnetwork of protein interactions, - 2 . Randomly selected seed genes are
uniformly distributed around the network. - Use set of seed genes derived from expert
list, or linkage results, or 4 known genes - - score each protein in network by how
close it is to the seed genes. - proteins can be ranked by their scores
- - proteins in expert lists can be
objectively rated - - high scoring proteins can be picked out
from others in a given linkage region
19Method Initalise give seed genes an initial
score For each protein P in network
calculate distance to each seed gene.
calculate score S for P where score ?
f(distance) increment Ps score by S next
P To calculate significance perform above
algorithm 1000x using random seed gene
selections get score distribution for each
node. P value for a given score S is the
probability of exceeding that score in random
selections e.g., if S ? seed score/(distance
1)
5
2
20Method Initialise give seed genes an initial
score For each protein P in network
calculate shortest path length to each seed gene
(Djikstra). calculate score S for P where
score ? f(distance) increment Ps score by
S next P To calculate significance perform
above algorithm 1000x using random seed gene
selections get score distribution for each
node. P value for a given score S is the
probability of exceeding that score in random
selections
5/4 (1.25)
5
2
21Method Initalise give seed genes an initial
evidence-based score For each protein P in
network calculate shortest path length to
each seed gene (Djikstra). calculate score S
for P where score ? f(distance) increment Ps
score by S next P To calculate significance
perform above algorithm 1000x using random seed
gene selections get score distribution for
each node. P value for a given score S is the
probability of exceeding that score in random
selections
5/4 2/7 1.54
5
2
22Method Initalise give seed genes an initial
score For each protein P in network
calculate shortest path length to each seed gene
(Djikstra). calculate score S for P where
score ? f(distance) increment Ps score by
S next P To calculate significance perform
above algorithm 1000x using random seed gene
selections get score distribution for each
node. P value for a given score S is the
probability of exceeding that score in random
selections
2.33
1.54
2.00
2.25
5.5
3.17
2.9
3.25
2.67
23Results of Krauthammer study
- were able to rank expert list
- - using 4 known genes as seeds, expert curated
genes appeared higher rated than by chance - highlighted novel candidate genes for which some
functional evidence exists. - produced a shortlist of 11 genes from 158
candidate genes from whole genome linkage study - .
- Shortcomings
- - Only a limited set of protein interactions and
proteins - Expert list used as a benchmark is not a gold
standard and may be flawed. - No attempt to use different distance measures.
24Identifying genes predisposing to BPAD / SCZ
- unlike Alzheimers, we dont have 4 definitively
known genes . - currently testing how this approach works using
the 10 strongest SCZ candidates - how well does this approach work for genes that
may be more loosely connected - in a network?
- - how well does assumption hold about clustering
of seeds in a sub-network? - use a leave-one-out approach - how highly
does the left out gene score? - can we enhance the concept of network distance?
- e.g., giving edges a weight depending on the
number of independent Shortest paths - between 2 nodes.
- e.g., weighting edges where both proteins are
known to exist in the brain (BIND). - e.g., calculating a Cartesian distance using a
3D network representation? - is there a way to model the expected score
distribution of random seed genes? - as number of nodes increases, shortest path
calculations become either very slow - or very memory intensive.
- - this has to date been limiting for us, so far
we have just prototyped code using a small
network.