Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
22
Module Title of Module
3Module 5Gene Function Prediction
- Quaid Morris
- Pathway and Network Analysis of omics Data
- May 2-3, 2011
http//morrislab.med.utoronto.ca
4Outline
- Concepts in gene function prediction
- Guilt-by-association
- Gene recommender systems
- Gene function prediction use cases
- Functional interaction networks
- Scoring interactions by guilt-by-association
- GeneMANIA STRING
- GeneMANIA demo
- STRING demo
5Using genome-wide data in the lab
CHiP-chip regulation data
Protein-protein interaction data
Genetic interaction data
?!?
Microarray expression data
6Genomics revolution, the bad news
Genomics datasets are
- noisy,
- redundant,
- incomplete,
- mysterious,
- massive
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Google cant do biology
12Google cant do biology
13Guilt-by-association principle
Microarray expression data
Co-expression network
Conditions
Genes
Eisen et al (PNAS 1998)
Fraser AG, Marcotte EM - A probabilistic view of
gene function - Nat Genet. 2004 Jun36(6)559-64
14Two types of functional prediction
- Give me more genes like these,
- e.g. find more genes in the Wnt signaling
pathway, find more kinases, find more members of
a protein complex - What does my gene do?
- Goal determine a genes function based on who it
interacts with guilt-by-association.
15Give me more genes like these
Input
Network and profile data
Output
from GeneMANIA
Gene recommender system
Query list
CDC48 CPR3 MCA1 TDH2
e.g., GeneMANIA, STRING, bioPIXIE (not updated)
16What does my gene do?, Solution 1
Input
Network and profile data
Output
Gene recommender system then enrichment analysis
Query list
CDC48
e.g., GeneMANIA, bioPIXIE
17What does my gene do?, Solution 2
CDC48
Input
Network and profile data
Supervised learning of a classifier
Classifier
from FuncBase
(e.g. Support Vector Machine, Naïve Bayes, Neural
networks, Random Forests)
Gene annotations, e.g. Gene Ontology
18Comparing solutions
- Supervised learning
- Needs gene sets for training, typically training
is time-consuming and is done off-line but
classifier is very fast - So, fast but inflexible
- Gene recommender systems
- Typically most computation is done online (except
for offline calculation of composite functional
interaction network, see next slide), so
updating is easier and can use arbitrary gene
sets - So, a little slower but much more flexible
Note can solve give me more genes like these
with supervised learning as well, so long as gene
set is predefined
19Composite functional interaction/linkage/associati
on networks
CHiP-chip regulation data
Protein-protein interaction data
Genetic interaction data
Composite functional association network
Microarray expression data
20Pre-computed functional interaction networks
Pre-combine networks e.g. by simple addition or
Naïve Bayes
Pavlidis et al, 2002, Marcotte et al,
1999 bioPIXIE
21Composite networks One size doesnt fit all
- Gene function could be a/the
- Biological process,
- Biochemical/molecular function,
- Subcellular/Cellular localization,
- Regulatory targets,
- Temporal expression pattern,
- Phenotypic effect of deletion.
Some networks may be better for some types of
gene function than others
22Solution Query-specific weights
Pavlidis et al, 2002, Lanckriet et al,
2004 Mostafavi et al, 2008
23Two rules for network weighting
- Relevance
- The network should be relevant to predicting the
function of interest - Test Are the genes in the query list more often
connected to one another than to other genes? - Redundancy
- The network should not be redundant with other
datasets particularly a problem for
co-expression - Test Do the two networks share many interactions
- Caveat Shared interactions also provide more
confidence that the interaction is real.
24Scoring nodes by guilt-by-association
Query list positive examples
25Scoring nodes by guilt-by-association
Query list positive examples
Score
high
low
Two main algorithms
26Node scoring algorithm details
- Direct neighbour node score depends on
- Strength of links to positive examples
- of positive neighbors
- Label propagation node score depends on
- Strength of links and of positive direct
neighbors - of shared neighbors with positive examples
- modular structure of network
27Label propagation example
Before
After
28Three parts of GeneMANIA
- A large, automatically updated collection of
interactions networks. - A query algorithm to find genes and networks that
are functionally associated to your query gene
list. - An interactive, client-side network browser with
extensive link-outs
29GeneMANIA data sources
Legend
Network types
minor curation
major curation
Co-expression
- Gene ID mappings from Ensembl and Ensembl Plant
- Network/gene descriptors from Entrez-Gene and
Pubmed - Gene annotations from Gene Ontology, GOA, and
model org. databases
Co-localization
Pathways
Physical interactions
Genetic interactions
Shared domains
Predicted interactions
Other
MGI Chemogenomics
30Gene identifiers
- All unique identifiers within the selected
organism e.g. - Entrez-Gene ID
- Gene symbol
- Ensembl ID
- Uniprot (primary)
- also, some synonyms organism-specific names
- We use Ensembl database for gene mappings (but we
mirror it once / 3 months, so sometimes we are
out of date)
31Current status
- Six organisms
- Human, Mouse, yeast, worm, fly, A Thaliana, Rat
coming soon - 1,250 networks (about 50 co-expression, 35
physical interaction) - Web network browser
32Cytoscape plugin
http//www.genemania.org/plugin/
33(No Transcript)
34(No Transcript)
35 QueryRunner
36http//cytoscapeweb.cytoscape.org/
37STRING http//string-db.org/
38STRING results
39STRING results
40GeneMANIA vs STRING
- STRING (2003-present)
- Large organism converge
- Protein focused
- Uses eight pre-computed networks
- Heavy use of phylogeny to infer functional
interactions, also contains text mining derived
interactions - Uses direct interaction to score nodes
- Link weights are Prob of functional interaction
- GeneMANIA webserver (2010-present)
- Covers 6 (not 7) major model organisms (but can
add more with plugin) - Gene focused
- Thousands of networks, weights are not
pre-computed, can upload your own network - Relies heavily on functional genomic data so has
genetic interactions, phenotypic info, chemical
interactions - Allows enrichment analysis
- Uses label propagation to score nodes
41Meaning of GeneMANIA link weights
Simple intuition Sum of link weights to
neighbors in each data source is 100
Weight 50
Weight 25
Precise definition Weight 100 x 1/sqrt( of
neighbours of node 1) x 1/sqrt( of neighbours of
node 2)
42GeneMANIA future directions
- Rat (1-3 weeks), next is probably E. Coli
- Non-coding genes (miRNAs!!!!)
- Regulatory networks (ChIP, RNA-protein,
miRNA-mRNAs) - More phenotypic information (OMIM, etc)
- Orthology mapping for inferring interologs
43GeneMANIA URLs
- Main site (stable but still fun)
- http//www.genemania.org
- Beta site (new and edgy but possibly unreliable)
- http//beta.genemania.org
44- We are on a Coffee Break Networking Session