Title: Integrating heterogeneous genomic data wrapup
1Integrating heterogeneous genomic data wrap-up
CSCI5980 Functional Genomics, Systems Biology
and Bioinformatics
- Rui Kuang and Chad Myers
- Department of Computer Science and Engineering
- University of Minnesota
2Announcements
- HW 3 due tonight at midnight!
- Project presentations May 5, 7, 12, 14
- presentation schedule out tomorrow
- required attendance!
- Project reports due midnight, May 14th
- 10 pg abstract, intro, methods, experiments,
discussion and/or conclusion
3Outline for today
- Wrap up discussion of inference of functional
linkage network based on heterogeneous data - Paper discussion Workman et al.
4Handling noise
- General approaches
- Simple strategies
- Conjunctive integration (keep only data/features
that are supported independently across multiple
input datasets) (AND integration) - Disjunctive integration (OR integration)
- Pros/Cons?
- More sophisticated idea use machine learning to
learn which datasets are reliable (optimal
weighting)
(requires gold standard)
5Gold Standards
- Expert-curated assignments of genes to functional
groups, complexes, or pathways - Gene Ontology
- KEGG - Kyoto Encyclopedia of Genes and Genomes
- MIPS Munich Information center for Protein
Sequences
6Remember the Gene Ontology?
- Three hierarchies
- Molecular function
- Biological process
- Cellular component
- Curated annotation
7Gold standard(pairwise data)
Predicted Gene Pairs
Based on TPs and FPs, calculate precision and
recall, and draw ROC curves
8An example evaluation based on GO gold standard
9Disagreement between gold standards
10Caveat functional biases in evaluations
11Comparison of individual datasets
(based on comparison against GO bio. process gold
standard)
Myers et al. Finding function evaluation methods
for functional genomic data. BMC Genomics (2006)
12Process specific evaluation
13Bayesian data integration a simple model
Wed like to infer
Cdc7
Dbf4
(naïve Bayes classifier)
14Modeling process-specific signal
Cdc7
Dbf4
(Derived from users query)
Reliability variation
Datasets
Contexts
What are we assuming here?
15Bayesian integration an intuitive view
Cdc7
Dbf4
Reliability variation
(Derived from users query)
Datasets
Contexts
Context ribosome
Context membrane organization
Unrelated genes
Functionally-related genes
PDF
PDF
Observed Co-expression (Pearson correlation)
Observed Co-expression (Pearson correlation)
16Bayesian integration example
- DNA replication initiation complex, Cdc7-Dbf4
Cdc7
Dbf4
Inferred prob. of FR .998
(Derived from users query)
Observed
17Bayesian integration example
- DNA replication initiation complex, Cdc7-Dbf4
Cdc7
Dbf4
Inferred prob. of FR .998
(Derived from users query)
Observed
18Evaluation experiments
Recovering known network components Conclusion
1 Robust integration is important
Precision ( TP / TP FP )
of recovered same-process protein pairs
(8 of 174 input datasets)
19Evaluation experiments (2)
Does incorporating biological context information
improve prediction?
Simple structure (global)
Comparison Same data Same query Simple vs.
context-specific integration
Context-sensitive BN
Datasets
Contexts
20Evaluation experiments (2)
Conclusion 2 Using contextual information in
integration is critical!
RNA splicing (GO0008380)
10-protein query each point- average of 50 trials
21Evaluation experiments (2)
RNA splicing same 5 query genes
Context-specific network 6 FPs/ 80 precision
Global network 22 FPs/27 precision
22A consistent improvement
- Context-specific integration improves 44/53
evaluated bio. process GO terms an average of 25
10-protein query each point average of 50 trials
Context-specific network ( of recovered proteins)
Global network ( of recovered proteins)
23A practical system for network discovery
Gene expression dataset 1
Gene expression dataset 2
Gene expression
Gene expression dataset N
Yeast two-hybrid dataset 1
Co-precipitation dataset 1
Physical interactions
Synthetic lethality dataset
Network recovery algorithm
Synthetic rescue dataset
Data integration via a Bayesian network
Genetic interactions
bioPIXIE Pathway Inference from eXperimental
Interaction Evidence
Transcription factor bin sites
Localization
Other
Curated literature
Results displayed in a dynamic visualization
Myers et al. Discovery of biological networks
from diverse functional genomic data. Genome
Biology (2005).
24Making our approach practicaleffective data
visualization
25Making our approach practicaleffective data
visualization
- Guiding principles
- Accessibility
- (users can access most recent data with little
effort) - Drill-down
- (details, e.g. supporting exp. data, hidden
until requested) - Browseable
26Graph visualization
27Graph visualization
28Biological validation characterizing new genes
Uncharacterized genes YPL077C, YPL017C,
YPL144W Predicted involvement in chromosome
segregation
29Biological validation characterizing new genes
Differential Interference Contrast
DAPI
FACS
Wild type
YPL017CD
YPL077CD
Prediction Chromosome segregation
YPL144WD
30Biological validation characterizing new genes
Differential Interference Contrast
DAPI
FACS
Wild type
YPL017CD
YPL077CD
YPL077CD
Prediction Chromosome segregation
YPL144WD
31Biological validation characterizing new genes
Differential Interference Contrast
DAPI
FACS
Wild type
YPL017CD
YPL077CD
YPL077CD
Prediction Chromosome segregation
YPL144WD