Title: Automated aides to scientific discovery
1Automated aides to scientific discovery
2We are at an inflection point in the history of
life and medicine
3Understanding Gene Lists
- There is no gene for any complex phenotype
gene products function together in dynamic
groups - A key task is to understand why a set of gene
products are grouped together in a condition,
exploiting all existing knowledge about - The genes (all of them)
- Their relationships (genes2)
- The condition(s) under study.
4 First application Craniofacial Development
- NICHD-funded study (Rich Spritz Trevor Williams)
focused on cleft lip palate - Well designed gene expression array experiment
- Craniofacial development in normal mice (control)
- Three tissues (Maxillary prominence, Fronto-nasal
prominence, Mandible) - Five time points (every 12 hours from E10.5)
- Seven biological replicates per condition (well
powered) - gt1,000 genes differentially expressed among at
least 2 of the 15 conditions (FDRlt0.01)
5The amount of information relevant to biomedicine
1,000 genomes project will create 1,400GB next
year (1000genomes.org)
6Exponential knowledge growth
- 1,078 peer-reviewed gene-related databases in
2008 NAR db issue - 751,195 Pubmed entries in 2007 (lt 2,000/day)
- Like drinking from a firehose -- Jim Ostell
- Still fairly low on the growth curve, and todays
knowledge is
7Not Nearly Enough!
- Experimental coverage of interactions and
pathways is still sparse, especially in mammals
8Infrastructure of understanding
- Need to integrate, extend and harness all of that
knowledge (and data) for scientific insight. - Google PubMed began a transformation in
scientific use of the literature (more coming!) - Community-curated ontologies improve the quality
of knowledge representation - Macromolecular sequences provide easily assayable
reference points - Computational challenge how best to interact
with people to make discoveries
9Reading and Inference
- Best source of information is the literature
- Asymmetric Co-occurrence Fraction (ACF)
- Information extraction
- Inferring implicit interactions
- Genes that have similar individual annotations,
e.g. - Annotated with similar GO terms (e.g. same
biological process or cellular component) - Knockouts result in similar phenotypes
- Enrich ontology connectivity by alignment
- E.g. Linking GOcalcium transport and GOcalcium
signaling via ChEBIcalcium Bada Hunter, 2006
Tipney, et al., 2008
10ACF improves network function prediction
Worm ACFgt0.9 literature networks, colored by GO
Molecular Function
Gabow, et al., in press
11Information Extractionwith OpenDMAP
- Concept Recognition connects community curated
ontological concepts to literature instances - Recognition patterns associated with (Protégé)
concepts and slots - Patterns can contain text literals, other
concepts, constraints (conceptual or syntactic),
ordering information, or outputs of other
processing. - Linked to many text analysis engines via UIMA
- Best performance in BioCreative II IPS task
- gt500,000 instances of three predicates (with
arguments) extracted from Medline Abstracts - Hunter, et al., 2008 http//bionlp.sourceforge.n
et
12Inferred interactions
- Dramatically increase coverage
- But at the costof much lowerreliability
- New methods toassess reliabilitywithout an
explicit goldstandard - Leach, et al., 2007Gabow, et al., in press
Top 1,000 Craniofacial genes
13Semantic Data Integration
- Combine diverse sources of background knowledge
into a unified overview - Genes are fiducial nodes knowledge links them
- Every link gets a reliability value
- Combine links between a node pair into a summary
link - Summary link gets weight via Noisy Or, or Linear
Opinion Pool - Summary links from more sources are more reliable
- Summary links from better sources are more
reliable - Summaries allow for effective use of noisy
inferences - Leach PhD thesis 2007 Leach et al., 2007
- Visualization tool shows combined summary value
and allow drill-down to each source
14Summaries without inference
Nodes are genes in the cluster. Arcs are links
found among the explicit experts (mostly PreMod).
Arc intensity is proportional to summary
srength. Colored nodes have the GO function
annotation regulation of transcription, DNA
dependent. Edge attributes (below) show
supporting knowledge sources. Visualization is
via Cytoscape.
15Same, but with inferred linkages
16Now add experimental data
- Expression array data generates its own network
of genes via correlated expression - Combined with background knowledge by
- Averaging (highlights already known linkages)
- Hanisch (ISMB 2002) method (emphasizes data
linkages not well known in the literature) - Visualize 1000 highest scoring data knowledge
linkages by color coding for scores of average,
Hanisch or both.
17The Whole Network
Craniofacial dataset, covering all genes on the
Affy mouse chip. Graph of top 1000 edges using
AVE or HANISCH (1734 in total). Edges identified
by both. Focus on mid-size subnetwork
18Strong data and background knowledge facilitate
explanations
AVE edges Both edges
Skeletal muscle structural components Skeletal
muscle contractile components Proteins of no
common family
- Goal is abductive inference why are these genes
doing this? - Specifically, why the increase in mandible before
the increase in maxilla, and not at all in the
frontonasal prominence? - Browse experimental data, network, gene ontology,
details
19Scientist aide literature ? explanation
tongue development
AVE edges Both edges
Skeletal muscle structural components Skeletal
muscle contractile components Proteins of no
common family
The delayed onset, at E12.5, of the same group of
proteins during mastication muscle development.
Myoblast differentiation and proliferation
continues until E15 at which point the tongue
muscle is completely formed.
Myogenic cells invade the tongue primodia E11
20On to Discovery
- Add the strong data, weak background knowledge
(Hanisch) links to the previous network, bringing
in new genes. - Five of these genes not previously implicated in
facial muscle development (1 almost completely
unannotated)
21Prediction validated!
Zim1,E12.5
E43rik,E12.5
HoxA2,E12.5
ApoBEC2,E11.5
22Aides for Biological Analysis
- How to replicate and extend this work?
- Deepen the connections to the literature
- NLP is critical
- Full texts are central (and increasingly
accessible) - Staying current (live semantic data integration)
- Improve user experience
- Easier drill down into databases and literature
- Integration with an analysts notebook
- Automated focus on interesting material
23NLP on full texts
- Google Scholar PubMed good, but not enough
- New challenges (cf abstracts)
- Coreference
- Finding relevant excerpts figures (e.g.
Hearsts BioText) - Infrastructure work
- Better representations
- Relationships (RO effort)
- Articles as hypotheses supporting data
- CRAFT (2000 articles, 200 semantic classes)
- Biological upper ontology (Being Alive annotation)
24Being Alive in Knowtator
25Picking interesting links
- Even with summary links, theres a lot to
explore. Ranking or highlighting would help. - Use of tool can provide stream of data about
utility of links selection, relevance feedback,
even eye tracking. - Goal exploit that data for ranking
- Reinforcement learning?
- Market for interestingness?
26Acknowledgements
- Sonia Leach (Semantic data integration)
- Hannah Tipney (Analyst)
- Bill Baumgartner (UIMA, Software engineer)
- Philip Ogren (Knowtator)
- Mike Bada (Ontologist)
- Helen Johnson (Linguist)
- Kevin Cohen (NLP guru)
- Lynne Fox (Librarian)
- Aaron Gabow (Programmer)
- NIH grants
- R01 LM 009254
- R01 LM 008111
- R01 GM 083649
- G08 LM 009639
- T15 LM 009451
- MIT Press for permission to use Being Alive for
doing science
27Many surprises to come
- Genotype can make environment more important