Automated aides to scientific discovery - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Automated aides to scientific discovery

Description:

There is no 'gene' for any complex phenotype; gene products function together in ... Three tissues (Maxillary prominence, Fronto-nasal prominence, Mandible) ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 28
Provided by: Lar9189
Category:

less

Transcript and Presenter's Notes

Title: Automated aides to scientific discovery


1
Automated aides to scientific discovery
2
We are at an inflection point in the history of
life and medicine
3
Understanding Gene Lists
  • There is no gene for any complex phenotype
    gene products function together in dynamic
    groups
  • A key task is to understand why a set of gene
    products are grouped together in a condition,
    exploiting all existing knowledge about
  • The genes (all of them)
  • Their relationships (genes2)
  • The condition(s) under study.

4
First application Craniofacial Development
  • NICHD-funded study (Rich Spritz Trevor Williams)
    focused on cleft lip palate
  • Well designed gene expression array experiment
  • Craniofacial development in normal mice (control)
  • Three tissues (Maxillary prominence, Fronto-nasal
    prominence, Mandible)
  • Five time points (every 12 hours from E10.5)
  • Seven biological replicates per condition (well
    powered)
  • gt1,000 genes differentially expressed among at
    least 2 of the 15 conditions (FDRlt0.01)

5
The amount of information relevant to biomedicine
1,000 genomes project will create 1,400GB next
year (1000genomes.org)
6
Exponential knowledge growth
  • 1,078 peer-reviewed gene-related databases in
    2008 NAR db issue
  • 751,195 Pubmed entries in 2007 (lt 2,000/day)
  • Like drinking from a firehose -- Jim Ostell
  • Still fairly low on the growth curve, and todays
    knowledge is

7
Not Nearly Enough!
  • Experimental coverage of interactions and
    pathways is still sparse, especially in mammals

8
Infrastructure of understanding
  • Need to integrate, extend and harness all of that
    knowledge (and data) for scientific insight.
  • Google PubMed began a transformation in
    scientific use of the literature (more coming!)
  • Community-curated ontologies improve the quality
    of knowledge representation
  • Macromolecular sequences provide easily assayable
    reference points
  • Computational challenge how best to interact
    with people to make discoveries

9
Reading and Inference
  • Best source of information is the literature
  • Asymmetric Co-occurrence Fraction (ACF)
  • Information extraction
  • Inferring implicit interactions
  • Genes that have similar individual annotations,
    e.g.
  • Annotated with similar GO terms (e.g. same
    biological process or cellular component)
  • Knockouts result in similar phenotypes
  • Enrich ontology connectivity by alignment
  • E.g. Linking GOcalcium transport and GOcalcium
    signaling via ChEBIcalcium Bada Hunter, 2006
    Tipney, et al., 2008

10
ACF improves network function prediction
Worm ACFgt0.9 literature networks, colored by GO
Molecular Function
Gabow, et al., in press
11
Information Extractionwith OpenDMAP
  • Concept Recognition connects community curated
    ontological concepts to literature instances
  • Recognition patterns associated with (Protégé)
    concepts and slots
  • Patterns can contain text literals, other
    concepts, constraints (conceptual or syntactic),
    ordering information, or outputs of other
    processing.
  • Linked to many text analysis engines via UIMA
  • Best performance in BioCreative II IPS task
  • gt500,000 instances of three predicates (with
    arguments) extracted from Medline Abstracts
  • Hunter, et al., 2008 http//bionlp.sourceforge.n
    et

12
Inferred interactions
  • Dramatically increase coverage
  • But at the costof much lowerreliability
  • New methods toassess reliabilitywithout an
    explicit goldstandard
  • Leach, et al., 2007Gabow, et al., in press

Top 1,000 Craniofacial genes
13
Semantic Data Integration
  • Combine diverse sources of background knowledge
    into a unified overview
  • Genes are fiducial nodes knowledge links them
  • Every link gets a reliability value
  • Combine links between a node pair into a summary
    link
  • Summary link gets weight via Noisy Or, or Linear
    Opinion Pool
  • Summary links from more sources are more reliable
  • Summary links from better sources are more
    reliable
  • Summaries allow for effective use of noisy
    inferences
  • Leach PhD thesis 2007 Leach et al., 2007
  • Visualization tool shows combined summary value
    and allow drill-down to each source

14
Summaries without inference
Nodes are genes in the cluster. Arcs are links
found among the explicit experts (mostly PreMod).
Arc intensity is proportional to summary
srength. Colored nodes have the GO function
annotation regulation of transcription, DNA
dependent. Edge attributes (below) show
supporting knowledge sources. Visualization is
via Cytoscape.
15
Same, but with inferred linkages
16
Now add experimental data
  • Expression array data generates its own network
    of genes via correlated expression
  • Combined with background knowledge by
  • Averaging (highlights already known linkages)
  • Hanisch (ISMB 2002) method (emphasizes data
    linkages not well known in the literature)
  • Visualize 1000 highest scoring data knowledge
    linkages by color coding for scores of average,
    Hanisch or both.

17
The Whole Network
Craniofacial dataset, covering all genes on the
Affy mouse chip. Graph of top 1000 edges using
AVE or HANISCH (1734 in total). Edges identified
by both. Focus on mid-size subnetwork
18
Strong data and background knowledge facilitate
explanations
AVE edges Both edges
Skeletal muscle structural components Skeletal
muscle contractile components Proteins of no
common family
  • Goal is abductive inference why are these genes
    doing this?
  • Specifically, why the increase in mandible before
    the increase in maxilla, and not at all in the
    frontonasal prominence?
  • Browse experimental data, network, gene ontology,
    details

19
Scientist aide literature ? explanation
tongue development
AVE edges Both edges
Skeletal muscle structural components Skeletal
muscle contractile components Proteins of no
common family
The delayed onset, at E12.5, of the same group of
proteins during mastication muscle development.
Myoblast differentiation and proliferation
continues until E15 at which point the tongue
muscle is completely formed.
Myogenic cells invade the tongue primodia E11
20
On to Discovery
  • Add the strong data, weak background knowledge
    (Hanisch) links to the previous network, bringing
    in new genes.
  • Five of these genes not previously implicated in
    facial muscle development (1 almost completely
    unannotated)

21
Prediction validated!
Zim1,E12.5
E43rik,E12.5
HoxA2,E12.5
ApoBEC2,E11.5
22
Aides for Biological Analysis
  • How to replicate and extend this work?
  • Deepen the connections to the literature
  • NLP is critical
  • Full texts are central (and increasingly
    accessible)
  • Staying current (live semantic data integration)
  • Improve user experience
  • Easier drill down into databases and literature
  • Integration with an analysts notebook
  • Automated focus on interesting material

23
NLP on full texts
  • Google Scholar PubMed good, but not enough
  • New challenges (cf abstracts)
  • Coreference
  • Finding relevant excerpts figures (e.g.
    Hearsts BioText)
  • Infrastructure work
  • Better representations
  • Relationships (RO effort)
  • Articles as hypotheses supporting data
  • CRAFT (2000 articles, 200 semantic classes)
  • Biological upper ontology (Being Alive annotation)

24
Being Alive in Knowtator
25
Picking interesting links
  • Even with summary links, theres a lot to
    explore. Ranking or highlighting would help.
  • Use of tool can provide stream of data about
    utility of links selection, relevance feedback,
    even eye tracking.
  • Goal exploit that data for ranking
  • Reinforcement learning?
  • Market for interestingness?

26
Acknowledgements
  • Sonia Leach (Semantic data integration)
  • Hannah Tipney (Analyst)
  • Bill Baumgartner (UIMA, Software engineer)
  • Philip Ogren (Knowtator)
  • Mike Bada (Ontologist)
  • Helen Johnson (Linguist)
  • Kevin Cohen (NLP guru)
  • Lynne Fox (Librarian)
  • Aaron Gabow (Programmer)
  • NIH grants
  • R01 LM 009254
  • R01 LM 008111
  • R01 GM 083649
  • G08 LM 009639
  • T15 LM 009451
  • MIT Press for permission to use Being Alive for
    doing science

27
Many surprises to come
  • Genotype can make environment more important
Write a Comment
User Comments (0)
About PowerShow.com