ReConceptualizing LiteratureBased Discovery - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

ReConceptualizing LiteratureBased Discovery

Description:

Regardless of how this is done, how the implicit assertions are assessed, ... autophagy, or. therapeutic agents 'Interestingness' Measures. Field of data mining. ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 22
Provided by: Nsmalh
Category:

less

Transcript and Presenter's Notes

Title: ReConceptualizing LiteratureBased Discovery


1
Re-Conceptualizing Literature-Based Discovery
Neil R. Smalheiser March 29, 2008
2
What is LBD? A strategy for uncovering novel
hypotheses
  • advocated by Don Swanson
  • Magnesium-migraine, Fish oil-Raynauds
  • The key idea is putting together
  • explicit assertions from different papers to form
    new implicit assertions
  • Regardless of how this is done, how the implicit
    assertions are assessed, whether the implicit
    assertions are correct!

3
What is LBD? A routine way of life for scientists
  • greatly under-recognized! Not just background
    reading, not just identifying anomalies or
    critical incidents that appear (explicitly) in
    a paper
  • Since 1996 8 papers with Swanson, 40 without
    (i.e. non-one node search),
  • 24 are biological (i.e. non-informatics
    modeling) 9/24 3/8 gt 1/3
  • Proteins in unexpected locations (Molec. Biol.
    Cell, 1996)
  • Expression of reelin in the blood (PNAS, 2000)
  • Reelin and schizophrenia (PNAS 2000)
  • Fluoxetine and neurogenesis (Eur. J. Pharmacol.
    2001)
  • RNAi and memory (Trends in Neurosci. 2001)
  • Bath toys (New Engl. J. Med. 2003)
  • Dicer and calpain (J. Neurochem. 2005)
  • Exosomal transfer of proteins RNAs at synapses
    (Biol Direct, 2007)
  • microRNA machinery and regulation by
    phosphorylation (BBA, 2008)

4
What is LBD? A body of research articles,
software and websites
  • mostly by information scientists and computer
    scientists
  • Mostly concerned with open discovery or one
    node searches, begin with a set of articles A
    that represents a problem
  • Mostly use B-terms present in A to expand the
    search, find disparate lits Ci that share B-terms
    with A
  • Try to find the Ci that is
  • disparate yet most similar to A

5
What is LBD? other researchers employ implicit
information too
  • Bioinformatics
  • gene-gene interactions
  • protein-protein interactions
  • web search
  • author disambiguation
  • text mining
  • Yet these are not viewed as examples of LBD for
    some reason!

6
Has the LBD field stagnated and not fulfilled
its promise?
  • Kostoff critique(s)
  • what is a discovery vs. an innovation
  • argues against frequency based ranking,
  • Uses very high recall, hundreds of discoveries
    claimed per question
  • Swansons legacy Sw refs ended 2001!
  • Bork review refs Sw ended 1996!
  • Few gold standards are available (Mg, fish oil
    worn out)
  • Combinatorial explosion A B C search method
  • Impossible standards for what counts as a LBD
    prediction (never considered, never tested, must
    shatter a paradigm but must be proven
    experimentally??)
  • Excluding active approaches other than one node
    search as being LBD

7
Well, what DO we know about progress in LBD?
  • The two-node search
  • http//arrowsmith.psych.uic.edu
  • Begin with two lits A and C that represent a
    known finding or a hypothesis (estrogen-AD)
  • look for meaningful links
  • (whether or not A and C are disparate)
  • We use B-terms extracted from titles
  • Could use abstracts, MeSH, triples

8
(No Transcript)
9
Modeling the Two-Node Search-1
  • Field testers, free-form use of the tool
  • Chose 6 two-node searches as gold standards not
    too big or small, disparate, topically coherent,
    clean questions
  • E.g. for A retinal detachment, C aortic
    aneurysm, a) find diseases in which both features
    appear not necessarily in same person or b)
    find surgical procedures that have been applied
    to both conditions.
  • Manually marked relevant B-terms for a given
    query (sometimes several queries for the same two
    node search)
  • Details in Bioinformatics (2007) paper

10
Modeling the Two-Node Search-2
  • Used 8 complementary features to score each
    B-term (e.g. recency, frequency, semantic
    categories)
  • created a single combined and weighted score for
    each B-term
  • Used logistic regression model to optimally give
    weights to each feature so as to separate marked
    relevant B-terms from all others (mixed set)

11
Modeling the Two-Node Search-3
12
Two End-Points of this Research
  • For any two-node search, we can now rank the list
    of B-terms in order of estimated probability that
    they will be marked as relevant (meaningful) by
    SOME user for SOME query.
  • For any pair of lits A and C,
  • we can now estimate the OVERALL shared implicit
    information between A and C
  • ( of B-terms that are predicted to be relevant)

13
Relevance to the One Node Search
  • We can re-conceptualize the one-node search as a
    series of two-node searches
  • Choose A, then choose category C
  • Divide category C into many small coherent Ci
    densely
  • For each Ci, score multi-dimensional features
    Including, but not limited to, features that
    relate A to Ci (e.g. number of B-terms in common
    or predicted relevant B-terms)
  • Rank the Ci to identify the most promising lits
    (which are presumed to point to novel hyps or
    implicit information helpful when applied to A)

14
A is evaluated pairwise against C
C1 might involve B-terms C2 might
not! C3 C4 . e.g. A
Huntington Disease C lifestyle
factors autophagy, or therapeutic
agents
15
Interestingness Measures
  • Field of data mining.
  • This allows us to encode real-life priorities and
    strategies of working scientists
  • Existing one node search looks for novelty,
    relevance, non-trivial, likelihood of being true
    . get low hanging fruit
  • What about actionability, feasibility of
    follow-up, surprisingness, cross-discipline,
    presence of high experimental support,
    generalizability to other problems, or high
    potential impact?
  • A candidate Ci could be interesting because it is
    recently discovered and rapidly growing (e.g.
    microRNAs), well characterized, for a disease
    has an animal model, for a protein is connected
    to many other proteins, for a drug has FDA
    approval.
  • not only re-conceptualizes one node search (e.g.,
    no combinatorial explosion) but it generalizes
    the ranking methods.

16
Gold Standards for One-Node Searches
  • Also, we can now envision preparing a series of
    gold standard searches, even automatically (cf.
    TREC 2006, 2007).
  • Use implicit assertions to reconstruct explicit
    knowledge.
  • Use review articles
  • lists (e.g. in virus study, gold standard was a
    list of viruses that were thought to be at risk
    of being exploited for biological warfare).
  • time slices
  • Avoids the paradox that one node searches must
    predict things that have no experimental support!

17
Conclusions
  • LBD is (can be, will be) alive and well!
  • Need to incorporate the types of real-life
    priorities and strategies of working scientists
  • Re-conceptualize the one node search as a series
    of two-node searches
  • Use interestingness measures to supplement
    B-term measures.

18
(No Transcript)
19
Journal of Biomedical Discovery and Collaboration
  • Unique multi-disciplinary audience
  • People who engage in scientific discovery and
    collaboration
  • People who make tools that enhance scientific
    discovery and collaboration
  • People who study scientific discovery and
    collaboration
  • Hosted by Biomed Central
  • Fully peer-reviewed
  • RAPID review (lt3 weeks is routine)
  • Open-access, indexed in PubMed Central et al
  • Readership goes up 10-100-fold
  • Impact goes up too
  • Article fee reduced or zeroed depending on
    institution

20
Acknowledgements
  • Don Swanson
  • Vetle Torvik
  • Wei Zhou (Clement Yu)
  • Marc Weeber

21
Ruminations
  • Should LBD analyses be user-friendly? Popular??
  • Dont they overlook true divergent discoveries?
  • Should LBD be run automatically as a program in
    the background, with alerts of possible
    discoveries?
  • Does LBD bypass, or reinforce, good old fashioned
    hypothesis driven science?
Write a Comment
User Comments (0)
About PowerShow.com