Title: ReConceptualizing LiteratureBased Discovery
1Re-Conceptualizing Literature-Based Discovery
Neil R. Smalheiser March 29, 2008
2What is LBD? A strategy for uncovering novel
hypotheses
- advocated by Don Swanson
- Magnesium-migraine, Fish oil-Raynauds
- The key idea is putting together
- explicit assertions from different papers to form
new implicit assertions - Regardless of how this is done, how the implicit
assertions are assessed, whether the implicit
assertions are correct!
3What is LBD? A routine way of life for scientists
- greatly under-recognized! Not just background
reading, not just identifying anomalies or
critical incidents that appear (explicitly) in
a paper - Since 1996 8 papers with Swanson, 40 without
(i.e. non-one node search), - 24 are biological (i.e. non-informatics
modeling) 9/24 3/8 gt 1/3 - Proteins in unexpected locations (Molec. Biol.
Cell, 1996) - Expression of reelin in the blood (PNAS, 2000)
- Reelin and schizophrenia (PNAS 2000)
- Fluoxetine and neurogenesis (Eur. J. Pharmacol.
2001) - RNAi and memory (Trends in Neurosci. 2001)
- Bath toys (New Engl. J. Med. 2003)
- Dicer and calpain (J. Neurochem. 2005)
- Exosomal transfer of proteins RNAs at synapses
(Biol Direct, 2007) - microRNA machinery and regulation by
phosphorylation (BBA, 2008)
4What is LBD? A body of research articles,
software and websites
- mostly by information scientists and computer
scientists - Mostly concerned with open discovery or one
node searches, begin with a set of articles A
that represents a problem - Mostly use B-terms present in A to expand the
search, find disparate lits Ci that share B-terms
with A - Try to find the Ci that is
- disparate yet most similar to A
5What is LBD? other researchers employ implicit
information too
- Bioinformatics
- gene-gene interactions
- protein-protein interactions
- web search
- author disambiguation
- text mining
- Yet these are not viewed as examples of LBD for
some reason!
6Has the LBD field stagnated and not fulfilled
its promise?
- Kostoff critique(s)
- what is a discovery vs. an innovation
- argues against frequency based ranking,
- Uses very high recall, hundreds of discoveries
claimed per question - Swansons legacy Sw refs ended 2001!
- Bork review refs Sw ended 1996!
- Few gold standards are available (Mg, fish oil
worn out) - Combinatorial explosion A B C search method
- Impossible standards for what counts as a LBD
prediction (never considered, never tested, must
shatter a paradigm but must be proven
experimentally??) - Excluding active approaches other than one node
search as being LBD
7Well, what DO we know about progress in LBD?
- The two-node search
- http//arrowsmith.psych.uic.edu
- Begin with two lits A and C that represent a
known finding or a hypothesis (estrogen-AD) - look for meaningful links
- (whether or not A and C are disparate)
- We use B-terms extracted from titles
- Could use abstracts, MeSH, triples
8(No Transcript)
9Modeling the Two-Node Search-1
- Field testers, free-form use of the tool
- Chose 6 two-node searches as gold standards not
too big or small, disparate, topically coherent,
clean questions - E.g. for A retinal detachment, C aortic
aneurysm, a) find diseases in which both features
appear not necessarily in same person or b)
find surgical procedures that have been applied
to both conditions. - Manually marked relevant B-terms for a given
query (sometimes several queries for the same two
node search) - Details in Bioinformatics (2007) paper
10Modeling the Two-Node Search-2
- Used 8 complementary features to score each
B-term (e.g. recency, frequency, semantic
categories) - created a single combined and weighted score for
each B-term - Used logistic regression model to optimally give
weights to each feature so as to separate marked
relevant B-terms from all others (mixed set)
11Modeling the Two-Node Search-3
12Two End-Points of this Research
- For any two-node search, we can now rank the list
of B-terms in order of estimated probability that
they will be marked as relevant (meaningful) by
SOME user for SOME query. - For any pair of lits A and C,
- we can now estimate the OVERALL shared implicit
information between A and C - ( of B-terms that are predicted to be relevant)
13Relevance to the One Node Search
- We can re-conceptualize the one-node search as a
series of two-node searches - Choose A, then choose category C
- Divide category C into many small coherent Ci
densely - For each Ci, score multi-dimensional features
Including, but not limited to, features that
relate A to Ci (e.g. number of B-terms in common
or predicted relevant B-terms) - Rank the Ci to identify the most promising lits
(which are presumed to point to novel hyps or
implicit information helpful when applied to A)
14A is evaluated pairwise against C
C1 might involve B-terms C2 might
not! C3 C4 . e.g. A
Huntington Disease C lifestyle
factors autophagy, or therapeutic
agents
15Interestingness Measures
- Field of data mining.
- This allows us to encode real-life priorities and
strategies of working scientists - Existing one node search looks for novelty,
relevance, non-trivial, likelihood of being true
. get low hanging fruit - What about actionability, feasibility of
follow-up, surprisingness, cross-discipline,
presence of high experimental support,
generalizability to other problems, or high
potential impact? - A candidate Ci could be interesting because it is
recently discovered and rapidly growing (e.g.
microRNAs), well characterized, for a disease
has an animal model, for a protein is connected
to many other proteins, for a drug has FDA
approval. - not only re-conceptualizes one node search (e.g.,
no combinatorial explosion) but it generalizes
the ranking methods.
16Gold Standards for One-Node Searches
- Also, we can now envision preparing a series of
gold standard searches, even automatically (cf.
TREC 2006, 2007). - Use implicit assertions to reconstruct explicit
knowledge. - Use review articles
- lists (e.g. in virus study, gold standard was a
list of viruses that were thought to be at risk
of being exploited for biological warfare). - time slices
- Avoids the paradox that one node searches must
predict things that have no experimental support!
17Conclusions
- LBD is (can be, will be) alive and well!
- Need to incorporate the types of real-life
priorities and strategies of working scientists - Re-conceptualize the one node search as a series
of two-node searches - Use interestingness measures to supplement
B-term measures.
18(No Transcript)
19Journal of Biomedical Discovery and Collaboration
- Unique multi-disciplinary audience
- People who engage in scientific discovery and
collaboration - People who make tools that enhance scientific
discovery and collaboration - People who study scientific discovery and
collaboration - Hosted by Biomed Central
- Fully peer-reviewed
- RAPID review (lt3 weeks is routine)
- Open-access, indexed in PubMed Central et al
- Readership goes up 10-100-fold
- Impact goes up too
- Article fee reduced or zeroed depending on
institution
20Acknowledgements
- Don Swanson
- Vetle Torvik
- Wei Zhou (Clement Yu)
- Marc Weeber
21Ruminations
- Should LBD analyses be user-friendly? Popular??
- Dont they overlook true divergent discoveries?
- Should LBD be run automatically as a program in
the background, with alerts of possible
discoveries? - Does LBD bypass, or reinforce, good old fashioned
hypothesis driven science?