Title: Literature-Based Knowledge Discovery using Natural Language Processing
1Literature-Based Knowledge Discovery using
Natural Language Processing
Dimitar Hristovski,1 PhD, Carol Friedman,2 PhD,
Thomas C Rindflesch,3 PhD, Borut Peterlin,4 MD
PhD 1Institute of Biomedical Informatics,
Medical Faculty, University of Ljubljana,
Slovenia 2Department of Biomedical Informatics,
Columbia University, New York3National Library
of Medicine, Bethesda, Maryland 4Division of
medical genetics, UMC, Slajmerjeva 3, Ljubljana,
Slovenia e-mail dimitar.hristovski_at_mf.uni-lj.si
2Part 1 Co-occurrence based LBD
3Motivation
- Overspecialization
- Information overload
- Large databases
- Need and opportunity for computer supported
knowledge discovery
4Literature-based Discovery (LBD)
- A method for automatically generating hypotheses
(discoveries) from literature - Hypotheses have formConcept1 Relation
Concept2 - ExampleFish oil Treats Raynauds disease
5Background
New Relation?e.g. Treats
Concept X(Disease) e.g. Raynauds
Concepts Y(Pathologycal or Cell Function,
) e.g. Blood viscosity
Concepts Z(Drugs, ) e.g. Fish oil
6Biomedical Discovery Support System (BITOLA)
- Goal
- discover potentially new relations (knowledge)
between biomedical concepts - to be used as research idea generator and/or as
- an alternative way to search Medline
- System user (researcher or intermediary)
- interactively guides the discovery process
- evaluates the proposed relations
7Extending and Enhancing Literature Based
Discovery
- Goal
- Make literature based discovery more suitable for
disease candidate gene discovery - Decrease the number of candidate relations
- Method
- Integrate background knowledge
- Chromosomal location of diseases and genes
- Gene expression location
- Disease manifestation location
8System Overview
Knowledge Base
Concepts
Background Knowledge (Chromosomal Locations, )
Discovery Algorithm
Association Rules
User Interface
Knowledge Extraction
Databases (Medline, LocusLink, HUGO, OMIM, )
9Terminology Problems during Knowledge Extraction
- Gene names
- Gene symbols
- MeSH and genetic diseases
10Detected Gene Symbols by Frequency
- type666548
- II552584
- III201776
- component179643
- CT175973
- AT151337
- ATP147357
- IV123429
- CD499657
- p5389357
- MR88682
- SD85889
- GH84797
- LPS68982
- 5967272
- E264616
- 8263521
- AMP61862
- TNF59343
- RA58818
- CD857324
- O256847
- ACTH54933
- CO253171
- PKC51057
- EGF50483
- T349632
- MS46813
- A244896
- ER43212
- upstream41820
- PRL41599
11Gene Symbol Disambiguation
- Find MEDLINE docs in which we can expect to find
gene symbols - Example of false positive
- Ethics in a twist "Life Support", BBC1. BMJ 1999
Aug 7319(7206)390 - breast basic conserved 1 (BBC1) gene, v.s. BBC1
television station featuring new drama series
Life Support
12Binary Association Rules
- X?Y (confidence, support)
- If X Then Y (confidence, support)
- Confidence of docs containing Y within the X
docs - Support number (or ) of docs containing both X
and Y - The relation between X and Y not known.
- Examples
- Multiple Sclerosis ? Optic Neuritis (2.02, 117)
- Multiple Sclerosis ? Interferon-beta (5.17, 300)
13Discovery Algorithm
Candidate Gene?
Concepts Y(Pathologycal or Cell Function, )
Concept X(Disease)
Concepts Z(Genes)
Chromosomal Region
Chromosomal Location
Match
Manifestation Location
Expression Location
Match
14Ranking Concepts Z
15Problem Size
- Full Medline analyzed (cca 15,000,000 recs)
- 87,000,000 association rules between 180,000
biomedical concepts
16Bilateral Perisylvian Polymicrogiria - BPP (OMIM
300388)
- Polymicrogyria of the cerebral cortex is a
developmental abnormality characterized by
excessive surface convolution - Clinical characteristics
- Mental retardation
- Epilepsy
- Pseudobulbar palsy (paralysis of the face,
throat, tongue and the chewing process) - X linked dominant inheritance
17237 genes in Xq28
relation between semantic types Cell Movement and
Gene or gene products
18 gene candidates
Sublocalisation in the Xq28
15 gene candidates
Tissue specific expression
2 gene candidates L1CAM and FLNA
18User Interface cgi-bin version
19Automatically search for supporting Medline
Citations
20Part 1 Summary and Conclusions
- Discovery support system (BITOLA) presented
- The system can be used as
- Research idea generator, or
- Alternative method of searching Medline
- Genetic knowledge about the chromosomal locations
of diseases and genes included to make BITOLA
more suitable for disease candidate gene discovery
21System Availability
- URL www.mf.uni-lj.si/bitola/
22Part 2 Exploring Semantic Relations for LBD
23Current LBD Systems
- Co-occurrence based
- Concepts
- Title/Abstract Words/Phrases
- MeSH
- UMLS
- Genes ...
- UMLS Semantic types used for filtering
- Semantic relations between concepts NOT used
24Drawbacks of Current LBD
- Not all co-occurrences represent a relation
- Users have to read many Medline citations when
reviewing candidate relations - Many spurious (false-positive) relations and
hypotheses produced - No explanation of proposed hypotheses
25Enhancing the LBD paradigm
- Use semantic relations obtained from
- two NLP systems (BioMedLee and SemRep) to
augment - co-occurrence based LBD system (BITOLA)
26Methods
27Discovery Patterns
- Discovery pattern Set of conditions to be
satisfied for the generation of new hypotheses - Conditions are combinations of semantic relations
between concepts - Maybe_Treats pattern in this research has two
forms - Maybe_Treats1
- Maybe_Treats2
28Maybe_Treats Discovery Pattern
Maybe_Treats1
Substance Y1(or Body meas., Body funct.)
Drug Z1 (or substance)
Opposite_Change1
Change1
Disease X
Disease X2
Substance Y2(or Body meas., Body funct.)
Same Change2
Change2
Treats
Drug Z2(or substance)
Maybe_Treats2
29Maybe_Treats1 and Maybe_Treats2
- GoalPropose potentially new treatments
- Can work in concert
- Propose different treatments (complementary)
- Propose same treatments using different discovery
reasoning (reinforcing)
30Multiple Usages of Maybe_Treats
- Given Disease X as input
- find new treatments Z
- Given Drug Z as input
- find diseases X that can be treated
- Given Disease X and Drug Z as input
- test whether Z can be used to treat X
31Semantic Relations Used
- Associated_with_change and Treats used to extract
known facts from the literature - Then Maybe_Treats1 and Maybe_Treats2 predict new
treatments based on the known extracted facts
32Associated_with_change
- One concept associated with a change in another
concept, for example - Associated_with(Raynauds, Blood viscosity,
increase) - Local increase of blood viscosity during
cold-induced Raynaud's phenomenon. - Increased viscosity might be a causal factor in
secondary forms of Raynaud's disease, - BioMedLee (Friedman et al) used to extract
Associated_with_change
33Treats
- Used to extract drugs known to treat a disease
- Major purpose in our approach
- Eliminate drugs already known to be used to treat
a disease - Find existing treatments for similar diseases
- TREATS(Amantadine,Huntington)
- treatment of Huntingtons disease with
amantadine - Treats extracted by SemRep (Rindflesch et al)
34Results
35Huntington Disease
- Inherited neurodegenerative disorder
- All 5511 Huntington citations (Jan.2006)
processed with BioMedLee and SemRep - 35 interesting concepts assoc.with change
selected and corresponding citations (250.000)
processed
36Insulin for Huntington Disease
- Assoc_with(Huntington,Insulin,decrease)
- Huntington's disease transgenic mice develop an
age-dependent reduction of insulin mRNA
expression and diminished expression of key
regulators of insulin gene transcription, - Insulin also decreased in diabetes mellitus
- Therapies used to regulate insulin in diabetes
might be used for Huntington
37Capsaicin for Huntington
- Assoc_with(Huntington,Substance P,decrease)
- In Huntington's disease brains decreased
Substance P staining was found in - Assoc_with(Capsaicin,Substance P,increase)
- Capsaicin also attenuated the increase in
Substance P content in sciatic nerve, - Capsaicin maybe treats Huntington because
Substance P is decreased in Huntington and
Capsaicin increases Substance P.
38Huntington Results - Summary
Maybe_Treats1
Substance P(Substance Y1)
Capsaicin(Drug Z1)
Increase
Decrease
Huntington(Disease X)
Diabetes M(Disease X2)
Insulin(Substance Y2)
Decrease
Decrease
Treats
Insulin regulation ther. (Z2)
Maybe_Treats2
39Example Parkinson disease as starting concept.
Bellow shown some related concepts changed in
association to Parkinson
40Potential Treatments for Parkinson (e.g.
gabapentine)
41Showing Supporting Sentenceswith highlighted
concepts and relations
42Gabapentine for Parkinson
- Assoc_with(Parkinson,gamma-aminobutyric
acid(GABA),decrease) - studies indicate that patients with Parkinson's
disease have decreased basal ganglia
gamma-aminobutyric acid function - Assoc_with(GABA,Gabapentine,increase)
- Gabapentin, probably through the activation of
glutamic acid decarboxylase, leads to the
increase in synaptic GABA. - Explanation Gabapentine maybe treats Parkinson
because GABA is decreased in Parkinson and
Gabapentine increases GABA.
43Part 2 Conclusions
- A new method to improve LBD presented
- Based on discovery patterns and semantic
relations extracted by BioMedLee and SemRep,
coupled with BITOLA LBD - Easier for the user to evaluate smaller number of
hypotheses - Two potentially new therapeutic approaches for
Huntington proposed and one for Parkinson - RaynaudsFish oil discovery replicated
44The future of Literature-based Discovery
- Development of specific discovery patterns based
on semantic relations and further integrated with
co-occurrence-based LBD
45Link, References and some propaganda
- http//www.mf.uni-lj.si/bitola
- Hristovski D, Peterlin B, Mitchell JA and
Humphrey SM. Using literature-based discovery to
identify disease candidate genes. Int. J. Med.
Inform. 2005. Vol. 74(24), pp. 289298. ?
Selected for Yearbook of Medical Informatics 2006 - Hristovski D, Friedman C, Rindflesch TC, Peterlin
B. Exploiting semantic relations for
literature-based discovery. In Proc AMIA 2006
Symp 2006. p. 349-53. - Ahlers C, Hristovski D, Kilicoglu H, Rindflesch
TC. Using the Literature-Based Discovery Paradigm
to Investigate Drug Mechanisms. In Proc AMIA 2007
Symp 2007. p. 6-10. ? Distinguished Paper Award
AMIA2007 - Hristovski D, Friedman C, Rindflesch TC, Peterlin
B. Literature-Based Knowledge Discovery using
Natural Language Processing. ? To appear as a
chapter in the first LBD book in 2008