Title: GeneTUC: Natural Language Understanding in Medical Text
1GeneTUC Natural Language Understanding in
Medical Text
- PhD Defense
- Rune Sætre
- June 27th 2006
2Overview
- Motivation
- Thesis Work
- Overview (Diploma Thesis)
- Idea (Paper 1 and 2)
- Bioogle (Paper 3, 4 and 5)
- GeneTUC (Paper 6)
- Results, Related Work and Discussion
- Comments and Questions by Jong C. Park and Eivind
Hovig
3Motivation
http//www.ncbi.nlm.nih.gov/PubMed/
4Motivation
- Biomedical Researchers publish almost 2000
abstracts per day in MEDLINE - Computers are needed to automatically find all
(recall), and only (precision), the relevant
information - Future Solution GeneTUC
- TUC The Understanding Computer
- BusTUC works for Natural Language queries about
busses in Trondheim - GeneTUC uses full-parsing to extract knowledge
from MEDLINE - After parsing the input, GeneTUC can answer
simple questions about protein and gene
interactions and other facts from the text
5Challenge Medical language
- Example Input Sentences
- Subsequently, activated CREB activates
transcription of genes essential for proper germ
cell differentiation. - Indeed, Ca2/calmodulin binds a complex of RGS4
and a transition state analog of Galpha
i1-GDP-AlF4-. - Medical language is not always natural language
- Complex grammar
- Invention of new words/names every day
PMID 11988318
BioCreative1 Example, PMID 11988318
6GeneTUC Research Overview
7Thesis Work
- GeneTUC Diploma Work
- Literature Review NLU in Medicine
- GeneTUC Full-parsing of MEDLINE Abstracts
- PhD Papers
- 1 Unitex Local Grammars
- 2 ProtChew Automatic Protein Name Recognition
- 3 Alchymoogle Automatic Entity Annotation
- 4 gProt Automatic Protein Interaction Annotation
- 5 WebProt Online gProt Experiments
- 6 GeneTUC GENIA corpus experiments
8TUC Introduction
- Chat-80, Prat-89, HSQL
- 1991 The Understanding Computer
- 1996 BusTUC (www.team-trafikk.no)
- 2000 GeneTUC, diploma project
- 2001-2006 GeneTUC has been my PhD-Project ?
9GeneTUC System Architecture
- MEDLINE Abstracts
- GO GeneOntology
- TUC The Understanding Computer
- DB TQL DataBase
- HGNC HUGO Gene Nomenclature Committee
- WordNet Ontology
GeneTUC
GO
Answer
Query
MEDLINE
TUC
DB
HG NC
WordNet
10WordNet 2.0
- Online lexical reference system
- Nouns, verbs, adjectives and adverbs
- Inspired by psycholinguistic theories of human
lexical memory - Organized into synonym sets, each representing
one underlying lexical concept - Different relations link the synonym sets
- E.g. hypernyms, hyponyms, holonyms, synonyms,
coordinate terms, domain,
11Nomenclature, HUGO
- HUGO Gene Nomenclature Committee
- Approve a gene name and symbol for each known
human gene - Stored in the Human Gene Nomenclature Database
- Approved 13,000 symbols (20-30,000 human genes)
- Each symbol is unique
- Each gene is only given one approved gene symbol
- Similar names used, e.g. in mouse gene research
- Efforts are made to use a symbol acceptable to
workers in the field - Facilitates electronic data retrieval from
publications
12Gene Ontology
- Heterarchy
- Molecular Terms
- Controlled Vocabulary
- Function, Process and Location
13GeneTUC Parser
S
- Top-Down, left to right
- Greedy Heuristics
- Semantic Constraints
- Interact(Agent RGS4)
- The rock grows
VP
NP
N
PP
V
PP
P
N
calmodulin
with
interacts
Rgs4
14Screenshot Example
RGS4
- E rgs4 interacts with calmodulin.
- ..................................................
...................... - TQL
- rgs4 isa protein
- calmodulin isa protein
- interact/rgs4/sk(1)
- srel/with/thing/calmodulin/sk(1)
- event/real/sk(1)
- ..................................................
...................... - E calmodulin interacts with cck.
- ..................................................
...................... - TQL
- cck isa gene
- interact/calmodulin/sk(3)
- srel/with/thing/cck/sk(3)
- event/real/sk(3)
- ..................................................
......................
Calmodulin
CCK
15Screenshot Example ctd.
RGS4
- E does rgs4 interact with cck?
- ..................................................
............. - TQL
- test(rgs4 isa protein,
- cck isa gene,
- interact/rgs4/A,
- srel/with/thing/cck/A,
- event/real/A)
- ..................................................
.............. - Yes
- ..................................................
.............. - A transitive rule
- ProteinA interacts with ProteinB and ProteinB
interacts with ProteinCgt ProteinA interacts
with ProteinC
Calmodulin
Calmodulin
CCK
16Dictionary
- GeneTUC does not perform very well without a
complete dictionary - Current Solution Bioogle can build a dictionary
17Bioogle (Paper III)
- Current ontology 275 medical terms
- Connect Unknowns to these Concepts
- Query syntax
- Unknown is (ana)
- Parse results until a hit is found (or not)
- Pentagastrin is a synthetic peptide containing
the five terminal amino acids of gastrin. - Result 104 of 200 terms were correctly classified
18GeneTUC Ontology
Relations AKO Is-A Has_A
19Google API Search
- 1000 queries per user pr day
- Free to use for everybody
- Can be programmed with SOAP in most languages
- Simple Object Access Protocol
- Results are handled automatically
- Alexa (Amazon) has implemented a similar service
- 1 per processor hour
- 1 per gigabyte/year of user storage
- 1 per 50 gigabytes of data processed
- 1 per gigabyte uploaded/downloaded
http//news.bbc.co.uk/1/hi/technology/4530978.st
m
20Paper IV gProt
- What about protein interactions?
- Protein Interaction
- Protein ? Protein
- BioCreAtIvE1 Protein ? Set of GeneOntology Terms
- Find publicly known interactions for a given
protein, using Google as the main source for new
knowledge - Query proteinX VerbY
- Example Gastrin activates
21PaperIVgProt
22Gastrin activates nuclear factor kappaB
(NFkappaB) through a ...Conclusions Gastrin
activates NF kappa B via a PKC dependent
pathway whichinvolves I kappa B kinase, NF
kappa B inducing kinase, and TRAF6.
...gut.bmjjournals.com/cgi/content/abstract/52/6/
813 - Lignende sider Gastrin activates nuclear
factor kappaB (NFkappaB) through a
...gut.bmjjournals.com/cgi/reprint/52/6/813 -
Lignende sider Gastrin activates nuclear factor
kappaB (NFkappaB) through a ...BACKGROUND We
previously reported that gastrin induces
expression of CXC chemokinesthrough
activat...www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmdRetrieve dbPubMedlist_uids12740336doptAb
stract - Lignende sider Gastrin activates
nuclear factor kappaB (NFkappaB) through a
...CONCLUSIONS Gastrin activates NFkappaB via a
PKC dependent pathway which involvesIkappaB
kinase, NFkappaB inducing kinase, and TRAF6. MeSH
Terms ...www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmdRetrieve dbPubMedlist_uids12740336doptCi
tation - Lignende sider Flere resultater fra
www.ncbi.nlm.nih.gov Gastrin activates nuclear
factor kappaB (NFkappaB) through a ...iHOP -
Information Hyperlinked over Proteins Gastrin
activates nuclear factorkappaB (NFkappaB)
through a protein kinase C dependent pathway
involving ...www.pdg.cnb.uam.es/UniPub/iHOP/gp/97
05030.html - 7k - I hurtigbuffer - Lignende sider
Gast - Gastrin precursorGastrin activates rat
stomach histidine decarboxylase via
cholecystokinin-B/gastrinreceptors.
Abstract-863492. Gastrin activated transcription
through a ...www.pdg.cnb.uam.es/UniPub/iHOP/gg/12
1191.html - 105k - I hurtigbuffer -
Lignende sider Flere resultater fra
www.pdg.cnb.uam.es Anatomy Physiology
Lecture Outlinesaa. gastrin activates gastric
juice secretion gastric smooth muscle
churning bb.gastrin activates gastroileal
reflex which moves chyme from ileum to
...www.gwc.maricopa.edu/class/bio202/digestlc.htm
- 20k - I hurtigbuffer - Lignende sider
23Paper IV gProt
24Paper V WebProt
- Online Implementation, bigger experiment
- Can Annotate Protein Interactions with 70
precision - Tested the effect of source filtering
- 90 precision, but recall dropping to 70
25Google as a source
4660 facts total from WebProt
1480
26WebProt
27Screenshot
WebProt
28Paper VI GeneTUC Results
- Can parse 60 of test input sentences in the
GENIA corpus (500 abstracts), - With 86 accuracy on the POS-tagging
- Bracketing Precision and Recall scores of 70,6
and 53,9 - And answer simple questions about the parsed
sentences
29Evalb scores
Paper VI
30Summary
- 6 papers describing the steps needed to show that
GeneTUC can handle medical text - 60 parsing success-rate may not be enough for a
commercial application, - But the fact that it improved from just 10 in
2001 is very promising - Once the parsing success-rate is good enough,
GeneTUC can be tested on Question-Answering - There is a need for a good public dataset that
allows measuring and comparing between different
QA systems (Future Work)
31Acknowledgements
- Biologists
- Astrid Lægreid, Kamilla Stunes, Kristine Misund,
Liv Thommesen, Tonje Strømmen Steigedal - Computer Scientists
- Tore Amble, Arne Halaas, Amund Tveit, Martin
Ranang, Harald Søvik, Yoshimasa Tsuruoka, Anders
Andenæs, Tor-Kristian Jenssen, Franz Günthner,
Junichi Tsujii, Jörg Cassens, Waclaw
Kusnierczyk, Tore Bruland, Peep Küngas, Magnus
Lie Hetland, Morten Hartman, Hallgeir Bergum, Jo
Kristian Bergum, Frode Jünge, Heri Ramampiaro,
Rolv Inge Seehuus, Per Kristian Lehre, Clemens
Marschner, Petra Maier, Holger Bosk, Sebastian
Nagel, Mariya Vitusevych, Yoshimasa Tsuruoka,
Jin-Dong Kim, Hong-Woo Chun, Takashi Ninomiya,
Yusuke Miyao, Frode Høyvik, Henrik Tveit, Jian Su
and others
32Questions and Comments
- Associate professor Jong C. Park
- Computer Science Division,
- Korea Advanced Institute of Science and
Technology (KAIST), - Daejeon, South Korea
- Professor Eivind Hovig
- Department of Tumor Biology,
- Institute for Cancer Research,
- The Norwegian Radium Hospital
33Thesis Work
- GeneTUC Project
- Use TUC in the Medical Text Domain
- Use Google (Bioogle) to Recognize Unknown
Entities - Galpha(i1)-GDP-AlF(4)(-), Ca2, Gastrin
- Use Google (WebProt) to do Automatic Annotation
- Mapping (BioCreative)
- From Gene/Protein ? Set of GeneOntology Terms
34Motivation
- Natural language is natural ?
- Talking computers
- Voice as input
- Repetitive tasks should be automated!
- Information Extraction is trivial,if you know
what to look for
350 GeneTUC Diploma Work
- NLU Review 2002
- GENIA HPSG
- Park et al. CCG-parsing
- Numbers?
36Paper I Local Grammars
- Maurice Gross
- there is more than 1050 ways to build a sentence
with at most twenty words
Gross (1997). Construction of Local Grammars
37Paper II ProtChew
- Protein Names
- Galpha(i1)-GDP-AlF(4)(-)
- Gastrin
-
- Idea Automatic Extraction
- Based on existing dictionaries and machine
learning - Results?
38evalb
- 4 OUTPUT FORMAT FROM THE SCORER
- The scorer gives individual scores for each
sentence, for - example
- Sent. Matched Bracket
Cross Correct Tag - ID Len. Stat. Recal Prec. Bracket gold test
Bracket Words Tags Accracy -
- 1 8 0 100.00 100.00 5 5 5
0 6 5 83.33 - At the end of the output the Summary
section gives statistics - for all sentences, and for sentences lt40 words
in length. The summary - contains the following information
- i) Number of sentences -- total number of
sentences. - ii) Number of Error/Skip sentences -- should
both be 0 if there is no - problem with the parsed/gold files.
- iii) Number of valid sentences Number of
sentences - Number of Error/Skip - sentences
- iv) Bracketing recall (number of correct
constituents) - -------------------------
--------------- - (number of constituents
in the goldfile) - v) Bracketing precision (number of correct
constituents) - -------------------------
--------------- - (number of constituents
in the parsed file) - vi) Complete match percentaage of sentences
where recall and precision are - both 100.
- vii) Average cross(const crossing a goldfile
constituen - -------------------
--------------------- - (number of
sentences) - viii) No crossing percentage of sentences which
have 0 crossing brackets. - ix) 2 or less crossing percentage of
sentences which have lt2 crossing brackets. - x) Tagging accuracy percentage of correct
POS tags (but see 5.3 for exact - details of what is counted).
39Remember
- Present one paper at the time
- Summary results and related work also in the end
Ta med tabeller for parsing, sammenligning med
andre etc. Et eksempel på en kompleks setning
med gtb treet. Ref tabell. Sammenlign
brackets. Ta med webprot screenshot Related
work!! Phd pres. Related work. Lexiquest, 40
verbs, hva er f-score? Fra tore Hvorfor bare
50. Er det semantikk eller gramatikk som gjør at
50 feiler
40Dr. Carl-Fredrik Sørensen (50 min, jeg tid /2) 5
min intro, state-of-the-art 5 min definitions
NLU 10 min thesis/papers overview and Research
Questions 15 min three themes and contributions.
Evaluation of the work 10 min future work Proof
of concept. It can be implemented. Next
step? Industry... Results are trusted Academic...
Results are validated through understanding the
research process. Dennings, proof of
concept Research question...soon....Moores
law Proof of performance Shift the work to
biologists Medline growth graph. Figure...
Everything is published. Background http//www.c
oli.uni-saarland.de/hansu/what_is_cl.html Schope
nhauer imagine how clever a vice man would be,
if he knew everything in his books. Inter-annotat
or agreement in gprot, maybe 80 percent precision
is enough?!