Title: A Bio Text Mining Workbench combined with Active Machine Learning
1A Bio Text Mining Workbench combined with Active
Machine Learning
- Gary Geunbae Lee
- Postech
- 11/25 LBM2005
2Contents
- Introduction
- POSBIOTM/W Workbench
- POSBIOTM/NER System
- POSBIOTM/NER with Active Machine Learning
- POSBIOTM/Event System
- Current status (demo)
3Introduction
- Exponentially growing biological publications
4Introduction
- Two key issues to deal with biological texts.
- Biological named entity recognition.
- Extract the biological interaction (events)
between biological entities. - Important to biological pathway.
Biological Papers
5Introduction
- Bio-text mining workbench
- Development workbench (common in NLP)
- Grammar development workbench
- POS/Tree Tagging workbench
- Use large amount of Corpus
- Machine Learning methods are used in NER task and
event extraction task. - Annotated corpus is essential to achieve good
results in machine learning based methods (both
in quantity and quality) - Lack of annotated corpus (notorious in
bio/medical fields) - Need
- tools in support of collecting, managing,
creating, annotating and exploiting rich
biomedical text resources. - Tools which interacts with the automatic system
to increase the high quality annotated corpus
6Contents
- Introduction
- POSBIOTM/W Workbench
- POSBIOTM/NER System
- POSBIOTM/NER with Active Machine Learning
- POSBIOTM/Event System
- Current status
7POSBIOTM/W A development Workbench
8POSBIOTM/W Workbench
- Goal
- help users to search, collect and manage
publications. - Quick Search Bar
- provides quick access to PubMed.
- Pubmed Search Assistant
- Users can select specific abstracts to do the
named-entity tagging and event extraction
9POSBIOTM/W Workbench
10POSBIOTM/W Workbench
- Named-entity recognition (NER) task
- identification of material names concerned.
- Goal automatically and effectively annotate
biomedical-related entities. - NER Tool is a Client Tool of POSBIOTM/NER System
- Currently, Three NER models are provided.
- The GENIA-NER model, the GENE-NER-model and the
GPCR-NER model - Named-entity recognition with Active learning
- To minimize the human labeling effort
11POSBIOTM/W Workbench
- Named-entity recognition with Active learning
12POSBIOTM/W Workbench
- Goal To extract the events which consist of
interaction, effecter, and reactant - Named-entity types protein (P), gene (G), small
molecule (SM), and cellular process (CP). - Interaction biological interaction (BI) and a
chemical interaction (CI). - Event Extraction Tool is a Client Tool of
POSBIOTM/Event System
13POSBIOTM/W Workbench
- Extraction Result in XML format
ltResultgt ltNERgt .... ltSentence SNum
"4"gtltproteingtEDG-1lt/proteingt, encoded by the
ltgenegtendothelial_differentiation_gene-1lt/genegt
, is a ltproteingtheterotrimeric_guanine_nucleotide_
binding_protein-coupled_receptorlt/proteingt (
ltprotein gtGPCRlt/ protein gt ) for
ltsmall_moleculegtsphingosine-1-phosphatelt/
small_moleculegt ( lt small_moleculegtSPPlt/
small_moleculegt ) that has been shown to
stimulate lt cellular_processgtangiogenesislt/
cellular_processgt and ltcellular_processgtcell_migra
tionlt/ cellular_processgt in cultured endothelial
cells.
lt/Sentencegt ..... lt/NERgt ltEvent_Extractiongt
ltEvent SNum "4"gt ltInteractiongtstimulatelt/Int
eractiongt ltEffectergtsphingosine-1-phosphatelt/Ef
fectergt ltReactantgtangiogenesislt/Reactantgt lt/E
ventgt ..... lt/ Event_Extraction gt lt/Resultgt
14POSBIOTM/W Workbench
15POSBIOTM/W Workbench
- Goal
- The GUI-based Annotation tool is designed to
manipulate the manual annotations. - Named-entity editing
- NE is displayed in different colors which could
be changed - add, remove or correct named-entity tags, or
change the boundaries of named entities, etc.
16POSBIOTM/W Workbench
- Event editing
- extracted events are displayed in a table
- double-clicking the event to look up the original
sentence from which each event is extracted - Upload function
- Users can upload the well-annotated data to the
POSBIOTM system - incremental build-up of a massive amount of
named-entity and event annotation corpus. -
17POSBIOTM/W Workbench
18Contents
- Introduction
- POSBIOTM/W Workbench
- POSBIOTM/NER System
- POSBIOTM/NER with Active Machine Learning
- POSBIOTM/Event System
- Current status
19POSBIOTM/NER System
- Named Entity Recognition (NER)
- Approach
- the named entity recognition problem is regarded
as a classification problem, marking up each
input token with named entity category labels. - CRF
- Conditional random fields (CRFs) (Lafferty
et.al. 2001) is a probabilistic framework for
labeling and segmenting a sequential data. (s
state(tag) o input) - For example
20POSBIOTM/NER System
- Named Entity Recognition (NER)
Feature Description
Lexical word only in the case that the previous/current/next words are in the surface word dictionary.
word feature orthographical feature of the previous/current/next words. Upper case letters, numbers, non-alphabet letters. Greek words alpha cells, beta hemolysis, tau interferon.
prefix/suffix Prefixes/suffixes which are contained in the prefix/suffix dictionary. Biological prefix, suffix concept ase, blast, cyt, phore, plast.
part-of-speech tag POS tag of the previous/current/next words. The part of speech is the term used to describe how a particular word is used. E.g. nouns, verb, etc.
Base noun phrase tag base noun phrase tag of the previous/current/next words.
21POSBIOTM/NER System
- Three NER models
- GENIA model / GENE-NER model / GPCR-NER model
- GENIA model
- The named entity classes used in the evaluation
- DNA, RNA, protein and cell_line, cell_type
- The training data consists of 2000 MEDLINE
abstracts of the GENIA version 3 corpus. These
abstracts were collected using the search terms
human, blood cell, transcription factor. - The testing data will come from a super-domain of
the training data (blood cell, transcription
factor).
22POSBIOTM/NER System
- GENE-NER model
- GENE-NER module uses BioCreative corpus.
- The aim of the GENE-NER module is the
identification of which terms in biomedical
research article are gene and/or protein names. - The training corpus consists of 7.5k sentences,
selected from MEDLINE according to their
likelihood of containing gene names. - GPCR-NER module (Postech)
- aims at recognizing four target named entity
categories - protein, gene, small molecule and cellular
process. - The training corpus consists of 50 full articles
related to GPCR(G-protein coupled receptor)
signal transduction pathway.
23POSBIOTM/NER System
- Evaluation for Three NER models
Corpus Precision Recall F-Measure
GENIA-NER 0.6960 0.6929 0.6945
GENE-NER 0.7550 0.8404 0.7982
GPCR-NER 0.6736 0.8135 0.7370
24Contents
- Introduction
- POSBIOTM/W Workbench
- POSBIOTM/NER System
- POSBIOTM/NER with Active Machine Learning
- POSBIOTM/Event System
- Current status
25POSBIOTM/NER with Active Learning
- NER with Machine Learning
- To enhance the NER performance through the idea
of re-using the annotated data and re-training
the NER module - NER with Active Machine Learning
- To minimize the human labeling effort without
degrading the performance - To select the most informative samples for
training
26POSBIOTM/NER with Active Learning
- Active Learning in NER Framework
27POSBIOTM/NER with Active Learning
- Active Learning Scoring Strategy
- Uncertainty-based Sample Selection
- Using an entropy-based measure to quantify the
uncertainty that the current classifier holds
(entropy or normalized entropy of the CRF
conditional probability) - The most uncertain samples are selected for human
annotation
28POSBIOTM/NER with Active Learning
- Active Learning Scoring Strategy
- Diversity-based Sample Selection
- To catch the most representative sentences in
each sampling. - The divergence measures of the two sentences are
represented by the minimum similarity among the
examples - The similarity score of two words
- The similarity score of two sentences
(for syntactic path)
29POSBIOTM/NER with Active Learning
- Active Learning Scoring Strategy
- MMR(Maximal Marginal Relevance) method
- The two measures for uncertainty and diversity
will be combined using the MMR method to give the
sampling scores in our active learning strategy
30POSBIOTM/NER with Active Learning
- Experiment and Discussion
- Training Data
- 2,000 MEDLINE abstracts from the GENIA corpus
- 5 named entity classes
- DNA, RNA, protein, cell line, cell type
- Test Data
- 404 abstracts
- Half of them are from the same domain as the
training data and the other half are from the
super-domain of blood cell and transcription
factor
31POSBIOTM/NER with Active Learning
- Experiment and Discussion
- Pool-based sample selection
- 100 abstracts were used to train initial NER
module - Each time, we chose k examples (sentences) from
the given pool to train the new NER module - The number k varied from 1,000 to 17,000 with
step size 1,000 - Active learning methods for test
- Random selection
- Entropy based uncertainty selection
- Entropy combined with Diversity
- Normalized Entropy combined with Diversity
32POSBIOTM/NER with Active Learning
- Experiment and Discussion
33POSBIOTM/NER with Active Learning
- Experiment and Discussion
- All three kinds of active learning strategies
outperform the random selection - The combined strategy reduces 24.64 training
examples compared with the random selection - The normalized combined strategy reduces 35.43
training examples compared with the random
selection - Diversity increases the classifiers performance
when the large amount of sample are selected - Up to 4,000 sentences, the entropy strategy and
the combined strategy perform similar - After 11,000 sentence point, the combined
strategy surpasses the entropy strategy
34Contents
- Introduction
- POSBIOTM/W Workbench
- POSBIOTM/NER System
- POSBIOTM/NER with Active Machine Learning
- POSBIOTM/Event System
- Current status
35POSBIOTM/Event System
36POSBIOTM/Event System
- Template Element
- Entities - participants of an event
- protein (P), gene (G), small molecule (SM),
cellular process (CP) - Interaction - relationship between entities
- biological interaction (BI) Functional
interaction - About how/whether one component affects the
other's status biologically - chemical interaction (CI) Molecular interaction
- About the interaction among entities at the
molecular structural level - Event
- One Interaction (I)
- Connecting the effecter and reactant
- Interaction keywords (BI, CI)
- One Effecter (E)
- Provoking an event
- Template element (P, G, SM, CP) or nested event
- One Reactant (R)
- Responding to an effecter
- Template element (P, G, SM, CP) or nested event
37POSBIOTM/Event System
The cross-talk between PDGF and SPP is required for these embryonic cell movements.
Template Element Entities PDGF (P), SPP (SM), Cell movement (CP) Interaction keywords cross-talk (BI), require (BI) Event cross-talk (I) PDGF (E) SPP (R) require (I) cross-talk (E) cell movement (R)
38POSBIOTM/Event System
- Sentence boundary detection
- Annotating Named Entity (NER)
- Protein
- Small molecule
- Gene
- Cellular process
- Compound/Complex Sentence Splitter
- To simplify the complicated full texts
39POSBIOTM/Event System
- Compound/Complex Sentence Splitter
- Simple splitting rules
- S NP1 VP1 NP2 SBAR thatwhich VP2 /SBAR
/S - ? NP1 VP1 NP2 NP2 VP2
- Example
- The best studied of these is EDG-1, which is
implicated in cell migration and angiogenesis. - gt 1. The best studied of these is EDG-1.
- 2. EDG-1 is implicated in cell migration and
angiogenesis.
40POSBIOTM/Event System
- Biological Event Extraction
- Two-level Event Rule Learner
41POSBIOTM/Event System
- Biological Event Extraction
- Event Rule Learner
- Adapt a supervised machine learning algorithm
WHISK - learns rules in the form of context-based regular
expressions - induces the rules with top-down manner
- Ex) NP .? (ltCPgt)E /NP VP (ltBIgt)I
/VP NP both (ltPgt)R and .? /NP - Limitation of the WHISK
- The longer distance between event components, the
more difficult to extract the correct event - WHISK consider all lexical words between event
components - Cannot handle nested biological events
- Propose two-level rule learning method to handle
the limitation of the flat rule learning method
42POSBIOTM/Event System
- Biological Event Extraction
- Two-level Event Rule Learner
NP ltBIgtcross-talklt/BIgt between ltPgtPDGFlt/Pgt and ltSMgtSPPlt/SMgt /NP VP is ltBIgtrequiredlt/BIgt /VP for NP these embryonic ltCPgtcell_movementslt/CPgt /NP ltTAGSgt B interaction cross-talk effecter PDGF reactant SPP ltTAGSgt B interaction require effecter cross-talk reactant cell movement
1. Marking long NP boundary 2. Learn the short-span rule corresponding to the NP ltBIgtcross-talklt/BIgt between ltPgtPDGFlt/Pgt and ltSMgtSPPlt/SMgt ? NP (ltBIgt)I between (ltPgt)E and (ltSMgt)R /NP 3. Re-annotate the short-span interaction as one noun with regular expression format
NP ltEgtcross-talk_between_PDGF_and_SPPlt/Egt /NP VP is ltBIgtrequiredlt/BIgt /VP for NP these embryonic ltCPgtcell_movementslt/CPgt /NP ltTAGSgt B interaction require effecter cross-talk reactant cell movement
4. Learn the long-span rule with the re-annotated sentence
43POSBIOTM/Event System
- Biological Event Extraction
- Event Extractor
- To extract the events with the automatic
generated rules - by using regular expression pattern matching
- To handle the alias and noun conjunction
- aliases and noun conjunctions have general
patterns like sphingosine-1-phosphate(SPP) or
FP, IP, and TP receptors - handle them with simple rules like A(B) or A,
B, C, and D - To remove sentences including the negative words
- not, never, fail, etc
44POSBIOTM/Event System
45POSBIOTM/Event System
- To remove the incorrectly extracted events
- Classify template elements (P, G, SM, CP, BI, CI)
into 4 classes - I (interaction), E (effecter), R (reactant), N
(none) - I, E, R events components
- N a template element , but not an event
component - Use a Maximum Entropy Classifier
- Features
- POS tag, phrase chunks, the type of template
element of neighboring words and semantic
information
46POSBIOTM/Event System
47POSBIOTM/Event System
Extracted Biological Events Ev1 Requires (I) sphingosine_kinase(E) cell_migration (R) Ev2 Requires (I) EDG-1 (E) cell_migration (R) Ev3 Requires (I) EDG-1 (E) PDGF (R)
Event Component Verifier Results I Requires E EDG-1, sphingosine_kinase, PDGF R cell_migration
Verified Biological Extracted Events Ev1 Requires (I) sphingosine_kinase (E) cell_migration (R) Ev2 Requires (I) EDG-1 (E) cell_migration (R)
48POSBIOTM/Event System
- Experiment and Discussion
- 500 Medline abstracts including 2,314 biological
events 10-fold cross validation - Flat rule learner vs. two-level rule learner
- Before verification vs. after verification
- Performance comparison
- Learning Information Extractors for Proteins and
their Interactions (2004) - Razvan Bunescu, et.
al - 1000 abstracts 10-fold cross validation
Flat rule learner Flat rule learner Two-level rule learner Two-level rule learner Comparison system
Before verification After verification Before verification After verification Comparison system
Precision() 38.3 54.7 38.2 53.1 39
Recall() 58.0 49.2 68.0 56.1 63
F-measure 46.1 51.8 48.9 54.6 48.2
49POSBIOTM/Event System
- Experiment and Discussion
- Trade-off between precision and recall
- Before verification big gap between precision
and recall - After verification low gap between precision
and recall - threshold cut the rules according to the
measure on how many of the extracted events from
a rule are correct
50POSBIOTM/Event System
- Experiment and Discussion
- Constant good performance regardless of the
threshold of rule learner
51Other Corpora for Bio-Relation Extraction
- BC-PPI
- From BioCreative Corpus for NER
- Protein/Gene interactions
- 255 interactions in 1000 sentences
- IEPA
- Protein/Protein interactions
- 410 interactions in 498 sentences
- LLL05
- Protein/Gene interactions
- 271 interactions in 80 sentences
- BioText
- Disease/Treatment relations
52Contents
- Introduction
- POSBIOTM/W Workbench
- POSBIOTM/NER System
- POSBIOTM/NER with Active Machine Learning
- POSBIOTM/Event System
- Current status
53Current Status future works
- Re-implemented with Java (platform independent)
- Integrated with J-Designer in SBW consortium
(will be) - Integrated with Active learning method to
automatically suggest human-annotated corpus - Used for national large scale BIT fusion
projects search for useful peptide (usable as a
ligand for drug) - Getting more feed back from biologists
- System getting smarter with more usage workbench
active learning