A Bio Text Mining Workbench combined with Active Machine Learning - PowerPoint PPT Presentation

About This Presentation

Title:

A Bio Text Mining Workbench combined with Active Machine Learning

Description:

Title: A Development Workbench for Machine-Learning Oriented BioMedical Text Mining System Author: songyu Last modified by: gblee Created Date: 2/4/2005 1:08:06 AM – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 54

Provided by: song151

Learn more at: http://lbm2005.biopathway.org

Category:

more less

Transcript and Presenter's Notes

Title: A Bio Text Mining Workbench combined with Active Machine Learning

1
A Bio Text Mining Workbench combined with Active
Machine Learning

Gary Geunbae Lee
Postech
11/25 LBM2005

2
Contents

Introduction
POSBIOTM/W Workbench
POSBIOTM/NER System
POSBIOTM/NER with Active Machine Learning
POSBIOTM/Event System
Current status (demo)

3
Introduction

Exponentially growing biological publications

4
Introduction

Two key issues to deal with biological texts.

Biological named entity recognition.
Extract the biological interaction (events)
between biological entities.
Important to biological pathway.

Biological Papers
5
Introduction

Bio-text mining workbench

Development workbench (common in NLP)
Grammar development workbench
POS/Tree Tagging workbench
Use large amount of Corpus
Machine Learning methods are used in NER task and
event extraction task.
Annotated corpus is essential to achieve good
results in machine learning based methods (both
in quantity and quality)
Lack of annotated corpus (notorious in
bio/medical fields)
Need
tools in support of collecting, managing,
creating, annotating and exploiting rich
biomedical text resources.
Tools which interacts with the automatic system
to increase the high quality annotated corpus

6
Contents

Introduction
POSBIOTM/W Workbench
POSBIOTM/NER System
POSBIOTM/NER with Active Machine Learning
POSBIOTM/Event System
Current status

7
POSBIOTM/W A development Workbench

Overall Design

8
POSBIOTM/W Workbench

Managing Tool

Goal
help users to search, collect and manage
publications.
Quick Search Bar
provides quick access to PubMed.
Pubmed Search Assistant
Users can select specific abstracts to do the
named-entity tagging and event extraction

9
POSBIOTM/W Workbench

Managing Tool

Pubmed search Assistant

10
POSBIOTM/W Workbench

NER Tool

Named-entity recognition (NER) task
identification of material names concerned.
Goal automatically and effectively annotate
biomedical-related entities.
NER Tool is a Client Tool of POSBIOTM/NER System
Currently, Three NER models are provided.
The GENIA-NER model, the GENE-NER-model and the
GPCR-NER model
Named-entity recognition with Active learning
To minimize the human labeling effort

11
POSBIOTM/W Workbench

NER Tool

Named-entity recognition with Active learning

12
POSBIOTM/W Workbench

Event Extraction Tool

Goal To extract the events which consist of
interaction, effecter, and reactant
Named-entity types protein (P), gene (G), small
molecule (SM), and cellular process (CP).
Interaction biological interaction (BI) and a
chemical interaction (CI).
Event Extraction Tool is a Client Tool of
POSBIOTM/Event System

13
POSBIOTM/W Workbench

Event Extraction Tool

Extraction Result in XML format

ltResultgt ltNERgt .... ltSentence SNum
"4"gtltproteingtEDG-1lt/proteingt, encoded by the
ltgenegtendothelial_differentiation_gene-1lt/genegt
, is a ltproteingtheterotrimeric_guanine_nucleotide_
binding_protein-coupled_receptorlt/proteingt (
ltprotein gtGPCRlt/ protein gt ) for
ltsmall_moleculegtsphingosine-1-phosphatelt/
small_moleculegt ( lt small_moleculegtSPPlt/
small_moleculegt ) that has been shown to
stimulate lt cellular_processgtangiogenesislt/
cellular_processgt and ltcellular_processgtcell_migra
tionlt/ cellular_processgt in cultured endothelial
cells.
lt/Sentencegt ..... lt/NERgt ltEvent_Extractiongt
ltEvent SNum "4"gt ltInteractiongtstimulatelt/Int
eractiongt ltEffectergtsphingosine-1-phosphatelt/Ef
fectergt ltReactantgtangiogenesislt/Reactantgt lt/E
ventgt ..... lt/ Event_Extraction gt lt/Resultgt
14
POSBIOTM/W Workbench

Event Extraction Tool

Extraction Result

15
POSBIOTM/W Workbench

Annotation Tool

Goal
The GUI-based Annotation tool is designed to
manipulate the manual annotations.
Named-entity editing
NE is displayed in different colors which could
be changed
add, remove or correct named-entity tags, or
change the boundaries of named entities, etc.

16
POSBIOTM/W Workbench

Annotation Tool

Event editing
extracted events are displayed in a table
double-clicking the event to look up the original
sentence from which each event is extracted
Upload function
Users can upload the well-annotated data to the
POSBIOTM system
incremental build-up of a massive amount of
named-entity and event annotation corpus.

17
POSBIOTM/W Workbench

Annotation Tool

18
Contents

Introduction
POSBIOTM/W Workbench
POSBIOTM/NER System
POSBIOTM/NER with Active Machine Learning
POSBIOTM/Event System
Current status

19
POSBIOTM/NER System

Named Entity Recognition (NER)

Approach
the named entity recognition problem is regarded
as a classification problem, marking up each
input token with named entity category labels.
CRF
Conditional random fields (CRFs) (Lafferty
et.al. 2001) is a probabilistic framework for
labeling and segmenting a sequential data. (s
state(tag) o input)
For example

20
POSBIOTM/NER System

Named Entity Recognition (NER)

Feature Set

Feature Description
Lexical word only in the case that the previous/current/next words are in the surface word dictionary.
word feature orthographical feature of the previous/current/next words. Upper case letters, numbers, non-alphabet letters. Greek words alpha cells, beta hemolysis, tau interferon.
prefix/suffix Prefixes/suffixes which are contained in the prefix/suffix dictionary. Biological prefix, suffix concept ase, blast, cyt, phore, plast.
part-of-speech tag POS tag of the previous/current/next words. The part of speech is the term used to describe how a particular word is used. E.g. nouns, verb, etc.
Base noun phrase tag base noun phrase tag of the previous/current/next words.
21
POSBIOTM/NER System

NER Models

Three NER models
GENIA model / GENE-NER model / GPCR-NER model
GENIA model
The named entity classes used in the evaluation
DNA, RNA, protein and cell_line, cell_type
The training data consists of 2000 MEDLINE
abstracts of the GENIA version 3 corpus. These
abstracts were collected using the search terms
human, blood cell, transcription factor.
The testing data will come from a super-domain of
the training data (blood cell, transcription
factor).

22
POSBIOTM/NER System

NER Models

GENE-NER model
GENE-NER module uses BioCreative corpus.
The aim of the GENE-NER module is the
identification of which terms in biomedical
research article are gene and/or protein names.
The training corpus consists of 7.5k sentences,
selected from MEDLINE according to their
likelihood of containing gene names.
GPCR-NER module (Postech)
aims at recognizing four target named entity
categories
protein, gene, small molecule and cellular
process.
The training corpus consists of 50 full articles
related to GPCR(G-protein coupled receptor)
signal transduction pathway.

23
POSBIOTM/NER System

NER Models

Evaluation for Three NER models

Corpus Precision Recall F-Measure
GENIA-NER 0.6960 0.6929 0.6945
GENE-NER 0.7550 0.8404 0.7982
GPCR-NER 0.6736 0.8135 0.7370
24
Contents

Introduction
POSBIOTM/W Workbench
POSBIOTM/NER System
POSBIOTM/NER with Active Machine Learning
POSBIOTM/Event System
Current status

25
POSBIOTM/NER with Active Learning

Active Learning in NER

NER with Machine Learning
To enhance the NER performance through the idea
of re-using the annotated data and re-training
the NER module
NER with Active Machine Learning
To minimize the human labeling effort without
degrading the performance
To select the most informative samples for
training

26
POSBIOTM/NER with Active Learning

Active Learning in NER Framework

27
POSBIOTM/NER with Active Learning

Active Learning Scoring Strategy

Uncertainty-based Sample Selection
Using an entropy-based measure to quantify the
uncertainty that the current classifier holds
(entropy or normalized entropy of the CRF
conditional probability)
The most uncertain samples are selected for human
annotation

28
POSBIOTM/NER with Active Learning

Active Learning Scoring Strategy

Diversity-based Sample Selection
To catch the most representative sentences in
each sampling.
The divergence measures of the two sentences are
represented by the minimum similarity among the
examples
The similarity score of two words
The similarity score of two sentences

(for syntactic path)
29
POSBIOTM/NER with Active Learning

Active Learning Scoring Strategy

MMR(Maximal Marginal Relevance) method
The two measures for uncertainty and diversity
will be combined using the MMR method to give the
sampling scores in our active learning strategy

30
POSBIOTM/NER with Active Learning

Experiment and Discussion

Training Data
2,000 MEDLINE abstracts from the GENIA corpus
5 named entity classes
DNA, RNA, protein, cell line, cell type
Test Data
404 abstracts
Half of them are from the same domain as the
training data and the other half are from the
super-domain of blood cell and transcription
factor

31
POSBIOTM/NER with Active Learning

Experiment and Discussion

Pool-based sample selection
100 abstracts were used to train initial NER
module
Each time, we chose k examples (sentences) from
the given pool to train the new NER module
The number k varied from 1,000 to 17,000 with
step size 1,000
Active learning methods for test
Random selection
Entropy based uncertainty selection
Entropy combined with Diversity
Normalized Entropy combined with Diversity

32
POSBIOTM/NER with Active Learning

Experiment and Discussion

33
POSBIOTM/NER with Active Learning

Experiment and Discussion

All three kinds of active learning strategies
outperform the random selection
The combined strategy reduces 24.64 training
examples compared with the random selection
The normalized combined strategy reduces 35.43
training examples compared with the random
selection
Diversity increases the classifiers performance
when the large amount of sample are selected
Up to 4,000 sentences, the entropy strategy and
the combined strategy perform similar
After 11,000 sentence point, the combined
strategy surpasses the entropy strategy

34
Contents

Introduction
POSBIOTM/W Workbench
POSBIOTM/NER System
POSBIOTM/NER with Active Machine Learning
POSBIOTM/Event System
Current status

35
POSBIOTM/Event System

System Architecture

36
POSBIOTM/Event System

Target Slot Definition

Template Element
Entities - participants of an event
protein (P), gene (G), small molecule (SM),
cellular process (CP)
Interaction - relationship between entities
biological interaction (BI) Functional
interaction
About how/whether one component affects the
other's status biologically
chemical interaction (CI) Molecular interaction
About the interaction among entities at the
molecular structural level
Event
One Interaction (I)
Connecting the effecter and reactant
Interaction keywords (BI, CI)
One Effecter (E)
Provoking an event
Template element (P, G, SM, CP) or nested event
One Reactant (R)
Responding to an effecter
Template element (P, G, SM, CP) or nested event

37
POSBIOTM/Event System

Target Slot Definition

Example

The cross-talk between PDGF and SPP is required for these embryonic cell movements.
Template Element Entities PDGF (P), SPP (SM), Cell movement (CP) Interaction keywords cross-talk (BI), require (BI) Event cross-talk (I) PDGF (E) SPP (R) require (I) cross-talk (E) cell movement (R)
38
POSBIOTM/Event System

Pre-Processor

Sentence boundary detection
Annotating Named Entity (NER)
Protein
Small molecule
Gene
Cellular process
Compound/Complex Sentence Splitter
To simplify the complicated full texts

39
POSBIOTM/Event System

Pre-Processor

Compound/Complex Sentence Splitter
Simple splitting rules
S NP1 VP1 NP2 SBAR thatwhich VP2 /SBAR
/S
? NP1 VP1 NP2 NP2 VP2
Example
The best studied of these is EDG-1, which is
implicated in cell migration and angiogenesis.
gt 1. The best studied of these is EDG-1.
2. EDG-1 is implicated in cell migration and
angiogenesis.

40
POSBIOTM/Event System

Biological Event Extraction

Two-level Event Rule Learner

41
POSBIOTM/Event System

Biological Event Extraction

Event Rule Learner
Adapt a supervised machine learning algorithm
WHISK
learns rules in the form of context-based regular
expressions
induces the rules with top-down manner
Ex) NP .? (ltCPgt)E /NP VP (ltBIgt)I
/VP NP both (ltPgt)R and .? /NP
Limitation of the WHISK
The longer distance between event components, the
more difficult to extract the correct event
WHISK consider all lexical words between event
components
Cannot handle nested biological events
Propose two-level rule learning method to handle
the limitation of the flat rule learning method

42
POSBIOTM/Event System

Biological Event Extraction

Two-level Event Rule Learner

NP ltBIgtcross-talklt/BIgt between ltPgtPDGFlt/Pgt and ltSMgtSPPlt/SMgt /NP VP is ltBIgtrequiredlt/BIgt /VP for NP these embryonic ltCPgtcell_movementslt/CPgt /NP ltTAGSgt B interaction cross-talk effecter PDGF reactant SPP ltTAGSgt B interaction require effecter cross-talk reactant cell movement
1. Marking long NP boundary 2. Learn the short-span rule corresponding to the NP ltBIgtcross-talklt/BIgt between ltPgtPDGFlt/Pgt and ltSMgtSPPlt/SMgt ? NP (ltBIgt)I between (ltPgt)E and (ltSMgt)R /NP 3. Re-annotate the short-span interaction as one noun with regular expression format
NP ltEgtcross-talk_between_PDGF_and_SPPlt/Egt /NP VP is ltBIgtrequiredlt/BIgt /VP for NP these embryonic ltCPgtcell_movementslt/CPgt /NP ltTAGSgt B interaction require effecter cross-talk reactant cell movement
4. Learn the long-span rule with the re-annotated sentence
43
POSBIOTM/Event System

Biological Event Extraction

Event Extractor
To extract the events with the automatic
generated rules
by using regular expression pattern matching
To handle the alias and noun conjunction
aliases and noun conjunctions have general
patterns like sphingosine-1-phosphate(SPP) or
FP, IP, and TP receptors
handle them with simple rules like A(B) or A,
B, C, and D
To remove sentences including the negative words
not, never, fail, etc

44
POSBIOTM/Event System

Event Component Verifier

45
POSBIOTM/Event System

Event Component Verifier

To remove the incorrectly extracted events
Classify template elements (P, G, SM, CP, BI, CI)
into 4 classes
I (interaction), E (effecter), R (reactant), N
(none)
I, E, R events components
N a template element , but not an event
component
Use a Maximum Entropy Classifier
Features
POS tag, phrase chunks, the type of template
element of neighboring words and semantic
information

46
POSBIOTM/Event System

Event Component Verifier

47
POSBIOTM/Event System

Event Component Verifier

Example

Extracted Biological Events Ev1 Requires (I) sphingosine_kinase(E) cell_migration (R) Ev2 Requires (I) EDG-1 (E) cell_migration (R) Ev3 Requires (I) EDG-1 (E) PDGF (R)
Event Component Verifier Results I Requires E EDG-1, sphingosine_kinase, PDGF R cell_migration
Verified Biological Extracted Events Ev1 Requires (I) sphingosine_kinase (E) cell_migration (R) Ev2 Requires (I) EDG-1 (E) cell_migration (R)
48
POSBIOTM/Event System

Experiment and Discussion

500 Medline abstracts including 2,314 biological
events 10-fold cross validation
Flat rule learner vs. two-level rule learner
Before verification vs. after verification
Performance comparison
Learning Information Extractors for Proteins and
their Interactions (2004) - Razvan Bunescu, et.
al
1000 abstracts 10-fold cross validation

Flat rule learner Flat rule learner Two-level rule learner Two-level rule learner Comparison system
Before verification After verification Before verification After verification Comparison system
Precision() 38.3 54.7 38.2 53.1 39
Recall() 58.0 49.2 68.0 56.1 63
F-measure 46.1 51.8 48.9 54.6 48.2
49
POSBIOTM/Event System

Experiment and Discussion

Trade-off between precision and recall
Before verification big gap between precision
and recall
After verification low gap between precision
and recall
threshold cut the rules according to the
measure on how many of the extracted events from
a rule are correct

50
POSBIOTM/Event System

Experiment and Discussion

Constant good performance regardless of the
threshold of rule learner

51
Other Corpora for Bio-Relation Extraction

BC-PPI
From BioCreative Corpus for NER
Protein/Gene interactions
255 interactions in 1000 sentences
IEPA
Protein/Protein interactions
410 interactions in 498 sentences
LLL05
Protein/Gene interactions
271 interactions in 80 sentences
BioText
Disease/Treatment relations

52
Contents

Introduction
POSBIOTM/W Workbench
POSBIOTM/NER System
POSBIOTM/NER with Active Machine Learning
POSBIOTM/Event System
Current status

53
Current Status future works

Re-implemented with Java (platform independent)
Integrated with J-Designer in SBW consortium
(will be)
Integrated with Active learning method to
automatically suggest human-annotated corpus
Used for national large scale BIT fusion
projects search for useful peptide (usable as a
ligand for drug)
Getting more feed back from biologists
System getting smarter with more usage workbench
active learning