A Bio Text Mining Workbench combined with Active Machine Learning - PowerPoint PPT Presentation

About This Presentation
Title:

A Bio Text Mining Workbench combined with Active Machine Learning

Description:

Title: A Development Workbench for Machine-Learning Oriented BioMedical Text Mining System Author: songyu Last modified by: gblee Created Date: 2/4/2005 1:08:06 AM – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 54
Provided by: song151
Category:

less

Transcript and Presenter's Notes

Title: A Bio Text Mining Workbench combined with Active Machine Learning


1
A Bio Text Mining Workbench combined with Active
Machine Learning
  • Gary Geunbae Lee
  • Postech
  • 11/25 LBM2005

2
Contents
  • Introduction
  • POSBIOTM/W Workbench
  • POSBIOTM/NER System
  • POSBIOTM/NER with Active Machine Learning
  • POSBIOTM/Event System
  • Current status (demo)

3
Introduction
  • Exponentially growing biological publications

4
Introduction
  • Two key issues to deal with biological texts.
  • Biological named entity recognition.
  • Extract the biological interaction (events)
    between biological entities.
  • Important to biological pathway.

Biological Papers
5
Introduction
  • Bio-text mining workbench
  • Development workbench (common in NLP)
  • Grammar development workbench
  • POS/Tree Tagging workbench
  • Use large amount of Corpus
  • Machine Learning methods are used in NER task and
    event extraction task.
  • Annotated corpus is essential to achieve good
    results in machine learning based methods (both
    in quantity and quality)
  • Lack of annotated corpus (notorious in
    bio/medical fields)
  • Need
  • tools in support of collecting, managing,
    creating, annotating and exploiting rich
    biomedical text resources.
  • Tools which interacts with the automatic system
    to increase the high quality annotated corpus

6
Contents
  • Introduction
  • POSBIOTM/W Workbench
  • POSBIOTM/NER System
  • POSBIOTM/NER with Active Machine Learning
  • POSBIOTM/Event System
  • Current status

7
POSBIOTM/W A development Workbench
  • Overall Design

8
POSBIOTM/W Workbench
  • Managing Tool
  • Goal
  • help users to search, collect and manage
    publications.
  • Quick Search Bar
  • provides quick access to PubMed.
  • Pubmed Search Assistant
  • Users can select specific abstracts to do the
    named-entity tagging and event extraction

9
POSBIOTM/W Workbench
  • Managing Tool
  • Pubmed search Assistant

10
POSBIOTM/W Workbench
  • NER Tool
  • Named-entity recognition (NER) task
  • identification of material names concerned.
  • Goal automatically and effectively annotate
    biomedical-related entities.
  • NER Tool is a Client Tool of POSBIOTM/NER System
  • Currently, Three NER models are provided.
  • The GENIA-NER model, the GENE-NER-model and the
    GPCR-NER model
  • Named-entity recognition with Active learning
  • To minimize the human labeling effort

11
POSBIOTM/W Workbench
  • NER Tool
  • Named-entity recognition with Active learning

12
POSBIOTM/W Workbench
  • Event Extraction Tool
  • Goal To extract the events which consist of
    interaction, effecter, and reactant
  • Named-entity types protein (P), gene (G), small
    molecule (SM), and cellular process (CP).
  • Interaction biological interaction (BI) and a
    chemical interaction (CI).
  • Event Extraction Tool is a Client Tool of
    POSBIOTM/Event System

13
POSBIOTM/W Workbench
  • Event Extraction Tool
  • Extraction Result in XML format

ltResultgt ltNERgt .... ltSentence SNum
"4"gtltproteingtEDG-1lt/proteingt, encoded by the
ltgenegtendothelial_differentiation_gene-1lt/genegt
, is a ltproteingtheterotrimeric_guanine_nucleotide_
binding_protein-coupled_receptorlt/proteingt (
ltprotein gtGPCRlt/ protein gt ) for
ltsmall_moleculegtsphingosine-1-phosphatelt/
small_moleculegt ( lt small_moleculegtSPPlt/
small_moleculegt ) that has been shown to
stimulate lt cellular_processgtangiogenesislt/
cellular_processgt and ltcellular_processgtcell_migra
tionlt/ cellular_processgt in cultured endothelial
cells.
lt/Sentencegt ..... lt/NERgt ltEvent_Extractiongt
ltEvent SNum "4"gt ltInteractiongtstimulatelt/Int
eractiongt ltEffectergtsphingosine-1-phosphatelt/Ef
fectergt ltReactantgtangiogenesislt/Reactantgt lt/E
ventgt ..... lt/ Event_Extraction gt lt/Resultgt
14
POSBIOTM/W Workbench
  • Event Extraction Tool
  • Extraction Result

15
POSBIOTM/W Workbench
  • Annotation Tool
  • Goal
  • The GUI-based Annotation tool is designed to
    manipulate the manual annotations.
  • Named-entity editing
  • NE is displayed in different colors which could
    be changed
  • add, remove or correct named-entity tags, or
    change the boundaries of named entities, etc.

16
POSBIOTM/W Workbench
  • Annotation Tool
  • Event editing
  • extracted events are displayed in a table
  • double-clicking the event to look up the original
    sentence from which each event is extracted
  • Upload function
  • Users can upload the well-annotated data to the
    POSBIOTM system
  • incremental build-up of a massive amount of
    named-entity and event annotation corpus.

17
POSBIOTM/W Workbench
  • Annotation Tool

18
Contents
  • Introduction
  • POSBIOTM/W Workbench
  • POSBIOTM/NER System
  • POSBIOTM/NER with Active Machine Learning
  • POSBIOTM/Event System
  • Current status

19
POSBIOTM/NER System
  • Named Entity Recognition (NER)
  • Approach
  • the named entity recognition problem is regarded
    as a classification problem, marking up each
    input token with named entity category labels.
  • CRF
  • Conditional random fields (CRFs) (Lafferty
    et.al. 2001) is a probabilistic framework for
    labeling and segmenting a sequential data. (s
    state(tag) o input)
  • For example

20
POSBIOTM/NER System
  • Named Entity Recognition (NER)
  • Feature Set

Feature Description
Lexical word only in the case that the previous/current/next words are in the surface word dictionary.
word feature orthographical feature of the previous/current/next words. Upper case letters, numbers, non-alphabet letters. Greek words alpha cells, beta hemolysis, tau interferon.
prefix/suffix Prefixes/suffixes which are contained in the prefix/suffix dictionary. Biological prefix, suffix concept ase, blast, cyt, phore, plast.
part-of-speech tag POS tag of the previous/current/next words. The part of speech is the term used to describe how a particular word is used. E.g. nouns, verb, etc.
Base noun phrase tag base noun phrase tag of the previous/current/next words.
21
POSBIOTM/NER System
  • NER Models
  • Three NER models
  • GENIA model / GENE-NER model / GPCR-NER model
  • GENIA model
  • The named entity classes used in the evaluation
  • DNA, RNA, protein and cell_line, cell_type
  • The training data consists of 2000 MEDLINE
    abstracts of the GENIA version 3 corpus. These
    abstracts were collected using the search terms
    human, blood cell, transcription factor.
  • The testing data will come from a super-domain of
    the training data (blood cell, transcription
    factor).

22
POSBIOTM/NER System
  • NER Models
  • GENE-NER model
  • GENE-NER module uses BioCreative corpus.
  • The aim of the GENE-NER module is the
    identification of which terms in biomedical
    research article are gene and/or protein names.
  • The training corpus consists of 7.5k sentences,
    selected from MEDLINE according to their
    likelihood of containing gene names.
  • GPCR-NER module (Postech)
  • aims at recognizing four target named entity
    categories
  • protein, gene, small molecule and cellular
    process.
  • The training corpus consists of 50 full articles
    related to GPCR(G-protein coupled receptor)
    signal transduction pathway.

23
POSBIOTM/NER System
  • NER Models
  • Evaluation for Three NER models

Corpus Precision Recall F-Measure
GENIA-NER 0.6960 0.6929 0.6945
GENE-NER 0.7550 0.8404 0.7982
GPCR-NER 0.6736 0.8135 0.7370
24
Contents
  • Introduction
  • POSBIOTM/W Workbench
  • POSBIOTM/NER System
  • POSBIOTM/NER with Active Machine Learning
  • POSBIOTM/Event System
  • Current status

25
POSBIOTM/NER with Active Learning
  • Active Learning in NER
  • NER with Machine Learning
  • To enhance the NER performance through the idea
    of re-using the annotated data and re-training
    the NER module
  • NER with Active Machine Learning
  • To minimize the human labeling effort without
    degrading the performance
  • To select the most informative samples for
    training

26
POSBIOTM/NER with Active Learning
  • Active Learning in NER Framework

27
POSBIOTM/NER with Active Learning
  • Active Learning Scoring Strategy
  • Uncertainty-based Sample Selection
  • Using an entropy-based measure to quantify the
    uncertainty that the current classifier holds
    (entropy or normalized entropy of the CRF
    conditional probability)
  • The most uncertain samples are selected for human
    annotation

28
POSBIOTM/NER with Active Learning
  • Active Learning Scoring Strategy
  • Diversity-based Sample Selection
  • To catch the most representative sentences in
    each sampling.
  • The divergence measures of the two sentences are
    represented by the minimum similarity among the
    examples
  • The similarity score of two words
  • The similarity score of two sentences

(for syntactic path)
29
POSBIOTM/NER with Active Learning
  • Active Learning Scoring Strategy
  • MMR(Maximal Marginal Relevance) method
  • The two measures for uncertainty and diversity
    will be combined using the MMR method to give the
    sampling scores in our active learning strategy

30
POSBIOTM/NER with Active Learning
  • Experiment and Discussion
  • Training Data
  • 2,000 MEDLINE abstracts from the GENIA corpus
  • 5 named entity classes
  • DNA, RNA, protein, cell line, cell type
  • Test Data
  • 404 abstracts
  • Half of them are from the same domain as the
    training data and the other half are from the
    super-domain of blood cell and transcription
    factor

31
POSBIOTM/NER with Active Learning
  • Experiment and Discussion
  • Pool-based sample selection
  • 100 abstracts were used to train initial NER
    module
  • Each time, we chose k examples (sentences) from
    the given pool to train the new NER module
  • The number k varied from 1,000 to 17,000 with
    step size 1,000
  • Active learning methods for test
  • Random selection
  • Entropy based uncertainty selection
  • Entropy combined with Diversity
  • Normalized Entropy combined with Diversity

32
POSBIOTM/NER with Active Learning
  • Experiment and Discussion

33
POSBIOTM/NER with Active Learning
  • Experiment and Discussion
  • All three kinds of active learning strategies
    outperform the random selection
  • The combined strategy reduces 24.64 training
    examples compared with the random selection
  • The normalized combined strategy reduces 35.43
    training examples compared with the random
    selection
  • Diversity increases the classifiers performance
    when the large amount of sample are selected
  • Up to 4,000 sentences, the entropy strategy and
    the combined strategy perform similar
  • After 11,000 sentence point, the combined
    strategy surpasses the entropy strategy

34
Contents
  • Introduction
  • POSBIOTM/W Workbench
  • POSBIOTM/NER System
  • POSBIOTM/NER with Active Machine Learning
  • POSBIOTM/Event System
  • Current status

35
POSBIOTM/Event System
  • System Architecture

36
POSBIOTM/Event System
  • Target Slot Definition
  • Template Element
  • Entities - participants of an event
  • protein (P), gene (G), small molecule (SM),
    cellular process (CP)
  • Interaction - relationship between entities
  • biological interaction (BI) Functional
    interaction
  • About how/whether one component affects the
    other's status biologically
  • chemical interaction (CI) Molecular interaction
  • About the interaction among entities at the
    molecular structural level
  • Event
  • One Interaction (I)
  • Connecting the effecter and reactant
  • Interaction keywords (BI, CI)
  • One Effecter (E)
  • Provoking an event
  • Template element (P, G, SM, CP) or nested event
  • One Reactant (R)
  • Responding to an effecter
  • Template element (P, G, SM, CP) or nested event

37
POSBIOTM/Event System
  • Target Slot Definition
  • Example

The cross-talk between PDGF and SPP is required for these embryonic cell movements.
Template Element Entities PDGF (P), SPP (SM), Cell movement (CP) Interaction keywords cross-talk (BI), require (BI) Event cross-talk (I) PDGF (E) SPP (R) require (I) cross-talk (E) cell movement (R)
38
POSBIOTM/Event System
  • Pre-Processor
  • Sentence boundary detection
  • Annotating Named Entity (NER)
  • Protein
  • Small molecule
  • Gene
  • Cellular process
  • Compound/Complex Sentence Splitter
  • To simplify the complicated full texts

39
POSBIOTM/Event System
  • Pre-Processor
  • Compound/Complex Sentence Splitter
  • Simple splitting rules
  • S NP1 VP1 NP2 SBAR thatwhich VP2 /SBAR
    /S
  • ? NP1 VP1 NP2 NP2 VP2
  • Example
  • The best studied of these is EDG-1, which is
    implicated in cell migration and angiogenesis.
  • gt 1. The best studied of these is EDG-1.
  • 2. EDG-1 is implicated in cell migration and
    angiogenesis.

40
POSBIOTM/Event System
  • Biological Event Extraction
  • Two-level Event Rule Learner

41
POSBIOTM/Event System
  • Biological Event Extraction
  • Event Rule Learner
  • Adapt a supervised machine learning algorithm
    WHISK
  • learns rules in the form of context-based regular
    expressions
  • induces the rules with top-down manner
  • Ex) NP .? (ltCPgt)E /NP VP (ltBIgt)I
    /VP NP both (ltPgt)R and .? /NP
  • Limitation of the WHISK
  • The longer distance between event components, the
    more difficult to extract the correct event
  • WHISK consider all lexical words between event
    components
  • Cannot handle nested biological events
  • Propose two-level rule learning method to handle
    the limitation of the flat rule learning method

42
POSBIOTM/Event System
  • Biological Event Extraction
  • Two-level Event Rule Learner

NP ltBIgtcross-talklt/BIgt between ltPgtPDGFlt/Pgt and ltSMgtSPPlt/SMgt /NP VP is ltBIgtrequiredlt/BIgt /VP for NP these embryonic ltCPgtcell_movementslt/CPgt /NP ltTAGSgt B interaction cross-talk effecter PDGF reactant SPP ltTAGSgt B interaction require effecter cross-talk reactant cell movement
1. Marking long NP boundary 2. Learn the short-span rule corresponding to the NP ltBIgtcross-talklt/BIgt between ltPgtPDGFlt/Pgt and ltSMgtSPPlt/SMgt ? NP (ltBIgt)I between (ltPgt)E and (ltSMgt)R /NP 3. Re-annotate the short-span interaction as one noun with regular expression format
NP ltEgtcross-talk_between_PDGF_and_SPPlt/Egt /NP VP is ltBIgtrequiredlt/BIgt /VP for NP these embryonic ltCPgtcell_movementslt/CPgt /NP ltTAGSgt B interaction require effecter cross-talk reactant cell movement
4. Learn the long-span rule with the re-annotated sentence
43
POSBIOTM/Event System
  • Biological Event Extraction
  • Event Extractor
  • To extract the events with the automatic
    generated rules
  • by using regular expression pattern matching
  • To handle the alias and noun conjunction
  • aliases and noun conjunctions have general
    patterns like sphingosine-1-phosphate(SPP) or
    FP, IP, and TP receptors
  • handle them with simple rules like A(B) or A,
    B, C, and D
  • To remove sentences including the negative words
  • not, never, fail, etc

44
POSBIOTM/Event System
  • Event Component Verifier

45
POSBIOTM/Event System
  • Event Component Verifier
  • To remove the incorrectly extracted events
  • Classify template elements (P, G, SM, CP, BI, CI)
    into 4 classes
  • I (interaction), E (effecter), R (reactant), N
    (none)
  • I, E, R events components
  • N a template element , but not an event
    component
  • Use a Maximum Entropy Classifier
  • Features
  • POS tag, phrase chunks, the type of template
    element of neighboring words and semantic
    information

46
POSBIOTM/Event System
  • Event Component Verifier

47
POSBIOTM/Event System
  • Event Component Verifier
  • Example

Extracted Biological Events Ev1 Requires (I) sphingosine_kinase(E) cell_migration (R) Ev2 Requires (I) EDG-1 (E) cell_migration (R) Ev3 Requires (I) EDG-1 (E) PDGF (R)
Event Component Verifier Results I Requires E EDG-1, sphingosine_kinase, PDGF R cell_migration
Verified Biological Extracted Events Ev1 Requires (I) sphingosine_kinase (E) cell_migration (R) Ev2 Requires (I) EDG-1 (E) cell_migration (R)
48
POSBIOTM/Event System
  • Experiment and Discussion
  • 500 Medline abstracts including 2,314 biological
    events 10-fold cross validation
  • Flat rule learner vs. two-level rule learner
  • Before verification vs. after verification
  • Performance comparison
  • Learning Information Extractors for Proteins and
    their Interactions (2004) - Razvan Bunescu, et.
    al
  • 1000 abstracts 10-fold cross validation

Flat rule learner Flat rule learner Two-level rule learner Two-level rule learner Comparison system
Before verification After verification Before verification After verification Comparison system
Precision() 38.3 54.7 38.2 53.1 39
Recall() 58.0 49.2 68.0 56.1 63
F-measure 46.1 51.8 48.9 54.6 48.2
49
POSBIOTM/Event System
  • Experiment and Discussion
  • Trade-off between precision and recall
  • Before verification big gap between precision
    and recall
  • After verification low gap between precision
    and recall
  • threshold cut the rules according to the
    measure on how many of the extracted events from
    a rule are correct

50
POSBIOTM/Event System
  • Experiment and Discussion
  • Constant good performance regardless of the
    threshold of rule learner

51
Other Corpora for Bio-Relation Extraction
  • BC-PPI
  • From BioCreative Corpus for NER
  • Protein/Gene interactions
  • 255 interactions in 1000 sentences
  • IEPA
  • Protein/Protein interactions
  • 410 interactions in 498 sentences
  • LLL05
  • Protein/Gene interactions
  • 271 interactions in 80 sentences
  • BioText
  • Disease/Treatment relations

52
Contents
  • Introduction
  • POSBIOTM/W Workbench
  • POSBIOTM/NER System
  • POSBIOTM/NER with Active Machine Learning
  • POSBIOTM/Event System
  • Current status

53
Current Status future works
  • Re-implemented with Java (platform independent)
  • Integrated with J-Designer in SBW consortium
    (will be)
  • Integrated with Active learning method to
    automatically suggest human-annotated corpus
  • Used for national large scale BIT fusion
    projects search for useful peptide (usable as a
    ligand for drug)
  • Getting more feed back from biologists
  • System getting smarter with more usage workbench
    active learning
Write a Comment
User Comments (0)
About PowerShow.com