SMBM Talks - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

SMBM Talks

Description:

YamCha (support vector machine) Tbl (transformation-based error-driven learning) ... Pearson's chi-square test. Generalized Likelihood Ratio (G-square; Dunning 1993; ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 19
Provided by: malvina3
Category:
Tags: smbm | talks

less

Transcript and Presenter's Notes

Title: SMBM Talks


1
SMBM Talks
NLP for Biomedical Text Mining
SMBM, Cambridge, April 11-13 (Edinburgh May 2)
2
Resources and Tools for Biomedical Text Mining
Junichi Tsujii (U of Tokyo)
Keywords GENIA corpus annotation
Main point progress in text mining depends on
the integration of growing GENIA annotation
(coreference, eg) with lexical resources for
domain knowledge (ontologies) and software
development.
Take home message see main point above
3
  • annotated corpus
  • POS
  • NER
  • coreference (670 abstracts, Singapore)
  • interaction (biological events cooperation with
    CNRS)
  • parse trees (1.5 million GENIA abstracts parsed
    in 10 days
  • using a 100 PC cluster)
  • ontology
  • top nodes substance source other
  • software development
  • POS tagger
  • NER tagger
  • parser
  • IR system (Medusa)
  • IE (event extraction relation gene/disease)
    system

4
  • POS tagger
  • MaxEnt model (Kazama and Tsujii 2003, 2005)
  • Trained on WSJ (gt39,000 sent.) and GENIA (18,500
    sent.)

train
test
WSJ
GENIA
97.0
84.3
WSJ
75.2
98.1
GENIA
96.9
98.1
WSJGENIA
  • NER tagger
  • combines a rule-based and statistical approach
  • on BioNLP 70.8 (?) -- our system got 70.1

5
  • HPSG-based parser (Enju)
  • see Miyao et al. ACL05
  • available on website
  • XML output
  • dependency relations
  • predicate-argument accuracy
  • PTB prec88.3 rec87.2
  • GENIA lower...
  • gene/disease relation extraction
  • pred/arg works better than bag of words or local
    context
  • (gives best precision)

6
Recognising noun phrases in biomedical text an
evaluation of lab prototypes and a commercial
chunker
J. Wermter, J. Fluck, J.Stroetgen, S.Geissler, U.
Hahn (U. Jena, Temis)
Keywords chunking, portability
Main pointtake several existing chunkers trained
on (or developed for) newspaper text and evaluate
their performance on biomedical data (beta
version of GENIA syntactic annotation).
  • Take home messages
  • overall performance drop (3-6 points) for ML
    systems when
  • shifting to bio domain
  • no significant difference between statistical
    and rule-based
  • systems

7
  • Three statistical chunkers
  • YamCha (support vector machine)
  • Tbl (transformation-based error-driven learning)
  • BoSS (boundaries predictor by combining observed
    probabilities
  • of NP boundaries and POS patterns in trainset)
  • One rule-based commercial system
  • Temis
  • 1. Uses words rather than GENIA POS tags
  • 2. Computes morphological information (XeLDA
    toolkit)
  • 3. HMM POS tagger disambiguates chain of POS tags
  • hand-coded grammar had to be modified (on PTB)
  • tagset had to be translated (not straightforward)

8
Training and Test Sets
  • Train
  • sections 15-18 of Penn Treebank for training
  • (over 200,000 POS-tagged tokens and
    IOB-chunked)
  • Test
  • GENIA treebank (beta version)
  • (200 MedLine abstracts with syntactic
    annotation)
  • the GENIA treebank was automatically converted
  • into the IOB format
  • just under 45,000 tokens
  • 11,000 devtest for settting Temis IOB output
  • 34,000 actual test set

9
Results and Errors
GENIA Corpus
PTB Corpus
Rec Prec F
Rec Prec F
YamCha
94.29 94.15 94.22
89.00 89.30 89.15
BoSS
89.92 90.10 90.01
86.46 86.84 86.65
Tbl
92.27 91.80 92.03
86.31 85.49 85.90
Temis
86.94 86.29 86.61
87.14 85.34 86.23
  • Errors
  • Coordination
  • bracketed elements
  • ...

After domain adaptations
Temis 91.24 90.59 90.91 BoSS 87.25 89.19
88.21
10
Automatic Term List Generation for Entity Tagging
Ted Sandler, Andrew Schein, and Lyle Ungar (CS,
UPenn)
KeywordsNER, automatic gazetteer creation
Main point term lists can be obtained
automatically, and when integrated in a NER
(gene)tagger (CRF) boost its performance to a
level comparable with hand-modelled lists
  • Take home messages
  • unsupervised gazetteer creation is feasible and
    useful
  • supervised methods for obtaining terms
    outperform unsupervised methods

11
  • Overall Approach
  • choose set of vocabulary items (nouns) to
    partition into classes
  • choose set of useful syntactic relations
  • frequent
  • informative
  • relatively noise-free
  • parse corpus to extract relations and collect
    statistics
  • use clustering algorithm to partition the
    vocabulary
  • resulting partitions are term lists
  • 4 related methods for generating term lists they
    differ wrt
  • (see table)
  • word representation
  • clustering algorithms to partition the words
  • choice of feature weighting

12
  • Corpus
  • 15,000 sentences from BioCreative 1,800,547
    Medline abs
  • parsed using Minipar vocabulary7782 single
    token nouns
  • Representation of the base vocabulary
  • vector space where each item is represented
  • by set of syn configurations it occurs in
  • affinity matrix where each item is represented
  • as its similarities to other items in the
    vocabulary
  • Weighting Schemes
  • Pearsons chi-square test
  • Generalized Likelihood Ratio (G-square Dunning
    1993
  • better with sparse data)
  • first better at common sense generalisations
    second
  • better at domain-specific generalisations
  • Clustering Algorithms
  • kmeans clustering for words in vector space
    (high recall)
  • agglomerative clustering for data in affinity
    matrix (high prec)

13
  • NER (Gene) Tagging
  • McDonald and Pereiras CRF tagger
  • automatically generated 2,164 overlapping term
    lists
  • incorporated as features in the model
  • binary feature (0/1) for each term list (in1
    not0)
  • baseline tagger without lists
  • tagger augmented with hand-compiled lists of
    genes (57,563)
  • tagger augmented with large list of genes
    obtained via
  • supervised learning (Tanabe and Wilbur
    Gene.Lexicon1,145,913)

TRAIN/TEST 5-fold Xvalidation on 394,661 words
of BioCreative (1/5 for training and 4/5 for
testing)
prec rec f-score Baseline 0.698 0.613 0.653 Un
supervised 0.705 0.622 0.661 Supervised 0.709 0.
621 0.662 Manual 0.716 0.631 0.671
14
Protein-Protein Interaction Extraction A
Supervised Learning Approach
J. Xiao, J. Su, G. Zhou, C. Tan (Inst. For
Infocomm Research, Singapore)
Keywordsrelation extraction
Main point a MaxEnt approach to protein-protein
relation extraction that exploits simple local
features performs better than co-occurrence and
rule-based approaches, achieving nearly 94
recall and 88 precision on 303 MedLine abstracts.
  • Take home message
  • supervised learning with shallow features work
    well for protein-protein interaction extraction

15
  • Task extract couple of interacting proteins
  • no direction
  • perfect NER (manual annotation)
  • Procedure
  • tokenisation and morphological analysis
  • POS tagging
  • NER
  • sentence analysis (parsing)
  • coreference resolution (including abbreviations
    and aliases)
  • MaxEnt classifier

16
  • Features
  • Words
  • all words that appear in two protein names
  • words in between two protein names
  • previous/next words in a n-words window
    (unordered)
  • Overlap
  • number of protein names in between 2 protein
    names
  • Keywords
  • occurrence of word from keyword list in
    surroundings
  • Chunks
  • all heads of base phrases in between 2 protein
    names
  • all heads surrounding the protein name pair
  • all phrase types between 2 protein names
  • Parse Tree
  • Dependency Tree
  • dependency between two proteins
  • Pair of heads of protein names
  • Pair of abbreviations of two proteins

17
  • Experiment and Results
  • corpus IEPA (Iowa University)
  • 303 Medline abstracts
  • 633 positive instances
  • 1080 negative instances
  • POS tagger trained on GENIA using an HMM model
  • Collins parser
  • 10-fold Xvalidation
  • best result rec93.9 prec88 f90.9

GOOD Features - words (esp. surrounding) -
chunks - pairs of protein heads - pairs of
abbreviations - keywords (so and so)
NOTSOGOOD Features - overlap - parse trees -
dependency relations
18
Challenges of Information Mining in a
Pharmaceutical Environment
Philippe Sanseau (Glaxo-Smith-Kline, UK)
Main point QHow do you see the role of NLP in
your field? AExcuse me, could someone explain
what NLP is, please.
Take home question are NLP and pharmaceutical
communities on the same track?
Write a Comment
User Comments (0)
About PowerShow.com