Linguistic techniques for Text Mining

About This Presentation

Title:

Linguistic techniques for Text Mining

Description:

Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ... JJ Adjective. JJR Adjective, comparative. JJS Adjective, superlative. DT Determiner ... – PowerPoint PPT presentation

Number of Views:214

Avg rating:3.0/5.0

Slides: 71

Provided by: personalpa6

Category:

more less

Transcript and Presenter's Notes

Title: Linguistic techniques for Text Mining

1
Linguistic techniques for Text Mining

NaCTeM team
www.nactem.ac.uk
Sophia Ananiadou
Chikashi Nobata
Yutaka Sasaki
Yoshimasa Tsuruoka

2
lexicon
ontology
Natural Language Processing
part-of-speech tagging
named entity recognition
deep syntactic parsing
raw (unstructured) text
annotated (structured) text
....... ...
Secretion of TNF was abolished by BHA in
PMA-stimulated U937 cells.
..
S
VP
VP
NP
PP
PP
PP
NP
NP
NN IN NN VBZ VBN IN NN
IN JJ NN NNS .
Secretion of TNF was abolished by BHA in
PMA-stimulated U937 cells .
protein_molecule organic_compound
cell_line
negative regulation
3
Basic Steps of Natural Language Processing

Sentence splitting
Tokenization
Part-of-speech tagging
Shallow parsing
Named entity recognition
Syntactic parsing
(Semantic Role Labeling)

4
Sentence splitting
Current immunosuppression protocols to prevent
lung transplant rejection reduce pro-inflammatory
and T-helper type 1 (Th1) cytokines. However, Th1
T-cell pro-inflammatory cytokine production is
important in host defense against bacterial
infection in the lungs. Excessive
immunosuppression of Th1 T-cell pro-inflammatory
cytokines leaves patients susceptible to
infection.
Current immunosuppression protocols to prevent
lung transplant rejection reduce pro-inflammatory
and T-helper type 1 (Th1) cytokines. However,
Th1 T-cell pro-inflammatory cytokine production
is important in host defense against bacterial
infection in the lungs. Excessive
immunosuppression of Th1 T-cell pro-inflammatory
cytokines leaves patients susceptible to
infection.
5
A heuristic rule for sentence splitting

sentence boundary
period space(s) capital letter
Regular expression in Perl

s/\. (A-Z)/\.\n\1/g
6
Errors
IL-33 is known to induce the production of
Th2-associated cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of
Th2-associated cytokines (e.g. IL-5 and IL-13).

Two solutions
Add more rules to handle exceptions
Machine learning

7
Tools for sentence splitting

JASMINE
Rule-based
http//uvdb3.hgc.jp/ALICE/program_download.html
Scott Piaos splitter
Rule-based
http//text0.mib.man.ac.uk8080/scottpiao/sent_det
ector
OpenNLP
Maximum-entropy learning
https//sourceforge.net/projects/opennlp/
Needs training data

8
Tokenization
The protein is activated by IL2.
The protein is activated by IL2
.

Convert a sentence into a sequence of tokens
Why do we tokenize?
Because we do not want to treat a sentence as a
sequence of characters!

9
Tokenization
The protein is activated by IL2.
The protein is activated by IL2
.

Tokenizing general English sentences is
relatively straightforward.
Use spaces as the boundaries
Use some heuristics to handle exceptions

10
Tokenisation issues

separate possessive endings or abbreviated forms
from preceding words
Marys ? Mary sMarys ? Mary isMarys ?
Mary has
separate punctuation marks and quotes from words
Mary. ? Mary .
new ? new

11
Tokenization

Tokenizer.sed a simple script in sed
http//www.cis.upenn.edu/treebank/tokenization.ht
ml
Undesirable tokenization
original 1,25(OH)2D3
tokenized 1 , 25 ( OH ) 2D3
Tokenization for biomedical text
Not straight-forward
Needs dictionary? Machine learning?

12
Tokenisation problems in Bio-text

Commas
2,6-diaminohexanoic acid
tricyclo(3.3.1.13,7)decanone
Four kinds of hyphens
Syntactic
Calcium-dependent
Hsp-60
Knocked-out gene lush-- flies
Negation -fever
Electric charge Cl-
K. Cohen NAACL-2007

13
Tokenisation

Tokenization Divides the text into smallest
units (usually words), removing punctuation.
Challenge What should be done with punctuation
that has linguistic meaning?
Negative charge (Cl-)
Absence of symptom (-fever)
Knocked-out gene (Ski-/-)
Gene name (IL-2 mediated)
Plus, syntacticuses (insulin-dependent)
K. Cohen NAACL-2007

14
Part-of-speech tagging
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS

Assign a part-of-speech tag to each token in a
sentence.

15
Part-of-speech tags

The Penn Treebank tagset
http//www.cis.upenn.edu/treebank/
45 tags

NN Noun, singular or mass NNS Noun,
plural NNP Proper noun, singular NNPS Proper
noun, plural VB Verb,
base form VBD Verb, past tense VBG Verb, gerund
or present participle VBN Verb, past
participle VBZ Verb, 3rd person singular present

JJ Adjective JJR Adjective, comparative JJS Adject
ive, superlative DT Determiner CD C
ardinal number CC Coordinating conjunction IN Prep
osition or subordinating conjunction FW Foreign
word
16
Part-of-speech tagging is not easy

Parts-of-speech are often ambiguous
We need to look at the context
But how?

I have to go to school. I had a go at skiing.
verb
noun
17
Writing rules for part-of-speech tagging
I have to go to school.
I had a go at skiing.
verb
noun

If the previous word is to, then its a verb.
If the previous word is a, then its a noun.
If the next word is
Writing rules manually is impossible

18
Learning from examples
The involvement of ion channels in B and T
lymphocyte activation is DT NN IN
NN NNS IN NN CC NN NN NN
VBZ supported by many reports of changes in
ion fluxes and membrane VBN IN JJ
NNS IN NNS IN NN NNS CC
NN .
.
training
Unseen text
Machine Learning Algorithm
We demonstrate PRP VBP that IN
We demonstrate that
19
Part-of-speech tagging with Hidden Markov Models
words
tags
transition probability
output probability
20
First-order Hidden Markov Models

Training
Estimate
Counting ( smoothing)
Using the tagger

21
Machine learning using diverse features

We want to use diverse types of information when
predicting the tag.

He opened it
Verb
The word is opened The suffix is ed The
previous word is He
many clues
22
Machine learning with log-linear models
Feature function
Feature weight
23
Machine learning with log-linear models

Maximum likelihood estimation
Find the parameters that maximize the
conditional log-likelihood of the training data
Gradient

24
Computing likelihood and model expectation

Example
Two possible tags Noun and Verb
Two types of features word and suffix

He opened it
Verb
Noun
Noun
tag noun
tag verb
25
Conditional Random Fields (CRFs)

A single log-linear model on the whole sentence
The number of classes is HUGE, so it is
impossible to do the estimation in a naive way.

26
Conditional Random Fields (CRFs)

Solution
Lets restrict the types of features
You can then use a dynamic programming algorithm
that drastically reduces the amount of
computation
Features you can use (in first-order CRFs)
Features defined on the tag
Features defined on the adjacent pair of tags

27
Features

Feature weights are associated with states and
edges

W0He Tag Noun
He has opened
it
Noun
Noun
Noun
Noun
Tagleft Noun Tagright Noun
Verb
Verb
Verb
Verb
28
A naive way of calculating Z(x)
7.2
4.1
Noun
Noun
Noun
Noun
Verb
Noun
Noun
Noun
1.3
0.8
Noun
Noun
Noun
Verb
Verb
Noun
Noun
Verb
4.5
9.7
Noun
Noun
Verb
Noun
Verb
Noun
Verb
Noun
0.9
5.5
Noun
Noun
Verb
Verb
Verb
Noun
Verb
Verb
2.3
5.7
Noun
Verb
Noun
Noun
Verb
Verb
Noun
Noun
11.2
4.3
Noun
Verb
Noun
Verb
Verb
Verb
Noun
Verb
3.4
2.2
Noun
Verb
Verb
Noun
Verb
Verb
Verb
Noun
2.5
1.9
Noun
Verb
Verb
Verb
Verb
Verb
Verb
Verb
Sum 67.5
29
Dynamic programming

Results of intermediate computation can be reused.

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
forward
30
Dynamic programming

Results of intermediate computation can be reused.

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
backward
31
Dynamic programming

Computing marginal distribution

He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
32
Maximum entropy learning and Conditional Random
Fields

Maximum entropy learning
Log-linear modeling MLE
Parameter estimation
Likelihood of each sample
Model expectation of each feature
Conditional Random Fields
Log-linear modeling on the whole sentence
Features are defined on states and edges
Dynamic programming

33
POS tagging algorithms

Performance on the Wall Street Journal corpus

Training Cost Speed Accuracy
Dependency Net (2003) Low Low 97.2
Conditional Random Fields High High 97.1
Support vector machines (2003) 97.1
Bidirectional MEMM (2005) Low 97.1
Brills tagger (1995) Low 96.6
HMM (2000) Very low High 96.7
34
POS taggers

Brills tagger
http//www.cs.jhu.edu/brill/
TnT tagger
http//www.coli.uni-saarland.de/thorsten/tnt/
Stanford tagger
http//nlp.stanford.edu/software/tagger.shtml
SVMTool
http//www.lsi.upc.es/nlp/SVMTool/
GENIA tagger
http//www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

35
Tagging errors made by a WSJ-trained POS tagger
and membrane potential after mitogen binding.
CC NN NN IN NN
JJ two factors, which bind to the same
kappa B enhancers CD NNS WDT NN TO
DT JJ NN NN NNS by analysing the Ag
amino acid sequence. IN VBG DT VBG
JJ NN NN to contain more T-cell
determinants than TO VB RBR JJ
NNS IN Stimulation of
interferon beta gene transcription in vitro by
NN IN JJ JJ NN
NN IN NN IN
36
Taggers for general text do not work wellon
biomedical text
Performance of the Brill tagger evaluated on
randomly selected 1000 MEDLINE sentences 86.8
(Smith et al., 2004)
Accuracy
Exact 84.4
NNP NN, NNPS NNS 90.0
LS NN 91.3
JJ NN 94.9
Accuracies of a WSJ-trained POS tagger evaluated
on the GENIA corpus (Tsuruoka et al., 2005)
37
MedPost(Smith et al., 2004)

Hidden Markov Models (HMMs)
Training data
5700 sentences randomly selected from various
thematic subsets.
Accuracy
97.43 (native tagset), 96.9 (Penn tagset)
Evaluated on 1,000 sentences
Available from
ftp//ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medp
ost.tar.gz

38
Training POS taggers with bio-corpora(Tsuruoka
and Tsujii, 2005)
training WSJ GENIA PennBioIE
WSJ 97.2 91.6 90.5
GENIA 85.3 98.6 92.2
PennBioIE 87.4 93.4 97.9
WSJ GENIA 97.2 98.5 93.6
WSJ PennBioIE 97.2 94.0 98.0
GENIA PennBioIE 88.3 98.4 97.8
WSJ GENIA PennBioIE 97.2 98.4 97.9
39
Performance on new data
Relative performance evaluated on recent
abstracts selected from three journals -
Nucleic Acid Research (NAR) - Nature Medicine
(NMED) - Journal of Clinical Investigation
(JCI)
training NAR NMED NMED Total (Acc.)
WSJ 109 47 102 258 (70.9)
GENIA 121 74 132 327 (89.8)
PennBioIE 129 65 122 316 (86.6)
WSJ GENIA 125 74 135 334 (91.8)
WSJ PennBioIE 133 71 133 337 (92.6)
GENIA PennBioIE 128 75 135 338 (92.9)
WSJ GENIA PennBioIE 133 74 139 346 (95.1)
40
Chunking (shallow parsing)
He reckons the current account deficit will
narrow to NP VP NP
VP PP only
1.8 billion in September . NP
PP NP

A chunker (shallow parser) segments a sentence
into non-recursive phrases.

41
Extracting noun phrases from MEDLINE(Bennett,
1999)

Rule-based noun phrase extraction
Tokenization
Part-Of-Speech tagging
Pattern matching

Noun phrase extraction accuracies evaluated on 40
abstracts
FastNPE NPtool Chopper AZ Phraser
Recall 50 95 97 92
Precision 80 96 90 86
42
Chunking with Machine learning

Chunking performance on Penn Treebank

Recall Precision F-score
Winnow (with basic features) (Zhang, 2002) 93.60 93.54 93.57
Perceptron (Carreras, 2003) 93.29 94.19 93.74
SVM voting (Kudoh, 2003) 93.92 93.89 93.91
SVM (Kudo, 2000) 93.51 93.45 93.48
Bidirectional MEMM (Tsuruoka, 2005) 93.70 93.70 93.70
43
Machine learning-based chunking

Convert a treebank into sentences that are
annotated with chunk information.
CoNLL-2000 data set
http//www.cnts.ua.ac.be/conll2000/chunking/
The conversion script is available
Apply a sequence tagging algorithm such as HMM,
MEMM, CRF, or Semi-CRF.
YamCha an SVM-based chunker
http//www.chasen.org/taku/software/yamcha/

44
GENIA tagger

Algorithm Bidirectional MEMM
POS tagging
Trained on WSJ, GENIA and Penn BioIE
Accuracy 97-98
Shallow parsing
Trained on WSJ and GENIA
Accuracy 90-94
Can output base forms
Available from
http//www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

45
Named-Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2
control
protein protein protein IL-2 receptor alpha
(IL-2R alpha) gene transcription in
DNA CD4-CD8-murine T lymphocyte
precursors. cell_line

Recognize named-entities in a sentence.
Gene/protein names
Protein, DNA, RNA, cell_line, cell_type

46
Performance of biomedical NE recognition

Shared task data for Coling 2004 BioNLP workshop
- entity types protein, DNA, RNA,
cell_type, and cell_line

Recall Precision F-score
SVMHMM (Zhou, 2004) 76.0 69.4 72.6
Semi-Markov CRFs (in prep.) 72.7 70.4 71.5
Two-Phase (Kim, 2005) 72.8 69.7 71.2
Sliding Window (in prep.) 71.5 70.2 70.8
CRF (Settles, 2005) 72.0 69.1 70.5
MEMM (Finkel, 2004) 71.6 68.6 70.1

47
Features
Classification models, main features used in
NLPBA (Kim, 2004)
CM lx af or sh gn gz po np sy tr ab ca do pa pr ext.
Zho SH x x x x x x x x x
Fin M x x x x x x x x x x B,W
Set C x x x x (x) (x) x (W)
Son SC x x x x x V
Zha H x x M
Classification Model (CM) S SVM H HMM M
MEMM C CRF Features lx lexical features
af affix information (chracter n-grams) or
orthographic Information sh word shapes gn
gene sequence gz gazetteers po part-of-speech
tags np noun phrase tags sy syntactic tags
tr word triggers ab abbreviations ca
cascaded entities do global document
information pa parentheses handling pre
previously predicted entity tags B British
National Corpus W WWW V virtually generated
corpus M MEDLINE
48
CFG parsing
S
VP
NP
NP
QP
VBN NN VBD DT JJ CD CD
NNS .
Estimated volume was a light 2.4 million
ounces .
49
Phrase structure head information
S
VP
NP
NP
QP
VBN NN VBD DT JJ CD CD
NNS .
Estimated volume was a light 2.4 million
ounces .
50
Dependency relations
VBN NN VBD DT JJ CD CD
NNS .
Estimated volume was a light 2.4 million
ounces .
51
CFG parsing algorithms

Performance on the Penn Treebank

LR LP F-score
Generative model (Collins, 1999) 88.1 88.3 88.2
Maxent-inspired (Charniak, 2000) 89.6 89.5 89.5
Simply Synchrony Networks (Henderson, 2004) 89.8 90.4 90.1
Data Oriented Parsing (Bod, 2003) 90.8 90.7 90.7
Re-ranking (Johnson, 2005) 91.0
52
CFG parsers

Collins parser
http//people.csail.mit.edu/mcollins/code.html
Bikels parser
http//www.cis.upenn.edu/dbikel/software.htmlsta
t-parser
Charniak parser
http//www.cs.brown.edu/people/ec/
Re-ranking parser
http//www.cog.brown.edu16080/mj/Software.htm
SSN parser
http//homepages.inf.ed.ac.uk/jhender6/parser/ssn_
parser.html

53
Parsing biomedical documents

CFG parsing accuracies on the GENIA treebank
(Clegg, 2005)
In order to improve performance,
Unsupervised parse combination (Clegg, 2005)
Use lexical information (Lease, 2005)
14.2 reduction in error.

LR LP F-score
Bikel 0.9.8 77.43 81.33 79.33
Charniak 76.05 77.12 76.58
Collins model 2 74.49 81.30 77.75
54
Parse tree
So
NP1
VP15
VP21
DT2
NP4
VP16
VP17
AV19
VP22
AJ5
NP7
NP25
A
does
NP24
NP10
not
exclude
normal
NP8
AJ26
NP28
NP13
serum
NP11
NP29
NP31
mesurement
CRP
deep
thrombosis
vein
55
Semantic structure
Predicate argument relations
So
NP1
VP15
VP21
DT2
NP4
VP16
ARG1
ARG1
ARG1
ARG2
VP17
AV19
VP22
AJ5
NP7
NP25
A
ARG2
ARG1
does
NP24
NP10
not
exclude
normal
NP8
MOD
ARG1
AJ26
NP28
NP13
serum
NP11
ARG1
MOD
NP29
NP31
mesurement
CRP
deep
MOD
thrombosis
vein
56
Abstraction of surface expressions
57
HPSG parsing

HPSG
A few schema
Many lexical entries
Deep syntactic analysis
Grammar
Corpus-based grammar construction (Miyao et al
2004)
Parser
Beam search (Tsuruoka et al.)

HEAD verb SUBJ ltgt COMPS ltgt
Subject-head schema
HEAD verb SUBJ ltnoungt COMPS ltgt
Lexical entry
Head-modifier schema
HEAD noun SUBJ ltgt COMPS ltgt
HEAD verb SUBJ ltnoungt COMPS ltgt
adv
HEAD
MOD verb
slowly
Mary
walked
58
Experimental results

Training set Penn Treebank Section 02-21 (39,832
sentences)
Test set Penn Treebank Section 23 (lt 40 words,
2,164 sentences)
Accuracy of predicate argument relations (i.e.,
red arrows) is measured

Precision Recall F-score
87.9 86.9 87.4
59
Parsing MEDLINE with HPSG

Enju
A wide-coverage HPSG parser
http//www-tsujii.is.s.u-tokyo.ac.jp/enju/

60
Extraction of Protein-protein InteractionsPredic
ate-argument relations SVM (1)

(Yakushiji, 2005)

CD4 protein interacts with non-polymorphic
regions of MHCII .
ENTITY1
ENTITY2
Extraction patterns based on predicate-argument
relations
arg1
arg1
arg2
arg1
arg2
argM
CD4
protein
interact
with
non-polymorphic
region
of
MHCII
ENTITY1
ENTITY2
arg1
SVM learning with predicate-argument patterns
61
Text Mining for Biology

MEDIE An interactive intelligent IR system
retrieving events
Performs a semantic search
InfoPubMed an interactive IE system and an
efficient PubMed search tool, helping users to
find information about biomedical entities such
as genes, proteins, and the interactions between
them.

62
Medie system overview
Deep parser
Semantically- annotated Textbase
RegionAlgebra Search engine
Input Textbase
Entity Recognizer
Query
Search results
63
(No Transcript)
64
(No Transcript)
65
Service extracting interactions

Info-PubMed interactive IE system and an
efficient PubMed search tool, helping users to
find information about biomedical entities such
as genes, proteins,and the interactions between
them.
System components
MEDIE
Extraction of protein-protein interactions
Multi-window interface on a browser
UTokyo NaCTeM self-funded partner
https//www-tsujii.is.s.u-tokyo.ac.jp/info-pubmed/

66
Info-PubMed

helps biologists to search for their interests
genes, proteins, their interactions, and evidence
sentences
extracted from MEDLINE(about 15 million
abstracts of biomedical papers)
uses many NLP techniques explained
in order to achieve high precision of retrieval

67
Flow Chart
Output
Input
tokenTNF
Gene or proteinkeywords
Gene or proteinentities
GeneTNF
interactionsaround thegiven gene
Gene or proteinentitiy
Interaction TNF and IL6
evidence sentencesdescribing the
giveninteraction
interaction
68
Techniques(1/2)

Biomedical entity recognitionin abstract
sentences
prepare a gene dictionary
string match

69
Techniques(2/2)

Extract sentences describing protein-protein
interaction
deep parser based on HPSG syntax
can detect semantic relations betweenphrases
domain dependent pattern recognition
can learn and expand source patterns
by using the result of the deep parser, it can
extract semantically true patterns
not affected by syntactic variations

70
Info-PubMed

Write a Comment

User Comments (0)