Title: Linguistic techniques for Text Mining
1Linguistic techniques for Text Mining
- NaCTeM team
- www.nactem.ac.uk
- Sophia Ananiadou
- Chikashi Nobata
- Yutaka Sasaki
- Yoshimasa Tsuruoka
2lexicon
ontology
Natural Language Processing
part-of-speech tagging
named entity recognition
deep syntactic parsing
raw (unstructured) text
annotated (structured) text
....... ...
Secretion of TNF was abolished by BHA in
PMA-stimulated U937 cells.
..
S
VP
VP
NP
PP
PP
PP
NP
NP
NN IN NN VBZ VBN IN NN
IN JJ NN NNS .
Secretion of TNF was abolished by BHA in
PMA-stimulated U937 cells .
protein_molecule organic_compound
cell_line
negative regulation
3Basic Steps of Natural Language Processing
- Sentence splitting
- Tokenization
- Part-of-speech tagging
- Shallow parsing
- Named entity recognition
- Syntactic parsing
- (Semantic Role Labeling)
4Sentence splitting
Current immunosuppression protocols to prevent
lung transplant rejection reduce pro-inflammatory
and T-helper type 1 (Th1) cytokines. However, Th1
T-cell pro-inflammatory cytokine production is
important in host defense against bacterial
infection in the lungs. Excessive
immunosuppression of Th1 T-cell pro-inflammatory
cytokines leaves patients susceptible to
infection.
Current immunosuppression protocols to prevent
lung transplant rejection reduce pro-inflammatory
and T-helper type 1 (Th1) cytokines. However,
Th1 T-cell pro-inflammatory cytokine production
is important in host defense against bacterial
infection in the lungs. Excessive
immunosuppression of Th1 T-cell pro-inflammatory
cytokines leaves patients susceptible to
infection.
5A heuristic rule for sentence splitting
- sentence boundary
- period space(s) capital letter
- Regular expression in Perl
s/\. (A-Z)/\.\n\1/g
6Errors
IL-33 is known to induce the production of
Th2-associated cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of
Th2-associated cytokines (e.g. IL-5 and IL-13).
- Two solutions
- Add more rules to handle exceptions
- Machine learning
7Tools for sentence splitting
- JASMINE
- Rule-based
- http//uvdb3.hgc.jp/ALICE/program_download.html
- Scott Piaos splitter
- Rule-based
- http//text0.mib.man.ac.uk8080/scottpiao/sent_det
ector - OpenNLP
- Maximum-entropy learning
- https//sourceforge.net/projects/opennlp/
- Needs training data
8Tokenization
The protein is activated by IL2.
The protein is activated by IL2
.
- Convert a sentence into a sequence of tokens
- Why do we tokenize?
- Because we do not want to treat a sentence as a
sequence of characters!
9Tokenization
The protein is activated by IL2.
The protein is activated by IL2
.
- Tokenizing general English sentences is
relatively straightforward. - Use spaces as the boundaries
- Use some heuristics to handle exceptions
10Tokenisation issues
- separate possessive endings or abbreviated forms
from preceding words - Marys ? Mary sMarys ? Mary isMarys ?
Mary has - separate punctuation marks and quotes from words
- Mary. ? Mary .
- new ? new
11Tokenization
- Tokenizer.sed a simple script in sed
- http//www.cis.upenn.edu/treebank/tokenization.ht
ml - Undesirable tokenization
- original 1,25(OH)2D3
- tokenized 1 , 25 ( OH ) 2D3
- Tokenization for biomedical text
- Not straight-forward
- Needs dictionary? Machine learning?
12Tokenisation problems in Bio-text
- Commas
- 2,6-diaminohexanoic acid
- tricyclo(3.3.1.13,7)decanone
- Four kinds of hyphens
- Syntactic
- Calcium-dependent
- Hsp-60
- Knocked-out gene lush-- flies
- Negation -fever
- Electric charge Cl-
- K. Cohen NAACL-2007
13Tokenisation
- Tokenization Divides the text into smallest
units (usually words), removing punctuation.
Challenge What should be done with punctuation
that has linguistic meaning? - Negative charge (Cl-)
- Absence of symptom (-fever)
- Knocked-out gene (Ski-/-)
- Gene name (IL-2 mediated)
- Plus, syntacticuses (insulin-dependent)
- K. Cohen NAACL-2007
14Part-of-speech tagging
The peri-kappa B site mediates human
immunodeficiency DT NN NN NN VBZ
JJ NN virus type 2 enhancer
activation in monocytes NN NN CD
NN NN IN NNS
- Assign a part-of-speech tag to each token in a
sentence.
15Part-of-speech tags
- The Penn Treebank tagset
- http//www.cis.upenn.edu/treebank/
- 45 tags
NN Noun, singular or mass NNS Noun,
plural NNP Proper noun, singular NNPS Proper
noun, plural VB Verb,
base form VBD Verb, past tense VBG Verb, gerund
or present participle VBN Verb, past
participle VBZ Verb, 3rd person singular present
JJ Adjective JJR Adjective, comparative JJS Adject
ive, superlative DT Determiner CD C
ardinal number CC Coordinating conjunction IN Prep
osition or subordinating conjunction FW Foreign
word
16Part-of-speech tagging is not easy
- Parts-of-speech are often ambiguous
- We need to look at the context
- But how?
I have to go to school. I had a go at skiing.
verb
noun
17Writing rules for part-of-speech tagging
I have to go to school.
I had a go at skiing.
verb
noun
- If the previous word is to, then its a verb.
- If the previous word is a, then its a noun.
- If the next word is
-
- Writing rules manually is impossible
18Learning from examples
The involvement of ion channels in B and T
lymphocyte activation is DT NN IN
NN NNS IN NN CC NN NN NN
VBZ supported by many reports of changes in
ion fluxes and membrane VBN IN JJ
NNS IN NNS IN NN NNS CC
NN .
.
training
Unseen text
Machine Learning Algorithm
We demonstrate PRP VBP that IN
We demonstrate that
19Part-of-speech tagging with Hidden Markov Models
words
tags
transition probability
output probability
20First-order Hidden Markov Models
- Training
- Estimate
- Counting ( smoothing)
- Using the tagger
21Machine learning using diverse features
- We want to use diverse types of information when
predicting the tag.
He opened it
Verb
The word is opened The suffix is ed The
previous word is He
many clues
22Machine learning with log-linear models
Feature function
Feature weight
23Machine learning with log-linear models
- Maximum likelihood estimation
- Find the parameters that maximize the
conditional log-likelihood of the training data - Gradient
24Computing likelihood and model expectation
- Example
- Two possible tags Noun and Verb
- Two types of features word and suffix
He opened it
Verb
Noun
Noun
tag noun
tag verb
25Conditional Random Fields (CRFs)
- A single log-linear model on the whole sentence
- The number of classes is HUGE, so it is
impossible to do the estimation in a naive way.
26Conditional Random Fields (CRFs)
- Solution
- Lets restrict the types of features
- You can then use a dynamic programming algorithm
that drastically reduces the amount of
computation - Features you can use (in first-order CRFs)
- Features defined on the tag
- Features defined on the adjacent pair of tags
27Features
- Feature weights are associated with states and
edges
W0He Tag Noun
He has opened
it
Noun
Noun
Noun
Noun
Tagleft Noun Tagright Noun
Verb
Verb
Verb
Verb
28A naive way of calculating Z(x)
7.2
4.1
Noun
Noun
Noun
Noun
Verb
Noun
Noun
Noun
1.3
0.8
Noun
Noun
Noun
Verb
Verb
Noun
Noun
Verb
4.5
9.7
Noun
Noun
Verb
Noun
Verb
Noun
Verb
Noun
0.9
5.5
Noun
Noun
Verb
Verb
Verb
Noun
Verb
Verb
2.3
5.7
Noun
Verb
Noun
Noun
Verb
Verb
Noun
Noun
11.2
4.3
Noun
Verb
Noun
Verb
Verb
Verb
Noun
Verb
3.4
2.2
Noun
Verb
Verb
Noun
Verb
Verb
Verb
Noun
2.5
1.9
Noun
Verb
Verb
Verb
Verb
Verb
Verb
Verb
Sum 67.5
29Dynamic programming
- Results of intermediate computation can be reused.
He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
forward
30Dynamic programming
- Results of intermediate computation can be reused.
He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
backward
31Dynamic programming
- Computing marginal distribution
He has opened
it
Noun
Noun
Noun
Noun
Verb
Verb
Verb
Verb
32Maximum entropy learning and Conditional Random
Fields
- Maximum entropy learning
- Log-linear modeling MLE
- Parameter estimation
- Likelihood of each sample
- Model expectation of each feature
- Conditional Random Fields
- Log-linear modeling on the whole sentence
- Features are defined on states and edges
- Dynamic programming
33POS tagging algorithms
- Performance on the Wall Street Journal corpus
Training Cost Speed Accuracy
Dependency Net (2003) Low Low 97.2
Conditional Random Fields High High 97.1
Support vector machines (2003) 97.1
Bidirectional MEMM (2005) Low 97.1
Brills tagger (1995) Low 96.6
HMM (2000) Very low High 96.7
34POS taggers
- Brills tagger
- http//www.cs.jhu.edu/brill/
- TnT tagger
- http//www.coli.uni-saarland.de/thorsten/tnt/
- Stanford tagger
- http//nlp.stanford.edu/software/tagger.shtml
- SVMTool
- http//www.lsi.upc.es/nlp/SVMTool/
- GENIA tagger
- http//www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
35Tagging errors made by a WSJ-trained POS tagger
and membrane potential after mitogen binding.
CC NN NN IN NN
JJ two factors, which bind to the same
kappa B enhancers CD NNS WDT NN TO
DT JJ NN NN NNS by analysing the Ag
amino acid sequence. IN VBG DT VBG
JJ NN NN to contain more T-cell
determinants than TO VB RBR JJ
NNS IN Stimulation of
interferon beta gene transcription in vitro by
NN IN JJ JJ NN
NN IN NN IN
36Taggers for general text do not work wellon
biomedical text
Performance of the Brill tagger evaluated on
randomly selected 1000 MEDLINE sentences 86.8
(Smith et al., 2004)
Accuracy
Exact 84.4
NNP NN, NNPS NNS 90.0
LS NN 91.3
JJ NN 94.9
Accuracies of a WSJ-trained POS tagger evaluated
on the GENIA corpus (Tsuruoka et al., 2005)
37MedPost(Smith et al., 2004)
- Hidden Markov Models (HMMs)
- Training data
- 5700 sentences randomly selected from various
thematic subsets. - Accuracy
- 97.43 (native tagset), 96.9 (Penn tagset)
- Evaluated on 1,000 sentences
- Available from
- ftp//ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medp
ost.tar.gz
38Training POS taggers with bio-corpora(Tsuruoka
and Tsujii, 2005)
training WSJ GENIA PennBioIE
WSJ 97.2 91.6 90.5
GENIA 85.3 98.6 92.2
PennBioIE 87.4 93.4 97.9
WSJ GENIA 97.2 98.5 93.6
WSJ PennBioIE 97.2 94.0 98.0
GENIA PennBioIE 88.3 98.4 97.8
WSJ GENIA PennBioIE 97.2 98.4 97.9
39Performance on new data
Relative performance evaluated on recent
abstracts selected from three journals -
Nucleic Acid Research (NAR) - Nature Medicine
(NMED) - Journal of Clinical Investigation
(JCI)
training NAR NMED NMED Total (Acc.)
WSJ 109 47 102 258 (70.9)
GENIA 121 74 132 327 (89.8)
PennBioIE 129 65 122 316 (86.6)
WSJ GENIA 125 74 135 334 (91.8)
WSJ PennBioIE 133 71 133 337 (92.6)
GENIA PennBioIE 128 75 135 338 (92.9)
WSJ GENIA PennBioIE 133 74 139 346 (95.1)
40Chunking (shallow parsing)
He reckons the current account deficit will
narrow to NP VP NP
VP PP only
1.8 billion in September . NP
PP NP
- A chunker (shallow parser) segments a sentence
into non-recursive phrases.
41Extracting noun phrases from MEDLINE(Bennett,
1999)
- Rule-based noun phrase extraction
- Tokenization
- Part-Of-Speech tagging
- Pattern matching
Noun phrase extraction accuracies evaluated on 40
abstracts
FastNPE NPtool Chopper AZ Phraser
Recall 50 95 97 92
Precision 80 96 90 86
42Chunking with Machine learning
- Chunking performance on Penn Treebank
Recall Precision F-score
Winnow (with basic features) (Zhang, 2002) 93.60 93.54 93.57
Perceptron (Carreras, 2003) 93.29 94.19 93.74
SVM voting (Kudoh, 2003) 93.92 93.89 93.91
SVM (Kudo, 2000) 93.51 93.45 93.48
Bidirectional MEMM (Tsuruoka, 2005) 93.70 93.70 93.70
43Machine learning-based chunking
- Convert a treebank into sentences that are
annotated with chunk information. - CoNLL-2000 data set
- http//www.cnts.ua.ac.be/conll2000/chunking/
- The conversion script is available
- Apply a sequence tagging algorithm such as HMM,
MEMM, CRF, or Semi-CRF. - YamCha an SVM-based chunker
- http//www.chasen.org/taku/software/yamcha/
-
44GENIA tagger
- Algorithm Bidirectional MEMM
- POS tagging
- Trained on WSJ, GENIA and Penn BioIE
- Accuracy 97-98
- Shallow parsing
- Trained on WSJ and GENIA
- Accuracy 90-94
- Can output base forms
- Available from
- http//www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
45Named-Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2
control
protein protein protein IL-2 receptor alpha
(IL-2R alpha) gene transcription in
DNA CD4-CD8-murine T lymphocyte
precursors. cell_line
- Recognize named-entities in a sentence.
- Gene/protein names
- Protein, DNA, RNA, cell_line, cell_type
46Performance of biomedical NE recognition
- Shared task data for Coling 2004 BioNLP workshop
- - entity types protein, DNA, RNA,
cell_type, and cell_line
Recall Precision F-score
SVMHMM (Zhou, 2004) 76.0 69.4 72.6
Semi-Markov CRFs (in prep.) 72.7 70.4 71.5
Two-Phase (Kim, 2005) 72.8 69.7 71.2
Sliding Window (in prep.) 71.5 70.2 70.8
CRF (Settles, 2005) 72.0 69.1 70.5
MEMM (Finkel, 2004) 71.6 68.6 70.1
47Features
Classification models, main features used in
NLPBA (Kim, 2004)
CM lx af or sh gn gz po np sy tr ab ca do pa pr ext.
Zho SH x x x x x x x x x
Fin M x x x x x x x x x x B,W
Set C x x x x (x) (x) x (W)
Son SC x x x x x V
Zha H x x M
Classification Model (CM) S SVM H HMM M
MEMM C CRF Features lx lexical features
af affix information (chracter n-grams) or
orthographic Information sh word shapes gn
gene sequence gz gazetteers po part-of-speech
tags np noun phrase tags sy syntactic tags
tr word triggers ab abbreviations ca
cascaded entities do global document
information pa parentheses handling pre
previously predicted entity tags B British
National Corpus W WWW V virtually generated
corpus M MEDLINE
48CFG parsing
S
VP
NP
NP
QP
VBN NN VBD DT JJ CD CD
NNS .
Estimated volume was a light 2.4 million
ounces .
49Phrase structure head information
S
VP
NP
NP
QP
VBN NN VBD DT JJ CD CD
NNS .
Estimated volume was a light 2.4 million
ounces .
50Dependency relations
VBN NN VBD DT JJ CD CD
NNS .
Estimated volume was a light 2.4 million
ounces .
51CFG parsing algorithms
- Performance on the Penn Treebank
LR LP F-score
Generative model (Collins, 1999) 88.1 88.3 88.2
Maxent-inspired (Charniak, 2000) 89.6 89.5 89.5
Simply Synchrony Networks (Henderson, 2004) 89.8 90.4 90.1
Data Oriented Parsing (Bod, 2003) 90.8 90.7 90.7
Re-ranking (Johnson, 2005) 91.0
52CFG parsers
- Collins parser
- http//people.csail.mit.edu/mcollins/code.html
- Bikels parser
- http//www.cis.upenn.edu/dbikel/software.htmlsta
t-parser - Charniak parser
- http//www.cs.brown.edu/people/ec/
- Re-ranking parser
- http//www.cog.brown.edu16080/mj/Software.htm
- SSN parser
- http//homepages.inf.ed.ac.uk/jhender6/parser/ssn_
parser.html
53Parsing biomedical documents
- CFG parsing accuracies on the GENIA treebank
(Clegg, 2005) - In order to improve performance,
- Unsupervised parse combination (Clegg, 2005)
- Use lexical information (Lease, 2005)
- 14.2 reduction in error.
LR LP F-score
Bikel 0.9.8 77.43 81.33 79.33
Charniak 76.05 77.12 76.58
Collins model 2 74.49 81.30 77.75
54Parse tree
So
NP1
VP15
VP21
DT2
NP4
VP16
VP17
AV19
VP22
AJ5
NP7
NP25
A
does
NP24
NP10
not
exclude
normal
NP8
AJ26
NP28
NP13
serum
NP11
NP29
NP31
mesurement
CRP
deep
thrombosis
vein
55Semantic structure
Predicate argument relations
So
NP1
VP15
VP21
DT2
NP4
VP16
ARG1
ARG1
ARG1
ARG2
VP17
AV19
VP22
AJ5
NP7
NP25
A
ARG2
ARG1
does
NP24
NP10
not
exclude
normal
NP8
MOD
ARG1
AJ26
NP28
NP13
serum
NP11
ARG1
MOD
NP29
NP31
mesurement
CRP
deep
MOD
thrombosis
vein
56Abstraction of surface expressions
57HPSG parsing
- HPSG
- A few schema
- Many lexical entries
- Deep syntactic analysis
- Grammar
- Corpus-based grammar construction (Miyao et al
2004) - Parser
- Beam search (Tsuruoka et al.)
HEAD verb SUBJ ltgt COMPS ltgt
Subject-head schema
HEAD verb SUBJ ltnoungt COMPS ltgt
Lexical entry
Head-modifier schema
HEAD noun SUBJ ltgt COMPS ltgt
HEAD verb SUBJ ltnoungt COMPS ltgt
adv
HEAD
MOD verb
slowly
Mary
walked
58Experimental results
- Training set Penn Treebank Section 02-21 (39,832
sentences) - Test set Penn Treebank Section 23 (lt 40 words,
2,164 sentences) - Accuracy of predicate argument relations (i.e.,
red arrows) is measured
Precision Recall F-score
87.9 86.9 87.4
59Parsing MEDLINE with HPSG
- Enju
- A wide-coverage HPSG parser
- http//www-tsujii.is.s.u-tokyo.ac.jp/enju/
60Extraction of Protein-protein InteractionsPredic
ate-argument relations SVM (1)
CD4 protein interacts with non-polymorphic
regions of MHCII .
ENTITY1
ENTITY2
Extraction patterns based on predicate-argument
relations
arg1
arg1
arg2
arg1
arg2
argM
CD4
protein
interact
with
non-polymorphic
region
of
MHCII
ENTITY1
ENTITY2
arg1
SVM learning with predicate-argument patterns
61Text Mining for Biology
- MEDIE An interactive intelligent IR system
retrieving events - Performs a semantic search
- InfoPubMed an interactive IE system and an
efficient PubMed search tool, helping users to
find information about biomedical entities such
as genes, proteins, and the interactions between
them.
62Medie system overview
Deep parser
Semantically- annotated Textbase
RegionAlgebra Search engine
Input Textbase
Entity Recognizer
Query
Search results
63(No Transcript)
64(No Transcript)
65 Service extracting interactions
- Info-PubMed interactive IE system and an
efficient PubMed search tool, helping users to
find information about biomedical entities such
as genes, proteins,and the interactions between
them. - System components
- MEDIE
- Extraction of protein-protein interactions
- Multi-window interface on a browser
- UTokyo NaCTeM self-funded partner
- https//www-tsujii.is.s.u-tokyo.ac.jp/info-pubmed/
66Info-PubMed
- helps biologists to search for their interests
- genes, proteins, their interactions, and evidence
sentences - extracted from MEDLINE(about 15 million
abstracts of biomedical papers) - uses many NLP techniques explained
- in order to achieve high precision of retrieval
67Flow Chart
Output
Input
tokenTNF
Gene or proteinkeywords
Gene or proteinentities
GeneTNF
interactionsaround thegiven gene
Gene or proteinentitiy
Interaction TNF and IL6
evidence sentencesdescribing the
giveninteraction
interaction
68Techniques(1/2)
- Biomedical entity recognitionin abstract
sentences - prepare a gene dictionary
- string match
69Techniques(2/2)
- Extract sentences describing protein-protein
interaction - deep parser based on HPSG syntax
- can detect semantic relations betweenphrases
- domain dependent pattern recognition
- can learn and expand source patterns
- by using the result of the deep parser, it can
extract semantically true patterns - not affected by syntactic variations
70Info-PubMed