Applications of Statistical Natural Language Processing

1 / 116

About This Presentation

Title:

Applications of Statistical Natural Language Processing

Description:

LSI text classification, Ontological Engineering, Question Answering. Multi-Agent coordination ... Shallow parsing, grammar debug, Named Entity Recognition ... – PowerPoint PPT presentation

Number of Views:264

Avg rating:3.0/5.0

Slides: 117

Provided by: wsCsieN

more less

Transcript and Presenter's Notes

Title: Applications of Statistical Natural Language Processing

1
Applications of Statistical Natural Language
Processing

Shih-Hung Wu
Dept. of CSIE,
Chaoyang University of Technology

2
My Research Topics

Information Retrieval
LSI text classification, Ontological Engineering,
Question Answering
Multi-Agent coordination
Game theoretical Multi-agent Negotiation
NLP
Shallow parsing, grammar debug, Named Entity
Recognition
Learning Technology
Word Problems in primary school mathematics
Web Intelligence
Wikipedia wrapper, Cross-language IR

3
Outline

AI NLP
Statistical Natural Language Processing
Example 1Prepostion Usage by Language Model
Sequential tagging technique
Maximum Entropy
Example 2 Bio NER
Example 3 Chinese Shallow Parsing

4
AI expected

Robots that can understand humans language and
interact with human.

C3PO and R2DR
5
What we have now

Home robots that can dance or clean dust

6
What is lost?

They cannot use natural language well

7
Research Topics in NLPIR

Applications of NLPIR
Question Answering system, Input system,
Information extraction, Ontology extraction,
Grammar Checker
Other Tough NLP
Machine Translation, Summary, Natural Language
Generation
Sub goal in NLP
Word segmentation, POS tagging, Full Parsing,
Alignment, Suffix pattern, Shallow parsing,
Semantic Role Labeling

8
Nature Language Processing (NLP)

Categories of the Development in NLP
Corpus-based Methods
Statistical Methods
Textbooks of NLP

Allen (1995) Natural Language Understanding
Manning (1999) Foundations of Statistical NLP
Jurafsky (2000) Speech and Language Processing
Jackson (2002) NLP for Online Application
9
Methodology

Machine Learning Pattern Recognition
Statistical NLP share the same methodology
Training and Test
Training from a large set of training examples
Test on an independent set of examples

10
Preposition Usage by Language Model

Based on a conference paper
Shih-Hung Wu, Chen-Yu Su, Tian-Jian Jiang,
Wen-Lian Hsu , An Evaluation of Adopting
Language Model as the Checker of Preposition
Usage, Proceedings of ROCLING 2006.

11
Motivation

Microsoft word detects grammar errors but do
not deal with the usages of prepositions
Language Model can predict next word
Original ideal Use language model to predict the
right preposition
Current approach Calculate the probability of
each sentence

12
Language Model

An LM uses short history to predict the next
word.
Ex.
Sue swallowed the large green ___.
___could be Pill or frog
large green ___
___could be Pill or frog
Markov assumption
Only the prior local context affects

13
Sentence Probability

The probability of a sentence in a language, say
English, is defined as the probability of the
sequence

Decomposed by the chain rule of conditional
probability under the Markov assumption

14
Maximum Likelihood Estimation
Maximum Likelihood Estimation bi-gram
Maximum Likelihood Estimation n-gram
15
Smoothing 1 GT

Good-Turing Discounting (GT)
adjusts the count of n-gram from r to r, which
is base on the assumption that their distribution
is binomial Good, 1953.

16
Smoothing 2AD

Absolute Discounting (AD)
all the unseen events gain the frequency
uniformly. Ney et al., 1994

17
Smoothing 3mKN

Modified Kneser-Ney discounting (mKN)
mKN has three different parameters, D1, D2, and
D3 that are applied to n-grams with one, two, and
three or more counts Chen and Goodman, 1998

18
Entropy and Perplexity

Entropy
Perplexity

19
Experiment setting training

Training sets for bi-gram model

Training sets for tri-gram model

20
Closed test setting

select 100 sentences from the training corpus
replacing the correct preposition with other
prepositions
Ex.
My sister whispered in my ear.
My sister whispered on my ear.
My sister whispered at my ear.
Assume the sentence with the lowest perplexity is
considered as the correct one.

21
Closed test results

Bi-gram model

Tri-gram model

22
Open test

100 sentences that we collect from
1. Amusements in Mathematics by Henry Ernest
Dudeney.
2. Grimm's Fairy Tales by Jacob Grimm and Wilhelm
Grimm.
3. The Art of War by Sun-Zi.
4. The Best American Humorous Short Stories.
5. The War of the Worlds by H. G. Wells.
Than use the same replacement setting as in
closed test

23
Open test results

Bi-gram model

Tri-gram model

24
TOEFL test

100 test questions like this one

My sister whispered __ my ear. (a) in (b) to (c)
with (d) on

Calculate the probability of the sentences
My sister whispered in my ear.
My sister whispered to my ear.
My sister whispered with my ear.
My sister whispered on my ear.

(correct)
(wrong) (wrong) (wrong)
25
TOEFL test results
26
Error case 1
27
Error case 2
28
Error case 3
29
Discussions

The experiment results show that the accuracy of
open test is 71 and the accuracy of closed test
is 89. The accuracy is 70 on TOEFL-level tests.
Use only Untagged Corpus

30
Future works

Encode more features into the statistical model
Rule Use in (not at) before the names of
countries, regions, cities, and large towns.
NER is necessary
Use advance statistical model
Maximum Entropy (ME) Berger et al., 1996
Conditional Random Fields (CRF) Lafferty et al.,
2001

31
Sequential Tagging with Maximum Entropy
32
Sequence Tagging

Given a sequence Xx1,x2,x3,,xn
and a tag set Tt1,t2,t3,,tm
Find the most possible valid sequence
Yy1,y2,y3,,yn
where yi ? T

33
Sequence Tagging-Naïve Bayes

Calculate P(yix1,x2,x3,,xn)ignore features
other than unigrams
Naïve Bayes
argmax P(yix1,x2,x3,,xn)
yi?T
argmax P(yi)P(x1,x2,,xnyi) by Bayes theorem
yi?T
argmax P(yi)P(x1yi)P(x2yi)P(xnyi)
Independent
yi?T
Pros simple
Cons cannot handle overlapped features due to
its
independent assumption

34
Maximum Entropy
35
Applications

NLP tasks
Machine Translation Berger et al., 1996
Sentence Boundary Detection
Part-of-Speech Tagging Ratnaparkhi, 1998,
NER in Newswire Domain Borthwick, 1998
Adaptive Statistical Language Modeling
Rosenfeld, 1996
Chunking Osborne, 2000
Junk Mail Filtering Zhang, 2003
Biomedical NER Lin et al., 2004

36
Maximum Entropy-A Simple Example

Suppose yi only relate to xi (unigram)
the alphabet of X is denoted as Sa,b,c,
Tt1,t2
We define a joint probability distribution p
over ST
Given N observations (x1,y1)(x2, )(xn,yn)
such as (a, t1)(b, t2)(c, t1)
How can we estimate the underlie model p?

37
The First Constraint (Cont.)

Two possible distributions are
Intuitively, we think the right one is better.
Since it is more uniform than the left one

38
The Second Constraint (Cont.)

Now we observe that t1appears 70, so we add a
new constraint to our model PaPbPc0.7
Again, two possible distributions

more uniform
39
The Third Constraint (Cont.)

Now we observe that a appears 50, so we add a
new constraint to our model Pat1Pat20.5
Two Questions
What does uniform mean?
How can we find the most uniform model subject to
a set of constraints

40
Entropy

A mathematical measure of the uniformity
(uncertainty)
For a conditional distribution its
conditional entropy is

41
Maximum Entropy Solution

Goal to select a model from a set C of
allowed probability distributions which maximizes
entropy H(p)
In other words, p is the most uniform
distribution we can have.
We call p the maximum entropy solution.

42
Expectation of Features

Once a feature is defined, its expectation is
Its observed expectation is
We require
More explicit, we write

43
Represent Constraints by Features

Under ME framework, constraints imposed on a
model are represented by features known as
feature function in the form
For example, previous constraints can be written
as

44
Expectation of Features (Cont.)

In previous example, the fact that t1 appears can
be formalized as
And the fact that a appears 50 can be written
as

45
Maximum Entropy Framework

In general, suppose we have k constraints
(features), we would like to find a model p lies
in the subset C of P
defined by
which maximizes entropy

46
Maximum Entropy Solution

It can be proved that the Maximum Entropy
solution p must have the form
Where k is the number of features and is a
normalization factor to ensure that

47
Maximum Entropy Solution (Cont.)

So our task is to estimate parameters ?i in p
which maximize H(p).
In simple cases, we can find the solution
analytically (like the previous example), when
the problem become more complex, we need to find
a way to automatically derive ?i, given a set of
constraints.

48
Parameters Estimation

A Constrained Optimization Problem
Finding a set of parameters
??1,?2,?3,, ?n of an exponential model which
maximizes its log likelihood.
To find values for the parameters of p
Generalized Iterative Scaling Darroch and
Ratchliff, 1972
Improve Iterative Scaling Berger et al., 1996

49
ME's Adv. And Disadvantage

Advantage
Knowledge-Poor Features
Reusable Software
Free Incorporation of Overlapping and
Interdependent Features
disadvantage
Slow Training Procedure
No Explicit Controls on Parameter Variance (like
SVMs)

50
Biomedical NER

Based on an in press journal paper
Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu,
Integrating Linguistic Knowledge into a
Conditional Random Field Framework to Identify
Biomedical Named Entities, Expert Systems with
Applications, Volume 30, Issue 1, January 2006,
pp. 117-128. (SCI)

51
Sequence Tagging-NER Example

Determine the best tags for a sentence IL-2 gene
induced NF-Kappa B
We can formulate this example as
XIL-2,gene,induced,NF-Kappa,B
assume TPs,Pc,Pe,Pu,Ds,Dc,De,Du,O
gt candidate1 of YPs,Ps,Ps,Ps,Ps (invalid)
gt candidate2 of YO,Ps,Pe,O,O (valid)
gt candidatek of YDs,De,O,Ps,Pe (valid)
The answer of this example is YDs,De,O,Ps,Pe

52
Decoding

By multiply all P(yicontexti), we can get the
probability of a tag sequence Y
But, some tag sequences are invalid
Ex XIL-2,gene,induced,NF-Kappa,B
assume TPs,Pc,Pe,Pu,Ds,Dc,De,Du,O
gt candidate1 of YPs,Ps,Ps,Ps,Ps (invalid)
gt candidatek of YDs,De,O,Ps,Pe (valid)
Even P(candidate1)gtP(candidatek),
still candidatek.
Use Viterbi Search to find the most probable tag
sequence

53
Biomedical Named Entity Recognition

were used to model changes in susceptibility to
NK cell killing caused by transient vs stable
We assign a tag to the current token xi according
to the its features in context.

Protein_St
RNA_St
Context
Feature 1 AllCaps
DNA_St
xi the current token
54
Biomedical Named Entity Recognition

BioNER identify biomedical names in text and
categorize them into different types
Essentials
Tagged Corpus
GENIA
Features
Orthographical Features
Morphological Features
Head Noun Features
POS Features
Tag Set

55
Biomedical NER- Tagged Corpus

GENIA Corpus (Ohta et al. 2002)
V1.1-670 MEDLINE abstracts
V2.1-670 MEDLINE abstracts and POS tagging
V3.0-2000 MEDLINE abstracts
V3.0p-2000 MEDLINE abstracts and POS tagging
V3.02
POS Tag Set
Penn Treebank (PTB)

56
Biomedical NER-Internal/External Features

Internal
Found within the name string itself
e.g., primary NK cells
External
Context
e.g., activate ROI formation
The CD4 coreceptor interacts

AllCaps
57
Orthographical Features
58
Morphological Features
59
Head Nouns
60
Biomedical NER-Tag Set

23 NE Categories
Protein, Other Name, DNA, CellType, Other
Organic,
CellLine, Lipid, Multi Cell,Virus, RNA,
CellComponent,
Body Part,Tissue, AminoAcidMo, Polynucleotide,
Mono Cell, Inorganic, Peptide, Nucleotide,
Atom, Other Artificial, Carbohy drate, Organic
Each NE category c has 4 tags c_start,
c_continue, c_end, c_unique,
Ex Protein has protein_start, protein_continue,
protein_end, protein_unique
In addition, theres a non-NE tag o.
Therefore, T234193

61
Bio NERSystem Architecture
Syste
62
Nested Named Entity

Nested Annotation
ltRNAgtltDNAgtCIITAlt/DNAgtmRNAlt/RNAgt
In the perspective of parsing, we prefer bigger
chucks, that is,ltRNAgtCIITA mRNAlt/RNAgt, However,
ME sometimes only recognizes CIITA as DNA
16.57 of NEs in GENIA 3.02 contains one or more
shorter NE Zhang, 2003
Solution-Post Processing
Boundary Extension
Re-classification

63
Post-ProcessingBoundary Extension

Boundary extension for nested NEs
Extend the R-boundary repeatedly if the NE is
followed by another NE, a head noun, or an
R-boundary word with a valid POS tag.
Extend the left boundary repeatedly if the NE is
preceded by a L-boundary word with a valid POS
tag.
Boundary extension for NEs containing brackets or
slashes
NE NE ( NE ) NE or head noun or
R-boundary word with valid POS tag
NE NE / NE ( / NE ) NE or head
noun or R-boundary word with valid POS tag

64
Experimental ResultsState-of-the-art Systems
GENIA v3.02 (10 Fold-CV)
65
Remarks

ME offers a clear way to incorporate
various evidence into a single, powerful model.
It detects 80 of the rough position of an
Bio-NE.
Due to nested annotations in GENIA and
preferences of bigger chucks, we applies a
post-processing techniques and gets the highest
F-Score in GENIA so far.

66
References

Borthwick, A. (1999). A Maximum Entropy Approach
to Named Entity Recognition, New York University.
Hou, W.-J. and H.-H. Chen (2003). Enhancing
Performance of Protein Name Recognizers Using
Collocation. ACL Workshop on Natural Language
Processing in Biomedicine, Sapporo, Japan.
Kazama, J., T. Makino, et al. (2002). Tuning
Support Vector Machine for Biomedical Named
Entity Recognition. ACL Workshop on NLP in the
Biomedical Domain.
Lee, K.-J., Y.-S. Hwang, et al. (2003). Two-Phase
Biomedical NE Recognition based on SVMs. ACL
Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan.
McDonald, D. (1996). Internal and External
Evidence in the Identification and Semantic
Categorization of Proper Names. Corpus Processing
for Lexical Acquisition. B. Boguraev and J.
Pustejovsky. Cambridge, MA, MIT Press 21-39.
Nenadic, G., S. Rice, et al. (2003). Selecting
Text Features for Gene Name Classification from
Documents to Terms. ACL Workshop on Natural
Language Processing in Biomedicine, Sapporo,
Japan.
Ohta, T., Y. Tateisi, et al. (2002). The GENIA
corpus An annotated research abstract corpus in
molecular biology domain. HLT 2002.
Shen, D., J. Zhang, et al. (2003). Effective
Adaptation of Hidden Markov Model-based Named
Entity Recognizer for Biomedical Domain. ACL
Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan.
Takeuchi, K. and N. Collier (2003). Bio-Medical
Entity Extraction using Support Vector Machines.
ACL Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan.
Torii, M., S. Kamboj, et al. (2003). An
Investigation of Various Information Sources for
Classifying Biological names. ACL Workshop on
Natural Language Processing in Biomedicine,
Sapporo, Japan.
Tsuruoka, Y. and J. i. Tsujii (2003). Boosting
Precision and Recall of Dictionary-Based Protein
Name Recognition. ACL Workshop on Natural
Language Processing in Biomedicine, Sapporo,
Japan.
Yamamoto, K., T. Kudo, et al. (2003). Protein
Name Tagging for Biomedical Annotation in Text.
ACL Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan.
Zhang, J., D. Shen, et al. (2003). Exploring
Various Evidences for Recognition of Named
Entities in Biomedical Domain. EMNLP 2003.

67
Chinese Shallow Parsing

Based on a conference paper
Shih-Hung Wu, Cheng-Wei Shih, Chia-Wei Wu,
Tzong-Han Tsai, and Wen-Lian Hsu, Applying
Maximum Entropy to Robust Chinese Shallow
Parsing, Proceedings of ROCLING 2005, NCKU,
Tainan.

68
Outline

Introduction
Method
Sequential tagging
Maximum Entropy
Experiment
Noise Generation
Test on the Noisy Training Set
Conclusion and future works

69
Introduction

Full-Parsing is useful but difficult
Ambiguity, Unknown word
Chunking is achievable
Fast and robust (suitable for online
applications)
Shallow Parsing (Chunking) applications
information retrieval, information extraction,
question answering, and automatic document
summarization
Our goal
Build a Chinese shallow parser and test the
robustness

70
Chunking with unknown word

???/??/?/??/??/???/??
Standard chunking ????????/NP ??/Dd ???/DM
??/VP
If ?/?/? VH13/Dd/P15 is an unknown word, then
the chunking might be
??/NP ??????/PP ??/Dd ???/DM ??/VP

71
Related works in Beijing, Harbin, Shenyang, and
Hong Kong 10, 15, 16, 20, 21

Standard
News corpus, UPenn Treebank, Sinica Treebank
Method
Memory-based learning, Naïve Bayes, SVM, CRF, ME
Evaluation
Perplexity, Token accuracy, Chunk accuracy
Noisy Model
random noise, filled noise, and repeated noise
13

72
Chunk standard

First level phrases of Sinica Treebank 3.0

73
Phrases
74
Sequential Tagging Scheme

B-I-O Scheme

75
Our Tagset

Each token is tagged by one of the 11 tags
NP_begin, NP_continue,
VP_begin, VP_continue,
PP_begin, PP_continue,
GP_begin, GP_continue,
S_begin, S_continue,
and X(others).

76
Maximum Entropy 2

Conditional Exponential Model
Binary-valued feature functions
Decoding

77
Feature functions

Each feature is represented by a binary-valued
function

78
Conditional Exponential Model

Feature fi(x,y) is a binary-valued function
Parameter ?i is a real-valued weight associated
with fi.
Model ???1 , ?2 , ??n
Normalizing factor

79
Conditional Exponential Model

the probability of observation o, given history h
Feature fi(x,y) is a binary-valued function
Parameter ai is a real-valued weight associated
with fi.
Normalizing factor Z(h)

80
Use the Model

Training
Use empirical data to estimate parameter with
Improved Iterative Scaling (IIS) Algorithm
Test
Decoding find the highest probability path
through the lattice of conditional probabilities

81
Experiment
82
Data and Features

Sinica Treebank 3.0 contains more than 54,000
sentences, from which we randomly extract 30,000
for training and 10,000 for testing.
Features
words, adjacent characters, prefixes of words (1
and 2 characters), suffixes of words (1 and 2
characters), word length, POS of words, adjacent
POS tags, and the words location in the chunk it
belongs to

83
Noise Model Generation

Type 1 (single characters)
???? Nca replaced by
?, Nab, ?, Dbab, ?, Ncda, ?, Nca
Type 2 (AUTOTAG segmentation)
???? Nb would be tagged as ??/Nb, ??/Nb.

84
Results and Discussion
85
Evaluation Criteria

We define four types of accuracy
chunk boundary accuracy
Ignore the category
chunk category accuracy
Ignore the boundary
Token accuracy
Chunk accuracy

86
Evaluation Criteria Example

Standard parsing
???/NP ??/VC ?/NP ?-???/VP
4 chunks, 5 tokens
If the parsing result is
???/NP ??/VC ?/NP ?/Db ???/VE
Then the
chunk boundary 3/4 0.75
chunk category 3/4 0.75
Token accuracy 3/5 0.6
Chunk accuracy 3/4 0.75

87
Result of Type 1 noisy data

the percentage of Nb and Nc replaced by
single character noisy data

88
Evaluation of the boundaries in different
experiment configurations
89
Evaluation of the chunking category in different
experiment configurations
90
Evaluation of tokens in different experiment
configurations
91
Evaluation of chunks in different experiment
configurations
92
Results of Type 2 noisy data

C-C Using a clean training model and clean test
data.
C-N Using a clean training model and noisy test
data in which all Nb and Nc are replaced by
tokenized results.
N-C Using a training model with noisy data in
which all Nb and Nc are replaced by the
tokenized results of chunking clean test data.
N-N Both the training model and the test data
have noisy data in which all Nb and Nc are
replaced by tokenized results.

93
Tokenized string noisy data (the Type 2 noise
model) vs. the AUTOTAG-parsed model
94
Error Analysis
95
Chunking examples with Type 1 noise
96
Shallow parsing examples with Type 2 noise
97
Shallow parsing examples with AUTOTAG-parsed
training data and test data
98
Chunk results of Open Corpus
99
Conclusion and Future Works

can chunk Chinese sentences into five chunk types
accuracy of data with simulated unknown words
only decreases slightly in chunk parsing
On open corpus yields interesting chunking
results.
Future works
adopting other POS systems, such as the Penn
Chinese Treebank tagset, for Chinese shallow
parsing could prove both interesting and useful
adding more types of noise, such as random noise,
filled noise, and repeated noise proposed by
Osborne 13.
In addition to Sinica Treebank, we will extend
our training corpus by incorporating other
corpora, such as Penns Chinese Treebank.

100
Appendix

ME and Improved Iterative Scaling Algorithm

101
Model a Problem
102
Conditional Exponential Model

Feature fi(x,y) is a binary-valued function
Parameter ?i is a real-valued weight associated
with fi.
Model ???1 , ?2 , ??n
Normalizing factor

103
Notes on the Model

Features is domain dependent
The exponential form guarantees positive
Initially, the parameters ?1 , ?2 , ??n are
unknown
Use empirical data to estimate them
maximum log-likelihood

104
Maximum log-likelihood

Given a joint empirical distribution
Log-likelihood as a measure of the quality of the
model ?
Log-likelihood ? 0 always
Log-likelihood 0 is optimal

105
Maximum likelihood of Conditional Exponential
Model

Differentiating with respect to each ?i

The expectation of fi(x,y) with respect to the
empirical distribution and the model

106
From Maximum Entropy to Exponential Model

Through Lagrange Multipliers

107
Entropy -gt Lagrangian

H(p) - ?x px log px
Lagrangian
Optimize the Lagrangian

108

For fixed ?, ? has a maximum where

109
Training

Improved Iterative Scaling (IIS) Algorithm

110
Finding optimal ? by iteration

The change in log-likelihood from ? to ?

111
Finding optimal ? by iteration 2

Use inequality log ? ? 1- ?

where
therefore
112
Finding optimal ? by iteration 3

By the Jensens inequality (cf. appendix)
exp?p(x)q(x)? ? p(x) exp q(x)

where
113
Finding optimal ? by iteration 4

call it B(?) and differentiating it

?i appears alone, can solve each ?i

114
Improved Iterative Scaling Algorithm

IIS Algorithm
Start with some value for each ?i
Repeat until convergence
find each ?i by solve the equation
Set ?i lt- ?i ?i

115
Applications

All sequential labeling problems
Natural language processing
NER, POS tagging, Chunking
Speech recognition
Graphics
Noise reduction
Many others

116
Reference

A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A
maximum entropy approach to natural language
processing, 1996
A. Berger, The Improved Iterative Scaling
Algorithm A Gentle Introduction, 1997

Write a Comment

User Comments (0)