Applications of Statistical Natural Language Processing

1 / 116
About This Presentation
Title:

Applications of Statistical Natural Language Processing

Description:

LSI text classification, Ontological Engineering, Question Answering. Multi-Agent coordination ... Shallow parsing, grammar debug, Named Entity Recognition ... – PowerPoint PPT presentation

Number of Views:264
Avg rating:3.0/5.0
Slides: 117
Provided by: wsCsieN

less

Transcript and Presenter's Notes

Title: Applications of Statistical Natural Language Processing


1
Applications of Statistical Natural Language
Processing
  • Shih-Hung Wu
  • Dept. of CSIE,
  • Chaoyang University of Technology

2
My Research Topics
  • Information Retrieval
  • LSI text classification, Ontological Engineering,
    Question Answering
  • Multi-Agent coordination
  • Game theoretical Multi-agent Negotiation
  • NLP
  • Shallow parsing, grammar debug, Named Entity
    Recognition
  • Learning Technology
  • Word Problems in primary school mathematics
  • Web Intelligence
  • Wikipedia wrapper, Cross-language IR

3
Outline
  • AI NLP
  • Statistical Natural Language Processing
  • Example 1Prepostion Usage by Language Model
  • Sequential tagging technique
  • Maximum Entropy
  • Example 2 Bio NER
  • Example 3 Chinese Shallow Parsing

4
AI expected
  • Robots that can understand humans language and
    interact with human.

C3PO and R2DR
5
What we have now
  • Home robots that can dance or clean dust

6
What is lost?
  • They cannot use natural language well

7
Research Topics in NLPIR
  • Applications of NLPIR
  • Question Answering system, Input system,
    Information extraction, Ontology extraction,
    Grammar Checker
  • Other Tough NLP
  • Machine Translation, Summary, Natural Language
    Generation
  • Sub goal in NLP
  • Word segmentation, POS tagging, Full Parsing,
    Alignment, Suffix pattern, Shallow parsing,
    Semantic Role Labeling

8
Nature Language Processing (NLP)
  • Categories of the Development in NLP
  • Corpus-based Methods
  • Statistical Methods
  • Textbooks of NLP

Allen (1995) Natural Language Understanding
Manning (1999) Foundations of Statistical NLP
Jurafsky (2000) Speech and Language Processing
Jackson (2002) NLP for Online Application
9
Methodology
  • Machine Learning Pattern Recognition
    Statistical NLP share the same methodology
  • Training and Test
  • Training from a large set of training examples
  • Test on an independent set of examples

10
Preposition Usage by Language Model
  • Based on a conference paper
  • Shih-Hung Wu, Chen-Yu Su, Tian-Jian Jiang,
    Wen-Lian Hsu , An Evaluation of Adopting
    Language Model as the Checker of Preposition
    Usage, Proceedings of ROCLING 2006.

11
Motivation
  • Microsoft word detects grammar errors but do
    not deal with the usages of prepositions
  • Language Model can predict next word
  • Original ideal Use language model to predict the
    right preposition
  • Current approach Calculate the probability of
    each sentence

12
Language Model
  • An LM uses short history to predict the next
    word.
  • Ex.
  • Sue swallowed the large green ___.
  • ___could be Pill or frog
  • large green ___
  • ___could be Pill or frog
  • Markov assumption
  • Only the prior local context affects

13
Sentence Probability
  • The probability of a sentence in a language, say
    English, is defined as the probability of the
    sequence
  • Decomposed by the chain rule of conditional
    probability under the Markov assumption

14
Maximum Likelihood Estimation
Maximum Likelihood Estimation bi-gram
Maximum Likelihood Estimation n-gram
15
Smoothing 1 GT
  • Good-Turing Discounting (GT)
  • adjusts the count of n-gram from r to r, which
    is base on the assumption that their distribution
    is binomial Good, 1953.

16
Smoothing 2AD
  • Absolute Discounting (AD)
  • all the unseen events gain the frequency
    uniformly. Ney et al., 1994

17
Smoothing 3mKN
  • Modified Kneser-Ney discounting (mKN)
  • mKN has three different parameters, D1, D2, and
    D3 that are applied to n-grams with one, two, and
    three or more counts Chen and Goodman, 1998

18
Entropy and Perplexity
  • Entropy
  • Perplexity

19
Experiment setting training
  • Training sets for bi-gram model
  • Training sets for tri-gram model

20
Closed test setting
  • select 100 sentences from the training corpus
  • replacing the correct preposition with other
    prepositions
  • Ex.
  • My sister whispered in my ear.
  • My sister whispered on my ear.
  • My sister whispered at my ear.
  • Assume the sentence with the lowest perplexity is
    considered as the correct one.

21
Closed test results
  • Bi-gram model
  • Tri-gram model

22
Open test
  • 100 sentences that we collect from
  • 1. Amusements in Mathematics by Henry Ernest
    Dudeney.
  • 2. Grimm's Fairy Tales by Jacob Grimm and Wilhelm
    Grimm.
  • 3. The Art of War by Sun-Zi.
  • 4. The Best American Humorous Short Stories.
  • 5. The War of the Worlds by H. G. Wells.
  • Than use the same replacement setting as in
    closed test

23
Open test results
  • Bi-gram model
  • Tri-gram model

24
TOEFL test
  • 100 test questions like this one

My sister whispered __ my ear. (a) in (b) to (c)
with (d) on
  • Calculate the probability of the sentences
  • My sister whispered in my ear.
  • My sister whispered to my ear.
  • My sister whispered with my ear.
  • My sister whispered on my ear.

(correct)
(wrong) (wrong) (wrong)
25
TOEFL test results
26
Error case 1
27
Error case 2
28
Error case 3
29
Discussions
  • The experiment results show that the accuracy of
    open test is 71 and the accuracy of closed test
    is 89. The accuracy is 70 on TOEFL-level tests.
  • Use only Untagged Corpus

30
Future works
  • Encode more features into the statistical model
  • Rule Use in (not at) before the names of
    countries, regions, cities, and large towns.
  • NER is necessary
  • Use advance statistical model
  • Maximum Entropy (ME) Berger et al., 1996
  • Conditional Random Fields (CRF) Lafferty et al.,
    2001

31
Sequential Tagging with Maximum Entropy
32
Sequence Tagging
  • Given a sequence Xx1,x2,x3,,xn
  • and a tag set Tt1,t2,t3,,tm
  • Find the most possible valid sequence
    Yy1,y2,y3,,yn
  • where yi ? T

33
Sequence Tagging-Naïve Bayes
  • Calculate P(yix1,x2,x3,,xn)ignore features
    other than unigrams
  • Naïve Bayes
  • argmax P(yix1,x2,x3,,xn)
  • yi?T
  • argmax P(yi)P(x1,x2,,xnyi) by Bayes theorem
    yi?T
  • argmax P(yi)P(x1yi)P(x2yi)P(xnyi)
    Independent
  • yi?T
  • Pros simple
  • Cons cannot handle overlapped features due to
    its
  • independent assumption

34
Maximum Entropy
35
Applications
  • NLP tasks
  • Machine Translation Berger et al., 1996
  • Sentence Boundary Detection
  • Part-of-Speech Tagging Ratnaparkhi, 1998,
  • NER in Newswire Domain Borthwick, 1998
  • Adaptive Statistical Language Modeling
    Rosenfeld, 1996
  • Chunking Osborne, 2000
  • Junk Mail Filtering Zhang, 2003
  • Biomedical NER Lin et al., 2004

36
Maximum Entropy-A Simple Example
  • Suppose yi only relate to xi (unigram)
  • the alphabet of X is denoted as Sa,b,c,
    Tt1,t2
  • We define a joint probability distribution p
    over ST
  • Given N observations (x1,y1)(x2, )(xn,yn)
  • such as (a, t1)(b, t2)(c, t1)
  • How can we estimate the underlie model p?

37
The First Constraint (Cont.)
  • Two possible distributions are
  • Intuitively, we think the right one is better.
    Since it is more uniform than the left one

38
The Second Constraint (Cont.)
  • Now we observe that t1appears 70, so we add a
    new constraint to our model PaPbPc0.7
  • Again, two possible distributions

more uniform
39
The Third Constraint (Cont.)
  • Now we observe that a appears 50, so we add a
    new constraint to our model Pat1Pat20.5
  • Two Questions
  • What does uniform mean?
  • How can we find the most uniform model subject to
    a set of constraints

40
Entropy
  • A mathematical measure of the uniformity
    (uncertainty)
  • For a conditional distribution its
    conditional entropy is

41
Maximum Entropy Solution
  • Goal to select a model from a set C of
    allowed probability distributions which maximizes
    entropy H(p)
  • In other words, p is the most uniform
    distribution we can have.
  • We call p the maximum entropy solution.

42
Expectation of Features
  • Once a feature is defined, its expectation is
  • Its observed expectation is
  • We require
  • More explicit, we write

43
Represent Constraints by Features
  • Under ME framework, constraints imposed on a
    model are represented by features known as
    feature function in the form
  • For example, previous constraints can be written
    as

44
Expectation of Features (Cont.)
  • In previous example, the fact that t1 appears can
    be formalized as
  • And the fact that a appears 50 can be written
    as

45
Maximum Entropy Framework
  • In general, suppose we have k constraints
    (features), we would like to find a model p lies
    in the subset C of P
  • defined by
  • which maximizes entropy

46
Maximum Entropy Solution
  • It can be proved that the Maximum Entropy
    solution p must have the form
  • Where k is the number of features and is a
    normalization factor to ensure that

47
Maximum Entropy Solution (Cont.)
  • So our task is to estimate parameters ?i in p
    which maximize H(p).
  • In simple cases, we can find the solution
    analytically (like the previous example), when
    the problem become more complex, we need to find
    a way to automatically derive ?i, given a set of
    constraints.

48
Parameters Estimation
  • A Constrained Optimization Problem
  • Finding a set of parameters
  • ??1,?2,?3,, ?n of an exponential model which
    maximizes its log likelihood.
  • To find values for the parameters of p
  • Generalized Iterative Scaling Darroch and
    Ratchliff, 1972
  • Improve Iterative Scaling Berger et al., 1996

49
ME's Adv. And Disadvantage
  • Advantage
  • Knowledge-Poor Features
  • Reusable Software
  • Free Incorporation of Overlapping and
  • Interdependent Features
  • disadvantage
  • Slow Training Procedure
  • No Explicit Controls on Parameter Variance (like
    SVMs)

50
Biomedical NER
  • Based on an in press journal paper
  • Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu,
  • Integrating Linguistic Knowledge into a
    Conditional Random Field Framework to Identify
    Biomedical Named Entities, Expert Systems with
    Applications, Volume 30, Issue 1, January 2006,
    pp. 117-128. (SCI)

51
Sequence Tagging-NER Example
  • Determine the best tags for a sentence IL-2 gene
    induced NF-Kappa B
  • We can formulate this example as
  • XIL-2,gene,induced,NF-Kappa,B
  • assume TPs,Pc,Pe,Pu,Ds,Dc,De,Du,O
  • gt candidate1 of YPs,Ps,Ps,Ps,Ps (invalid)
  • gt candidate2 of YO,Ps,Pe,O,O (valid)
  • gt candidatek of YDs,De,O,Ps,Pe (valid)
  • The answer of this example is YDs,De,O,Ps,Pe

52
Decoding
  • By multiply all P(yicontexti), we can get the
    probability of a tag sequence Y
  • But, some tag sequences are invalid
  • Ex XIL-2,gene,induced,NF-Kappa,B
  • assume TPs,Pc,Pe,Pu,Ds,Dc,De,Du,O
  • gt candidate1 of YPs,Ps,Ps,Ps,Ps (invalid)
  • gt candidatek of YDs,De,O,Ps,Pe (valid)
  • Even P(candidate1)gtP(candidatek),
  • still candidatek.
  • Use Viterbi Search to find the most probable tag
    sequence

53
Biomedical Named Entity Recognition
  • were used to model changes in susceptibility to
    NK cell killing caused by transient vs stable
  • We assign a tag to the current token xi according
    to the its features in context.

Protein_St
RNA_St
Context
Feature 1 AllCaps
DNA_St
xi the current token
54
Biomedical Named Entity Recognition
  • BioNER identify biomedical names in text and
    categorize them into different types
  • Essentials
  • Tagged Corpus
  • GENIA
  • Features
  • Orthographical Features
  • Morphological Features
  • Head Noun Features
  • POS Features
  • Tag Set

55
Biomedical NER- Tagged Corpus
  • GENIA Corpus (Ohta et al. 2002)
  • V1.1-670 MEDLINE abstracts
  • V2.1-670 MEDLINE abstracts and POS tagging
  • V3.0-2000 MEDLINE abstracts
  • V3.0p-2000 MEDLINE abstracts and POS tagging
  • V3.02
  • POS Tag Set
  • Penn Treebank (PTB)

56
Biomedical NER-Internal/External Features
  • Internal
  • Found within the name string itself
  • e.g., primary NK cells
  • External
  • Context
  • e.g., activate ROI formation
  • The CD4 coreceptor interacts

AllCaps
57
Orthographical Features
58
Morphological Features
59
Head Nouns
60
Biomedical NER-Tag Set
  • 23 NE Categories
  • Protein, Other Name, DNA, CellType, Other
    Organic,
  • CellLine, Lipid, Multi Cell,Virus, RNA,
    CellComponent,
  • Body Part,Tissue, AminoAcidMo, Polynucleotide,
  • Mono Cell, Inorganic, Peptide, Nucleotide,
  • Atom, Other Artificial, Carbohy drate, Organic
  • Each NE category c has 4 tags c_start,
    c_continue, c_end, c_unique,
  • Ex Protein has protein_start, protein_continue,
    protein_end, protein_unique
  • In addition, theres a non-NE tag o.
  • Therefore, T234193

61
Bio NERSystem Architecture
Syste
62
Nested Named Entity
  • Nested Annotation
  • ltRNAgtltDNAgtCIITAlt/DNAgtmRNAlt/RNAgt
  • In the perspective of parsing, we prefer bigger
    chucks, that is,ltRNAgtCIITA mRNAlt/RNAgt, However,
    ME sometimes only recognizes CIITA as DNA
  • 16.57 of NEs in GENIA 3.02 contains one or more
    shorter NE Zhang, 2003
  • Solution-Post Processing
  • Boundary Extension
  • Re-classification

63
Post-ProcessingBoundary Extension
  • Boundary extension for nested NEs
  • Extend the R-boundary repeatedly if the NE is
    followed by another NE, a head noun, or an
    R-boundary word with a valid POS tag.
  • Extend the left boundary repeatedly if the NE is
    preceded by a L-boundary word with a valid POS
    tag.
  • Boundary extension for NEs containing brackets or
    slashes
  • NE NE ( NE ) NE or head noun or
    R-boundary word with valid POS tag
  • NE NE / NE ( / NE ) NE or head
    noun or R-boundary word with valid POS tag

64
Experimental ResultsState-of-the-art Systems
GENIA v3.02 (10 Fold-CV)
65
Remarks
  • ME offers a clear way to incorporate
  • various evidence into a single, powerful model.
  • It detects 80 of the rough position of an
    Bio-NE.
  • Due to nested annotations in GENIA and
    preferences of bigger chucks, we applies a
    post-processing techniques and gets the highest
    F-Score in GENIA so far.

66
References
  • Borthwick, A. (1999). A Maximum Entropy Approach
    to Named Entity Recognition, New York University.
  • Hou, W.-J. and H.-H. Chen (2003). Enhancing
    Performance of Protein Name Recognizers Using
    Collocation. ACL Workshop on Natural Language
    Processing in Biomedicine, Sapporo, Japan.
  • Kazama, J., T. Makino, et al. (2002). Tuning
    Support Vector Machine for Biomedical Named
    Entity Recognition. ACL Workshop on NLP in the
    Biomedical Domain.
  • Lee, K.-J., Y.-S. Hwang, et al. (2003). Two-Phase
    Biomedical NE Recognition based on SVMs. ACL
    Workshop on Natural Language Processing in
    Biomedicine, Sapporo, Japan.
  • McDonald, D. (1996). Internal and External
    Evidence in the Identification and Semantic
    Categorization of Proper Names. Corpus Processing
    for Lexical Acquisition. B. Boguraev and J.
    Pustejovsky. Cambridge, MA, MIT Press 21-39.
  • Nenadic, G., S. Rice, et al. (2003). Selecting
    Text Features for Gene Name Classification from
    Documents to Terms. ACL Workshop on Natural
    Language Processing in Biomedicine, Sapporo,
    Japan.
  • Ohta, T., Y. Tateisi, et al. (2002). The GENIA
    corpus An annotated research abstract corpus in
    molecular biology domain. HLT 2002.
  • Shen, D., J. Zhang, et al. (2003). Effective
    Adaptation of Hidden Markov Model-based Named
    Entity Recognizer for Biomedical Domain. ACL
    Workshop on Natural Language Processing in
    Biomedicine, Sapporo, Japan.
  • Takeuchi, K. and N. Collier (2003). Bio-Medical
    Entity Extraction using Support Vector Machines.
    ACL Workshop on Natural Language Processing in
    Biomedicine, Sapporo, Japan.
  • Torii, M., S. Kamboj, et al. (2003). An
    Investigation of Various Information Sources for
    Classifying Biological names. ACL Workshop on
    Natural Language Processing in Biomedicine,
    Sapporo, Japan.
  • Tsuruoka, Y. and J. i. Tsujii (2003). Boosting
    Precision and Recall of Dictionary-Based Protein
    Name Recognition. ACL Workshop on Natural
    Language Processing in Biomedicine, Sapporo,
    Japan.
  • Yamamoto, K., T. Kudo, et al. (2003). Protein
    Name Tagging for Biomedical Annotation in Text.
    ACL Workshop on Natural Language Processing in
    Biomedicine, Sapporo, Japan.
  • Zhang, J., D. Shen, et al. (2003). Exploring
    Various Evidences for Recognition of Named
    Entities in Biomedical Domain. EMNLP 2003.

67
Chinese Shallow Parsing
  • Based on a conference paper
  • Shih-Hung Wu, Cheng-Wei Shih, Chia-Wei Wu,
    Tzong-Han Tsai, and Wen-Lian Hsu, Applying
    Maximum Entropy to Robust Chinese Shallow
    Parsing, Proceedings of ROCLING 2005, NCKU,
    Tainan.

68
Outline
  • Introduction
  • Method
  • Sequential tagging
  • Maximum Entropy
  • Experiment
  • Noise Generation
  • Test on the Noisy Training Set
  • Conclusion and future works

69
Introduction
  • Full-Parsing is useful but difficult
  • Ambiguity, Unknown word
  • Chunking is achievable
  • Fast and robust (suitable for online
    applications)
  • Shallow Parsing (Chunking) applications
  • information retrieval, information extraction,
    question answering, and automatic document
    summarization
  • Our goal
  • Build a Chinese shallow parser and test the
    robustness

70
Chunking with unknown word
  • ???/??/?/??/??/???/??
  • Standard chunking ????????/NP ??/Dd ???/DM
    ??/VP
  • If ?/?/? VH13/Dd/P15 is an unknown word, then
    the chunking might be
  • ??/NP ??????/PP ??/Dd ???/DM ??/VP

71
Related works in Beijing, Harbin, Shenyang, and
Hong Kong 10, 15, 16, 20, 21
  • Standard
  • News corpus, UPenn Treebank, Sinica Treebank
  • Method
  • Memory-based learning, Naïve Bayes, SVM, CRF, ME
  • Evaluation
  • Perplexity, Token accuracy, Chunk accuracy
  • Noisy Model
  • random noise, filled noise, and repeated noise
    13

72
Chunk standard
  • First level phrases of Sinica Treebank 3.0

73
Phrases
74
Sequential Tagging Scheme
  • B-I-O Scheme

75
Our Tagset
  • Each token is tagged by one of the 11 tags
  • NP_begin, NP_continue,
  • VP_begin, VP_continue,
  • PP_begin, PP_continue,
  • GP_begin, GP_continue,
  • S_begin, S_continue,
  • and X(others).

76
Maximum Entropy 2
  • Conditional Exponential Model
  • Binary-valued feature functions
  • Decoding

77
Feature functions
  • Each feature is represented by a binary-valued
    function

78
Conditional Exponential Model
  • Feature fi(x,y) is a binary-valued function
  • Parameter ?i is a real-valued weight associated
    with fi.
  • Model ???1 , ?2 , ??n
  • Normalizing factor

79
Conditional Exponential Model
  • the probability of observation o, given history h
  • Feature fi(x,y) is a binary-valued function
  • Parameter ai is a real-valued weight associated
    with fi.
  • Normalizing factor Z(h)

80
Use the Model
  • Training
  • Use empirical data to estimate parameter with
  • Improved Iterative Scaling (IIS) Algorithm
  • Test
  • Decoding find the highest probability path
    through the lattice of conditional probabilities

81
Experiment
82
Data and Features
  • Sinica Treebank 3.0 contains more than 54,000
    sentences, from which we randomly extract 30,000
    for training and 10,000 for testing.
  • Features
  • words, adjacent characters, prefixes of words (1
    and 2 characters), suffixes of words (1 and 2
    characters), word length, POS of words, adjacent
    POS tags, and the words location in the chunk it
    belongs to

83
Noise Model Generation
  • Type 1 (single characters)
  • ???? Nca replaced by
  • ?, Nab, ?, Dbab, ?, Ncda, ?, Nca
  • Type 2 (AUTOTAG segmentation)
  • ???? Nb would be tagged as ??/Nb, ??/Nb.

84
Results and Discussion
85
Evaluation Criteria
  • We define four types of accuracy
  • chunk boundary accuracy
  • Ignore the category
  • chunk category accuracy
  • Ignore the boundary
  • Token accuracy
  • Chunk accuracy

86
Evaluation Criteria Example
  • Standard parsing
  • ???/NP ??/VC ?/NP ?-???/VP
  • 4 chunks, 5 tokens
  • If the parsing result is
  • ???/NP ??/VC ?/NP ?/Db ???/VE
  • Then the
  • chunk boundary 3/4 0.75
  • chunk category 3/4 0.75
  • Token accuracy 3/5 0.6
  • Chunk accuracy 3/4 0.75

87
Result of Type 1 noisy data
  • the percentage of Nb and Nc replaced by
    single character noisy data

88
Evaluation of the boundaries in different
experiment configurations
89
Evaluation of the chunking category in different
experiment configurations
90
Evaluation of tokens in different experiment
configurations
91
Evaluation of chunks in different experiment
configurations
92
Results of Type 2 noisy data
  • C-C Using a clean training model and clean test
    data.
  • C-N Using a clean training model and noisy test
    data in which all Nb and Nc are replaced by
    tokenized results.
  • N-C Using a training model with noisy data in
    which all Nb and Nc are replaced by the
    tokenized results of chunking clean test data.
  • N-N Both the training model and the test data
    have noisy data in which all Nb and Nc are
    replaced by tokenized results.

93
Tokenized string noisy data (the Type 2 noise
model) vs. the AUTOTAG-parsed model
94
Error Analysis
95
Chunking examples with Type 1 noise
96
Shallow parsing examples with Type 2 noise
97
Shallow parsing examples with AUTOTAG-parsed
training data and test data
98
Chunk results of Open Corpus
99
Conclusion and Future Works
  • can chunk Chinese sentences into five chunk types
  • accuracy of data with simulated unknown words
    only decreases slightly in chunk parsing
  • On open corpus yields interesting chunking
    results.
  • Future works
  • adopting other POS systems, such as the Penn
    Chinese Treebank tagset, for Chinese shallow
    parsing could prove both interesting and useful
  • adding more types of noise, such as random noise,
    filled noise, and repeated noise proposed by
    Osborne 13.
  • In addition to Sinica Treebank, we will extend
    our training corpus by incorporating other
    corpora, such as Penns Chinese Treebank.

100
Appendix
  • ME and Improved Iterative Scaling Algorithm

101
Model a Problem
102
Conditional Exponential Model
  • Feature fi(x,y) is a binary-valued function
  • Parameter ?i is a real-valued weight associated
    with fi.
  • Model ???1 , ?2 , ??n
  • Normalizing factor

103
Notes on the Model
  • Features is domain dependent
  • The exponential form guarantees positive
  • Initially, the parameters ?1 , ?2 , ??n are
    unknown
  • Use empirical data to estimate them
  • maximum log-likelihood

104
Maximum log-likelihood
  • Given a joint empirical distribution
  • Log-likelihood as a measure of the quality of the
    model ?
  • Log-likelihood ? 0 always
  • Log-likelihood 0 is optimal

105
Maximum likelihood of Conditional Exponential
Model
  • Differentiating with respect to each ?i
  • The expectation of fi(x,y) with respect to the
    empirical distribution and the model

106
From Maximum Entropy to Exponential Model
  • Through Lagrange Multipliers

107
Entropy -gt Lagrangian
  • H(p) - ?x px log px
  • Lagrangian
  • Optimize the Lagrangian

108
  • For fixed ?, ? has a maximum where

109
Training
  • Improved Iterative Scaling (IIS) Algorithm

110
Finding optimal ? by iteration
  • The change in log-likelihood from ? to ?

111
Finding optimal ? by iteration 2
  • Use inequality log ? ? 1- ?

where
therefore
112
Finding optimal ? by iteration 3
  • By the Jensens inequality (cf. appendix)
    exp?p(x)q(x)? ? p(x) exp q(x)

where
113
Finding optimal ? by iteration 4
  • call it B(?) and differentiating it
  • ?i appears alone, can solve each ?i

114
Improved Iterative Scaling Algorithm
  • IIS Algorithm
  • Start with some value for each ?i
  • Repeat until convergence
  • find each ?i by solve the equation
  • Set ?i lt- ?i ?i

115
Applications
  • All sequential labeling problems
  • Natural language processing
  • NER, POS tagging, Chunking
  • Speech recognition
  • Graphics
  • Noise reduction
  • Many others

116
Reference
  • A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A
    maximum entropy approach to natural language
    processing, 1996
  • A. Berger, The Improved Iterative Scaling
    Algorithm A Gentle Introduction, 1997
Write a Comment
User Comments (0)