The Improved Iterative Scaling Algorithm: A gentle Introduction - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

The Improved Iterative Scaling Algorithm: A gentle Introduction

Description:

... output value y, a member of a (necessarily finite) set of possible output values ... fi, which determine statistics we feel are important in modeling the process. ... – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 21
Provided by: Supe1
Category:

less

Transcript and Presenter's Notes

Title: The Improved Iterative Scaling Algorithm: A gentle Introduction


1
The Improved Iterative Scaling Algorithm A
gentle Introduction
  • Adam Berger, CMU, 1997

2
Introduction
  • Random process
  • Produces some output value y, a member of a
    (necessarily finite) set of possible output
    values
  • The value of the random variable y is influenced
    by some conditioning information (or context) x
  • Language modeling problem
  • Assign a probability p(y x) to the event that
    the next word in a sequence of text will be y,
    given x, the value of the previous words

3
Features and constraints
  • The goal is to construct a statistical model of
    the process which generated the training sample
  • The building blocks of this model will be a set
    of statistics of the training sample
  • The frequency that in translated to either dans
    or en was 3/10
  • The frequency that in translated to either dans
    or au cours de was ½
  • And so on

Statistics of the training sample
4
Features and constraints
  • Conditioning information x
  • E.g., in the training sample, if April is the
    word following in, then the translation of in is
    en with frequency 9/10
  • Indicator function
  • Expected value of f

5
Features and constraints
  • We can express any statistic of the sample as the
    expected value of an appropriate binary-valued
    indicator function f
  • We call such function a feature function or
    feature for short

6
Features and constraints
  • When we discover a statistic that we feel is
    useful, we can acknowledge its importance by
    requiring that our model accord with it
  • We do this by constraining the expected value
    that the model assigns to the corresponding
    feature function f
  • The expected value of f with respect to the model
    p(y x) is

7
Features and constraints
  • We constrain this expected value to be the same
    as the expected value of f in the training
    sample. That is, we require
  • We call this requirement a constraint equation or
    simply a constraint
  • Finally, we get

8
Features and constraints
  • To sum up so far, we now have
  • A means of representing statistical phenomena
    inherent in a sample of data (namely, )
  • A means of requiring that our model of the
    process exhibit these phenomena (namely,
    )
  • Feature
  • Is a binary-value function of (x, y)
  • Constraint
  • Is an equation between the expected value of the
    feature function in the model and its expected
    value in the training data

9
The maxent principle
  • Suppose that we are given n feature functions fi,
    which determine statistics we feel are important
    in modeling the process. We would like our model
    to accord with these statistics
  • That is, we would like p to lie in the subset C
    of P defined by

10
Exponential form
  • The maximum entropy principle presents us with a
    problem in constrained optimization find the
    p??C which maximizes H(p)
  • Find

11
Exponential form
  • We maximize H(p) subject to the following
    constraints
  • 1.
  • 2.
  • This and the previous condition guarantee that p
    is a conditional probability distribution
  • 3.
  • In other words, p ?C, and so satisfies the active
    constraints C

12
Exponential form
  • To solve this optimization problem, introduce the
    Lagrangian

13
Exponential form
(1)
14
(2)
15
Maximum likelihood
16
(4)
17
Finding ?
18
(5)
19
(6)
(7)
p(x) q(x)
20
(8)
Write a Comment
User Comments (0)
About PowerShow.com