Back to Conditional Log-Linear Modeling - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Back to Conditional Log-Linear Modeling

Description:

... as a collection of features (here, tree = a collection of context-free rules) ... Our weights have always been conditional log-probs ( 0) ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 33
Provided by: jasone2
Learn more at: http://www.cs.jhu.edu
Category:

less

Transcript and Presenter's Notes

Title: Back to Conditional Log-Linear Modeling


1
Back to Conditional Log-Linear Modeling
2
Probability is Useful
summary of half of the course (statistics)
  • We love probability distributions!
  • Weve learned how to define use p() functions.
  • Pick best output text T from a set of candidates
  • speech recognition (HW2) machine translation
    OCR spell correction...
  • maximize p1(T) for some appropriate distribution
    p1
  • Pick best annotation T for a fixed input I
  • text categorization parsing part-of-speech
    tagging
  • maximize p(T I) equivalently maximize joint
    probability p(I,T)
  • often define p(I,T) by noisy channel p(I,T)
    p(T) p(I T)
  • speech recognition other tasks above are cases
    of this too
  • were maximizing an appropriate p1(T) defined by
    p(T I)
  • Pick best probability distribution (a
    meta-problem!)
  • really, pick best parameters ? train HMM, PCFG,
    n-grams, clusters
  • maximum likelihood smoothing EM if unsupervised
    (incomplete data)
  • Bayesian smoothing max p(?data) max p(?,
    data) p(?)p(data?)

3
Probability is Flexible
summary of other half of the course (linguistics)
  • We love probability distributions!
  • Weve learned how to define use p() functions.
  • We want p() to define probability of linguistic
    objects
  • Trees of (non)terminals (PCFGs CKY, Earley,
    pruning, inside-outside)
  • Sequences of words, tags, morphemes, phonemes
    (n-grams, FSAs, FSTs regex compilation,
    best-paths, forward-backward, collocations)
  • Vectors (decis.lists, Gaussians, naïve Bayes
    Yarowsky, clustering/k-NN)
  • Weve also seen some not-so-probabilistic stuff
  • Syntactic features, semantics, morph., Gold.
    Could be stochasticized?
  • Methods can be quantitative data-driven but not
    fully probabilistic transf.-based learning,
    bottom-up clustering, LSA, competitive linking
  • But probabilities have wormed their way into most
    things
  • p() has to capture our intuitions about the
    ling. data

4
An Alternative Tradition
  • Old AI hacking technique
  • Possible parses (or whatever) have scores.
  • Pick the one with the best score.
  • How do you define the score?
  • Completely ad hoc!
  • Throw anything you want into the stew
  • Add a bonus for this, a penalty for that, etc.
  • Learns over time as you adjust bonuses and
    penalties by hand to improve performance. ?
  • Total kludge, but totally flexible too
  • Can throw in any intuitions you might have

5
An Alternative Tradition
  • Old AI hacking technique
  • Possible parses (or whatever) have scores.
  • Pick the one with the best score.
  • How do you define the score?
  • Completely ad hoc!
  • Throw anything you want into the stew
  • Add a bonus for this, a penalty for that, etc.
  • Learns over time as you adjust bonuses and
    penalties by hand to improve performance. ?
  • Total kludge, but totally flexible too
  • Can throw in any intuitions you might have

Exposé at 9 Probabilistic RevolutionNot Really a
Revolution, Critics Say Log-probabilities no
more than scores in disguise Were just adding
stuff up like the old corrupt regime did, admits
spokesperson
6
Nuthin but adding weights
  • n-grams log p(w7 w5, w6) log p(w8 w6,
    w7)
  • PCFG log p(NP VP S) log p(Papa NP) log
    p(VP PP VP)
  • HMM tagging log p(t7 t5, t6) log p(w7
    t7)
  • Noisy channel log p(source) log p(data
    source)
  • Cascade of composed FSTs log p(A) log
    p(B A) log p(C B)
  • Naïve Bayes log p(Class) log p(feature1
    Class) log p(feature2 Class)
  • Note Today well use logprob not
    logprobi.e., bigger weights are better.

7
Nuthin but adding weights
  • n-grams log p(w7 w5, w6) log p(w8 w6,
    w7)
  • PCFG log p(NP VP S) log p(Papa NP) log
    p(VP PP VP)
  • Can describe any linguistic object as collection
    of features(here, a trees features are all
    of its component rules) (different meaning of
    features from singular/plural/etc.)
  • Weight of the object total weight of features
  • Our weights have always been conditional
    log-probs (? 0)
  • but what if we changed that?
  • HMM tagging log p(t7 t5, t6) log p(w7
    t7)
  • Noisy channel log p(source) log p(data
    source)
  • Cascade of FSTs log p(A) log p(B A)
    log p(C B)
  • Naïve Bayes log(Class) log(feature1 Class)
    log(feature2 Class)

8
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)
  • n-grams log p(w7 w5, w6) log p(w8 w6,
    w7)
  • PCFG log p(NP VP S) log p(Papa NP) log
    p(VP PP VP)
  • HMM tagging log p(t7 t5, t6) log p(w7
    t7)
  • Noisy channel log p(source) log p(data
    source)
  • Cascade of FSTs log p(A) log p(B A)
    log p(C B)
  • Naïve Bayes log p(Class) log p(feature1
    Class) log p(feature2 Class)

9
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)
  • n-grams ?(w7 w5, w6) ?(w8
    w6, w7)
  • PCFG ?(NP VP S) ?(Papa NP)
    ?(VP PP VP)
  • HMM tagging ?(t7 t5, t6) ?(w7
    t7)
  • Noisy channel ?(source) ?(data
    source)
  • Cascade of FSTs ?(A) ?(B A)
    ?(C B)
  • Naïve Bayes ?(Class) ?(feature1
    Class) ?(feature2 Class)

In practice, ? is a hash table Maps from feature
name (a string or object) to feature weight (a
float) e.g., ?(NP VP S) weight of the S ?
NP VP rule, say -0.1 or 1.3
10
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)
?(that this) prettiername
  • n-grams ?(w5 w6 w7) ?(w6 w7
    w8)
  • PCFG ?(S ? NP VP) ?(NP ? Papa)
    ?(VP ? VP PP)
  • HMM tagging ?(t5 t6 t7) ?(t7
    ? w7)
  • Noisy channel ?(source) ?(source,
    data)
  • Cascade of FSTs ?(A) ?(A, B)
    ?(B, C)
  • Naïve Bayes ?(Class) ?(Class,
    feature 1) ?(Class, feature2)

In practice, ? is a hash table Maps from feature
name (a string or object) to feature weight (a
float) e.g., ?(S ? NP VP) weight of the S ? NP
VP rule, say -0.1 or 1.3
11
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(that this)
  • n-grams ?(w5 w6 w7) ?(w6 w7
    w8)
  • Best string is the one whose trigrams have the
    highest total weight
  • PCFG ?(S ? NP VP) ?(NP ? Papa)
    ?(VP ? VP PP)
  • Best parse is one whose rules have highest total
    weight (use CKY/Earley)
  • HMM tagging ?(t5 t6 t7) ?(t7
    ? w7)
  • Best tagging has highest total weight of all
    transitions and emissions
  • Noisy channel ?(source) ?(source,
    data)
  • To guess source max (weight of source weight
    of source-data match)
  • Naïve Bayes ?(Class) ?(Class, feature 1)
    ?(Class, feature 2)
  • Best class maximizes prior weight weight of
    compatibility with features

12
The general problem
  • Given some input x
  • Occasionally empty, e.g., no input needed for a
    generative n-gram or model of strings (randsent)
  • Consider a set of candidate outputs y
  • Classifications for x (small number often just
    2)
  • Taggings of x (exponentially many)
  • Parses of x (exponential, even
    infinite)
  • Translations of x (exponential, even
    infinite)
  • Want to find the best y, given x

13
Finding the best y given x
  • Given some input x
  • Consider a set of candidate outputs y
  • Define a scoring function score(x,y)
  • Were talking about linear functions A sum of
    feature weights
  • Choose y that maximizes score(x,y)
  • Easy when only two candidates y (spam
    classification, binary WSD, etc.) just try both!
  • Hard for structured prediction but you now know
    how!
  • At least for linear scoring functions with
    certain kinds of features.
  • Generalizing beyond this is an active area!
  • Approximate inference in graphical models,
    integer linear programming, weighted MAX-SAT,
    etc. see 600.325/425 Declarative Methods

14
  • Given sentence x
  • You know how to find max-score parse y (or
    min-cost parse)
  • Provided that the score of a parse a sum over
    its indiv. rules
  • Each rule score can add up several features of
    that rule
  • But a feature cant look at 2 rules at once (how
    to solve?)

1 S ? NP VP 6 S ? Vst NP 2 S ? S PP 1 VP ? V
NP 2 VP ? VP PP 1 NP ? Det N 2 NP ? NP PP 3
NP ? NP NP 0 PP ? P NP
15
  • Given upper string x
  • You know how to find lower string y such that
    score(x,y) is highest
  • Provided that score(x,y) is a sum of arc scores
    along the best path that transduces x to y
  • Each arc score can add up several features of
    that arc
  • But a feature cant look at 2 arcs at once (how
    to solve?)

16
Linear model notation
  • Given some input x
  • Consider a set of candidate outputs y
  • Define a scoring function score(x,y)
  • Linear function A sum of feature weights (you
    pick the features!)
  • Choose y that maximizes score(x,y)

17
Linear model notation
  • Given some input x
  • Consider a set of candidate outputs y
  • Define a scoring function score(x,y)
  • Linear function A sum of feature weights (you
    pick the features!)
  • Choose y that maximizes score(x,y)

600.465 - Intro to NLP - J. Eisner
17
18
Probabilists Rally Behind Paradigm
83 of
  • .2, .4, .6, .8! Were not gonna take your
    bait!
  • Can estimate our parameters automatically
  • e.g., log p(t7 t5, t6) (trigram
    tag probability)
  • from supervised or unsupervised data
  • Our results are more meaningful
  • Can use probabilities to place bets, quantify
    risk
  • e.g., how sure are we that this is the correct
    parse?
  • Our results can be meaningfully combined ?
    modularity!
  • Multiply indep. conditional probs normalized,
    unlike scores
  • p(English text) p(English phonemes English
    text) p(Jap. phonemes English phonemes)
    p(Jap. text Jap. phonemes)
  • p(semantics) p(syntax semantics)
    p(morphology syntax) p(phonology
    morphology) p(sounds phonology)

19
Probabilists Regret Being Bound by Principle
  • Problem with our courses principled approach
  • All weve had is the chain rule backoff.
  • But this forced us to make some tough either-or
    decisions.
  • p(t7 t5, t6) do we want to back off to t6 or
    t5?
  • p(S ? NP VP S) with features do we want to
    back off first from number or gender features
    first?
  • p(spam message text) which words of the
    message do we back off from??

p(Paul Revere wins weathers clear, ground is
dry, jockey getting over sprain, Epitaph also in
race, Epitaph was recently bought by Gonzalez,
race is on May 17, )
20
News Flash! Hope arrives
  • So far Chain rule backoff
  • directed graphical model
  • Bayesian network or Bayes net
  • locally normalized model
  • We do have a good trick to help with this
  • Conditional log-linear model look back at
    smoothing lecture
  • Solves problems on previous slide!
  • Computationally a bit harder to train
  • Have to compute Z(x) for each condition x

21
Gradient-based training
General function maximization algorithms include
gradient ascent, L-BFGS, simulated annealing
  • Gradually try to adjust ? in a direction that
    will improve the function were trying to
    maximize
  • So compute that functions partial derivatives
    with respect to the feature weights in ? the
    gradient.
  • Heres how the key part works out

22
Why Bother?
  • Gives us probs, not just scores.
  • Can use em to bet, or combine w/ other probs.
  • We can now learn weights from data!

23
News Flash! More hope
  • So far Chain rule backoff
  • directed graphical model
  • Bayesian network or Bayes net
  • locally normalized model
  • Also consider Markov Random Field
  • undirected graphical model
  • log-linear model (globally normalized)
  • exponential model
  • maximum entropy model
  • Gibbs distribution

24
Maximum Entropy
  • Suppose there are 10 classes, A through J.
  • I dont give you any other information.
  • Question Given message m what is your guess for
    p(C m)?
  • Suppose I tell you that 55 of all messages are
    in class A.
  • Question Now what is your guess for p(C m)?
  • Suppose I also tell you that 10 of all messages
    contain Buy and 80 of these are in class A or C.
  • Question Now what is your guess for p(C m),
    if m contains Buy?
  • OUCH!

25
Maximum Entropy
  • Column A sums to 0.55 (55 of all messages are
    in class A)

26
Maximum Entropy
  • Column A sums to 0.55
  • Row Buy sums to 0.1 (10 of all messages
    contain Buy)

27
Maximum Entropy
  • Column A sums to 0.55
  • Row Buy sums to 0.1
  • (Buy, A) and (Buy, C) cells sum to 0.08 (80 of
    the 10)
  • Given these constraints, fill in cells as
    equally as possible maximize the entropy
    (related to cross-entropy, perplexity)
  • Entropy -.051 log .051 - .0025 log .0025 - .029
    log .029 -
  • Largest if probabilities are evenly distributed

28
Maximum Entropy
  • Column A sums to 0.55
  • Row Buy sums to 0.1
  • (Buy, A) and (Buy, C) cells sum to 0.08 (80 of
    the 10)
  • Given these constraints, fill in cells as
    equally as possible maximize the entropy
  • Now p(Buy, C) .029 and p(C Buy) .29
  • We got a compromise p(C Buy) lt p(A Buy) lt
    .55

29
Generalizing to More Features
lt100
Other
30
What we just did
  • For each feature (contains Buy), see what
    fraction of training data has it
  • Many distributions p(c,m) would predict these
    fractions (including the unsmoothed one where all
    mass goes to feature combos weve actually seen)
  • Of these, pick distribution that has max entropy
  • Amazing Theorem This distribution has the form
    p(m,c) (1/Z(?)) exp ?i ?i fi(m,c)
  • So it is log-linear. In fact it is the same
    log-linear distribution that maximizes ?j p(mj,
    cj) as before!
  • Gives another motivation for our log-linear
    approach.

31
Overfitting
  • If we have too many features, we can choose
    weights to model the training data perfectly.
  • If we have a feature that only appears in spam
    training, not ling training, it will get weight ?
    to maximize p(spam feature) at 1.
  • These behaviors overfit the training data.
  • Will probably do poorly on test data.

32
Solutions to Overfitting
  • Throw out rare features.
  • Require every feature to occur gt 4 times, and to
    occur at least once with each output class.
  • Only keep 1000 features.
  • Add one at a time, always greedily picking the
    one that most improves performance on held-out
    data.
  • Smooth the observed feature counts.
  • Smooth the weights by using a prior.
  • max p(?data) max p(?, data) p(?)p(data?)
  • decree p(?) to be high when most weights close to
    0
Write a Comment
User Comments (0)
About PowerShow.com