Statistical techniques in NLP - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical techniques in NLP

Description:

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas Learning Central to statistical NLP In most cases, supervised methods are used ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 25
Provided by: Vasile4
Category:

less

Transcript and Presenter's Notes

Title: Statistical techniques in NLP


1
Statistical techniques in NLP
  • Vasileios Hatzivassiloglou
  • University of Texas at Dallas

2
Learning
  • Central to statistical NLP
  • In most cases, supervised methods are used, with
    a separate training set
  • Unsupervised methods (clustering) recalculate the
    entire model on new data

3
Parameterized models
  • Assume that the observed (training) data D is
    described by a given distribution
  • This distribution, possibly with some parameters
    ?, is our model ?.
  • We want to maximize the likelihood function,
    P(D?) or P(D?).

4
Maximum likelihood estimation
  • Find the ? that maximizes P(D?), i.e.,
  • Example Binomial distribution
  • P(Dm)
  • Therefore, mD/N

5
Smoothing
  • MLE assigns zero probability to unseen events
  • Example trigrams in part of speech tagging (23
    unseen)
  • Solution smoothing (small probabilities for
    unseen data)

6
Bayesian learning
  • It is often impossible to solve
  • Bayes decision rule choose ? that maximizes
    P(?D) (minimum error rate)
  • But it may be hard to calculate P(?D)
  • Use Bayes rule
  • Naïve Bayes

7
Examples
  • Gale et al 1992, 90 sense disambiguation
    accuracy (choose between bank/money and
    bank/river)
  • Hanks and Rooth 1990, prepositional phrase
    attachment
  • He ate pasta with cheese
  • He ate pasta with a fork
  • Both rely on observable features (nearby words,
    the verb)

8
Markov models
  • A stochastic process follows a sequence of states
    over time with some transition probabilities
  • If the process is stationary and with limited
    memory, we have a Markov chain
  • The model can be visible, or with hidden states
    (HMM)

9
Example N-gram language models
  • Result for a word depends only on the word and a
    limited number of neighbors
  • Part-of-speech tagging maximize
  • With Bayes rule, chain rule, and independence
    assumptions
  • Use HMM for automatically adjusting back-off
    smoothing

10
Example Speech recognition
  • Need to find correct sequence of words given
    aural signal
  • Language model (N-gram) accounts for dependencies
    between words
  • Acoustic model maps from visible (phonemes) to
    hidden (words) level
  • HMM combines both
  • Viterbi algorithm will find optimal solution

11
Estimation-Maximization
  • In general, we can iteratively estimate complex
    models with hidden parameters
  • Define a quality function Q as the conditional
    likelihood of the model on all parameters
  • Estimate Q from an initial choice for ?
  • Choose new ? that maximizes Q

12
Example PCFG parsing
  • Probabilistic context-free grammars
  • Likelihood of each rule (e.g., VP ? V or VP ? V
    NP) is a basic parameter
  • Combined probability of the entire tree gives the
    quality function
  • Forward-backward algorithm gives the solution
  • Lexicalization (Collins, 1996, 1997)

13
Example Machine Translation
  • The noisy channel model (Brown et al., 1991)
  • Input in one language (e.g., English) is garbled
    into another (e.g., French)
  • Estimate probabilities of each word or phrase
    generating words or phrases in the other language
    and how many of them (fertility)
  • A similar approach Transliteration (Knight, 1998)

14
Linear regression
  • Predict output as a linear combination of input
    variables
  • Choose weights that minimize the sum of residual
    square error (least squares)
  • Can be computed efficiently via a matrix
    decomposition and inversion

15
Log-linear regression
  • Ideal output is 0 or 1
  • Because the distribution changes from normal to
    binomial, a transformed LS fit is not accurate
  • Solution Use an intermediate predictor ?,
  • Can be approximated with iterative reweighted
    least squares

16
Examples
  • Text categorization for information retrieval
    (Yang, 1998)
  • Many types of sentence/word classification
  • cue words (Passonneau and Litman, 1993)
  • prosodic features (Pan and McKeown, 1999)

17
Singular-value decomposition
  • A technique for reducing dimensionality data
    points are projected
  • Given matrix A (n?m), find matrices T (n?k), S
    (k?k), and D (k?m) so that their product is A
  • S is the top k singular values of A
  • Projection is achieved by multiplying and A
  • Application Latent Semantic Indexing

18
Methods without an explicit probability model
  • Use empirical techniques to directly provide
    output without calculating a model
  • Decision trees Each node is associated with a
    decision on one of the input features
  • The tree is built incrementally by choosing
    features with the most discriminatory power

19
Variations on decision trees
  • Shrinking to prevent over-training
  • Decision lists (Yarowsky 1997) use only the top
    feature for accent restoration

20
Rule induction
  • Similar to decision trees, but the rules are
    allowed to vary and contain different operators
  • Examples RIPPER (Cohen 1996), transformation-base
    d learning (Brill 1996), genetic algorithms
    (Siegel 1998)

21
Methods without explicit model
  • k-Nearest Neighbor classification
  • Neural networks
  • Genetic algorithms

22
Support vector machines
  • Find hyperplane that maximizes distance from
    support vectors
  • Non-linear transformation From original space to
    separable space via kernel function
  • Text categorization (Joachims, 1997), OCR (Burges
    and Vapnik, 1996), Speech recognition (Schmidt,
    1996)

23
Classification issues
  • Two or many classes
  • Classifier confidence, probability of membership
    in each class
  • Training / test set distributions
  • Balance of training data across classes

24
When to use each method?
  • Probabilistic models depend on distributional
    assumptions
  • Linear models (and SVD) assume a normal data
    distribution, and generalized linear models a
    Poisson, binomial, or negative binomial
  • Markov models capture limited dependencies
  • Rule-based models allow for multi-way
    classification easier than linear/log-linear ones
  • For many applications, it is important to get a
    confidence estimate
Write a Comment
User Comments (0)
About PowerShow.com