Statistical techniques in NLP - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical techniques in NLP

Description:

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas Learning Central to statistical NLP In most cases, supervised methods are used ... – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 25

Provided by: Vasile4

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Statistical techniques in NLP

1
Statistical techniques in NLP

Vasileios Hatzivassiloglou
University of Texas at Dallas

2
Learning

Central to statistical NLP
In most cases, supervised methods are used, with
a separate training set
Unsupervised methods (clustering) recalculate the
entire model on new data

3
Parameterized models

Assume that the observed (training) data D is
described by a given distribution
This distribution, possibly with some parameters
?, is our model ?.
We want to maximize the likelihood function,
P(D?) or P(D?).

4
Maximum likelihood estimation

Find the ? that maximizes P(D?), i.e.,
Example Binomial distribution
P(Dm)
Therefore, mD/N

5
Smoothing

MLE assigns zero probability to unseen events
Example trigrams in part of speech tagging (23
unseen)
Solution smoothing (small probabilities for
unseen data)

6
Bayesian learning

It is often impossible to solve
Bayes decision rule choose ? that maximizes
P(?D) (minimum error rate)
But it may be hard to calculate P(?D)
Use Bayes rule
Naïve Bayes

7
Examples

Gale et al 1992, 90 sense disambiguation
accuracy (choose between bank/money and
bank/river)
Hanks and Rooth 1990, prepositional phrase
attachment
He ate pasta with cheese
He ate pasta with a fork
Both rely on observable features (nearby words,
the verb)

8
Markov models

A stochastic process follows a sequence of states
over time with some transition probabilities
If the process is stationary and with limited
memory, we have a Markov chain
The model can be visible, or with hidden states
(HMM)

9
Example N-gram language models

Result for a word depends only on the word and a
limited number of neighbors
Part-of-speech tagging maximize
With Bayes rule, chain rule, and independence
assumptions
Use HMM for automatically adjusting back-off
smoothing

10
Example Speech recognition

Need to find correct sequence of words given
aural signal
Language model (N-gram) accounts for dependencies
between words
Acoustic model maps from visible (phonemes) to
hidden (words) level
HMM combines both
Viterbi algorithm will find optimal solution

11
Estimation-Maximization

In general, we can iteratively estimate complex
models with hidden parameters
Define a quality function Q as the conditional
likelihood of the model on all parameters
Estimate Q from an initial choice for ?
Choose new ? that maximizes Q

12
Example PCFG parsing

Probabilistic context-free grammars
Likelihood of each rule (e.g., VP ? V or VP ? V
NP) is a basic parameter
Combined probability of the entire tree gives the
quality function
Forward-backward algorithm gives the solution
Lexicalization (Collins, 1996, 1997)

13
Example Machine Translation

The noisy channel model (Brown et al., 1991)
Input in one language (e.g., English) is garbled
into another (e.g., French)
Estimate probabilities of each word or phrase
generating words or phrases in the other language
and how many of them (fertility)
A similar approach Transliteration (Knight, 1998)

14
Linear regression

Predict output as a linear combination of input
variables
Choose weights that minimize the sum of residual
square error (least squares)
Can be computed efficiently via a matrix
decomposition and inversion

15
Log-linear regression

Ideal output is 0 or 1
Because the distribution changes from normal to
binomial, a transformed LS fit is not accurate
Solution Use an intermediate predictor ?,
Can be approximated with iterative reweighted
least squares

16
Examples

Text categorization for information retrieval
(Yang, 1998)
Many types of sentence/word classification
cue words (Passonneau and Litman, 1993)
prosodic features (Pan and McKeown, 1999)

17
Singular-value decomposition

A technique for reducing dimensionality data
points are projected
Given matrix A (n?m), find matrices T (n?k), S
(k?k), and D (k?m) so that their product is A
S is the top k singular values of A
Projection is achieved by multiplying and A
Application Latent Semantic Indexing

18
Methods without an explicit probability model

Use empirical techniques to directly provide
output without calculating a model
Decision trees Each node is associated with a
decision on one of the input features
The tree is built incrementally by choosing
features with the most discriminatory power

19
Variations on decision trees

Shrinking to prevent over-training
Decision lists (Yarowsky 1997) use only the top
feature for accent restoration

20
Rule induction

Similar to decision trees, but the rules are
allowed to vary and contain different operators
Examples RIPPER (Cohen 1996), transformation-base
d learning (Brill 1996), genetic algorithms
(Siegel 1998)

21
Methods without explicit model

k-Nearest Neighbor classification
Neural networks
Genetic algorithms

22
Support vector machines

Find hyperplane that maximizes distance from
support vectors
Non-linear transformation From original space to
separable space via kernel function
Text categorization (Joachims, 1997), OCR (Burges
and Vapnik, 1996), Speech recognition (Schmidt,
1996)

23
Classification issues

Two or many classes
Classifier confidence, probability of membership
in each class
Training / test set distributions
Balance of training data across classes

24
When to use each method?

Probabilistic models depend on distributional
assumptions
Linear models (and SVD) assume a normal data
distribution, and generalized linear models a
Poisson, binomial, or negative binomial
Markov models capture limited dependencies
Rule-based models allow for multi-way
classification easier than linear/log-linear ones
For many applications, it is important to get a
confidence estimate