Title: Aquesta
1Seminar Statistical NLP
Machine Learning for Natural Language Processing
Lluís Màrquez TALP Research Center Llenguatges i
Sistemes Informàtics Universitat Politècnica de
Catalunya
Girona, June 2003
2Outline
- The Classification Problem
- Three ML Algorithms
- Applications to NLP
3Outline
- The Classification Problem
- Three ML Algorithms
- Applications to NLP
4Machine Learning
ML4NLP
- There are many general-purpose definitions of
Machine Learning (or artificial learning)
- Learners are computers we study learning
algorithms - Resources are scarce time, memory, data, etc.
- It has (almost) nothing to do with Cognitive
science, neuroscience, theory of scientific
discovery and research, etc. - Biological plausibility is welcome but not the
main goal
5Machine Learning
ML4NLP
- Learning... but what for?
- To perform some particular task
- To react to environmental inputs
- Concept learning from data
- modelling concepts underlying data
- predicting unseen observations
- compacting the knowledge representation
- knowledge discovery for expert systems
- We will concentrate on
- Supervised inductive learning for classification
- discriminative learning
6Machine Learning
ML4NLP
A more precise definition
- What to read?
- Machine Learning (Mitchell, 1997)
7Empirical NLP
ML4NLP
90s Application of Machine Learning techniques
(ML) to NLP problems
- Lexical and structural ambiguity problems
- Word selection (SR, MT)
- Part-of-speech tagging
- Semantic ambiguity (polysemy)
- Prepositional phrase attachment
- Reference ambiguity (anaphora)
- etc.
- What to read? Foundations of Statistical Language
Processing (Manning Schütze, 1999)
8NLP classification problems
ML4NLP
- Ambiguity is a crucial problem for natural
language understanding/processing. Ambiguity
Resolution Classification
He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
9NLP classification problems
ML4NLP
- Morpho-syntactic ambiguity
He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
10NLP classification problems
ML4NLP
- Morpho-syntactic ambiguity
Part of Speech Tagging
He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
11NLP classification problems
ML4NLP
- Semantic (lexical) ambiguity
He was shot in the hand as he chased the robbers
in the back street
body-part clock-part
(The Wall Street Journal Corpus)
12NLP classification problems
ML4NLP
- Semantic (lexical) ambiguity
Word Sense Disambiguation
He was shot in the hand as he chased the robbers
in the back street
body-part clock-part
(The Wall Street Journal Corpus)
13NLP classification problems
ML4NLP
- Structural (syntactic) ambiguity
He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
14NLP classification problems
ML4NLP
- Structural (syntactic) ambiguity
He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
15NLP classification problems
ML4NLP
- Structural (syntactic) ambiguity
PP-attachment disambiguation
He was shot in the hand as he (chased (the
robbers)NP (in the back street)PP)
(The Wall Street Journal Corpus)
16Outline
- The Classification Problem
- Three ML Algorithms in detail
- Applications to NLP
17Feature Vector Classification
Classification
IA perspective
- An instance is a vector x ltx1,, xngt whose
components, called features (or attributes), are
discrete or real-valued. - Let X be the space of all possible instances.
- Let Yy1,, ym be the set of categories (or
classes). - The goal is to learn an unknown target function,
f X Y - A training example is an instance x belonging to
X, labelled with the correct value for f(x),
i.e., a pair ltx, f(x)gt - Let D be the set of all training examples.
18Feature Vector Classification
Classification
- The hypotheses space, H, is the set of functions
h X Y that the learner can consider as
possible definitions
- The goal is to find a function h belonging to H
such that for all pair ltx, f (x)gt belonging to
D, h(x) f (x)
19An Example
Classification
otherwise Þ negative
20An Example
Classification
21Some important concepts
Classification
- Inductive Bias
- Any means that a classification learning
system uses to choose between to functions that
are both consistent with the training data is
called inductive bias (Mooney Cardie, 99) - Language / Search bias
22Some important concepts
Classification
- Inductive Bias
- Training error and generalization error
-
- Generalization ability and overfitting
- Batch Learning vs. on-line Leaning
- Symbolic vs. statistical Learning
- Propositional vs. first-order learning
23Classification
Propositional vs. Relational Learning
color(red) Ù shape(circle) Þ classA
24The Classification SettingClass, Point, Example,
Data Set, ...
Classification
CoLT/SLT perspective
- Input Space X ? Rn
- (binary) Output Space Y 1,-1
- A point, pattern or instance x ? X, x
(x1, x2, , xn) - Example (x, y) with x ? X, y ? Y
- Training Set a set of m examples generated
i.i.d. according to an unknown distribution
P(x,y) S (x1,
y1), , (xm, ym) ? (X ? Y)m
25The Classification SettingLearning, Error, ...
Classification
- The hypotheses space, H, is the set of functions
h X?Y that the learner can consider as
possible definitions. In SVM are of the form - The goal is to find a function h belonging to H
such that the expected misclassification error on
new examples, also drawn from P(x,y), is minimal
(Risk Minimization, RM)
26The Classification SettingLearning, Error, ...
Classification
- Expected error (risk)
- Problem P itself is unknown. Known are training
examples ? an induction principle is needed - Empirical Risk Minimization (ERM) Find the
function h belonging to H for which the training
error (empirical risk) is minimal
27The Classification SettingError,
Over(under)fitting,...
Classification
- Low training error ? low true error?
- The overfitting dilemma
(Müller et al., 2001)
- Trade-off between training error and complexity
- Different learning biases can be used
28Outline
- The Classification Problem
- Three ML Algorithms
- Applications to NLP
29Outline
- The Classification Problem
- Three ML Algorithms
- Decision Trees
- AdaBoost
- Support Vector Machines
- Applications to NLP
30Learning Paradigms
Algorithms
- Statistical learning
- HMM, Bayesian Networks, ME, CRF, etc.
- Traditional methods from Artificial Intelligence
(ML, AI) - Decision trees/lists, exemplar-based learning,
rule induction, neural networks, etc. - Methods from Computational Learning Theory
(CoLT/SLT) - Winnow, AdaBoost, SVMs, etc.
31Learning Paradigms
Algorithms
- Classifier combination
- Bagging, Boosting, Randomization, ECOC, Stacking,
etc. - Semi-supervised learning learning from labelled
and unlabelled examples - Bootstrapping, EM, Transductive learning (SVMs,
AdaBoost), Co-Training, etc. - etc.
32Decision Trees
Algorithms
- Decision trees are a way to represent rules
underlying training data, with hierarchical
structures that recursively partition the data. - They have been used by many research communities
(Pattern Recognition, Statistics, ML, etc.) for
data exploration with some of the following
purposes Description, Classification, and
Generalization. - From a machine-learning perspective Decision
Trees are n -ary branching trees that represent
classification rules for classifying the objects
of a certain domain into a set of mutually
exclusive classes
33Decision Trees
Algorithms
- Acquisition
Top-Down Induction of Decision Trees (TDIDT)
- Systems
- CART (Breiman et al. 84),
- ID3, C4.5, C5.0 (Quinlan 86,93,98),
- ASSISTANT, ASSISTANT-R (Cestnik et al. 87)
(Kononenko et al. 95) - etc.
34An Example
Algorithms
35Learning Decision Trees
Algorithms
36General Induction Algorithm
Algorithms
37General Induction Algorithm
Algorithms
38Feature Selection Criteria
Algorithms
- Functions derived from Information Theory
- Information Gain, Gain Ratio (Quinlan 86)
- Functions derived from Distance Measures
- Gini Diversity Index (Breiman et al. 84)
- RLM (López de Mántaras 91)
- Statistically-based
- Chi-square test (Sestito Dillon 94)
- Symmetrical Tau (Zhou Dillon 91)
- RELIEFF-IG variant of RELIEFF (Kononenko 94)
39Extensions of DTs
Algorithms
(Murthy 95)
- Pruning (pre/post)
- Minimize the effect of the greedy approach
lookahead - Non-lineal splits
- Combination of multiple models
- Incremental learning (on-line)
- etc.
40Decision Trees and NLP
Algorithms
- Speech processing (Bahl et al. 89 Bakiri
Dietterich 99) - POS Tagging (Cardie 93, Schmid 94b Magerman 95
Màrquez Rodríguez 95,97 Màrquez et al. 00) - Word sense disambiguation (Brown et al. 91
Cardie 93 Mooney 96) - Parsing (Magerman 95,96 Haruno et al. 98,99)
- Text categorization (Lewis Ringuette 94 Weiss
et al. 99) - Text summarization (Mani Bloedorn 98)
- Dialogue act tagging (Samuel et al. 98)
41Decision Trees and NLP
Algorithms
- Noun phrase coreference
(Aone Benett 95 Mc Carthy
Lehnert 95) - Discourse analysis in information extraction
(Soderland Lehnert 94) - Cue phrase identification in text and speech
(Litman 94 Siegel McKeown 94) - Verb classification in Machine Translation
(Tanaka 96 Siegel 97)
42Decision Trees proscons
Algorithms
- Advantages
- Acquires symbolic knowledge in a understandable
way - Very well studied ML algorithms and variants
- Can be easily translated into rules
- Existence of available software C4.5, C5.0, etc.
- Can be easily integrated into an ensemble
43Decision Trees proscons
Algorithms
- Drawbacks
- Computationally expensive when scaling to large
natural language domains training examples,
features, etc. - Data sparseness and data fragmentation the
problem of the small disjuncts gt Probability
estimation - DTs is a model with high variance (unstable)
- Tendency to overfit training data pruning is
necessary - Requires quite a big effort in tuning the model
44Boosting algorithms
Algorithms
- Idea
- to combine many simple and moderately accurate
hypotheses (weak classifiers) into a single and
highly accurate classifier - AdaBoost (Freund Schapire 95) has been
theoretically and empirically studied extensively
- Many other variants extensions (1997-2003)
- http//www.lsi.upc.es/lluism/seminari/mlnlp.htm
l
45AdaBoost general scheme
Algorithms
TRAINING
46AdaBoost algorithm
Algorithms
47AdaBoost example
Algorithms
Weak hypotheses vertical/horizontal hyperplanes
48AdaBoost round 1
Algorithms
49AdaBoost round 2
Algorithms
50AdaBoost round 3
Algorithms
51Combined Hypothesis
Algorithms
52AdaBoost and NLP
Algorithms
- POS Tagging (Abney et al. 99 Màrquez 99)
- Text and Speech Categorization
(Schapire Singer 98 Schapire et al. 98 Weiss
et al. 99) - PP-attachment Disambiguation (Abney et al. 99)
- Parsing (Haruno et al. 99)
- Word Sense Disambiguation (Escudero et al. 00,
01) - Shallow parsing (Carreras Màrquez, 01a 02)
- Email spam filtering (Carreras Màrquez, 01b)
- Term Extraction (Vivaldi, et al. 01)
53AdaBoost proscons
Algorithms
- Easy to implement and few parameters to set
- Time and space grow linearly with number of
examples. Ability to manage very large learning
problems - Does not constrain explicitly the complexity of
the learner - Naturally combines feature selection with
learning - Has been succesfully applied to many practical
problems
54AdaBoost proscons
Algorithms
- Seems to be rather robust to overfitting
(number of rounds) but sensitive to noise - Performance is very good when there are
relatively few relevant terms (features) - Can perform poorly when there is insufficient
training data relative to the complexity of the
base classifiers, the training errors of the base
classifiers become too large too quickly
55Algorithms
SVM A General Definition
- Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from
optimisation theory that implements a learning
bias derived from statistical learning theory.
(Cristianini Shawe-Taylor, 2000)
56SVM A General Definition
Algorithms
- Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from
optimisation theory that implements a learning
bias derived from statistical learning theory.
(Cristianini Shawe-Taylor, 2000)
Key Concepts
57Linear Classifiers
Algorithms
- Hyperplanes in RN.
- Defined by a weight vector (w) and a threshold
(b). - They induce a classification rule
58Optimal Hyperplane Geometric Intuition
Algorithms
59Optimal Hyperplane Geometric Intuition
Algorithms
Maximal Margin Hyperplane
?
?
?
60Linearly separable data
Algorithms
Quadratic Programming
61Non-separable case (soft margin)
Algorithms
62Non-linear SVMs
Algorithms
- Implicit mapping into feature space via kernel
functions
63Non-linear SVMs
Algorithms
- Kernel functions
- Must be efficiently computable
- Characterization via Mercers theorem
- One of the curious facts about using a kernel is
that we do not need to know the underlying
feature map in order to be able to learn in the
feature space! (Cristianini Shawe-Taylor, 2000) - Examples polynomials, Gaussian radial basis
functions, two-layer sigmoidal neural networks,
etc.
64Non linear SVMs
Algorithms
Degree 3 polynomial kernel
lin. non-separable
lin. separable
65Toy Examples
Algorithms
- All examples have been run with the 2D graphic
interface of SVMLIB (Chang and Lin, National
University of Taiwan) - LIBSVM is an integrated software for support
vector classification, (C-SVC, nu-SVC),
regression (epsilon-SVR, un-SVR) and distribution
estimation (one-class SVM). It supports
multi-class classification. The basic algorithm
is a simplification of both SMO by Platt and
SVMLight by Joachims. It is also a simplification
of the modification 2 of SMO by Keerthy et al.
Our goal is to help users from other fields to
easily use SVM as a tool. LIBSVM provides a
simple interface where users can easily link it
with their own programs - Available from www.csie.ntu.edu.tw/cjlin/libsvm
(it icludes a Web integrated demo tool) -
66Toy Examples (I)
Algorithms
Linearly separable data set Linear SVM Maximal
margin Hyperplane
67Toy Examples (I)
Algorithms
(still) Linearly separable data set Linear
SVM High value of C parameter Maximal margin
Hyperplane
The example is correctly classified
68Toy Examples (I)
Algorithms
(still) Linearly separable data set Linear
SVM Low value of C parameter Trade-off between
margin and training error
The example is now a bounded SV
69Toy Examples (II)
Algorithms
70Toy Examples (II)
Algorithms
71Toy Examples (II)
Algorithms
72Toy Examples (III)
Algorithms
73SVM Summary
Algorithms
- SVMs introduced in COLT92 (Boser, Guyon,
Vapnik, 1992). Great developement since then - Kernel-induced feature spaces SVMs work
efficiently in very high dimensional feature
spaces () - Learning bias maximal margin optimisation.
Reduces the danger of overfitting. Generalization
bounds for SVMs () - Compact representation of the induced hypothesis.
The solution is sparse in terms of SVs ()
74SVM Summary
Algorithms
- Due to Mercers conditions on the kernels the
optimi-sation problems are convex. No local
minima () - Optimisation theory guides the implementation.
Efficient learning () - Mainly for classification but also for
regression, density estimation, clustering, etc. - Success in many real-world applications OCR,
vision, bioinformatics, speech recognition, NLP
TextCat, POS tagging, chunking, parsing, etc. () - Parameter tuning (). Implications in convergence
times, sparsity of the solution, etc.
75Outline
- The Classification Problem
- Three ML Algorithms
- Applications to NLP
76NLP problems
Applications
- Warning! We will not focus on final NLP
applications, but on intermediate tasks... - We will classify the NLP tasks according to their
(structural) complexity
77NLP problems structural complexity
Applications
- Decisional problems
- Text Categorization, Document filtering, Word
Sense Disambiguation, etc. - Sequence tagging and detection of sequential
structures - POS tagging, Named Entity extraction, syntactic
chunking, etc. - Hierarchical structures
- Clause detection, full parsing, IE of complex
concepts, composite Named Entities, etc.
78POS tagging
Applications
- Morpho-syntactic ambiguity
Part of Speech Tagging
He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
79POS tagging
Applications
preposition-adverb tree
80POS tagging
Applications
preposition-adverb tree
Collocations
as_RB much_RB as_IN
as_RB soon_RB as_IN
as_RB well_RB as_IN
81POS tagging
Applications
RTT (Màrquez Rodríguez 97)
Language Model
stop?
Classify
Update
Filter
Tagged text
Raw text
Morphological analysis
yes
no
Disambiguation
82POS tagging
Applications
STT (Màrquez Rodríguez 97)
83Detection of sequential and hierarchical
structures
Applications
- Named Entity recognition
- Clause detection
84Summary/conclusions
Conclusions
- We have briefly outlined
- The ML setting supervised learning for
classification - Three concrete machine learning algorithms
- How to apply them to solve itermediate NLP tasks
85Conclusions
Summary/conclusions
- Any ML algorithm for NLP should be
- Robust to noise and outliers
- Efficient in large feature/example spaces
- Adaptive to new/changing domains
portability, tuning, etc. - Able to take advantage of unlabelled examples
semi-supervised learning
86Summary/conclusions
Conclusions
- Statistical and ML-based Natural Language
Processing is a very active and multidisciplinary
area of research
87Some current research lines
Conclusions
- Appropriate learning paradigm for all kind of NLP
problems TiMBL (DBZ99), TBEDL (Brill95), ME
(Ratnaparkhi98), SNoW (Roth98), CRF (Pereira
Singer02). - Definition of an adequate (and task-specific)
feature space mapping from the input space to a
high dimensional feature space, kernels, etc. - Resolution of complex NLP problems inference
with classifiers constraint satisfaction - etc.
88Bibliografia
Conclusions
- You may found additional information at
- http//www.lsi.upc.es/lluism/
- tesi.html
- publicacions/pubs.html
- cursos/talks.html
- cursos/MLandNL.html
- cursos/emnlp1.html
- This talk at
- http//www.lsi.upc.es/lluism/udg03.ppt.gz
89Seminar Statistical NLP
Machine Learning for Natural Language Processing
Lluís Màrquez TALP Research Center Llenguatges i
Sistemes Informàtics Universitat Politècnica de
Catalunya
Girona, June 2003