Aquesta - PowerPoint PPT Presentation

About This Presentation
Title:

Aquesta

Description:

Title: Aquesta s una prova petita Author: lluism Last modified by: lluism Created Date: 5/20/1999 10:25:04 PM Document presentation format: Presentaci n en pantalla – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 90
Provided by: llu5
Category:
Tags: aquesta

less

Transcript and Presenter's Notes

Title: Aquesta


1
Seminar Statistical NLP
Machine Learning for Natural Language Processing
Lluís Màrquez TALP Research Center Llenguatges i
Sistemes Informàtics Universitat Politècnica de
Catalunya
Girona, June 2003
2
Outline
  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP

3
Outline
  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP

4
Machine Learning
ML4NLP
  • There are many general-purpose definitions of
    Machine Learning (or artificial learning)
  • Learners are computers we study learning
    algorithms
  • Resources are scarce time, memory, data, etc.
  • It has (almost) nothing to do with Cognitive
    science, neuroscience, theory of scientific
    discovery and research, etc.
  • Biological plausibility is welcome but not the
    main goal

5
Machine Learning
ML4NLP
  • Learning... but what for?
  • To perform some particular task
  • To react to environmental inputs
  • Concept learning from data
  • modelling concepts underlying data
  • predicting unseen observations
  • compacting the knowledge representation
  • knowledge discovery for expert systems
  • We will concentrate on
  • Supervised inductive learning for classification
  • discriminative learning

6
Machine Learning
ML4NLP
A more precise definition
  • What to read?
  • Machine Learning (Mitchell, 1997)

7
Empirical NLP
ML4NLP
90s Application of Machine Learning techniques
(ML) to NLP problems
  • Lexical and structural ambiguity problems
  • Word selection (SR, MT)
  • Part-of-speech tagging
  • Semantic ambiguity (polysemy)
  • Prepositional phrase attachment
  • Reference ambiguity (anaphora)
  • etc.
  • What to read? Foundations of Statistical Language
    Processing (Manning Schütze, 1999)

8
NLP classification problems
ML4NLP
  • Ambiguity is a crucial problem for natural
    language understanding/processing. Ambiguity
    Resolution Classification

He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
9
NLP classification problems
ML4NLP
  • Morpho-syntactic ambiguity

He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
10
NLP classification problems
ML4NLP
  • Morpho-syntactic ambiguity
    Part of Speech Tagging

He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
11
NLP classification problems
ML4NLP
  • Semantic (lexical) ambiguity

He was shot in the hand as he chased the robbers
in the back street
body-part clock-part
(The Wall Street Journal Corpus)
12
NLP classification problems
ML4NLP
  • Semantic (lexical) ambiguity
    Word Sense Disambiguation

He was shot in the hand as he chased the robbers
in the back street
body-part clock-part
(The Wall Street Journal Corpus)
13
NLP classification problems
ML4NLP
  • Structural (syntactic) ambiguity

He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
14
NLP classification problems
ML4NLP
  • Structural (syntactic) ambiguity

He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
15
NLP classification problems
ML4NLP
  • Structural (syntactic) ambiguity
    PP-attachment disambiguation

He was shot in the hand as he (chased (the
robbers)NP (in the back street)PP)
(The Wall Street Journal Corpus)
16
Outline
  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms in detail
  • Applications to NLP

17
Feature Vector Classification
Classification
IA perspective
  • An instance is a vector x ltx1,, xngt whose
    components, called features (or attributes), are
    discrete or real-valued.
  • Let X be the space of all possible instances.
  • Let Yy1,, ym be the set of categories (or
    classes).
  • The goal is to learn an unknown target function,
    f X Y
  • A training example is an instance x belonging to
    X, labelled with the correct value for f(x),
    i.e., a pair ltx, f(x)gt
  • Let D be the set of all training examples.

18
Feature Vector Classification
Classification
  • The hypotheses space, H, is the set of functions
    h X Y that the learner can consider as
    possible definitions
  • The goal is to find a function h belonging to H
    such that for all pair ltx, f (x)gt belonging to
    D, h(x) f (x)

19
An Example
Classification
otherwise Þ negative
20
An Example
Classification
21
Some important concepts
Classification
  • Inductive Bias
  • Any means that a classification learning
    system uses to choose between to functions that
    are both consistent with the training data is
    called inductive bias (Mooney Cardie, 99)
  • Language / Search bias

22
Some important concepts
Classification
  • Inductive Bias
  • Training error and generalization error
  • Generalization ability and overfitting
  • Batch Learning vs. on-line Leaning
  • Symbolic vs. statistical Learning
  • Propositional vs. first-order learning

23
Classification
Propositional vs. Relational Learning
  • Propositional learning

color(red) Ù shape(circle) Þ classA
24
The Classification SettingClass, Point, Example,
Data Set, ...
Classification
CoLT/SLT perspective
  • Input Space X ? Rn
  • (binary) Output Space Y 1,-1
  • A point, pattern or instance x ? X, x
    (x1, x2, , xn)
  • Example (x, y) with x ? X, y ? Y
  • Training Set a set of m examples generated
    i.i.d. according to an unknown distribution
    P(x,y) S (x1,
    y1), , (xm, ym) ? (X ? Y)m

25
The Classification SettingLearning, Error, ...
Classification
  • The hypotheses space, H, is the set of functions
    h X?Y that the learner can consider as
    possible definitions. In SVM are of the form
  • The goal is to find a function h belonging to H
    such that the expected misclassification error on
    new examples, also drawn from P(x,y), is minimal
    (Risk Minimization, RM)

26
The Classification SettingLearning, Error, ...
Classification
  • Expected error (risk)
  • Problem P itself is unknown. Known are training
    examples ? an induction principle is needed
  • Empirical Risk Minimization (ERM) Find the
    function h belonging to H for which the training
    error (empirical risk) is minimal

27
The Classification SettingError,
Over(under)fitting,...
Classification
  • Low training error ? low true error?
  • The overfitting dilemma

(Müller et al., 2001)
  • Trade-off between training error and complexity
  • Different learning biases can be used

28
Outline
  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP

29
Outline
  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms
  • Decision Trees
  • AdaBoost
  • Support Vector Machines
  • Applications to NLP

30
Learning Paradigms
Algorithms
  • Statistical learning
  • HMM, Bayesian Networks, ME, CRF, etc.
  • Traditional methods from Artificial Intelligence
    (ML, AI)
  • Decision trees/lists, exemplar-based learning,
    rule induction, neural networks, etc.
  • Methods from Computational Learning Theory
    (CoLT/SLT)
  • Winnow, AdaBoost, SVMs, etc.

31
Learning Paradigms
Algorithms
  • Classifier combination
  • Bagging, Boosting, Randomization, ECOC, Stacking,
    etc.
  • Semi-supervised learning learning from labelled
    and unlabelled examples
  • Bootstrapping, EM, Transductive learning (SVMs,
    AdaBoost), Co-Training, etc.
  • etc.

32
Decision Trees
Algorithms
  • Decision trees are a way to represent rules
    underlying training data, with hierarchical
    structures that recursively partition the data.
  • They have been used by many research communities
    (Pattern Recognition, Statistics, ML, etc.) for
    data exploration with some of the following
    purposes Description, Classification, and
    Generalization.
  • From a machine-learning perspective Decision
    Trees are n -ary branching trees that represent
    classification rules for classifying the objects
    of a certain domain into a set of mutually
    exclusive classes

33
Decision Trees
Algorithms
  • Acquisition
    Top-Down Induction of Decision Trees (TDIDT)
  • Systems
  • CART (Breiman et al. 84),
  • ID3, C4.5, C5.0 (Quinlan 86,93,98),
  • ASSISTANT, ASSISTANT-R (Cestnik et al. 87)
    (Kononenko et al. 95)
  • etc.

34
An Example
Algorithms
35
Learning Decision Trees
Algorithms
36
General Induction Algorithm
Algorithms
37
General Induction Algorithm
Algorithms
38
Feature Selection Criteria
Algorithms
  • Functions derived from Information Theory
  • Information Gain, Gain Ratio (Quinlan 86)
  • Functions derived from Distance Measures
  • Gini Diversity Index (Breiman et al. 84)
  • RLM (López de Mántaras 91)
  • Statistically-based
  • Chi-square test (Sestito Dillon 94)
  • Symmetrical Tau (Zhou Dillon 91)
  • RELIEFF-IG variant of RELIEFF (Kononenko 94)

39
Extensions of DTs
Algorithms
(Murthy 95)
  • Pruning (pre/post)
  • Minimize the effect of the greedy approach
    lookahead
  • Non-lineal splits
  • Combination of multiple models
  • Incremental learning (on-line)
  • etc.

40
Decision Trees and NLP
Algorithms
  • Speech processing (Bahl et al. 89 Bakiri
    Dietterich 99)
  • POS Tagging (Cardie 93, Schmid 94b Magerman 95
    Màrquez Rodríguez 95,97 Màrquez et al. 00)
  • Word sense disambiguation (Brown et al. 91
    Cardie 93 Mooney 96)
  • Parsing (Magerman 95,96 Haruno et al. 98,99)
  • Text categorization (Lewis Ringuette 94 Weiss
    et al. 99)
  • Text summarization (Mani Bloedorn 98)
  • Dialogue act tagging (Samuel et al. 98)

41
Decision Trees and NLP
Algorithms
  • Noun phrase coreference
    (Aone Benett 95 Mc Carthy
    Lehnert 95)
  • Discourse analysis in information extraction
    (Soderland Lehnert 94)
  • Cue phrase identification in text and speech
    (Litman 94 Siegel McKeown 94)
  • Verb classification in Machine Translation
    (Tanaka 96 Siegel 97)

42
Decision Trees proscons
Algorithms
  • Advantages
  • Acquires symbolic knowledge in a understandable
    way
  • Very well studied ML algorithms and variants
  • Can be easily translated into rules
  • Existence of available software C4.5, C5.0, etc.
  • Can be easily integrated into an ensemble

43
Decision Trees proscons
Algorithms
  • Drawbacks
  • Computationally expensive when scaling to large
    natural language domains training examples,
    features, etc.
  • Data sparseness and data fragmentation the
    problem of the small disjuncts gt Probability
    estimation
  • DTs is a model with high variance (unstable)
  • Tendency to overfit training data pruning is
    necessary
  • Requires quite a big effort in tuning the model

44
Boosting algorithms
Algorithms
  • Idea
  • to combine many simple and moderately accurate
    hypotheses (weak classifiers) into a single and
    highly accurate classifier
  • AdaBoost (Freund Schapire 95) has been
    theoretically and empirically studied extensively
  • Many other variants extensions (1997-2003)
  • http//www.lsi.upc.es/lluism/seminari/mlnlp.htm
    l

45
AdaBoost general scheme
Algorithms
TRAINING
46
AdaBoost algorithm
Algorithms
47
AdaBoost example
Algorithms
Weak hypotheses vertical/horizontal hyperplanes
48
AdaBoost round 1
Algorithms
49
AdaBoost round 2
Algorithms
50
AdaBoost round 3
Algorithms
51
Combined Hypothesis
Algorithms
52
AdaBoost and NLP
Algorithms
  • POS Tagging (Abney et al. 99 Màrquez 99)
  • Text and Speech Categorization
    (Schapire Singer 98 Schapire et al. 98 Weiss
    et al. 99)
  • PP-attachment Disambiguation (Abney et al. 99)
  • Parsing (Haruno et al. 99)
  • Word Sense Disambiguation (Escudero et al. 00,
    01)
  • Shallow parsing (Carreras Màrquez, 01a 02)
  • Email spam filtering (Carreras Màrquez, 01b)
  • Term Extraction (Vivaldi, et al. 01)

53
AdaBoost proscons
Algorithms
  • Easy to implement and few parameters to set
  • Time and space grow linearly with number of
    examples. Ability to manage very large learning
    problems
  • Does not constrain explicitly the complexity of
    the learner
  • Naturally combines feature selection with
    learning
  • Has been succesfully applied to many practical
    problems

54
AdaBoost proscons
Algorithms
  • Seems to be rather robust to overfitting
    (number of rounds) but sensitive to noise
  • Performance is very good when there are
    relatively few relevant terms (features)
  • Can perform poorly when there is insufficient
    training data relative to the complexity of the
    base classifiers, the training errors of the base
    classifiers become too large too quickly

55
Algorithms
SVM A General Definition
  • Support Vector Machines (SVM) are learning
    systems that use a hypothesis space of linear
    functions in a high dimensional feature space,
    trained with a learning algorithm from
    optimisation theory that implements a learning
    bias derived from statistical learning theory.
    (Cristianini Shawe-Taylor, 2000)

56
SVM A General Definition
Algorithms
  • Support Vector Machines (SVM) are learning
    systems that use a hypothesis space of linear
    functions in a high dimensional feature space,
    trained with a learning algorithm from
    optimisation theory that implements a learning
    bias derived from statistical learning theory.
    (Cristianini Shawe-Taylor, 2000)

Key Concepts
57
Linear Classifiers
Algorithms
  • Hyperplanes in RN.
  • Defined by a weight vector (w) and a threshold
    (b).
  • They induce a classification rule

58
Optimal Hyperplane Geometric Intuition
Algorithms
59
Optimal Hyperplane Geometric Intuition
Algorithms
Maximal Margin Hyperplane
?
?
?
60
Linearly separable data
Algorithms
Quadratic Programming
61
Non-separable case (soft margin)
Algorithms
62
Non-linear SVMs
Algorithms
  • Implicit mapping into feature space via kernel
    functions

63
Non-linear SVMs
Algorithms
  • Kernel functions
  • Must be efficiently computable
  • Characterization via Mercers theorem
  • One of the curious facts about using a kernel is
    that we do not need to know the underlying
    feature map in order to be able to learn in the
    feature space! (Cristianini Shawe-Taylor, 2000)
  • Examples polynomials, Gaussian radial basis
    functions, two-layer sigmoidal neural networks,
    etc.

64
Non linear SVMs
Algorithms
Degree 3 polynomial kernel
lin. non-separable
lin. separable
65
Toy Examples
Algorithms
  • All examples have been run with the 2D graphic
    interface of SVMLIB (Chang and Lin, National
    University of Taiwan)
  • LIBSVM is an integrated software for support
    vector classification, (C-SVC, nu-SVC),
    regression (epsilon-SVR, un-SVR) and distribution
    estimation (one-class SVM). It supports
    multi-class classification. The basic algorithm
    is a simplification of both SMO by Platt and
    SVMLight by Joachims. It is also a simplification
    of the modification 2 of SMO by Keerthy et al.
    Our goal is to help users from other fields to
    easily use SVM as a tool. LIBSVM provides a
    simple interface where users can easily link it
    with their own programs
  • Available from www.csie.ntu.edu.tw/cjlin/libsvm
    (it icludes a Web integrated demo tool)

66
Toy Examples (I)
Algorithms
Linearly separable data set Linear SVM Maximal
margin Hyperplane
67
Toy Examples (I)
Algorithms
(still) Linearly separable data set Linear
SVM High value of C parameter Maximal margin
Hyperplane
The example is correctly classified
68
Toy Examples (I)
Algorithms
(still) Linearly separable data set Linear
SVM Low value of C parameter Trade-off between
margin and training error
The example is now a bounded SV
69
Toy Examples (II)
Algorithms
70
Toy Examples (II)
Algorithms
71
Toy Examples (II)
Algorithms
72
Toy Examples (III)
Algorithms
73
SVM Summary
Algorithms
  • SVMs introduced in COLT92 (Boser, Guyon,
    Vapnik, 1992). Great developement since then
  • Kernel-induced feature spaces SVMs work
    efficiently in very high dimensional feature
    spaces ()
  • Learning bias maximal margin optimisation.
    Reduces the danger of overfitting. Generalization
    bounds for SVMs ()
  • Compact representation of the induced hypothesis.
    The solution is sparse in terms of SVs ()

74
SVM Summary
Algorithms
  • Due to Mercers conditions on the kernels the
    optimi-sation problems are convex. No local
    minima ()
  • Optimisation theory guides the implementation.
    Efficient learning ()
  • Mainly for classification but also for
    regression, density estimation, clustering, etc.
  • Success in many real-world applications OCR,
    vision, bioinformatics, speech recognition, NLP
    TextCat, POS tagging, chunking, parsing, etc. ()
  • Parameter tuning (). Implications in convergence
    times, sparsity of the solution, etc.

75
Outline
  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP

76
NLP problems
Applications
  • Warning! We will not focus on final NLP
    applications, but on intermediate tasks...
  • We will classify the NLP tasks according to their
    (structural) complexity

77
NLP problems structural complexity
Applications
  • Decisional problems
  • Text Categorization, Document filtering, Word
    Sense Disambiguation, etc.
  • Sequence tagging and detection of sequential
    structures
  • POS tagging, Named Entity extraction, syntactic
    chunking, etc.
  • Hierarchical structures
  • Clause detection, full parsing, IE of complex
    concepts, composite Named Entities, etc.

78
POS tagging
Applications
  • Morpho-syntactic ambiguity
    Part of Speech Tagging

He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
79
POS tagging
Applications
preposition-adverb tree
80
POS tagging
Applications
preposition-adverb tree
Collocations
as_RB much_RB as_IN
as_RB soon_RB as_IN
as_RB well_RB as_IN
81
POS tagging
Applications
RTT (Màrquez Rodríguez 97)
Language Model
stop?
Classify
Update
Filter
Tagged text
Raw text
Morphological analysis
yes
no
Disambiguation
82
POS tagging
Applications
STT (Màrquez Rodríguez 97)
83
Detection of sequential and hierarchical
structures
Applications
  • Named Entity recognition
  • Clause detection

84
Summary/conclusions
Conclusions
  • We have briefly outlined
  • The ML setting supervised learning for
    classification
  • Three concrete machine learning algorithms
  • How to apply them to solve itermediate NLP tasks

85
Conclusions
Summary/conclusions
  • Any ML algorithm for NLP should be
  • Robust to noise and outliers
  • Efficient in large feature/example spaces
  • Adaptive to new/changing domains
    portability, tuning, etc.
  • Able to take advantage of unlabelled examples
    semi-supervised learning

86
Summary/conclusions
Conclusions
  • Statistical and ML-based Natural Language
    Processing is a very active and multidisciplinary
    area of research

87
Some current research lines
Conclusions
  • Appropriate learning paradigm for all kind of NLP
    problems TiMBL (DBZ99), TBEDL (Brill95), ME
    (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira
    Singer02).
  • Definition of an adequate (and task-specific)
    feature space mapping from the input space to a
    high dimensional feature space, kernels, etc.
  • Resolution of complex NLP problems inference
    with classifiers constraint satisfaction
  • etc.

88
Bibliografia
Conclusions
  • You may found additional information at
  • http//www.lsi.upc.es/lluism/
  • tesi.html
  • publicacions/pubs.html
  • cursos/talks.html
  • cursos/MLandNL.html
  • cursos/emnlp1.html
  • This talk at
  • http//www.lsi.upc.es/lluism/udg03.ppt.gz

89
Seminar Statistical NLP
Machine Learning for Natural Language Processing
Lluís Màrquez TALP Research Center Llenguatges i
Sistemes Informàtics Universitat Politècnica de
Catalunya
Girona, June 2003
Write a Comment
User Comments (0)
About PowerShow.com