Machine Learning in Natural Language - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Machine Learning in Natural Language

Description:

Title: QA for the Web Author: Dan Moldovan Last modified by: Dan Roth Created Date: 5/7/2002 3:19:09 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 31

Provided by: DanMol1

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning in Natural Language

1
Machine Learning in Natural Language

No Lecture on Thursday.
Instead
Monday, 4pm, 1404SC
Mark Johnson lectures on Bayesian Models of
Language Acquisition

2
Machine Learning in Natural LanguageFeatures
and Kernels

The idea of kernels
Kernel Perceptron
Structured Kernels
Tree and Graph Kernels
Lessons
Multi-class classification

3
Can be done explicitly (generate expressive
features) or implicitly (use kernels).
Embedding
New discriminator in functionally simpler
4
Kernel Based Methods

A method to run Perceptron on a very large
feature set, without incurring the cost of
keeping a very large weight vector.
Computing the weight vector is done in the
original space.
Notice this pertains only to efficiency.
Generalization is still relative to the real
dimensionality.
This is the main trick in SVMs. (Algorithm -
different) (although many applications actually
use linear kernels).

5
Kernel Base Methods

Let I be the set t1,t2,t3 of monomials
(conjunctions) over The feature space x1, x2 xn.
Then we can write a linear function over this new
feature space.

6
Kernel Based Methods

Great Increase in expressivity
Can run Perceptron, Winnow, Logistics regression,
but the convergence bound may suffer exponential
growth.
Exponential number of monomials are true in each
example.
Also, will have to keep many weights.

7
The Kernel Trick(1)

Consider the value of w used in the prediction.
Each previous mistake, on example z, makes an
additive contribution of /-1 to w, iff t(z) 1.
The value of w is determined by the number of
mistakes on which t() was satisfied.

8
The Kernel Trick(2)

P set of examples on which we Promoted
D set of examples on which we Demoted
M P? D

9
The Kernel Trick(3)

P set of examples on which we Promoted
D set of examples on which we Demoted
M P? D

Where S(z)1 if z ?P and S(z) -1 if z ?D.
Reordering

10
The Kernel Trick(4)

S(y)1 if y ?P and S(y) -1 if y ?D.

A mistake on z contributes the value /-1 to all
monomials satisfied by z. The total contribution
of z to the sum is equal to the number of
monomials that satisfy both x and z.
Define a dot product in the t-space

We get the standard notation

11
Kernel Based Methods

What does this representation give us?
We can view this Kernel as the distance between
x,z
measured in the t-space.
But, K(x,z) can be computed in the original
space, without explicitly writing the
t-representation of x, z

12
Kernel Based Methods

Consider the space of all 3n monomials (allowing
both positive and negative literals).
Then,
if same(x,z) is the number of features that have
the same value for both x and z.. We get
Example Take n2 x(00), z(01), .
Proof let ksame(x,z) choose to (1)include the
literal with the right polarity in the monomial,
or (2) not include at all.
Other Kernels can be used.

13
Implementation

Simply run Perceptron in an on-line mode, but
keep track of the set M.
Keeping the set M allows to keep track of S(z).

Rather than remembering the weight vector w,
remember the set M (P and D) all those
examples on which we made mistakes.

Dual Representation
14
Summary Kernel Based Methods I

A method to run Perceptron on a very large
feature set, without incurring the cost of
keeping a very large weight vector.
Computing the weight vector can still be done in
the original feature space.
Notice this pertains only to efficiency The
classifier is identical to the one you get by
blowing up the feature space.
Generalization is still relative to the real
dimensionality.
This is the main trick in SVMs. (Algorithm -
different) (although most applications actually
use linear kernels)

15
Summary Kernel Trick

Separating hyperplanes (produced by Perceptron,
SVM) can be computed in terms of dot products
over a feature based representation of examples.
We want to define a dot product in a high
dimensional space.
Given two examples x (x1, x2, xn) and y
(y1,y2, yn) we want to map them to a high
dimensional space example- quadratic
?(x1,x2,xn) (x1,xn, x12,xn2, x1 x2,
,xn-1 xn)
?(y1,y2,yn) (y1,yn ,y12,yn2, y1
y2,,yn-1 yn)
And compute the dot product A ?(x) ?(y)
takes time
Instead, in the original space, compute
B f(x y) 1 (x1,x2, xn) (y1,y2, yn)2
Theorem A B
Coefficients do not really matter can be done
for other functions.

16
Efficiency-Generalization Tradeoff

There is a tradeoff between the computational
efficiency with which these kernels can be
computed and the generalization ability of the
classifier.
For example, using such kernels the Perceptron
algorithm can make an exponential number of
mistakes even when learning simple functions.
In addition, computing with kernels depends
strongly on the number of examples. It turns out
that sometimes working in the blown up space is
more efficient than using kernels.
Next More Complicated Kernels

17
Structured Input
NP Which type PP of NP submarine VP was
bought ADVPrecently PP by NP South Korea
(. ?)

S John will join the board as a director

Knowledge Representation
18
Learning From Structured Input

We want to extract features from structured
domain elements
their internal (hierarchical) structure should be
encoded.
A feature is a mapping from the instances space
to 0,1 or 0,1
With appropriate representation language it is
possible to represent expressive features that
constitute infinite dimensional space FEX
Learning can be done in the infinite attribute
domain.
What does it mean to extract features?
Conceptually different data instantiations may
be abstracted to yield the same representation
(quantified elements)
Computationally Some kind of graph matching
process
Challenge
Provide the expressivity necessary to deal with
large scale and highly structured domains
Meet the strong tractability requirements for
these tasks.

19
Example

Only those descriptions that are ACTIVE in the
input are listed
Michael Collins developed kernels over parse
trees.
Cumby/Roth developed parameterized kernels over
structures.
When is it better to use kernel vs. using the
primal representation.

D (AND word (before tag))
Explicit features
20
Overview Goals (CumbyRoth 2003)

Applying kernel learning methods to structured
domains.
Develop a unified formalism for structured
kernels. (Collins Duffy, Gaertner Lloyd,
Haussler)
Flexible language that measures distance between
structure with respect to a given substructure.
Examine complexity generalization between
different feature sets, learners.
When does each type of feature set perform better
with what learners?
Exemplify with experiments from bioinformatics
NLP.
Mutagenesis, Named-Entity prediction.

21
Feature Description Logic

A flexible knowledge representation for feature
extraction from structured data
Domain Elements are represented as labeled graphs
Concept graphs that correspond to FDL
expressions.
FDL is formed from an alphabet of
attributes, value, and role symbols.
Well defined syntax and equivalent semantics
E.g., descriptions are defined inductively with
sensors as primitives
Sensor a basic description a term of the form
a(v), or a
a attribute symbol, v value symbol (ground
sensor).
existential sensor a describes object that has
some value for attribute a.
AND clauses, (role D) clauses for relations
between objects,
Expressive and Efficient Feature extraction.

Knowledge Representation
22
Example (Cont.)

Features Feature Generation Functions
extensions Subsumption
(see paper)
Basically
Only those descriptions that are ACTIVE in the
input are listed
The language is expressive enough to generate
linguistically interesting features such as
agreements, etc.

D (AND word (before tag))
D? (AND word(the) (before tag(N)), (AND
word(dog) (before tag(V)), (AND word(ran) (before
tag(ADV)), (AND word(very) (before tag(ADJ))
Explicit features
23
Kernels

Its possible to define FDL based Kernels for
structured data
When using linear classifiers it is important to
enhance the set of features to gain expressivity.
A common way - blow up the feature space by
generating functions of primitive features.
For some algorithms SVM, Perceptron - Kernel
functions can be used to expand the feature space
while working still in the original space.

Is it worth doing in structured domains?
Answers are not clear so far
Computationally yes, when we simulate a huge
space
Generalization not always Khardon,Roth,Servedio
,NIPS01 Ben David et al.

Kernels
24
Kernels in Structured Domains

We define a Kernel family K parameterized by FDL
descriptions.
The definition is recursive on the definition of
D
sensor, existential sensor role
description AND
Key Many previous structured kernels considered
all substructures. (e.g., CollinsDuffy02, Tree
Kernels)
Analogous to an exponential feature
space over fitting.

Kernels
25
FDL Kernel Definition

Kernel family K parameterized by feature type
descriptions. For description D
If D is a sensor s(v) is a label of
then
If D is a sensor s and sensor descriptions
s(v1), s(v2) s(vj) are labels of both
then
If D is a role description (r D), then
with n1, n2 those nodes that have r labeled
edge from n1,n2.
If D is a description (AND D1 D2 ... Dn) with li
repetitions of any Di then

Kernels
26
Kernel Example

D (AND word (before word))
G1 The dog ran very fast
G2 The dog ran quickly
Etc. the final output is 2 since there are 2
matching collocations.
Can simulate Boolean kernels as seen in
Khardon,Roth et al.

Kernels
27
Complexity Generalization

How to compare in complexity and generalization
to other kernels for structured data?
for m examples, with average example size g, and
time to evaluate the kernel t1, kernel Perceptron
takes O(m2g2t1)
if extracting a feature explicitly takes t2 ,
Perceptron takes O(mgt2).
most kernels that simulate a well defined
feature space have t1 ltlt t2 .

By restricting size of expanded feature space we
avoid overfitting even SVM suffers under many
irrelevant features (Weston).
Margin argument Margin goes down when you have
more features.
given a linearly separable set of points S
x1,xm 2 Rn with separator w 2 Rn
embed S into an ngtn dimensional space by adding
zero-mean random noise e to the additional n-n
dimensions s.t. w (w,0) 2 Rn still
separates S.
Now margin
but

Analysis
28
Experiments

Serve as comparison Our features w/ kernel
Perc, normal Winnow, and all-subtrees expanded
features.
Bioinformatics experiment in mutagenesis
prediction
188 compounds with atom-bond data, binary
prediction.
10-fold cross validation with 12 runs training
NLP experiment in classifying detected NEs
4700 training 1500 test phrases from MUC-7
person, location, organization
Trained and tested with kernel Perceptron, Winnow
(Snow) classifiers with FDL kernel respective
features. Also all-subtrees kernel based on
Collins Duffy work.
Mutagenesis concept graph
Features simulated with all-subtrees kernel

29
Discussion

microaveraged accuracy
Have kernel that simulates features obtained with
FDL
But quadratic training time means cheaper to
extract and learn explicitly vs kernel Perceptron
SVM could take (slightly) even longer, but maybe
perform better
But restricted features might work better than
larger spaces simulated by other kernels.
Can we improve on benefits of useful features?
Compile examples together ?
More sophisticated kernels than matching kernel?
Still provides metric for similarity based
approaches.

30
Conclusion

Kernels for learning from structured data is an
interesting idea
Different kernels may expand/restrict the
hypothesis space in useful ways.
Need to know the benefits and hazards
To justify these methods we must embed in a space
much larger than the training set size.
Can decrease margin
Expressive knowledge representations can be used
to create features explicitly or in implicit
kernel-spaces.
Data representation could allow us to plug in
different base kernels to replace matching
kernel.
Parameterized kernel allows us to direct the way
the feature space is blown up to encode
background knowledge.