Title: Machine Learning in Natural Language
1Machine Learning in Natural Language
- No Lecture on Thursday.
- Instead
- Monday, 4pm, 1404SC
- Mark Johnson lectures on Bayesian Models of
Language Acquisition
2Machine Learning in Natural LanguageFeatures
and Kernels
- The idea of kernels
- Kernel Perceptron
- Structured Kernels
- Tree and Graph Kernels
- Lessons
- Multi-class classification
3Can be done explicitly (generate expressive
features) or implicitly (use kernels).
Embedding
New discriminator in functionally simpler
4Kernel Based Methods
- A method to run Perceptron on a very large
feature set, without incurring the cost of
keeping a very large weight vector. - Computing the weight vector is done in the
original space. - Notice this pertains only to efficiency.
- Generalization is still relative to the real
dimensionality. - This is the main trick in SVMs. (Algorithm -
different) (although many applications actually
use linear kernels).
5Kernel Base Methods
- Let I be the set t1,t2,t3 of monomials
(conjunctions) over The feature space x1, x2 xn.
- Then we can write a linear function over this new
feature space.
6Kernel Based Methods
- Great Increase in expressivity
- Can run Perceptron, Winnow, Logistics regression,
but the convergence bound may suffer exponential
growth. - Exponential number of monomials are true in each
example. - Also, will have to keep many weights.
7The Kernel Trick(1)
- Consider the value of w used in the prediction.
- Each previous mistake, on example z, makes an
additive contribution of /-1 to w, iff t(z) 1. - The value of w is determined by the number of
mistakes on which t() was satisfied.
8The Kernel Trick(2)
- P set of examples on which we Promoted
- D set of examples on which we Demoted
- M P? D
9The Kernel Trick(3)
- P set of examples on which we Promoted
- D set of examples on which we Demoted
- M P? D
- Where S(z)1 if z ?P and S(z) -1 if z ?D.
Reordering
10The Kernel Trick(4)
- S(y)1 if y ?P and S(y) -1 if y ?D.
- A mistake on z contributes the value /-1 to all
monomials satisfied by z. The total contribution
of z to the sum is equal to the number of
monomials that satisfy both x and z. - Define a dot product in the t-space
- We get the standard notation
11Kernel Based Methods
- What does this representation give us?
- We can view this Kernel as the distance between
x,z - measured in the t-space.
- But, K(x,z) can be computed in the original
space, without explicitly writing the
t-representation of x, z
12Kernel Based Methods
- Consider the space of all 3n monomials (allowing
both positive and negative literals). - Then,
- if same(x,z) is the number of features that have
the same value for both x and z.. We get - Example Take n2 x(00), z(01), .
- Proof let ksame(x,z) choose to (1)include the
literal with the right polarity in the monomial,
or (2) not include at all. - Other Kernels can be used.
13Implementation
- Simply run Perceptron in an on-line mode, but
keep track of the set M. - Keeping the set M allows to keep track of S(z).
- Rather than remembering the weight vector w,
- remember the set M (P and D) all those
examples on which we made mistakes.
Dual Representation
14Summary Kernel Based Methods I
- A method to run Perceptron on a very large
feature set, without incurring the cost of
keeping a very large weight vector. - Computing the weight vector can still be done in
the original feature space. - Notice this pertains only to efficiency The
classifier is identical to the one you get by
blowing up the feature space. - Generalization is still relative to the real
dimensionality. - This is the main trick in SVMs. (Algorithm -
different) (although most applications actually
use linear kernels)
15Summary Kernel Trick
- Separating hyperplanes (produced by Perceptron,
SVM) can be computed in terms of dot products
over a feature based representation of examples. - We want to define a dot product in a high
dimensional space. - Given two examples x (x1, x2, xn) and y
(y1,y2, yn) we want to map them to a high
dimensional space example- quadratic - ?(x1,x2,xn) (x1,xn, x12,xn2, x1 x2,
,xn-1 xn) - ?(y1,y2,yn) (y1,yn ,y12,yn2, y1
y2,,yn-1 yn) - And compute the dot product A ?(x) ?(y)
takes time - Instead, in the original space, compute
- B f(x y) 1 (x1,x2, xn) (y1,y2, yn)2
- Theorem A B
- Coefficients do not really matter can be done
for other functions.
16Efficiency-Generalization Tradeoff
- There is a tradeoff between the computational
efficiency with which these kernels can be
computed and the generalization ability of the
classifier. - For example, using such kernels the Perceptron
algorithm can make an exponential number of
mistakes even when learning simple functions. - In addition, computing with kernels depends
strongly on the number of examples. It turns out
that sometimes working in the blown up space is
more efficient than using kernels. - Next More Complicated Kernels
17Structured Input
NP Which type PP of NP submarine VP was
bought ADVPrecently PP by NP South Korea
(. ?)
- S John will join the board as a director
Knowledge Representation
18Learning From Structured Input
- We want to extract features from structured
domain elements - their internal (hierarchical) structure should be
encoded. -
- A feature is a mapping from the instances space
to 0,1 or 0,1 - With appropriate representation language it is
possible to represent expressive features that
constitute infinite dimensional space FEX - Learning can be done in the infinite attribute
domain. - What does it mean to extract features?
- Conceptually different data instantiations may
be abstracted to yield the same representation
(quantified elements) - Computationally Some kind of graph matching
process - Challenge
- Provide the expressivity necessary to deal with
large scale and highly structured domains - Meet the strong tractability requirements for
these tasks.
19Example
- Only those descriptions that are ACTIVE in the
input are listed - Michael Collins developed kernels over parse
trees. - Cumby/Roth developed parameterized kernels over
structures. - When is it better to use kernel vs. using the
primal representation.
D (AND word (before tag))
Explicit features
20Overview Goals (CumbyRoth 2003)
- Applying kernel learning methods to structured
domains. - Develop a unified formalism for structured
kernels. (Collins Duffy, Gaertner Lloyd,
Haussler) - Flexible language that measures distance between
structure with respect to a given substructure. - Examine complexity generalization between
different feature sets, learners. - When does each type of feature set perform better
with what learners? - Exemplify with experiments from bioinformatics
NLP. - Mutagenesis, Named-Entity prediction.
21Feature Description Logic
- A flexible knowledge representation for feature
extraction from structured data - Domain Elements are represented as labeled graphs
- Concept graphs that correspond to FDL
expressions. - FDL is formed from an alphabet of
- attributes, value, and role symbols.
- Well defined syntax and equivalent semantics
- E.g., descriptions are defined inductively with
sensors as primitives - Sensor a basic description a term of the form
a(v), or a - a attribute symbol, v value symbol (ground
sensor). - existential sensor a describes object that has
some value for attribute a. - AND clauses, (role D) clauses for relations
between objects, - Expressive and Efficient Feature extraction.
Knowledge Representation
22Example (Cont.)
- Features Feature Generation Functions
extensions Subsumption
(see paper) - Basically
- Only those descriptions that are ACTIVE in the
input are listed - The language is expressive enough to generate
linguistically interesting features such as
agreements, etc.
D (AND word (before tag))
D? (AND word(the) (before tag(N)), (AND
word(dog) (before tag(V)), (AND word(ran) (before
tag(ADV)), (AND word(very) (before tag(ADJ))
Explicit features
23Kernels
- Its possible to define FDL based Kernels for
structured data - When using linear classifiers it is important to
enhance the set of features to gain expressivity.
- A common way - blow up the feature space by
generating functions of primitive features. - For some algorithms SVM, Perceptron - Kernel
functions can be used to expand the feature space
while working still in the original space.
- Is it worth doing in structured domains?
- Answers are not clear so far
- Computationally yes, when we simulate a huge
space - Generalization not always Khardon,Roth,Servedio
,NIPS01 Ben David et al.
Kernels
24Kernels in Structured Domains
- We define a Kernel family K parameterized by FDL
descriptions. - The definition is recursive on the definition of
D - sensor, existential sensor role
description AND - Key Many previous structured kernels considered
all substructures. (e.g., CollinsDuffy02, Tree
Kernels) - Analogous to an exponential feature
space over fitting. -
Kernels
25FDL Kernel Definition
- Kernel family K parameterized by feature type
descriptions. For description D - If D is a sensor s(v) is a label of
then - If D is a sensor s and sensor descriptions
s(v1), s(v2) s(vj) are labels of both
then - If D is a role description (r D), then
with n1, n2 those nodes that have r labeled
edge from n1,n2. - If D is a description (AND D1 D2 ... Dn) with li
repetitions of any Di then
Kernels
26Kernel Example
- D (AND word (before word))
- G1 The dog ran very fast
- G2 The dog ran quickly
- Etc. the final output is 2 since there are 2
matching collocations. - Can simulate Boolean kernels as seen in
Khardon,Roth et al.
Kernels
27Complexity Generalization
- How to compare in complexity and generalization
to other kernels for structured data? - for m examples, with average example size g, and
time to evaluate the kernel t1, kernel Perceptron
takes O(m2g2t1) - if extracting a feature explicitly takes t2 ,
Perceptron takes O(mgt2). - most kernels that simulate a well defined
feature space have t1 ltlt t2 .
- By restricting size of expanded feature space we
avoid overfitting even SVM suffers under many
irrelevant features (Weston). - Margin argument Margin goes down when you have
more features. - given a linearly separable set of points S
x1,xm 2 Rn with separator w 2 Rn - embed S into an ngtn dimensional space by adding
zero-mean random noise e to the additional n-n
dimensions s.t. w (w,0) 2 Rn still
separates S. - Now margin
- but
Analysis
28Experiments
- Serve as comparison Our features w/ kernel
Perc, normal Winnow, and all-subtrees expanded
features. - Bioinformatics experiment in mutagenesis
prediction - 188 compounds with atom-bond data, binary
prediction. - 10-fold cross validation with 12 runs training
- NLP experiment in classifying detected NEs
- 4700 training 1500 test phrases from MUC-7
- person, location, organization
- Trained and tested with kernel Perceptron, Winnow
(Snow) classifiers with FDL kernel respective
features. Also all-subtrees kernel based on
Collins Duffy work. -
- Mutagenesis concept graph
Features simulated with all-subtrees kernel
29Discussion
- microaveraged accuracy
- Have kernel that simulates features obtained with
FDL - But quadratic training time means cheaper to
extract and learn explicitly vs kernel Perceptron - SVM could take (slightly) even longer, but maybe
perform better - But restricted features might work better than
larger spaces simulated by other kernels. - Can we improve on benefits of useful features?
- Compile examples together ?
- More sophisticated kernels than matching kernel?
- Still provides metric for similarity based
approaches.
30Conclusion
- Kernels for learning from structured data is an
interesting idea - Different kernels may expand/restrict the
hypothesis space in useful ways. - Need to know the benefits and hazards
- To justify these methods we must embed in a space
much larger than the training set size. - Can decrease margin
- Expressive knowledge representations can be used
to create features explicitly or in implicit
kernel-spaces. - Data representation could allow us to plug in
different base kernels to replace matching
kernel. - Parameterized kernel allows us to direct the way
the feature space is blown up to encode
background knowledge.