Machine Learning in Natural Language - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Machine Learning in Natural Language

Description:

Title: QA for the Web Author: Dan Moldovan Last modified by: Dan Roth Created Date: 5/7/2002 3:19:09 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 31
Provided by: DanMol1
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning in Natural Language


1
Machine Learning in Natural Language
  • No Lecture on Thursday.
  • Instead
  • Monday, 4pm, 1404SC
  • Mark Johnson lectures on Bayesian Models of
    Language Acquisition

2
Machine Learning in Natural LanguageFeatures
and Kernels
  • The idea of kernels
  • Kernel Perceptron
  • Structured Kernels
  • Tree and Graph Kernels
  • Lessons
  • Multi-class classification

3
Can be done explicitly (generate expressive
features) or implicitly (use kernels).
Embedding
New discriminator in functionally simpler
4
Kernel Based Methods
  • A method to run Perceptron on a very large
    feature set, without incurring the cost of
    keeping a very large weight vector.
  • Computing the weight vector is done in the
    original space.
  • Notice this pertains only to efficiency.
  • Generalization is still relative to the real
    dimensionality.
  • This is the main trick in SVMs. (Algorithm -
    different) (although many applications actually
    use linear kernels).

5
Kernel Base Methods
  • Let I be the set t1,t2,t3 of monomials
    (conjunctions) over The feature space x1, x2 xn.
  • Then we can write a linear function over this new
    feature space.

6
Kernel Based Methods
  • Great Increase in expressivity
  • Can run Perceptron, Winnow, Logistics regression,
    but the convergence bound may suffer exponential
    growth.
  • Exponential number of monomials are true in each
    example.
  • Also, will have to keep many weights.

7
The Kernel Trick(1)
  • Consider the value of w used in the prediction.
  • Each previous mistake, on example z, makes an
    additive contribution of /-1 to w, iff t(z) 1.
  • The value of w is determined by the number of
    mistakes on which t() was satisfied.

8
The Kernel Trick(2)
  • P set of examples on which we Promoted
  • D set of examples on which we Demoted
  • M P? D

9
The Kernel Trick(3)
  • P set of examples on which we Promoted
  • D set of examples on which we Demoted
  • M P? D
  • Where S(z)1 if z ?P and S(z) -1 if z ?D.
    Reordering

10
The Kernel Trick(4)
  • S(y)1 if y ?P and S(y) -1 if y ?D.
  • A mistake on z contributes the value /-1 to all
    monomials satisfied by z. The total contribution
    of z to the sum is equal to the number of
    monomials that satisfy both x and z.
  • Define a dot product in the t-space
  • We get the standard notation

11
Kernel Based Methods
  • What does this representation give us?
  • We can view this Kernel as the distance between
    x,z
  • measured in the t-space.
  • But, K(x,z) can be computed in the original
    space, without explicitly writing the
    t-representation of x, z

12
Kernel Based Methods
  • Consider the space of all 3n monomials (allowing
    both positive and negative literals).
  • Then,
  • if same(x,z) is the number of features that have
    the same value for both x and z.. We get
  • Example Take n2 x(00), z(01), .
  • Proof let ksame(x,z) choose to (1)include the
    literal with the right polarity in the monomial,
    or (2) not include at all.
  • Other Kernels can be used.

13
Implementation
  • Simply run Perceptron in an on-line mode, but
    keep track of the set M.
  • Keeping the set M allows to keep track of S(z).
  • Rather than remembering the weight vector w,
  • remember the set M (P and D) all those
    examples on which we made mistakes.

Dual Representation
14
Summary Kernel Based Methods I
  • A method to run Perceptron on a very large
    feature set, without incurring the cost of
    keeping a very large weight vector.
  • Computing the weight vector can still be done in
    the original feature space.
  • Notice this pertains only to efficiency The
    classifier is identical to the one you get by
    blowing up the feature space.
  • Generalization is still relative to the real
    dimensionality.
  • This is the main trick in SVMs. (Algorithm -
    different) (although most applications actually
    use linear kernels)

15
Summary Kernel Trick
  • Separating hyperplanes (produced by Perceptron,
    SVM) can be computed in terms of dot products
    over a feature based representation of examples.
  • We want to define a dot product in a high
    dimensional space.
  • Given two examples x (x1, x2, xn) and y
    (y1,y2, yn) we want to map them to a high
    dimensional space example- quadratic
  • ?(x1,x2,xn) (x1,xn, x12,xn2, x1 x2,
    ,xn-1 xn)
  • ?(y1,y2,yn) (y1,yn ,y12,yn2, y1
    y2,,yn-1 yn)
  • And compute the dot product A ?(x) ?(y)
    takes time
  • Instead, in the original space, compute
  • B f(x y) 1 (x1,x2, xn) (y1,y2, yn)2
  • Theorem A B
  • Coefficients do not really matter can be done
    for other functions.

16
Efficiency-Generalization Tradeoff
  • There is a tradeoff between the computational
    efficiency with which these kernels can be
    computed and the generalization ability of the
    classifier.
  • For example, using such kernels the Perceptron
    algorithm can make an exponential number of
    mistakes even when learning simple functions.
  • In addition, computing with kernels depends
    strongly on the number of examples. It turns out
    that sometimes working in the blown up space is
    more efficient than using kernels.
  • Next More Complicated Kernels

17
Structured Input
NP Which type PP of NP submarine VP was
bought ADVPrecently PP by NP South Korea
(. ?)
  • S John will join the board as a director

Knowledge Representation
18
Learning From Structured Input
  • We want to extract features from structured
    domain elements
  • their internal (hierarchical) structure should be
    encoded.
  • A feature is a mapping from the instances space
    to 0,1 or 0,1
  • With appropriate representation language it is
    possible to represent expressive features that
    constitute infinite dimensional space FEX
  • Learning can be done in the infinite attribute
    domain.
  • What does it mean to extract features?
  • Conceptually different data instantiations may
    be abstracted to yield the same representation
    (quantified elements)
  • Computationally Some kind of graph matching
    process
  • Challenge
  • Provide the expressivity necessary to deal with
    large scale and highly structured domains
  • Meet the strong tractability requirements for
    these tasks.

19
Example
  • Only those descriptions that are ACTIVE in the
    input are listed
  • Michael Collins developed kernels over parse
    trees.
  • Cumby/Roth developed parameterized kernels over
    structures.
  • When is it better to use kernel vs. using the
    primal representation.

D (AND word (before tag))
Explicit features
20
Overview Goals (CumbyRoth 2003)
  • Applying kernel learning methods to structured
    domains.
  • Develop a unified formalism for structured
    kernels. (Collins Duffy, Gaertner Lloyd,
    Haussler)
  • Flexible language that measures distance between
    structure with respect to a given substructure.
  • Examine complexity generalization between
    different feature sets, learners.
  • When does each type of feature set perform better
    with what learners?
  • Exemplify with experiments from bioinformatics
    NLP.
  • Mutagenesis, Named-Entity prediction.

21
Feature Description Logic
  • A flexible knowledge representation for feature
    extraction from structured data
  • Domain Elements are represented as labeled graphs
  • Concept graphs that correspond to FDL
    expressions.
  • FDL is formed from an alphabet of
  • attributes, value, and role symbols.
  • Well defined syntax and equivalent semantics
  • E.g., descriptions are defined inductively with
    sensors as primitives
  • Sensor a basic description a term of the form
    a(v), or a
  • a attribute symbol, v value symbol (ground
    sensor).
  • existential sensor a describes object that has
    some value for attribute a.
  • AND clauses, (role D) clauses for relations
    between objects,
  • Expressive and Efficient Feature extraction.

Knowledge Representation
22
Example (Cont.)
  • Features Feature Generation Functions
    extensions Subsumption
    (see paper)
  • Basically
  • Only those descriptions that are ACTIVE in the
    input are listed
  • The language is expressive enough to generate
    linguistically interesting features such as
    agreements, etc.

D (AND word (before tag))
D? (AND word(the) (before tag(N)), (AND
word(dog) (before tag(V)), (AND word(ran) (before
tag(ADV)), (AND word(very) (before tag(ADJ))
Explicit features
23
Kernels
  • Its possible to define FDL based Kernels for
    structured data
  • When using linear classifiers it is important to
    enhance the set of features to gain expressivity.
  • A common way - blow up the feature space by
    generating functions of primitive features.
  • For some algorithms SVM, Perceptron - Kernel
    functions can be used to expand the feature space
    while working still in the original space.
  • Is it worth doing in structured domains?
  • Answers are not clear so far
  • Computationally yes, when we simulate a huge
    space
  • Generalization not always Khardon,Roth,Servedio
    ,NIPS01 Ben David et al.

Kernels
24
Kernels in Structured Domains
  • We define a Kernel family K parameterized by FDL
    descriptions.
  • The definition is recursive on the definition of
    D
  • sensor, existential sensor role
    description AND
  • Key Many previous structured kernels considered
    all substructures. (e.g., CollinsDuffy02, Tree
    Kernels)
  • Analogous to an exponential feature
    space over fitting.

Kernels
25
FDL Kernel Definition
  • Kernel family K parameterized by feature type
    descriptions. For description D
  • If D is a sensor s(v) is a label of
    then
  • If D is a sensor s and sensor descriptions
    s(v1), s(v2) s(vj) are labels of both
    then
  • If D is a role description (r D), then
    with n1, n2 those nodes that have r labeled
    edge from n1,n2.
  • If D is a description (AND D1 D2 ... Dn) with li
    repetitions of any Di then

Kernels
26
Kernel Example
  • D (AND word (before word))
  • G1 The dog ran very fast
  • G2 The dog ran quickly
  • Etc. the final output is 2 since there are 2
    matching collocations.
  • Can simulate Boolean kernels as seen in
    Khardon,Roth et al.

Kernels
27
Complexity Generalization
  • How to compare in complexity and generalization
    to other kernels for structured data?
  • for m examples, with average example size g, and
    time to evaluate the kernel t1, kernel Perceptron
    takes O(m2g2t1)
  • if extracting a feature explicitly takes t2 ,
    Perceptron takes O(mgt2).
  • most kernels that simulate a well defined
    feature space have t1 ltlt t2 .
  • By restricting size of expanded feature space we
    avoid overfitting even SVM suffers under many
    irrelevant features (Weston).
  • Margin argument Margin goes down when you have
    more features.
  • given a linearly separable set of points S
    x1,xm 2 Rn with separator w 2 Rn
  • embed S into an ngtn dimensional space by adding
    zero-mean random noise e to the additional n-n
    dimensions s.t. w (w,0) 2 Rn still
    separates S.
  • Now margin
  • but

Analysis
28
Experiments
  • Serve as comparison Our features w/ kernel
    Perc, normal Winnow, and all-subtrees expanded
    features.
  • Bioinformatics experiment in mutagenesis
    prediction
  • 188 compounds with atom-bond data, binary
    prediction.
  • 10-fold cross validation with 12 runs training
  • NLP experiment in classifying detected NEs
  • 4700 training 1500 test phrases from MUC-7
  • person, location, organization
  • Trained and tested with kernel Perceptron, Winnow
    (Snow) classifiers with FDL kernel respective
    features. Also all-subtrees kernel based on
    Collins Duffy work.
  • Mutagenesis concept graph
    Features simulated with all-subtrees kernel

29
Discussion
  • microaveraged accuracy
  • Have kernel that simulates features obtained with
    FDL
  • But quadratic training time means cheaper to
    extract and learn explicitly vs kernel Perceptron
  • SVM could take (slightly) even longer, but maybe
    perform better
  • But restricted features might work better than
    larger spaces simulated by other kernels.
  • Can we improve on benefits of useful features?
  • Compile examples together ?
  • More sophisticated kernels than matching kernel?
  • Still provides metric for similarity based
    approaches.

30
Conclusion
  • Kernels for learning from structured data is an
    interesting idea
  • Different kernels may expand/restrict the
    hypothesis space in useful ways.
  • Need to know the benefits and hazards
  • To justify these methods we must embed in a space
    much larger than the training set size.
  • Can decrease margin
  • Expressive knowledge representations can be used
    to create features explicitly or in implicit
    kernel-spaces.
  • Data representation could allow us to plug in
    different base kernels to replace matching
    kernel.
  • Parameterized kernel allows us to direct the way
    the feature space is blown up to encode
    background knowledge.
Write a Comment
User Comments (0)
About PowerShow.com