Improving Subcategorization Acquisition using Word Sense Disambiguation - PowerPoint PPT Presentation

About This Presentation
Title:

Improving Subcategorization Acquisition using Word Sense Disambiguation

Description:

The dependents of a verb are classified in: arguments -subject, object, ... determine the probability distribution for each noun, verb, adjective and adverb ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 19
Provided by: oana2
Category:

less

Transcript and Presenter's Notes

Title: Improving Subcategorization Acquisition using Word Sense Disambiguation


1
Improving Subcategorization Acquisition using
Word Sense Disambiguation
  • Anna Korhonen and Judith Preiss
  • University of Cambridge, Computer
    Laboratory
  • 15 JJ Thomas Avenue, Cambridge CB3
    0FD, UK
  • Anna.Korhonen_at_cl.cam.ac.uk,
    Judita.Preiss_at_cl.cam.ac.uk

2
Outline
  • Subcategorization Acquisition
  • Baseline System
  • Baseline System combined with WSD
  • Probabilistic WSD
  • Experiment
  • Evaluation
  • Methods

3
Introduction
  • Subcategorization
  • The dependents of a verb are classified in
  • arguments -subject, object, direct
    object
  • - subject
  • - non subject
    arguments (complements)
  • e.g. Mary knows that she is wining.
  • adjuncts
  • e.g. She read the book with great
    interest.
  • The type of complements that a verb permits
    gives the verb classification
  • The verb classification is called
    subcategorization
  • SCFs subcategorization frames for a
    given
  • predicate essential for parsing

4
Introduction
  • SCFs- a particular set of arguments that a verb
    can appear with
  • Intransitive verb. NPsubject. They danced.
  • Transitive verb. NPsubject, NPobject. Mary
    appreciates
    her Professor.
  • Intransitive with PP. NPsubject,PP. He
    leave in Paris Transitive with PP.
    NPsubject, NPobject, PP. She put the book on
    the table.

5
Introduction
  • Manual subcategorization versus automatically one
  • Manual - does not provide the relative frequency
    of SCFs
  • - predicates change behavior
  • Automatically - no lexical/semantic
    information is exploited
  • - reveals only syntactic aspects
  • - no distinction between predicate
    senses
  • Korhonen(2002) model back-off estimates
    which used the predominant sense of a verb
    (WordNet)
  • Acquisition Goal domain specific lexicon
    (written vs. spoken genre based on
    different senses)

6
Subcategorization Acquisition
  • Baseline System
  • system with the knowledge of verb semantics
  • Levin(93)
  • - verb senses divides them in classes
    distinctive for subcategorization
  • Korhonen(2002)
  • - verb forms are able to divide them into
    semantic classes based on the
    predominant sense (fly - move)
  • - determine the sense and the
    semantic class (Levin Classes Motion
    verbs)
  • Briscoe Carroll(97) SCF distribution
    are acquired from corpus data

7
Subcategorization Acquisition
  • Baseline System description
  • The linear interpolation smoothing back-off
    estimates is used for the SCF distribution
  • The method of obtaining back-off estimates
  • a) 4-5 representative verbs are chosen
    from a verb class
  • b) for theses verbs the SCF distribution is
    built using manually analysis of 300
    occurrences of each verb (BNC)
  • c) the resulted SCF distributions are
    merged giving equal weight to each
    distribution
  • E.g. fly - move, slide, arrive,
    travel, sail
  • An empirical threshold is used to filter out
    noisy SCFs

8
Subcategorization Acquisition
  • Combining with WSD
  • Preiss Korhonen(02)
  • - created different corpus datasets for the
    senses (first/and or second) being
    disambiguated and other datasets for the
    remaining senses
  • - SCFs were acquired from both types of
    datasets
  • - back-off estimates used for the SCFs acquired
    from the initial dataset, the estimates
    were used for smoothing according to the
    relevant sense
  • - the SCF lexicons acquired were merged in the
    end SCF distribution was rather specific to a
    verb than a sense
  • - problems with subcategorization acquisition
    datasets too small, separation of the data
    was unnecessary

9
Subcategorization Acquisition
  • New method
  • does not involve separating data and
    it uses back-off estimates for the sense
    distribution given by the WSD system not only
    for the predominant sense
  • pj(scfi), j1..nb0 (nb0the number of
    back-off estimates)
  • - the probabilities of SCFs in
    different back-off distribution
  • P(scfi) ??jpj(scfi)
  • ?j - weights for the different distributions
    that sum up to 1,
    are obtained from the probabilistic WSD system
  • Probabilistic WSD
  • - able to determine the probability
    distribution for each noun, verb, adjective
    and adverb
  • - able to determine a probability distribution
    on the senses for each verb and compute the
    average of it

nb0
J1
10
Subcategorization Acquisition
  • System Description
  • - it is based on Stevenson and Wilks(2001)
    system which combines knowledge sources to
    produce a WSD Tool
  • - it combines the probability distribution on
    senses determined by each module used
    (modules described in Yarowsky(2000)
    Mihalcea(2002) Pederson(2002)) for the WSD
    probabilistic system
  • - a process of smoothing is used for each
    module according to each confidence value a low
    module confidence is smoothed extensively for
    uniform distribution
  • - the optimal combination of modules is based
    on the accuracy (F-measure) for the English
    all-words task

11
Subcategorization Acquisition
  • Experiment
  • Test Data
  • - polysemous verbs with the predominant
    sense not very frequent 29 verbs
    chosen randomly
  • - the Levin-style senses are used to map the
    WordNet senses of the chosen verbs
  • - he maximum number of Levin
    senses considered was 4 and some of the
    given senses were left out

12
Subcategorization Acquisition
13
Subcategorization Acquisition
  • Evaluation
  • Method
  • - 20 mil words of the BNC corpus and
    extracted all senses for the test verbs
  • - 1000 sentences for each verb
    disambiguated with the probabilistic
    WSD
  • - applied the modified subcategorization
    system
  • - for each verb an individual set of
    back-off estimates was built based on
    the different frequency senses from the
    corpus data
  • - results were evaluated against a manual
    analysis of the corpus data
  • - for an average of 300 occurrences for
    each verb in the BNC test data 5-21
    gold standard SCFs were found (16 SCFs
    per verb)

14
Subcategorization Acquisition
  • Evaluation
  • Method
  • F-measure 2PR / PR
  • P-precision
  • R-recall
  • RC Sperman rank correction
  • KL Kullback-Leibler distance
  • CE cross entropy
  • - record the total number of SCFs
    missing in the distribution for
    determining the accuracy of the back-off
    estimates
  • - comparison with other systems the
    base-line and other which assumed no sense
    at all

15
Subcategorization Acquisition
  • Results
  • - using the unsmoothed lexicon from a total of
    175 unseen standard SCFs a number of 107 remain
    unseen after using the predominant sense method
  • - using the WSD method only 22 remain unseen
  • the performance improves with the numbers of
    senses
  • - IS measure reveals that between the acquired
    and the gold standard SCFs exists an
    intersection when WSD is used

16
Subcategorization Acquisition
17
Subcategorization Acquisition
  • Results
  • - improvement for the highly polysemous verbs
    (bear, count, roar e.t.c)
  • - verbs who differ substantially in terms of
    subcategorization (conceive, continue, grasp
    e.t.c)
  • - verbs whose sense involves mainly NP/PP
  • - SCFs seems to appear in data as families
    for a sense of a verb
  • - worse performance for seek using WSD even
    though is highly polysemous and differs in
    terms of subcategorization
  • -no clear improvement choose, compose,
    induce, watch

18
Subcategorization Acquisition
  • Conclusions
  • - using the WSD an improvement can be shown
    for SCFs acquisition of difficult verbs because
    the senses differ also in terms of
    subcategorization not only in the degree of
    polysemy
  • Future work
  • - a better way of integrating the frequency of
    acquired senses into the SCFs and a refinancefor
    the subcategorization method
Write a Comment
User Comments (0)
About PowerShow.com