Number Sense Disambiguation - PowerPoint PPT Presentation

About This Presentation
Title:

Number Sense Disambiguation

Description:

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL) – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 28
Provided by: Stuart285
Category:

less

Transcript and Presenter's Notes

Title: Number Sense Disambiguation


1
Number Sense Disambiguation
  • Stuart Moore
  • Supervised by
  • Anna Korhonen (Computer Lab)?
  • Sabine Buchholz (Toshiba CRL)?

2
Number Sense Disambiguation
  • Similar to Word Sense Disambiguation
  • Seek to classify numbers into different senses
  • e.g. year, time, telephone number...

3
Applications
  • Speech Synthesis
  • 1990
  • nineteen-ninety
  • one thousand, nine hundred and ninety
  • 2015
  • two thousand and fifteen
  • eight fifteen p.m.
  • Information Retrieval
  • Parsing

4
Aim
  • To successfully classify numbers into sense
    categories
  • To use a semi-supervised method
  • Avoids the need for a large, human annotated
    training set
  • Allows economical adaptation to different
    languages and domains

5
Differences with Word Sense Disambiguation
  • There are infinitely many numbers you will
    almost certainly come across 'digit strings' you
    have not seen in training data.
  • Intuitively, the models for 2007 and 2008 should
    be similar
  • But the model for 5, or 2007.4, should be
    different
  • There is no resource equivalent to a dictionary,
    enumerating all possible senses of a number.

6
Previous System
  • The report Normalization of non-standard Words
    (Sproat et al, 2001) defines a taxonomy of 13
    'senses' for numbers
  • They annotated 4 corpora, the largest of which is
    a subsection of the North American News Text
    Corpus newswire text from 1994-97
  • They used this to create a decision tree
    classifier
  • The main focus of the report was the performance
    when expanding abbreviations, and numbers are not
    examined in detail.

7
Number Sense Categories
(Counts are from the training data of the North
American News Text Corpus)?
8
Overview of my system
  • Based on work by Yarowsky (1995) investigating
    decision lists for Word Sense Disambiguation
  • Takes a few annotated 'seed examples', together
    with a large, unannotated corpus.
  • Generates one model using the seed examples, and
    applies this to the unannotated corpus.
  • This is used as input to generate another model.
  • The process can be iterated many times

9
Overview of my system
10
Features
  • The context of each number is examined for a list
    of features.
  • Local context 5 tokens from the number
  • Punctuation, words, word stems, number features
  • Specific location (e.g. token following number)?
  • Wider context 15 tokens from the number
  • Words and Word stems only
  • Bag of words (anywhere within the window)?

11
Rules
  • Each rule is conditional on the presence of one
    or two features
  • Consider all possible combinations of features
    that occur together at least five times in the
    training corpus.
  • Based on Yarowsky's rules, but more powerful
  • He had 'Bag of word' rules, and some rules
    combining two words in the local area
  • He did not have any specific numeric or
    punctuation features.

12
Ranking Rules
  • a is a parameter that can be varied to change the
    effect of negative examples on the model
  • Rank rules according to log likelihood
  • When classifying, use the first rule that matches
    the target sentence
  • Follows Yarowsky (1995)?
  • For each rule, count the number of examples for
    each number sense
  • Calculate Log Likelihood

13
Performance as a fully supervised system
  • We applied the method to the entire training set,
    and investigated its performance on the training
    and test sets
  • This gives an idea of the 'upper bound' of
    performance of the system

14
Performance on training data
97.2
Log Likelihood cut off
15
Performance on test data
81.2
66.0
Log Likelihood cut off
16
Performance as a fully Supervised system - Summary
  • Accuracy is 66.0 on test data
  • Using the most common number type for
    unclassified examples increases accuracy to 81.2
  • The Sproat et al system achieves an accuracy of
    97.6 on the same task
  • Uses decision trees instead of decision lists
  • Decision trees generally classify everything
    less suitable for an iterative process.

17
Performance as a fully Supervised system - Summary
  • A large proportion of the test data
    approximately 25 - was unclassified.
  • By adding in unlabelled data to the training set,
    we hope to increase coverage of the rules, and
    thereby boost accuracy
  • (experiment not yet performed)?

18
Performance as a semi-supervised system
  • Concept
  • Provide a small number of seed examples, from
    which rules are extrapolated over various
    iterations.
  • Important to have high precision in the first
    iteration
  • (Recall can be low, as long as it's not too low)?
  • Future iterations aim to improve recall

19
Performance as a semi-supervised system
  • After experimenting with a few different
    strategies for the first iteration, the following
    was found to perform best
  • Rank all rules based on their scores from the
    seed examples
  • For each number type, take the three highest
    scoring rules (more if several had an equal
    score)?
  • Apply these rules to the unlabelled data.
  • If a number is matched by rules from more than
    one number type, do not classify it

20
How many seed examples are needed?
  • Seed examples were randomly picked from the
    training data
  • Equal numbers of seed examples for each number
    type
  • Definite improvement seen for going up to 40 seed
    examples
  • Limited improvement after this point

Precision ( of those assigned where the category
is correct)?
21
Performance of the second iteration training
data
Peak 84.84 (LogLike gt 5.0)?
Baseline - 56.24
Log Likelihood cut off
22
Performance of the second iteration test data
Peak 75.2 (LogLike gt 5.2)?
Using previous peak value, cut off5.0, gives
74.93 accuracy
Log Likelihood cut off
23
Future Work
  • Error analysis of the data
  • More sophisticated features
  • Part of Speech tags, or a parser
  • More sophisticated rules
  • Try to allow more than two features per rule,
    without creating too many rules to be handled.
  • Different rule strategies
  • Closer to a decision tree
  • Other machine learning methods?

24
Future Work
  • Increase coverage
  • Investigate use of document level features, using
    method from Stevenson et al, 2008
  • Investigate different strategies for picking the
    seed examples
  • Distribute according to relative frequency of
    categories, rather than a set number per category
  • Investigate the effects of more unannotated data
  • Can use sections of the North American News
    Corpus that haven't been annotated.

25
Future Work
  • Consider modifying the number classes
  • Should some categories be combined?
  • Would moving the categories into a tree structure
    improve performance?
  • Are different classes needed for different
    domains (e.g. financial, biomedical) or
    languages?
  • Investigate corpus for consistency
  • A few inconsistent examples have been identified

26
(No Transcript)
27
Number Features
  • Does the number start with a leading zero
  • Is the number an integer
  • How many digits in the number
  • The real value of the number
  • The number rounded to one significant figure
  • So 1500 x lt 2500 maps to 2000
  • The token with all digits removed
  • 1st becomes st, 70mph becomes mph
Write a Comment
User Comments (0)
About PowerShow.com