HLT, Data Sparsity and Semantic Tagging PowerPoint PPT Presentation

presentation player overlay
1 / 21
About This Presentation
Transcript and Presenter's Notes

Title: HLT, Data Sparsity and Semantic Tagging


1
HLT, Data Sparsity and Semantic Tagging Louise
Guthrie (University of Sheffield) Roberto Basili
(University of Tor Vergata, Rome) Hamish
Cunningham (University of Sheffield)
2
Outline
  • A ubiquitous problem data sparsity
  • The approach
  • coarse-grained semantic tagging
  • learning by combining multiple evidence
  • The evaluation intrinsic and extrinsic measures
  • The expected outcomes architectures, tools,
    development support

3
Applications
  • PresentWeve seen growing interest in a range of
    HLT tasks
  • e.g. IE, MT
  • Trends
  • Fully portable IE, unsupervised learning
  • Content Extraction vs. IE

4
Data Sparsity
  • Language Processing depends on a model of the
    features important to an application.
  • MT - Trigrams and frequencies
  • Extraction - Word patterns
  • New texts always seem to have lots of phenomena
    we havent seen before

5
Different kinds of patterns
  • Person was appointed as post of company
  • Company named person to post
  • Almost all extraction systems tried to find
    patterns of mixed words and entities.
  • People, Locations, Organizations, dates, times,
    currencies

6
Can we do more?
  • Astronauts aboard the space shuttle Endeavor were
    forced to dodge a derelict Air Force satellite
    Friday
  • Humans aboard space_vehicle dodge satellite
    timeref.

7
Could we know these are the same?
  • The IRA bombed a family owned shop in Belfast
    yesterday.
  • FMLN set off a series of explosions in central
    Bogota today.
  • ORGANIZATION ATTACKED LOCATION DATE

8
Machine translation
  • Ambiguity of words often means that a word can
    translate several ways.
  • Would knowing the semantic class of a word, help
    us to know the translation?

9
Sometimes . . .
  • Crane the bird vs crane the machine
  • Bat the animal vs bat for cricket and baseball
  • Seal on a letter vs the animal

10
SO ..
  • P(translation(crane) grulla animal) gt
  • P(translation(crane)
    grulla)
  • P(translation(crane) grua machine) gt
  • P(translation(crane)
    grua)
  • Can we show the overall effect lowers entropy?

11
Language Modeling Data Sparseness again ..
  • We need to estimate Pr (w3 w1 w2)
  • If we have never seen w1w2 w3 before
  • Can we instead develop a model and estimate Pr
    (w3 C1 C2) or Pr (C3 C1 C2)

12
A Semantic Tagging technology. How?
  • We will exploit similarity with NE tagging, ...
  • Development of pattern matching rules as
    incremental wrapper induction
  • ... with semantic (sense) disambiguation
  • Use as much evidence as possible
  • Exploit existing resources like MRD or LKBs
  • ... and with machine learning tasks
  • Generalize from positive examples in training data

13
Multiple Sources of Evidence
  • Lexical information (priming effects)
  • Distributional information from general and
    training texts
  • Syntactic features
  • SVO patterns or Adjectival modifiers
  • Semantic features
  • Structural information in LKBs
  • (LKB-based) similarity measures

14
Machine Learning for ST
  • Similarity estimation
  • among contexts (texts overlaps, )
  • among lexical items wrt MRD/LKBs
  • We will experiment
  • Decision tree learning (e.g. C4.5)
  • Support Vector Machines (e.g. SVM light)
  • Memory-based Learning (TiMBL)
  • Bayesian learning

15
Whats New?
  • Granularity
  • Semantic categories are coarser than word senses
    (cfr. homograph level in MRD)
  • Integration of existing ML methods
  • Pattern induction is combined with probabilistic
    description of word semantic classes
  • Co-training
  • Annotated data are used to drive the sampling of
    further evidence from unannotated material
    (active learning)

16
  • How we know what weve done measurement, the
    corpus
  • Hand-annotated corpus
  • from the BNC, 100-million word balanced corpus
  • 1 million words annotated
  • a little under ½ million categorised noun
    phrases
  • Extrinsic evaluationPerplexity of lexical choice
    in Machine Translation
  • Intrinsic evaluationStandard measures or
    precision, recall, false positives
  • (baseline tag with most common category 33)

17
Ambiguity levels in the training data
NPs by semantic categories 0 104824 23.1 1 1192
28 26.3 2 96852 21.4 3 44385 9.8 4 35671 7.9 5
15499 3.4 6 13555 3.0 7 7635 1.7 8 6000 1.3 9
2191 0.5 10 3920 0.9 11 1028 0.2 12 606 0.1 1
3 183 0.0 14 450 0.1 15 919 0.2 17 414 0.1
Total NPs (interim) 453360
18
  • Maximising project outputssoftware
    infrastructure for HLT
  • Three outputs from the project
  • 1. A new resource
  • Automatical annotation of the whole corpus
  • Experimental evidence re. 1.- how accurate the
    final results are- how accurate the various
    methods employed are
  • Component tools for doing 1., based on GATE(a
    General Architecture for Text Engineering)

19
  •                                                
                                                    
                               
  • What is GATE?
  • An architectureA macro-level organisational
    picture for LE software systems.
  • A frameworkFor programmers, GATE is an
    object-oriented class library that implements the
    architecture.
  • A development environmentFor language
    engineers, computational linguists et al, GATE is
    a graphical development environment bundled with
    a set of tools for doing e.g. Information
    Extraction.
  • Some free components... ...and wrappers for
    other people's components
  • Tools for evaluation visualise/edit
    persistence IR IE dialogue ontologies etc.
  • Free software (LGPL). Download at
    http//gate.ac.uk/download/

20
  •                                                
                                                    
                               
  • Where did GATE come from?
  • A number of researchers realised in the early-
    mid-1990s (e.g. in TIPSTER)
  • Increasing trend towards multi-site collaborative
    projects
  • Role of engineering in scalable, reusable, and
    portable HLT solutions
  • Support for large data, in multiple media,
    languages, formats, and locations
  • Lower the cost of creation of new language
    processing components
  • Promote quantitative evaluation metrics via tools
    and a level playing field
  • History
  • 1996 2002 GATE version 1, proof of concept
  • March 2002 version 2, rewritten in Java,
    component based, LGPL, more users
  • Fall 2003 new development cycle

21
  •                          Role of GATE in the
    project
  • Productivity- reuse some baseline components for
    simple tasks- development environment support
    for implementors (MATLAB for HLT?)- reduce
    integration overhead (standard interfaces between
    components)- system takes care of persistency,
    visualisation, multilingual edit, ...
  • Quantification- tool support for metrics
    generation - visualisation of key/response
    differences- regression test tool for nightly
    progress verification
  • Repeatability- open source supported,
    maintained, documented software- cross-platform
    (Linux, Windows, Solaris, others)- easy install
    and proven useability (thousands of people,
    hundreds of sites)- mobile code if you write in
    Java web services otherwise
Write a Comment
User Comments (0)
About PowerShow.com