HLT, Data Sparsity and Semantic Tagging presentation

About This Presentation

Transcript and Presenter's Notes

Title: HLT, Data Sparsity and Semantic Tagging

1
HLT, Data Sparsity and Semantic Tagging Louise
Guthrie (University of Sheffield) Roberto Basili
(University of Tor Vergata, Rome) Hamish
Cunningham (University of Sheffield)
2
Outline

A ubiquitous problem data sparsity
The approach
coarse-grained semantic tagging
learning by combining multiple evidence
The evaluation intrinsic and extrinsic measures
The expected outcomes architectures, tools,
development support

3
Applications

PresentWeve seen growing interest in a range of
HLT tasks
e.g. IE, MT
Trends
Fully portable IE, unsupervised learning
Content Extraction vs. IE

4
Data Sparsity

Language Processing depends on a model of the
features important to an application.
MT - Trigrams and frequencies
Extraction - Word patterns
New texts always seem to have lots of phenomena
we havent seen before

5
Different kinds of patterns

Person was appointed as post of company
Company named person to post
Almost all extraction systems tried to find
patterns of mixed words and entities.
People, Locations, Organizations, dates, times,
currencies

6
Can we do more?

Astronauts aboard the space shuttle Endeavor were
forced to dodge a derelict Air Force satellite
Friday
Humans aboard space_vehicle dodge satellite
timeref.

7
Could we know these are the same?

The IRA bombed a family owned shop in Belfast
yesterday.
FMLN set off a series of explosions in central
Bogota today.
ORGANIZATION ATTACKED LOCATION DATE

8
Machine translation

Ambiguity of words often means that a word can
translate several ways.
Would knowing the semantic class of a word, help
us to know the translation?

9
Sometimes . . .

Crane the bird vs crane the machine
Bat the animal vs bat for cricket and baseball
Seal on a letter vs the animal

10
SO ..

P(translation(crane) grulla animal) gt
P(translation(crane)
grulla)
P(translation(crane) grua machine) gt
P(translation(crane)
grua)
Can we show the overall effect lowers entropy?

11
Language Modeling Data Sparseness again ..

We need to estimate Pr (w3 w1 w2)
If we have never seen w1w2 w3 before
Can we instead develop a model and estimate Pr
(w3 C1 C2) or Pr (C3 C1 C2)

12
A Semantic Tagging technology. How?

We will exploit similarity with NE tagging, ...
Development of pattern matching rules as
incremental wrapper induction
... with semantic (sense) disambiguation
Use as much evidence as possible
Exploit existing resources like MRD or LKBs
... and with machine learning tasks
Generalize from positive examples in training data

13
Multiple Sources of Evidence

Lexical information (priming effects)
Distributional information from general and
training texts
Syntactic features
SVO patterns or Adjectival modifiers
Semantic features
Structural information in LKBs
(LKB-based) similarity measures

14
Machine Learning for ST

Similarity estimation
among contexts (texts overlaps, )
among lexical items wrt MRD/LKBs
We will experiment
Decision tree learning (e.g. C4.5)
Support Vector Machines (e.g. SVM light)
Memory-based Learning (TiMBL)
Bayesian learning

15
Whats New?

Granularity
Semantic categories are coarser than word senses
(cfr. homograph level in MRD)
Integration of existing ML methods
Pattern induction is combined with probabilistic
description of word semantic classes
Co-training
Annotated data are used to drive the sampling of
further evidence from unannotated material
(active learning)

How we know what weve done measurement, the
corpus
Hand-annotated corpus
from the BNC, 100-million word balanced corpus
1 million words annotated
a little under ½ million categorised noun
phrases
Extrinsic evaluationPerplexity of lexical choice
in Machine Translation
Intrinsic evaluationStandard measures or
precision, recall, false positives
(baseline tag with most common category 33)

17
Ambiguity levels in the training data
NPs by semantic categories 0 104824 23.1 1 1192
28 26.3 2 96852 21.4 3 44385 9.8 4 35671 7.9 5
15499 3.4 6 13555 3.0 7 7635 1.7 8 6000 1.3 9
2191 0.5 10 3920 0.9 11 1028 0.2 12 606 0.1 1
3 183 0.0 14 450 0.1 15 919 0.2 17 414 0.1
Total NPs (interim) 453360
18

Maximising project outputssoftware
infrastructure for HLT
Three outputs from the project
1. A new resource
Automatical annotation of the whole corpus
Experimental evidence re. 1.- how accurate the
final results are- how accurate the various
methods employed are
Component tools for doing 1., based on GATE(a
General Architecture for Text Engineering)

What is GATE?
An architectureA macro-level organisational
picture for LE software systems.
A frameworkFor programmers, GATE is an
object-oriented class library that implements the
architecture.
A development environmentFor language
engineers, computational linguists et al, GATE is
a graphical development environment bundled with
a set of tools for doing e.g. Information
Extraction.
Some free components... ...and wrappers for
other people's components
Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc.
Free software (LGPL). Download at
http//gate.ac.uk/download/

Where did GATE come from?
A number of researchers realised in the early-
mid-1990s (e.g. in TIPSTER)
Increasing trend towards multi-site collaborative
projects
Role of engineering in scalable, reusable, and
portable HLT solutions
Support for large data, in multiple media,
languages, formats, and locations
Lower the cost of creation of new language
processing components
Promote quantitative evaluation metrics via tools
and a level playing field
History
1996 2002 GATE version 1, proof of concept
March 2002 version 2, rewritten in Java,
component based, LGPL, more users
Fall 2003 new development cycle

Role of GATE in the
project
Productivity- reuse some baseline components for
simple tasks- development environment support
for implementors (MATLAB for HLT?)- reduce
integration overhead (standard interfaces between
components)- system takes care of persistency,
visualisation, multilingual edit, ...
Quantification- tool support for metrics
generation - visualisation of key/response
differences- regression test tool for nightly
progress verification
Repeatability- open source supported,
maintained, documented software- cross-platform
(Linux, Windows, Solaris, others)- easy install
and proven useability (thousands of people,
hundreds of sites)- mobile code if you write in
Java web services otherwise

Write a Comment

User Comments (0)

About PowerShow.com

HLT, Data Sparsity and Semantic Tagging PowerPoint PPT Presentation