Title: HLT, Data Sparsity and Semantic Tagging
1HLT, Data Sparsity and Semantic Tagging Louise
Guthrie (University of Sheffield) Roberto Basili
(University of Tor Vergata, Rome) Hamish
Cunningham (University of Sheffield)
2Outline
- A ubiquitous problem data sparsity
- The approach
- coarse-grained semantic tagging
- learning by combining multiple evidence
- The evaluation intrinsic and extrinsic measures
- The expected outcomes architectures, tools,
development support
3Applications
- PresentWeve seen growing interest in a range of
HLT tasks - e.g. IE, MT
- Trends
- Fully portable IE, unsupervised learning
- Content Extraction vs. IE
4Data Sparsity
- Language Processing depends on a model of the
features important to an application. - MT - Trigrams and frequencies
- Extraction - Word patterns
- New texts always seem to have lots of phenomena
we havent seen before
5Different kinds of patterns
- Person was appointed as post of company
- Company named person to post
- Almost all extraction systems tried to find
patterns of mixed words and entities. - People, Locations, Organizations, dates, times,
currencies
6Can we do more?
- Astronauts aboard the space shuttle Endeavor were
forced to dodge a derelict Air Force satellite
Friday - Humans aboard space_vehicle dodge satellite
timeref.
7Could we know these are the same?
- The IRA bombed a family owned shop in Belfast
yesterday. - FMLN set off a series of explosions in central
Bogota today. - ORGANIZATION ATTACKED LOCATION DATE
8Machine translation
- Ambiguity of words often means that a word can
translate several ways. - Would knowing the semantic class of a word, help
us to know the translation?
9Sometimes . . .
- Crane the bird vs crane the machine
- Bat the animal vs bat for cricket and baseball
- Seal on a letter vs the animal
10SO ..
- P(translation(crane) grulla animal) gt
- P(translation(crane)
grulla) - P(translation(crane) grua machine) gt
- P(translation(crane)
grua) - Can we show the overall effect lowers entropy?
11Language Modeling Data Sparseness again ..
- We need to estimate Pr (w3 w1 w2)
- If we have never seen w1w2 w3 before
- Can we instead develop a model and estimate Pr
(w3 C1 C2) or Pr (C3 C1 C2)
12A Semantic Tagging technology. How?
- We will exploit similarity with NE tagging, ...
- Development of pattern matching rules as
incremental wrapper induction - ... with semantic (sense) disambiguation
- Use as much evidence as possible
- Exploit existing resources like MRD or LKBs
- ... and with machine learning tasks
- Generalize from positive examples in training data
13Multiple Sources of Evidence
- Lexical information (priming effects)
- Distributional information from general and
training texts - Syntactic features
- SVO patterns or Adjectival modifiers
- Semantic features
- Structural information in LKBs
- (LKB-based) similarity measures
14Machine Learning for ST
- Similarity estimation
- among contexts (texts overlaps, )
- among lexical items wrt MRD/LKBs
- We will experiment
- Decision tree learning (e.g. C4.5)
- Support Vector Machines (e.g. SVM light)
- Memory-based Learning (TiMBL)
- Bayesian learning
15Whats New?
- Granularity
- Semantic categories are coarser than word senses
(cfr. homograph level in MRD) - Integration of existing ML methods
- Pattern induction is combined with probabilistic
description of word semantic classes - Co-training
- Annotated data are used to drive the sampling of
further evidence from unannotated material
(active learning)
16- How we know what weve done measurement, the
corpus - Hand-annotated corpus
- from the BNC, 100-million word balanced corpus
- 1 million words annotated
- a little under ½ million categorised noun
phrases - Extrinsic evaluationPerplexity of lexical choice
in Machine Translation - Intrinsic evaluationStandard measures or
precision, recall, false positives - (baseline tag with most common category 33)
17Ambiguity levels in the training data
NPs by semantic categories 0 104824 23.1 1 1192
28 26.3 2 96852 21.4 3 44385 9.8 4 35671 7.9 5
15499 3.4 6 13555 3.0 7 7635 1.7 8 6000 1.3 9
2191 0.5 10 3920 0.9 11 1028 0.2 12 606 0.1 1
3 183 0.0 14 450 0.1 15 919 0.2 17 414 0.1
Total NPs (interim) 453360
18- Maximising project outputssoftware
infrastructure for HLT - Three outputs from the project
- 1. A new resource
- Automatical annotation of the whole corpus
- Experimental evidence re. 1.- how accurate the
final results are- how accurate the various
methods employed are - Component tools for doing 1., based on GATE(a
General Architecture for Text Engineering)
19-
- What is GATE?
- An architectureA macro-level organisational
picture for LE software systems. - A frameworkFor programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environmentFor language
engineers, computational linguists et al, GATE is
a graphical development environment bundled with
a set of tools for doing e.g. Information
Extraction. - Some free components... ...and wrappers for
other people's components - Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc. - Free software (LGPL). Download at
http//gate.ac.uk/download/
20-
- Where did GATE come from?
- A number of researchers realised in the early-
mid-1990s (e.g. in TIPSTER) - Increasing trend towards multi-site collaborative
projects - Role of engineering in scalable, reusable, and
portable HLT solutions - Support for large data, in multiple media,
languages, formats, and locations - Lower the cost of creation of new language
processing components - Promote quantitative evaluation metrics via tools
and a level playing field - History
- 1996 2002 GATE version 1, proof of concept
- March 2002 version 2, rewritten in Java,
component based, LGPL, more users - Fall 2003 new development cycle
21- Role of GATE in the
project - Productivity- reuse some baseline components for
simple tasks- development environment support
for implementors (MATLAB for HLT?)- reduce
integration overhead (standard interfaces between
components)- system takes care of persistency,
visualisation, multilingual edit, ... - Quantification- tool support for metrics
generation - visualisation of key/response
differences- regression test tool for nightly
progress verification - Repeatability- open source supported,
maintained, documented software- cross-platform
(Linux, Windows, Solaris, others)- easy install
and proven useability (thousands of people,
hundreds of sites)- mobile code if you write in
Java web services otherwise