Semantic Annotation for Interlingual Representation of Mulilingual Texts - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Semantic Annotation for Interlingual Representation of Mulilingual Texts

Description:

... their service company to LG Telecom or KTF ... Once a ... Post-annotation reconciliation process and interface. Evaluation scores: ... Post-annotation ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 24

Provided by: Eduar61

Category:

more less

Transcript and Presenter's Notes

Title: Semantic Annotation for Interlingual Representation of Mulilingual Texts

1
Semantic Annotation for Interlingual
Representation of Mulilingual Texts

Teruko Mitamura (CMU), Keith Miller (MITRE),
Bonnie Dorr (Maryland), David Farwell (NMSU),
Nizar Habash (Columbia), Stephen Helmreich
(NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen
Rambow (Columbia),
Flo Reeder (MITRE), Advaith Siddharthan
(Columbia)
LREC 2004 Workshop Beyond Named Entity
Recognition
Semantic labelling for NLP tasks

2
(No Transcript)
3
IAMTC (Interlingua Annotation of Multilingual
Corpora) Project

Goals
Develop MT / interlingua representations and test
them by human annotation on texts from six
languages (Japanese, Arabic, Korean, Spanish,
French, English)
Develop annotation methodology
Develop semantic annotation tools
Design of new metrics and evaluation for the
interlingual representation

4
IAMTC Project

Collaboration New Mexico, Maryland, Columbia,
MITRE, CMU, ISI
Outcomes
IL design for set of complex representational
phenomena
Annotation methodology, manuals, tools,
evaluations
Annotated parallel texts according to IL, for
training data
Funding NSF, 1 year

5
Theoretical goal Getting at meaning
K1E1 Starting on January 1 of next year, SK
Telecom subscribers can switch to less expensive
LG Telecom or KTF. The Subscribers cannot
switch again to another provider for the first 3
months, but they can cancel the switch in 14
days if they are not satisfied with services
like voice quality. K1E2 Starting January
1st of next year, customers of SK Telecom can
change their service company to LG Telecom or
KTF Once a service company swap has been made,
customers are not allowed to change companies
again within the first three months, although
they can cancel the change anytime within 14
days if problems such as poor call quality are
experienced.

Semantically identical

Semantically equivalent

Semantically different

Additional/less information

Different information

6
Corpus and Data

Initial Corpus
10 texts in each language
2 translations each into English
Interlingua designed for MT
Multiple English translations of same source show
translation divergences. Some phenomena
Lexical level word changes
Syntactic level phrasing, thematization,
nominalization
Semantic level additional/different content
Discourse level multi-clause structure, anaphor
Pragmatic level Speech Acts, implicatures,
style, interpersonal
Causes of divergence
Genuine ambiguity/vagueness of source meaning
Translator error/reinterpretation

7
IL Development Staged, deepening

IL0 simple dependency tree gives structure
IL1 semantic annotations for Nouns, Verbs, Adjs,
Advs, and Theta Roles
Not yet semanticbuy?sell, many remaining
simplifications
Concept senses from ISIs Omega ontology
Theta Roles from Dorrs LCS work
Elaborate annotation manuals
Tiamat annotation interface
Post-annotation reconciliation process and
interface
Evaluation scores annotator agreement
IL2 that comes next

8
Details of IL0

Deep syntactic dependency representation
Removes auxiliary verbs, determiners, and some
function words
Normalizes passives, clefts, etc.
Includes syntactic roles (Subj, Obj)
Construction
1. Dependency parsed using Connexor (English)
2. Hand-corrected
Extensive manual and instructions on website

9
Details of IL1

Intermediate semantic representation
Annotations performed manually by each person
alone
Associate open-class lexical items with Omega
Ontology items
Replace syntactic relations by one of approx. 20
semantic (theta) roles (from Dorr), e.g., AGENT,
THEME, GOAL, INSTR
No treatment of prepositions, quantification,
negation, time, modality, idioms, proper names,
NP-internal structure
Nodes may receive more than one concept
Average about 1.2
Manual under development annotation tool built

10
Example of IL1 internal representation

The study led them to ask the Czech government to
recapitalize CSA at this level.
3, lead, V, lead, Root, LEADltGET, GUIDE
2, study, N, study, AGENT, SURVEYltWORK, REPORT
4, they, N, they, THEME, ---, ---
6, ask, V, ask, PROPOSITION, ---, ---
9, government, N, government, GOAL,
AUTHORITIES,
GOVERNMENTAL-ORGANIZATION
8, Czech, Adj, Czech, MOD, CZECHCZECHOSLOVAKIA,
---
11, recapitalize, V, recapitalize, PROP,
CAPITALIZEltSUPPLY, INVEST
12, csa, N, csa, THEME, AIRLINEltLINE, ---
16, at, P, value_at, GOAL, ---, ---
15, level, N, level, ---, DEGREE, MEASURE
14, this, Det, this, ---, ---, ---

11
Details of IL2 In development

Start capturing meaning
Handle proper names one of around 5 classes
(PERSON, LOCATION, TIME, ORGANIZATION)
Conversives (buy vs. sell) at the FrameNet level
Non-literal language usage (open the door to
customers vs. start doing business)
Extended paraphrases involving syntax, lexicon,
grammatical features
Possible incorporation of other standardized
notations for temporal and spatial expressions
Still excluded
Quantification and negation
Discourse structure
Pragmatics

12
Omega ontology

Single set of all semantic terms, taxonomized and
interconnected (http//omega.isi.edu)
Merger of existing ontologies and other
resources
Manually built top structure from ISI
WordNet (110,000 nodes) from Princeton
Mikrokosmos (6000 nodes) from NMSU
Penman Upper model (300 nodes) from ISI
1-million instances (people, locations) from ISI
TAP domain relations from Stanford
Undergoing constant reconciliation and pruning
Used in several past projects (metadata formation
for database integration MT QA summarization)

13
Dependency parser and Omega ontology
Omega (ISI)110,000 concepts (WordNet,
Mikrokosmos, etc.), 1.1 mill instances URL
http//omega.isi.edu
Dependency parser (Prague)
14
Tiamat annotation interface
For each new sentence
Step 1 find Omega concepts for objects and events
Candidate concepts
Step 2 select event frame (theta roles)
15
Evaluation webpage
16
Evaluation

Three approaches to evaluation
Inter-annotator agreement completed
Sentence generation from extracted annotation
structure to be completed
Comparison of interlingual structures (graph
comparisons) not planned
Inter-annotator agreement Is the IL sufficiently
defined to permit consistent annotation?
Impacts ontology, theta-roles coverage and
precision

17
Annotation Issues

Post-annotation consistency checking
Novice annotators may make inconsistent
annotations within the same text.
Intra-annotator consistency checking procedure
e.g.
If two nodes in different sentences are
co-indexed, then annotators must ensure that the
two nodes carry the same meaning in the context
of the two different sentences
Post-annotation reconciliation

18
2. Post-annotation reconciliation

Question How much can annotators be brought into
agreement?
Procedure
Annotator sees all annotations, votes
Yes/Maybe/No on each
Annotators then discuss all differences
(telephone conf)
Annotators then vote again, independently
We collapse all Yes and Maybe votes, compare them
with No to identify all serious disagreement
Result
Annotators derive common methodology
Small errors and oversights removed during
discussion
Inter-annotator agreement improved
Serious problems of interpretation or error
identified

19
Annotation across Translations

Question How different are the translations?
Procedure
Annotator sees annotations across both
translations, identifies differences of form and
meaning
Annotator selects true meaning(s)
Results (work still in progress)
Impacts ontology richness/conciseness
Improvement in Interlingua representation depth
Useful for IL2 design development
Observations
This is very hard work
Methodology unclear what is seen first, how to
show alternatives, what to do with results

20
Principal problems to date

Proper nouns
Proposed solution automatically tag with one of
6 types (Person, Location, Org, DateTime, etc.)
Noun compounds
Alternatives tag head only parse and tag whole
structure
Omega is too rich
Hard to distinguish from the others
Granularity of concept selection
Light verbs
Proposed solution rephrase to remove light verb
if possible (take a shower ? shower, but
take a shower ? ?)
Vagueness and ambiguity
Annotate all plausible senses (propose as Urge
and Suggest)
Idioms and metaphors
Proposed solution ?

21
Discussion and conclusion

Results are encouraging
But more work must be done to solidify them
Outcomeshow have we done?
IL design partly, and IL2 in the works
Annotation methodology, manuals, tools, evals
yes
Annotated parallel texts approx. 150 done
Next steps
Foreign language annotation standards and tools
Development of IL2
Addressing coverage gaps (1/3 of open class words
marked as having no concept)
Generation of surface structure from deep
structure
Is it possible?

22
Toward a Theory of Annotation

Recently, sharp increase in number of annotated
resources being built
Penn Treebank, Propbank, many others
For annotation, need
Theory behind phenomena being annotated (for)
Annotation termsets (even WordNet, FrameNet,
verbnet, HowNet)
Standard (?) annotation corpus (same old
Treebank?)
Annotation toolsthey make an immense difference
Carefully considered annotation procedure
(interleaving per text vs. per sentence, etc.)
Reconciliation and consistency checking
procedures
Evaluation measures, appropriately defined