Semantic Annotation for Interlingual Representation of Mulilingual Texts - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Semantic Annotation for Interlingual Representation of Mulilingual Texts

Description:

... their service company to LG Telecom or KTF ... Once a ... Post-annotation reconciliation process and interface. Evaluation scores: ... Post-annotation ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 24
Provided by: Eduar61
Category:

less

Transcript and Presenter's Notes

Title: Semantic Annotation for Interlingual Representation of Mulilingual Texts


1
Semantic Annotation for Interlingual
Representation of Mulilingual Texts
  • Teruko Mitamura (CMU), Keith Miller (MITRE),
  • Bonnie Dorr (Maryland), David Farwell (NMSU),
    Nizar Habash (Columbia), Stephen Helmreich
    (NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen
    Rambow (Columbia),
  • Flo Reeder (MITRE), Advaith Siddharthan
    (Columbia)
  • LREC 2004 Workshop Beyond Named Entity
    Recognition
  • Semantic labelling for NLP tasks

2
(No Transcript)
3
IAMTC (Interlingua Annotation of Multilingual
Corpora) Project
  • Goals
  • Develop MT / interlingua representations and test
    them by human annotation on texts from six
    languages (Japanese, Arabic, Korean, Spanish,
    French, English)
  • Develop annotation methodology
  • Develop semantic annotation tools
  • Design of new metrics and evaluation for the
    interlingual representation

4
IAMTC Project
  • Collaboration New Mexico, Maryland, Columbia,
    MITRE, CMU, ISI
  • Outcomes
  • IL design for set of complex representational
    phenomena
  • Annotation methodology, manuals, tools,
    evaluations
  • Annotated parallel texts according to IL, for
    training data
  • Funding NSF, 1 year

5
Theoretical goal Getting at meaning
K1E1 Starting on January 1 of next year, SK
Telecom subscribers can switch to less expensive
LG Telecom or KTF. The Subscribers cannot
switch again to another provider for the first 3
months, but they can cancel the switch in 14
days if they are not satisfied with services
like voice quality. K1E2 Starting January
1st of next year, customers of SK Telecom can
change their service company to LG Telecom or
KTF Once a service company swap has been made,
customers are not allowed to change companies
again within the first three months, although
they can cancel the change anytime within 14
days if problems such as poor call quality are
experienced.
  • Semantically identical
  • Semantically equivalent
  • Semantically different
  • Additional/less information
  • Different information

6
Corpus and Data
  • Initial Corpus
  • 10 texts in each language
  • 2 translations each into English
  • Interlingua designed for MT
  • Multiple English translations of same source show
    translation divergences. Some phenomena
  • Lexical level word changes
  • Syntactic level phrasing, thematization,
    nominalization
  • Semantic level additional/different content
  • Discourse level multi-clause structure, anaphor
  • Pragmatic level Speech Acts, implicatures,
    style, interpersonal
  • Causes of divergence
  • Genuine ambiguity/vagueness of source meaning
  • Translator error/reinterpretation

7
IL Development Staged, deepening
  • IL0 simple dependency tree gives structure
  • IL1 semantic annotations for Nouns, Verbs, Adjs,
    Advs, and Theta Roles
  • Not yet semanticbuy?sell, many remaining
    simplifications
  • Concept senses from ISIs Omega ontology
  • Theta Roles from Dorrs LCS work
  • Elaborate annotation manuals
  • Tiamat annotation interface
  • Post-annotation reconciliation process and
    interface
  • Evaluation scores annotator agreement
  • IL2 that comes next

8
Details of IL0
  • Deep syntactic dependency representation
  • Removes auxiliary verbs, determiners, and some
    function words
  • Normalizes passives, clefts, etc.
  • Includes syntactic roles (Subj, Obj)
  • Construction
  • 1. Dependency parsed using Connexor (English)
  • 2. Hand-corrected
  • Extensive manual and instructions on website

9
Details of IL1
  • Intermediate semantic representation
  • Annotations performed manually by each person
    alone
  • Associate open-class lexical items with Omega
    Ontology items
  • Replace syntactic relations by one of approx. 20
    semantic (theta) roles (from Dorr), e.g., AGENT,
    THEME, GOAL, INSTR
  • No treatment of prepositions, quantification,
    negation, time, modality, idioms, proper names,
    NP-internal structure
  • Nodes may receive more than one concept
  • Average about 1.2
  • Manual under development annotation tool built

10
Example of IL1 internal representation
  • The study led them to ask the Czech government to
    recapitalize CSA at this level.
  • 3, lead, V, lead, Root, LEADltGET, GUIDE
  • 2, study, N, study, AGENT, SURVEYltWORK, REPORT
  • 4, they, N, they, THEME, ---, ---
  • 6, ask, V, ask, PROPOSITION, ---, ---
  • 9, government, N, government, GOAL,
    AUTHORITIES,
  • GOVERNMENTAL-ORGANIZATION
  • 8, Czech, Adj, Czech, MOD, CZECHCZECHOSLOVAKIA,
    ---
  • 11, recapitalize, V, recapitalize, PROP,
    CAPITALIZEltSUPPLY, INVEST
  • 12, csa, N, csa, THEME, AIRLINEltLINE, ---
  • 16, at, P, value_at, GOAL, ---, ---
  • 15, level, N, level, ---, DEGREE, MEASURE
  • 14, this, Det, this, ---, ---, ---

11
Details of IL2 In development
  • Start capturing meaning
  • Handle proper names one of around 5 classes
    (PERSON, LOCATION, TIME, ORGANIZATION)
  • Conversives (buy vs. sell) at the FrameNet level
  • Non-literal language usage (open the door to
    customers vs. start doing business)
  • Extended paraphrases involving syntax, lexicon,
    grammatical features
  • Possible incorporation of other standardized
    notations for temporal and spatial expressions
  • Still excluded
  • Quantification and negation
  • Discourse structure
  • Pragmatics

12
Omega ontology
  • Single set of all semantic terms, taxonomized and
    interconnected (http//omega.isi.edu)
  • Merger of existing ontologies and other
    resources
  • Manually built top structure from ISI
  • WordNet (110,000 nodes) from Princeton
  • Mikrokosmos (6000 nodes) from NMSU
  • Penman Upper model (300 nodes) from ISI
  • 1-million instances (people, locations) from ISI
  • TAP domain relations from Stanford
  • Undergoing constant reconciliation and pruning
  • Used in several past projects (metadata formation
    for database integration MT QA summarization)

13
Dependency parser and Omega ontology
Omega (ISI)110,000 concepts (WordNet,
Mikrokosmos, etc.), 1.1 mill instances URL
http//omega.isi.edu
Dependency parser (Prague)
14
Tiamat annotation interface
For each new sentence
Step 1 find Omega concepts for objects and events
Candidate concepts
Step 2 select event frame (theta roles)
15
Evaluation webpage
16
Evaluation
  • Three approaches to evaluation
  • Inter-annotator agreement completed
  • Sentence generation from extracted annotation
    structure to be completed
  • Comparison of interlingual structures (graph
    comparisons) not planned
  • Inter-annotator agreement Is the IL sufficiently
    defined to permit consistent annotation?
  • Impacts ontology, theta-roles coverage and
    precision

17
Annotation Issues
  • Post-annotation consistency checking
  • Novice annotators may make inconsistent
    annotations within the same text.
  • Intra-annotator consistency checking procedure
  • e.g.
  • If two nodes in different sentences are
    co-indexed, then annotators must ensure that the
    two nodes carry the same meaning in the context
    of the two different sentences
  • Post-annotation reconciliation

18
2. Post-annotation reconciliation
  • Question How much can annotators be brought into
    agreement?
  • Procedure
  • Annotator sees all annotations, votes
    Yes/Maybe/No on each
  • Annotators then discuss all differences
    (telephone conf)
  • Annotators then vote again, independently
  • We collapse all Yes and Maybe votes, compare them
    with No to identify all serious disagreement
  • Result
  • Annotators derive common methodology
  • Small errors and oversights removed during
    discussion
  • Inter-annotator agreement improved
  • Serious problems of interpretation or error
    identified

19
Annotation across Translations
  • Question How different are the translations?
  • Procedure
  • Annotator sees annotations across both
    translations, identifies differences of form and
    meaning
  • Annotator selects true meaning(s)
  • Results (work still in progress)
  • Impacts ontology richness/conciseness
  • Improvement in Interlingua representation depth
  • Useful for IL2 design development
  • Observations
  • This is very hard work
  • Methodology unclear what is seen first, how to
    show alternatives, what to do with results

20
Principal problems to date
  • Proper nouns
  • Proposed solution automatically tag with one of
    6 types (Person, Location, Org, DateTime, etc.)
  • Noun compounds
  • Alternatives tag head only parse and tag whole
    structure
  • Omega is too rich
  • Hard to distinguish from the others
  • Granularity of concept selection
  • Light verbs
  • Proposed solution rephrase to remove light verb
    if possible (take a shower ? shower, but
    take a shower ? ?)
  • Vagueness and ambiguity
  • Annotate all plausible senses (propose as Urge
    and Suggest)
  • Idioms and metaphors
  • Proposed solution ?

21
Discussion and conclusion
  • Results are encouraging
  • But more work must be done to solidify them
  • Outcomeshow have we done?
  • IL design partly, and IL2 in the works
  • Annotation methodology, manuals, tools, evals
     yes
  • Annotated parallel texts approx. 150 done
  • Next steps
  • Foreign language annotation standards and tools
  • Development of IL2
  • Addressing coverage gaps (1/3 of open class words
    marked as having no concept)
  • Generation of surface structure from deep
    structure
  • Is it possible?

22
Toward a Theory of Annotation
  • Recently, sharp increase in number of annotated
    resources being built
  • Penn Treebank, Propbank, many others
  • For annotation, need
  • Theory behind phenomena being annotated (for)
  • Annotation termsets (even WordNet, FrameNet,
    verbnet, HowNet)
  • Standard (?) annotation corpus (same old
    Treebank?)
  • Annotation toolsthey make an immense difference
  • Carefully considered annotation procedure
    (interleaving per text vs. per sentence, etc.)
  • Reconciliation and consistency checking
    procedures
  • Evaluation measures, appropriately defined

23
Contact information
  • URLs and Wiki pages
  • Project website http//aitc.aitcnet.org/nsf/iamtc
    /
Write a Comment
User Comments (0)
About PowerShow.com