Romanian Semantic Role Resource - PowerPoint PPT Presentation

About This Presentation
Title:

Romanian Semantic Role Resource

Description:

AGENT: Columbus discovered America. PATIENT: Columbus discovered America. ... hours, i.e. 20 months, considering 8 hours a day, 5 days a week working time. ... – PowerPoint PPT presentation

Number of Views:550
Avg rating:3.0/5.0
Slides: 27
Provided by: alexsidian
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Romanian Semantic Role Resource


1
Romanian Semantic Role Resource
  • Diana Trandabat1,3 and Maria Husarciuc1,2
  • 1Faculty of Computer Science, Al. I. Cuza
    University of Iasi, Romania
  • 2Faculty of Letters, Al. I. Cuza University of
    Iasi, Romania
  • 3Institute for Computer Science, Romanian Academy

2
Motivation
  • Annotated language resources have became a must
    in natural language processing, for
  • supervised learning training and evaluation
  • unsupervised learning evaluation
  • hand-crafted systems evaluation.
  • Quality control is an important issue, since
    annotations, in order to be used as gold standard
    for evaluation, need to be very accurate.
  • What if we have short deadlines and limited human
    and financial possibilities?
  • A good solution would be to use existing language
    resources (built with considerable efforts for a
    specific language), and import them for a new
    language.

3
Predication
  • Predicational word a word that demands a
    specific argument structure in order to express
    its sense.
  • Each predicational word has
  • Arguments
  • Verbs with one argument John leaves.
  • Verbs with two arguments John reads a book.
  • Verbs with three arguments John gives a book to
    Mary.
  • Adjuncts
  • John leaves New York.

4
Predication
  • Besides verbs, there are also predicational nouns
    (also called nominalizations) and predicational
    adjectives
  • predicational verbs
  • John wrote the paper on time.
  • predicational nouns
  • Johns writing of the paper was difficult.
  • predicational adjectives.
  • The paper written by John was the best
    one.

5
Case Grammar
  • Language representation
  • Surface Structure (the syntactic knowledge)
  • Deep Structure (the semantic knowledge).
  • Case Roles (Ch. Fillmore) - representations at a
    semantic level of the lexical arguments
  • Examples
  • AGENT Columbus discovered America.
  • PATIENT Columbus discovered America.
  • INSTRUMENT The window was broken by the storm.
  • Temporal LOCALISATION They dined at 5 a.m.
  • Spatial LOCALISATION John goes to London.
  • etc.

6
Semantic Frames Databases
  • FrameNet
  • http//framenet.icsi.berkeley.edu/
  • 135.000 examples form the British National Corpus
  • 10.000 lexical units
  • over 800 semantic frames
  • PropBank
  • http//www.cs.rochester.edu/gildea/PropBank/Sort
  • Semantic annotation for PennTreebank
  • Salsa
  • http//www.coli.uni-saarland.de/projects/salsa
  • VerbNet
  • http//verbs.colorado.edu/kipper/verbnet.html

7
FrameNet
  • FrameNet is a lexicographic research project
    developed at Berkley University, California,
    which produced a lexicon containing very detailed
    information about the English predicational words
    (verbs, nouns and adjectives).
  • A frame structure
  • a definition
  • a set of frame elements FEs (semantic roles)
    valences for a target predicational word.
  • core frame elements mandatory for the verb
    lexico-semantic realization gt arguments
  • non-core frame elements facultative gt adjuncts
  • a set of lexical units LUs a predicational word
    for which combinatory properties (the semantic
    frame) applies to.

8
FrameNet frame example for sell
  • Frame elements (semantic roles)
  • Core FE Buyer, Seller, Goods
  • Non - core FE Duration, Manner, Means, Money,
    Place, Purpose, Rate, Reason, Time, Unit
  • Lexical units
  • Verbs retail, sell, vend
  • Nouns retailer, vendor.
  • Example
  • HeSeller will probably sellTarget herBuyer
    the bookGoods for 15Money.

9
Semantic Role Resource Building from Scratch or
Importing?
10
Semantic Role Resource Building from Scratch or
Importing?
  • Annotation of a new corpus
  • Considering that we have the corpus, the schema,
    the software and two very well trained
    annotators, with good semantic frames knowledge,
    and that we only need to worry about the
    annotation process itself.
  • Our tests revealed that a person can annotate an
    average of 30 medium sized sentences per hour.
    For a target of 100.000 sentences, we computed
    around 3500 hours, i.e. 20 months, considering 8
    hours a day, 5 days a week working time.
  • The main problem with this approach was the lack
    of a definite list of possible semantic roles.
    Therefore, different annotators can give
    different names (agent or seller or vendor for
    instance) to the same role, confusing the corpus
    quality metrics.

11
Semantic Role Resource Building from Scratch or
Importing?
  • Import of the annotation
  • For the import method, the main time consuming
    task is the translation.
  • A professional translator can translate up to
    40-50 sentences an hour, even faster if
    translation memory is used.
  • But the real gain is that the corpus can be split
    to several translators (cheaper and easier to
    find than semantic annotators).
  • After the automatic alignment and import, a
    single annotator is needed to perform the
    validation of the created corpus, focusing on
    cases where the alignment was not 11 ( 15 of
    the total number of sentences).

12
Towards a Romanian Semantic Frames database
  • Considering those calculations, the fact that we
    didnt had two annotators to work for 20 months
    just on semantic annotation, and the belief that
    once we have the import program, every other
    language could benefit from it and transfer
    annotations for its own language, we created a
    Romanian FrameNet based on the English
    annotation.
  • The intuition
  • Most of the frames defined in the English
    FrameNet are likely to be valid
    cross-linguistically
  • Semantic frames express conceptual structures,
    language independent at the deep structure level.
  • The surface realization is realized according to
    each language syntactic constraints.

13
Steps towards a parallel Romanian/English FrameNet
  • manual translation, by professional translators,
    of 1094 sentences from the English FrameNet 110
    randomly selected sentences and the Event frame.
  • word level alignment of the Romanian sentences
    with the English ones using the aligner developed
    by the Institute of Research in Artificial
    Intelligence.
  • automatic import of the English annotation,
    followed by a manual verification to detect the
    mismatching cases
  • an optimization process which, based on inference
    rules, corrects the automatic annotation.

14
Automatic import
EUROLAN 2005 Summer School
15
Automatic import
  • The algorithm
  • reading the English XML files and the alignment
    files
  • labeling each English word with the corresponding
    semantic role (FE) converting the character
    indexes into a word level annotation
  • mapping the English words with the aligned
    Romanian correspondences
  • writing an output XML file containing the
    Romanian annotated corpus.

EUROLAN 2005 Summer School
16
English semantic roles
  • ltannotationSet ID"1052804" status"MANUAL"gt
  • ltlayersgt
  • ltlayer ID"6375447" name"FE"gt
  • ltlabelsgt
  • ltlabel name"Event" start"0"
    end"11" /gt
  • ltlabel name"Time" start"22"
    end"62" /gt
  • ltlabel name"Place" start"64"
    end"106" /gt
  • lt/labelsgt
  • lt/layergt
  • ltlayer ID"6375452" name"Target"gt
  • ltlabelsgt
  • ltlabel name"Target" start"13"
    end"20" /gt
  • lt/labelsgt
  • lt/layergt
  • ltlayer ID"6375453" name"Verb" /gt
  • lt/layersgt
  • ltsentence ID"797186" aPos"103724676"gt
  • lttextgtThe incident occurred after a
    dispute between the man and staff at a branch of
    the Bank of Ireland in Cahir . lt/textgt
  • lt/sentencegt

EUROLAN 2005 Summer School
17
Romanian semantic roles
  • ltannotationSet ID"1" status"AUTOMATIC"gt
  • ltlayersgt
  • ltlayer ID"6375447" name"FE"gt
  • ltlabelsgt
  • ltlabel name"Event" start"0" end"9"
    /gt
  • ltlabel name"Time" start"20"
    end"59" /gt
  • ltlabel name"Place" start"61"
    end"101" /gt
  • lt/labelsgt
  • lt/layergt
  • ltlayer ID"6375452" name"Target"gt
  • ltlabelsgt
  • ltlabel name"Target" start"11"
    end"18" /gt
  • lt/labelsgt
  • lt/layergt
  • ltlayer ID"6375453" name"Verb" /gt
  • lt/layersgt
  • ltsentence ID"671" aPos"103724676"gt
  • lttextgtIncidentul a aparut dupa o disputa
    între individ si personal la o filiala a Bancii
    Irlandeze din Cahir . lt/textgt
  • lt/sentencegt

EUROLAN 2005 Summer School
18
Optimization of the Romanian obtained database
  • Translations
  • realized by professional translators to minimize
    errors.
  • problems mainly due to the lack of the context in
    English sentences.
  • however, if the English semantic frame is
    considered, this problem is surmountable.
  • Alignment
  • performed with the aligner developed by the
    Institute of Research in Artificial Intelligence,
    which is considered to have a precision of 87.17
    and a recall of 70.25.
  • however, the aligner results were manually
    validated before entering the annotation import
    program

19
Optimization of the Romanian obtained database
  • The assessment of the correctitude of the
    obtained Romanian corpus is preformed manually.
  • The first results of the annotation import show
    an overall accuracy of approx. 83.
  • The validation focuses on detecting the cases
    where the import has failed, trying to discover
    if the problems are due to the translation or to
    the semantic or syntactic specificities of
    Romanian.
  • Import difficulties
  • the double annotation
  • the existence of imbricate frame elements
  • unexpressed semantic frames
  • the lack of total correspondence between English
    and Romanian frames.

EUROLAN 2005 Summer School
20
Double annotation
  • The double annotation applies only to the
    non-core frame elements, due to the fact that the
    same phrase can refer to multiple circumstances
    (peripheral roles) of an event.
  • When a semantic element is double annotated in
    English, the same generally holds also for
    Romanian.
  • The most frequent case of double annotation is
    for the Time/Cause roles, since almost any
    temporal specification involves a cause and/or a
    goal.

The incidentEvent OCCURRED after a dispute
between the man and staffTime/Cause at a branch
of the Bank of Ireland in CahirPlace Incidentul
Event A APARUT dupa o disputa între individ si
personalTime/Cause la o filiala a Bancii
Irlandeze din CahirPlace.
21
Imbrications
  • A word can be part of two semantic elements
    without being double annotated.
  • The imbrication process is common in the English
    annotations mainly in the possessive noun
    phrases. The imbrication process doesnt occur in
    Romanian.
  • Even if we dont have an absolute correspondence
    between the whole FE BodyPart form English into
    Romanian, the noun mâna (hand) is correctly
    annotated in Romanian as representing the
    BodyPart frame.

When she got over the strokeTime/Cause sheExp
fell and BROKE herExp handBodyPart. Când
si-a revenit dupa atacTime/Cause , a cazut si
siExp-a RUPT mânaBodyPart .
22
Imbrications
  • The import of the annotation works also when the
    Romanian target-word is a gerund followed by a
    reflexive pronoun and a noun phrase
  • Although apparently similar to the English
    structure, in the Romanian sentence, the frame
    elements are not imbricate, but successive, since
    the regent of the pronoun si, is not the noun
    glezna (ankle), but the gerundive verb.

Josef JakobsProt landed in a potato field in
North Stifford , Essex , falling heavily and
BREAKING hisProt ankleBodyP . Josef
JakobsProt a aterizat într-un câmp de cartofi în
North Stifford , Essex , cazând greu si
RUPÂNDU-siProt gleznaBodyP .
23
Unexpressed Semantic Frames
  • A FE can be expressed in English, but implicit in
    Romanian, or vice-versa. If the first case poses
    no problems to the transfer, the second one
    supposes importing roles unexpressed in English.

BloodUndergoer had CONGEALED thicklyManner
on the end of the smashed fibulaPlace
. SângeleUndergoer se ÎNGROSA spre capatul
fibulei zdrobitePlace . QUIT
smokingProcess . LASATI-vaProtagonist de
fumatProcess .
24
The lack of total correspondence between frames
  • In the English FrameNet, similar sentences can
    serve as examples for different, related, frames.
  • The relation between Communication and Contacting
    frame is illustrated by two sentences that are
    apparently semantically equivalent
  • The Romanian translation of both sentences is
    similar due to the absence in Romanian of a
    simple verb corresponding to e-mail

Contacting frame I e-mailed him my new phone
number. Communication frame I communicated my
new phone number to him by e-mail.
Communication frame I-am trimis prin e-mail noul
meu numar de telefon.
25
Conclusions and Further work
  • The import method was preferred to the
    classical creation by hand of a manually
    annotated corpus because of its possible
    automation. We currently investigate the
    possibility of using a translation engine for the
    most time consuming task, namely the translation
    of the English sentences.
  • The resulted resource can also be used as a
    verifying resource for the syntactic annotation.
  • FrameNet comes, besides frame elements, with a
    syntactic analysis of each the sentences. This
    annotation can also be imported, but it is not
    representative, since the syntax represents the
    surface level, thus the one with language
    specificities.
  • Therefore, the Romanian sentences are
    syntactically parsed at the alignment stage. The
    comparison of the two annotations is a very
    useful to create a syntax transfer model.

26
  • Thank you!
  • dtrandabat_at_info.uaic.ro
  • mhusarciuc_at_gmail.com
Write a Comment
User Comments (0)
About PowerShow.com