Spanish FrameNet Project presentation

About This Presentation

Transcript and Presenter's Notes

Title: Spanish FrameNet Project

1
Spanish FrameNet Project

Autonomous University of Barcelona
Marc Ortega

2
Spanish FrameNet Project

Spanish FrameNet is a research project which is
sponsored by the Department of Education of Spain
(Grant No. TSI2005-01200) from December 2005 to
December 2006.
A new grant proposal has been submitted to the
Spanish Department of Education for the period
2007-2009
SFN is developed at the Autonomous University of
Barcelona (Spain) and the International Computer
Science Institute (Berkeley, CA) in cooperation
with the FrameNet Project.
PI Carlos Subirats, System Analyst Marc Ortega,
2 linguist

3
SFN Goals

The Spanish FrameNet Project is creating an
online lexical resource for Spanish, based on
frame semantics and supported by corpus evidence.
SFN will be available to the public by July 2007
SFN will contain at least 1,000 lexical items
aprox. -verbs, predicative nouns, and adjectives,
adverbs, prepositions and entities-
representative of a wide range of semantic
domains.
The aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses

4
Frame Semantics

Spanish FrameNet (SFN) is using, adapting and
changing FrameNet Frames in order to adapt them
to Spanish
Some SFN Frames are the same as English FN (with
Spanish examples)
Some SFN Frames have the same English FN name but
they are different (slightly different
definition, different FEs, or different core
sets)
To adapt FN to Spanish we defined some new frames
and some FN frames are not used (new frames use
the same FN format), like
Cause_to_halt
Change_emotional_state
Collapse
Inventing
Motion_backwards, Motion_interruption,
Motion_manner, Motion_medium, Motion_up_downwards
Return
Social_interaction
Think_up

5
Current Project Status

Frames Defined 92
Lexical Units 624
Annotated 413
Subcorporated 130
Created but without subcorporation 23

6
Spanish FrameNet Corpus and Tools

Spanish FrameNet is using a 350 million word
corpus
It includes both European and New World Spanish
(40 and 60)
The SFN Corpus has been developed by the SFN
research team, since there are no (large) public
domain Spanish corpora available
The SFN Corpus is lemmatized and tagged with a
set of in-house tools
FNDesktop
Web Reports
Sato Tool

7
The SFN tagging and chunking system

The SFN Corpus is tagged and lemmatized by using
An electronic dictionary of Spanish of 600,000
forms, which is expanded from a dictionary of
93,000 lemmas
66,000 single-word lexical units, like unir
(unite), inmoralidad (immorality), allí (there),
etc.
26,000 multi-word lexical units (MWLU), like
muerte cerebral (brain death), etc., which are
automatically expanded in 55,000 inflected MWLU
forms.
Plain text to Deterministic Finite State Automata
(FSA) corpus tagger
2,000 Finite State Transducers (FST) transducers
of multi-word verbs
Transducers of head of verbal phrases (compound
verbal tenses)

8
The SFN tagging and chunking system

The POS tagging process gives to corpus formats
Automata Corpus
IMS-CWB (Institut für Maschinelle
Sprachverarbeitung -Corpus Workbench)

9
Automata Corpus

Lexical tagging (part-of-speech, lemma)
Word ambiguities are represented in deterministic
finite state automata (DFSAs) as different
possible transitions between two consecutive
states

Allows efficient word disambiguation
Allows extended lexical tagging using automata
transduction
Compound verbal forms tagging
Multi-word verb recognition

Very efficient process rates
Human access is almost impossible

10
CWB Corpus

Lexical tagging (part-of-speech, lemma)
Text DSFA are disambiguated and converted to XML
format
Unambiguous corpus
Allows human access to corpus contents
Allows human corpus search
Corpus contents are codified and indexed for an
efficient corpus search

11
Multi-word verb recognition

Inflectional morphological properties are kept
the siempre adverb is detected between the core
verb and idiom

12
Subcorporation Process

Internal tools GramCreator and XQS are used to
create subcorporation grammar

Request solicitud N-de-GN-de ltPALABRAgt
4 ltNPREDgt ( ltAPREDgt ltPALABRAgt )
ltde.PREPgt ( (ltPRONgt (
( ltEgt ltPREDETgt ) ( ltEgt ltDETgt
ltAPOSgt ) ( ltEgt ltAPREDgt ltVPREDPPgt
) )) ltNgt (ltNPROPgt ( ltEgt ltNPROPgt
)) ) ltde.PREPgt
Solicitud grammar example the syntactic
structure N-de-GN-de is detected
13
Subcorporation Process

Each grammar (regular expression) is converted to
a Finite State Transducer
LUs subcorpora is transduced with a set of
grammars FST to produce a set of subcorpora
The transduction process allows very efficient
process rates (100 transductions per second)
The subcorporation set is converted to XML and
imported to FNDesktop

14
Subcorporation Process
N-de-GN-de structure detection
15
Annotation Tool

SFN uses the FN annotation tool (FNDesktop) to
add semantic annotation to the LU subcorporation
sets
The FNClassifier has been adapted to Spanish the
classifier has new rules which are adapted to the
Spanish tags and Spanish local Syntactic contexts

16
Annotation search tools (Web Reports)
17
Annotation search tools (Sato Tool)

Write a Comment

User Comments (0)

About PowerShow.com

Spanish FrameNet Project PowerPoint PPT Presentation