Title: Spanish FrameNet Project
1Spanish FrameNet Project
- Autonomous University of Barcelona
- Marc Ortega
2Spanish FrameNet Project
- Spanish FrameNet is a research project which is
sponsored by the Department of Education of Spain
(Grant No. TSI2005-01200) from December 2005 to
December 2006. - A new grant proposal has been submitted to the
Spanish Department of Education for the period
2007-2009 - SFN is developed at the Autonomous University of
Barcelona (Spain) and the International Computer
Science Institute (Berkeley, CA) in cooperation
with the FrameNet Project. - PI Carlos Subirats, System Analyst Marc Ortega,
2 linguist
3SFN Goals
- The Spanish FrameNet Project is creating an
online lexical resource for Spanish, based on
frame semantics and supported by corpus evidence.
- SFN will be available to the public by July 2007
- SFN will contain at least 1,000 lexical items
aprox. -verbs, predicative nouns, and adjectives,
adverbs, prepositions and entities-
representative of a wide range of semantic
domains. -
- The aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses
4Frame Semantics
- Spanish FrameNet (SFN) is using, adapting and
changing FrameNet Frames in order to adapt them
to Spanish - Some SFN Frames are the same as English FN (with
Spanish examples) - Some SFN Frames have the same English FN name but
they are different (slightly different
definition, different FEs, or different core
sets) - To adapt FN to Spanish we defined some new frames
and some FN frames are not used (new frames use
the same FN format), like - Cause_to_halt
- Change_emotional_state
- Collapse
- Inventing
- Motion_backwards, Motion_interruption,
Motion_manner, Motion_medium, Motion_up_downwards
- Return
- Social_interaction
- Think_up
5Current Project Status
- Frames Defined 92
- Lexical Units 624
- Annotated 413
- Subcorporated 130
- Created but without subcorporation 23
6Spanish FrameNet Corpus and Tools
- Spanish FrameNet is using a 350 million word
corpus - It includes both European and New World Spanish
(40 and 60) - The SFN Corpus has been developed by the SFN
research team, since there are no (large) public
domain Spanish corpora available - The SFN Corpus is lemmatized and tagged with a
set of in-house tools - FNDesktop
- Web Reports
- Sato Tool
7The SFN tagging and chunking system
- The SFN Corpus is tagged and lemmatized by using
- An electronic dictionary of Spanish of 600,000
forms, which is expanded from a dictionary of
93,000 lemmas - 66,000 single-word lexical units, like unir
(unite), inmoralidad (immorality), allí (there),
etc. - 26,000 multi-word lexical units (MWLU), like
muerte cerebral (brain death), etc., which are
automatically expanded in 55,000 inflected MWLU
forms. - Plain text to Deterministic Finite State Automata
(FSA) corpus tagger - 2,000 Finite State Transducers (FST) transducers
of multi-word verbs - Transducers of head of verbal phrases (compound
verbal tenses)
8The SFN tagging and chunking system
- The POS tagging process gives to corpus formats
- Automata Corpus
- IMS-CWB (Institut für Maschinelle
Sprachverarbeitung -Corpus Workbench)
9Automata Corpus
- Lexical tagging (part-of-speech, lemma)
- Word ambiguities are represented in deterministic
finite state automata (DFSAs) as different
possible transitions between two consecutive
states
- Allows efficient word disambiguation
- Allows extended lexical tagging using automata
transduction - Compound verbal forms tagging
- Multi-word verb recognition
- Very efficient process rates
- Human access is almost impossible
10CWB Corpus
- Lexical tagging (part-of-speech, lemma)
- Text DSFA are disambiguated and converted to XML
format - Unambiguous corpus
- Allows human access to corpus contents
- Allows human corpus search
- Corpus contents are codified and indexed for an
efficient corpus search
11Multi-word verb recognition
- Inflectional morphological properties are kept
- the siempre adverb is detected between the core
verb and idiom
12Subcorporation Process
- Internal tools GramCreator and XQS are used to
create subcorporation grammar
Request solicitud N-de-GN-de ltPALABRAgt
4 ltNPREDgt ( ltAPREDgt ltPALABRAgt )
ltde.PREPgt ( (ltPRONgt (
( ltEgt ltPREDETgt ) ( ltEgt ltDETgt
ltAPOSgt ) ( ltEgt ltAPREDgt ltVPREDPPgt
) )) ltNgt (ltNPROPgt ( ltEgt ltNPROPgt
)) ) ltde.PREPgt
Solicitud grammar example the syntactic
structure N-de-GN-de is detected
13Subcorporation Process
- Each grammar (regular expression) is converted to
a Finite State Transducer - LUs subcorpora is transduced with a set of
grammars FST to produce a set of subcorpora - The transduction process allows very efficient
process rates (100 transductions per second) - The subcorporation set is converted to XML and
imported to FNDesktop
14Subcorporation Process
N-de-GN-de structure detection
15Annotation Tool
- SFN uses the FN annotation tool (FNDesktop) to
add semantic annotation to the LU subcorporation
sets - The FNClassifier has been adapted to Spanish the
classifier has new rules which are adapted to the
Spanish tags and Spanish local Syntactic contexts
16Annotation search tools (Web Reports)
17Annotation search tools (Sato Tool)