Title: Using bilingual LSA for FN annotation of French text from generic resources
1Using bilingual LSA for FN annotation of French
text from generic resources
- Guillaume Pitel - LORIA/LED
- FR.FrameNet Project
- Funded by France-Berkeley Fund
2Outline
- The (small) FR.FrameNet project
- The projection problem
- Realizations
- French Frames database
- Annotated reference sub-corpus
- English semantic clusters from FEs
- Projection into French
- Other potential applications
3The (small) FR.FrameNet project
- A Berkeley-Nancy collab. Funded by
France-Berkeley Fund - ICSI, ATILF, LORIA - French participants Susanne Alt, Benoît Crabbé,
Christiane Jadelot, Guillaume Pitel, Laurent
Romary - Setting the foundations for a cheap bootstrapping
of a French FrameNet - Reusing existing French Lexical Semantic
resources - Reusing any available resources
- Focus on automatic methods
4The projection problem
- Use a semantic lexicon in language A to annotate
a corpus in language B - Resulting data is expected to be of much lower
quality than a handcrafted lexicon - It is a bootstrapping process requires manual
correction - Important question does it really speed up the
final production ?
5Pado Lapata approach
- Using a Source language/Target language parallel
corpus - The Source-side of the corpus must be
FN-annotated, - The roles are projected in the Target corpus
- Train a statistical semantic role parser for
Target language - Automatic annotation of any corpus in Target
language
6Pado Lapata approach
- Problems
- translation is not frame-conserving in many cases
(20-30) - parallel corpora are a rare resource
- Berkeleys FrameNet is not built on the English
side of a // corpus ( - But very useful with a resource like Europarl
7The main bottleneck
- Existence of parallel AND annotated corpora
rare and expensive to build - But
- Annotated corpora are available
- Parallel, aligned corpora are available
8The Semantic Space based approach (using LSA)
- Pure semantic annotation
- no grammatical function
- no POS
- Use a bilingual LSA space to make the projection
- Preparation
- Find the lexical units in the Target language
that fits for each frame - Use an available resource
- Compute them automatically
- Compute the semantic clusters of each frame
element
9The Semantic Space based approach (using LSA)
- Usage Automatic preannotation (or selection)
- For each sentence in Target corpus
- Find potential frames from LUs
- Compare each word (or head of constituent) of the
sentence with to computed semantic clusters of
the (core) roles of the candidate frames (or the
corresponding roles in parents if training data
missing) - Candidate Frames and FEs are rated by the
semantic distance - What we can expect
- Cant deal with anaphora,
- Cant deal with FEs not semantically narrow
10Subprojects
- Convert frames to French
- Using the ISC Semantic Atlas (built from 2
synonym dictonaries a minimal FR//EN corpus) - Annotation of reference subcorpus
- 1000 sentences from Europarl
- Projection using LSA
11Convert Frames to French
12English LUs to French LUs
- For each Frame in Berkeley FrameNet
- For each LU, find potential translations in
French. Using Semantic ATLAS (Ploux Ji, 2003) -
other languages ? - Compute the French profile of the Frame
- Manually check that a lemma can actually evoke
the frame (pure subjective judgment) - Frame-by-frame procedure
- Must be validated later by corpus evidence
13Lexical units in Filling Frame
- adorn.v, anoint.v, asphalt.v, brush.v, butter.v,
coat.v, cover.v, cram.v, crowd.v, dab.v, daub.v,
douse.v, drape.v, drizzle.v, dust.v, embellish.v,
fill.v, flood.v, gild.v, glaze.v, hang.v, heap.v,
inject.v, jam.v, load.v, pack.v, paint.v,
panel.v, pave.v, pile.v, plant.v, plaster.v,
pump.v, scatter.v, seed.v, shower.v, smear.v,
sow.v, spatter.v, splash.v, splatter.v, spray.v,
spread.v, sprinkle.v, squirt.v, strew.v, stuff.v,
suffuse.v, surface.v, tile.v, varnish.v,
wallpaper.v, wrap.v
14Translations 1/4
- Adorn Chamarrer, embellir, enjoliver, orner,
parer, revêtir - Anoint Oindre
- Asphalt Asphalter, bitumer
- Brush Badigeonner, brosser, effleurer
- Butter Beurrer
- Coat Empâter, enduire, enrober, revêtir
- Cover badigeonner, barbouiller, couvrir,
franchir, gainer, garnir, habiller, monter,
parcourir, quadriller, recouvrir, revêtir,
saillir, se couvrir, subvenir, tapisser - Cram bachoter, bâfrer, bûcher, chauffer,
engraisser, lester, potasser - Crowd foule (should be also peupler)
- Dab bassiner, tamponner, toucher
- Daub badigeonner, barbouiller, peinturlurer
- Douse ???
- Drape Draper
- Drizzle brouillasser, bruiner, crachiner,
pleuvasser, pleuviner - Dust enlever la poussière, essuyer, poussière,
saupoudrer, épousseter - Embellish broder, embellir, enjoliver, orner
- Fill appliquer un enduit, boucher, bourrer,
calfeutrer, combler, devenir plein, emplir,
enfler, fourrer, garnir, gonfler, gorger,
imprégner, lester, mastiquer, meubler, obturer,
occuper, peupler, plomber, pourvoir, pourvoir à,
pénétrer, remplir, s'enfler, se gonfler, se
peupler, se remplir
15Manual selection 1/4
- Adorn Chamarrer, embellir, enjoliver, orner,
parer, revêtir - Anoint Oindre
- Asphalt Asphalter, bitumer
- Brush Badigeonner, brosser, effleurer
- Butter Beurrer
- Coat Empâter, enduire, enrober, revêtir
- Cover badigeonner, barbouiller, couvrir,
franchir, gainer, garnir, habiller, monter,
parcourir, quadriller, recouvrir, revêtir,
saillir, se couvrir, subvenir, tapisser - Cram bachoter, bâfrer, bûcher, chauffer,
engraisser, lester, potasser - Crowd foule (should be also peupler)
- Dab bassiner, tamponner, toucher
- Daub badigeonner, barbouiller, peinturlurer
- Douse ???
- Drape Draper
- Drizzle brouillasser, bruiner, crachiner,
pleuvasser, pleuviner - Dust enlever la poussière, essuyer, poussière,
saupoudrer, épousseter - Embellish broder, embellir, enjoliver, orner
- Fill appliquer un enduit, boucher, bourrer,
calfeutrer, combler, devenir plein, emplir,
enfler, fourrer, garnir, gonfler, gorger,
imprégner, lester, mastiquer, meubler, obturer,
occuper, peupler, plomber, pourvoir, pourvoir à,
pénétrer, remplir, s'enfler, se gonfler, se
peupler, se remplir
16Frame building Conclusion
- Quite inexpensive compared to an approach of
introspection from scratch or corpus-based
(Filling is a big frame with a lot of LUs, it
took me 30min to select good instances - with
manual color setting) - Probably far from perfect coverage, low precision
- Need several annotators to duplicate the work
17Our approach to cross-language semantic annotation
- The goal
- A lemma can be related to several Frames
- We want to disambiguate between the possible
choices, - And also try to attribute roles (at least core
roles) once we have made the choice - All of this in French, while we have the training
data in English
18Bilingual LSA approach
19Latent Semantic Analysis
- Improvement of cooccurrence matrices
- Reduce the number of dimensions
- Example
- A occurs in documents (or contexts) 1,2,3
- B in 2,3,4,5
- C in 4,5,6
- A and C never occur in the same document
- LSA would allow to reduce documents 1-6 into one
dimension
20Evaluating the semantic position of Frame
Elements in LSA
- Computing an English LSA space
- Tools Treetagger Infomap-nlp
- Corpus BNCEnglish part of Europarl
translation of Balzac - POSlemma NNyear
- Keep only Verbs, Adjectives, Nouns, Adverbs
- Other combinations (no POS, all POS, raw form)
dont perform as well
21Example
- FEs annotations for Filling.Theme
- with water.
- with a fungicide such as green or yellow sulphur.
- with a soft brush and malathion dust.
- with a little cayenne pepper.
-
- Terms used for the FEs representation
- NNwaterNNfungicideJJsuchJJgreenJJyellowNNsulp
hurJJsoftNNbrushNNmalathionNNdustJJlittleNNc
ayenneNNpepper
22Evaluating FEs semantic coherence
- Compute the semantic center of the FE center of
each FE terms position - Find the N nearest neighbors of this center
- If the center is in a semantically coherent
region, the average similarity between neighbors
and center is high.
23FEs de Filling
- Frame.FE Average Std Min Max Nb annot
- Filling.Agent 0.604941 0.0413504 0.563591
0.717469 279 - Filling.Cause
- Filling.Degree 0.595513 0.0431123 0.552401
0.697830 4 - Filling.Depictive 0.683302 0.0502735 0.633029
0.804053 1 - Filling.Goal 0.6483 0.0510976 0.597202
0.793063 543 - Filling.Instrument 0.646028 0.0715617 0.574466
0.844308 4 - Filling.Manner 0.647012 0.0795992 0.567413
0.896142 25 - Filling.Means 0.67356 0.0502949 0.623265
0.820630 1 - Filling.Path 0.708096 0.069683 0.638413
0.925448 2 - Filling.Place 0.562765 0.0364663 0.526299
0.683526 2 - Filling.Purpose 0.631099 0.0585047 0.572594
0.761788 5 - Filling.Result 0.734567 0.0585102 0.676057
0.825459 37 - Filling.Source 0.611222 0.0447367 0.566485
1.000000 1 - Filling.Subregion 0.782659 0.0756196 0.707039
0.944916 2 - Filling.Theme 0.747146 0.0485786 0.698567
0.890307 450 - Filling.Time 0.474269 0.0474972 0.426772
0.628049 16
24Neighbors of Filling.Theme
- powder 0.890307
- spray 0.836283
- dry 0.821666
- crushed 0.820905
- charcoal 0.813571
- plastic 0.806768
- copper 0.804459
- paste 0.802643
- foam 0.802201
- brush 0.799847
-
- Computed from with fake diamonds. with pictures
of cute white bunnies. with jewels and fine
gowns. with one of these pegs. with pictures ,
flowers , and messages of peace. with wreaths of
flowers and garlands of feathers. with the finest
furniture from a firm in London 's New Bond
Street. with a crown. with beautifully hooked
melodies and harmonies. with chrism , the sacred
ointment ,. with gel. with such a leaden armour
of expectations. with the poison. with these
substances. with vaseline. with his pungent
urine. with holy oil. in bulb fibre. in whipped
cream and honey. with a foot of topsoil. with her
hand.
25Neighbors of Filling.Agent
- oliver 0.717469
- jack 0.696716
- joe 0.691628
- marie 0.686812
- harry 0.684113
- charlie 0.681887
- billy 0.680378
- tom 0.678887
- jane 0.676179
- rose 0.669748
-
- Computed from Your man. I. They. The priests.
He. the wife of Cnut 's henchman Tofi the Proud.
The Reclusiarch. she. What father. The Indians.
Over 200 species of birds. He. He. Father Peter.
Viktor. by ecclesiastics. We. One girl. She. she.
he. the white gravel. the reluctant soldier. I.
Eva. he. Two people. he. the good beachcombers.
Sylvester. he. He. Two girls. you. Cecil Beaton.
you. Larsen. you. He. you. you. He. he. she. Mina
and K. She. you. she. the programme that turns
the cameras on teenagers and let's them do the
talking and the interviews. Baldwin. by Molly
Fletcher. She. I. They. she. Endill. They. He.
the BBC and official propaganda
26FEs clusters
- Grouping terms of the FE by minimal distance
(arbitrarily set) i.e. 0.8 74 - Keeping clusters with more than 5 of terms
- http//guillaume.work.free.fr/Frames.en.3
27Clusters of Filling frame
- Agent 2 cluster(s)
- Degree 4 cluster(s)
- Depictive 6 cluster(s)
- Goal 2 cluster(s)
- Instrument 6 cluster(s)
- Manner 2 cluster(s)
- Means 2 cluster(s)
- Path 1 cluster(s)
- Place 5 cluster(s)
- Purpose 1 cluster(s)
- Result 2 cluster(s)
- Source 1 cluster(s)
- Subregion 1 cluster(s)
- Theme 2 cluster(s)
- Time 0 cluster(s)
28Clusters Filling.Agent
- rachel 0.867663
- sara 0.863332
- ellen 0.856612
- lily 0.855513
- sally 0.853933
- alice 0.849205
- emily 0.847480
- dad 0.845598
- jenny 0.844003
- kate 0.839664
- maggie 0.836391
tom 0.924026 john 0.908828 hugh 0.898049 michael
0.897622 scott 0.892861 sir 0.891623 david
0.889539 frank 0.889324 murray 0.879660 anthony
0.879149 geoffrey 0.876748
29Clusters Filling.Goal
- tin 0.924426
- pot 0.908988
- jar 0.908169
- cake 0.893367
- bottle 0.888083
- bag 0.871596
- jug 0.866099
- bowl 0.860658
- basket 0.858857
- plastic 0.852992
- dish 0.846176
- peel 0.834313
wall 0.911646 wooden 0.864492 entrance
0.851708 front 0.846124 floor 0.834214 porch
0.834039 staircase 0.827131 roof 0.823297 rear
0.815847 corner 0.815765 rear 0.813187 front
0.813136
30Clusters Filling.Theme
- powder 0.913015
- salt 0.907773
- dry 0.900202
- aromatic 0.886529
- vegetable 0.870903
- spray 0.867004
- bean 0.860508
- herb 0.858321
- meat 0.852165
- apple 0.848998
- vinegar 0.848045
- pea 0.845492
shiny 0.915945 red 0.908281 pink 0.905748 tint
0.900729 grey 0.899490 yellow 0.882565 blue
0.882097 white 0.877434 ribbon 0.876266 brown
0.875334 pale 0.875016 silk 0.865824
31Projection
- Compute French clusters from English clusters
- Corpus collection
- Europarl (French-English)
- // French-English Balzac from Project Gutenberg
- French//English 50M lemmas
- Shakespeare, Hansard Corpus to be included
32Training data
- Lemmas interleaved on a sentence alignment basis
- Training with a larger window
- Only parallel corpus, experiments that introduce
bits of pure monolingual corpus show a quality
loss
33Similarity between translations in the Biling.
Sem. Space
- Results
- eat / manger 0,98 (32)
- fleuve / river 0,94 (55)
- green / vert 0,83 (92)
- bleu / blue 0,87 (81)
- eat / fleuve 0,77 (107)
- drink / écran 0,82 (96)
34Neighborhood in Bilingual Semantic Space
35Neighborhood in Bilingual Semantic Space
36Neighborhood in Bilingual Semantic Space
37Projection Conclusion
- Projecting whole clusters gives variable results
- Results in the projection are very disappointing
- Unusable in this state
- Seems that it may simply come from alignment
mistakes - Can we improve the projected clusters with a
bilingual dictionary ? - Relating clusters to Synsets ? Not necessarily a
good idea Champagne and Caviar are not related
in WN - More generally simple translation may cause
undesired broadening of the cluster
38Potential application
- Statistical processing is interesting because it
can capture usage-based regularities - Clusters built with LSA can be interesting
information sources for the lexicographer - They can also more simply be used to
automatically find new semantic types/selectional
preferences emerging from the annotation of a new
domain (metaphors occuring frequently for
instance) - In a multilingual, collaborative annotation task,
could be useful in order to transfer work between
languages without requiring annotation of a
parallel corpus.