Methodological provisions in the construction of idiom resources - PowerPoint PPT Presentation

About This Presentation
Title:

Methodological provisions in the construction of idiom resources

Description:

If an idiom is not in the F list, the F author can have missed it ... Complementarity between idiom resource and grammar is obtained through distributional analysis ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 50
Provided by: lapo4
Category:

less

Transcript and Presenter's Notes

Title: Methodological provisions in the construction of idiom resources


1
Methodological provisions in the construction of
idiom resources
Collocations and idioms 2006Linguistic,
computational and psycholinguistic
perspectivesBerlin, Nov. 3, 2006
Eric LaporteInstitut Gaspard-MongeUniversité de
Marne-la-ValléeFrancehttp//www-igm.univ-mlv.fr
/laporte/
2
Why construct resources describing
idioms?Defining objectives of
quality accuracy coverageData- or
computer-based provisions corpus
attestations statistical analyses golden-standar
d-based evaluationHuman-based
provisions objective introspective
3
Why construct resources describing idioms?
Linguistic interestIdioms make up a large part
of languagesComputer applicationsText analysis
for information retrieval, information
extraction, translation...Text generation
4
What kinds of idioms?
Verbal make ends meetAdverbial in the long
termNominal American coffeeAdjectival rough
and readyPrepositional ph. in a hurryNot
support verb constructions make a decision
5
1. Defining objectives of quality
Goals and methodological provisions must be
adapted to each otherProvisions depend on
goalsProvisions are responses to goal-specific
risksExample Objective know idioms in 1st
century AD Latin Provision gather 1st century
AD Latin textMore ambition, more methodological
provisionsCompatibility between objectives and
provisionsExample Objective know idioms in 1st
century AD Latin Provision human control over
acceptability of idiomsTrade-off between
ambition and provisions
6
Defining objectives of quality
General objective of qualityConformity with
linguistic realityInclusion of all relevant
informationRealistic goalsAlready attained for
some languages
7
Selected objectives of quality accuracy
Complementarity with grammarThe salmons swim up
the river grammarJohn drank up his
beer grammarMike gave up the piano idiom
resourceCompositional grammarNon-compositional
idiom resourcesIn fact, idioms also require a
grammarFormalization of descriptionConventional
dictionaries, second-language grammars... are
interesting but not formalized enough for
computer exploitation
8
Selected objectives of quality consistency
between intended and actual coverage
Independence from authors' idiolectsau petit
bonheur la chance (Fr.)au petit bonheur de la
chance (my idiolect)discovered through paper
reviewingGeographical limitsLuc amuse le
temps (Québec)Luc amuse le temps (France)Limi
ts with respect to language playsInclusion of
variantsRecall or completeness vs. silence or
undergenerationPrecision vs. noise or
overgeneration
9
Intended and actual coverage
Completeness vs. undergenerationExamples of
undergenerationNeglecting variantsin the long
termin the very long termConsider an idiom as
compositional (i.e. taken into account by
grammar)pomme de terre (recent conversation with
a linguist)
10
Intended and actual coverage
Precision vs. overgenerationInclusion of
obsolete idioms (out-of-date dictionaries)It
rains cats and dogs (?)Admission of
unacceptable variantsJohn is on the verge of
giving up againJohn is on a new verge of giving
upChecking lemmas, not inflected formsIl faut
voir les choses en face idiomatic meaningIl faut
voir la chose en face no idiomatic meaning
11
Intended and actual coverage
The linguistic notion underlying over- and
undergeneration is obviously that of
constraintsExampleCo-reference of
possessivesLeurs hôtes préviennent leurs
désirs not necessarily co-referent to
subjectLeurs hôtes reprennent leurs
esprits co-referent to subject(cf. lose one's
temper)
12
Intended and actual coverage
Goals with respect to language playsExamples
creative reworking of lexicalised metaphorsJohn
spilled the beans lexicalisedJohn spilled the
beans of their relationship lexicalisedJohn
spilled coffee on the bed and the beans of their
relationship with it creativeLa direction a
jeté le bébé avec l'eau du bain lexicalised
La direction a jeté le bébé de la qualité avec
l'eau du bain de la formation creative
13
Intended and actual coverage
A realistic goalInclude Fully lexicalised
forms Limits of variation of fully lexicalised
formsExclude Creative reworkingA basis for
future studies about creative reworking
14
Intended and actual coverage
Syntactic variantsSomeone spilled the beans
idiomaticThe beans were spilled idiomaticThe
beans spilled not idiomaticA realistic
goalDescribe idiomatic variants of idiomsLink
all variants of each idiomEx. Freckleton 1985,
Machonis 1985A common overgeneralizationA
frequent base form, unfrequent variantsLuc n'a
pas été gâté par la nature more frequentLa
nature n'a pas gâté Luc less frequent, active
15
Other objectives of quality
Less relevant psychological plausibility of
description etymology ...
16
2. Data- and computer-based provisions
Corpus linguisticsA reaction to biased
introspective linguistics - normativity -
idiolect generalization - tendency to disregard
contexts - reliance on incomplete conventional
dictionaries - necessity of updatesConvergence
with computational linguisticsAutomatization of
corpus linguistics
17
Corpus attestations
Attestations give information about existence and
frequency of idioms (example the 'Collocations
in the German Language' project)Balanced
corporaAnnotated corporaThe web as corpus
(example the BFQS project)Recognising the
limits of language playsContext headlines,
advertisement...Requires intuition also
18
Corpus attestations
ConcordancersMost corpus linguists use
concordancers without lexiconsUnitex, an
open-source generator of lemmatized concordances
from raw corporahttp//igm.univ-mlv.fr/unitexCo
ntains lexicons produced through introspective
approaches
19
Corpus attestations
ResultsConventional dictionaries (e.g. COBUILD)
for human usersProblemNo attestations of
unacceptabilitypetite cuillère 'tea spoon'
absent from a large Canadian corpus of French
textsCorpus-dependent information about
frequency can be in contradiction with real
language use (Garrigues 1993)
20
Statistical analysis
Can be seen as a methodological provision against
subjectivityFor many researchers, other
motivations more fun ('Manual construction of
resources is tedious'), better salaries?...Examp
leStatistical attraction as a sign of
frozennessSimilarity of contexts as a sign of
semantic proximityMore efficient on technical
terms than on verbal idiomsHuman
revisionRequired (methodological provisions
human-based, part 3)
21
Statistical analysis
ProblemsQuality of results of automatic analysis
of natural language shallow parsing small
tagsets incomplete data about sense
distinctionsUnfrequent idioms are a challenge
(e.g. variations, constraints)Detection of
properties semantic properties, creative
reworking of idioms
22
Statistical analysis
ResultsLists only properties (variants,
constraints) still largely out of reachUsually
not made availableTerminological lists placed on
the market
23
Golden-standard evaluation
Evaluation of an idiom extractorManual
annotation of a sub-corpus (golden
standard)Comparison with results of automatic
extractionProblemsGolden standards for idioms
are small and rareLittle communication about
methodological problems in building them
(human-based provisions)
24
Lexicon-Grammar of idioms as Golden standard
A manually constructed Lexicon-Grammar of French
idiomsAuthors Maurice Gross, Laurence
Danlos10.000 entriesMade available on line in
2006http//infolingu.univ-mlv.fr/english
UsersUse as golden standardDo not be scared by
so much information, you can use only the lists
if you prefer soUsers and descriptive
linguistsConstructive criticism is welcome
25
Lexicon-Grammar of idioms
26
3. Human-based provisions
Objective psycholinguistic experimentsIntrospe
ctive avoid preconceptions native
linguists mutual control time
limitation readability of resources formal
criteria differential semantic judgment
27
Psycholinguistic experiments
A reaction to biased introspective
linguisticsSeparate informant from
scientistControl age, sex, origin, number... of
informantsExamplesRecognising idioms as
suchParaphrasing idioms
28
Psycholinguistic experiments
DrawbacksTypical time required by an experiment
on 20 forms2 monthsExtrapolated velocity of
construction of resources 40 lexical
entries/year (counting 3 forms/entry)Usually,
the idioms need to be known beforehandNot
applicable for comprehensive resources
29
Human-based provisions introspective
Specific solutions to the biases of introspective
linguisticsMethodology and actual description
simultaneouslyEuropean traditionLexis/grammar
interactionDescription of idioms
1980-nowhttp//infolingu.univ-mlv.fr/englishAme
rican traditionWordnet
30
Avoid preconceptions (1/2)
Preconception 1'Manual construction of language
resources is too difficult''Manual construction
of resources is error-prone'Frequently read in
(peer-reviewed) computer scientists' papersThe
quality of manually constructed resources depends
on the background, skills, training and effort of
authorsCf. softwareDysfunctioning of
scientific democracy in a case of
multi-disciplinarityAt stake the future of the
institutions around the world that train people
to construct high-quality language resources
31
Avoid preconceptions (2/2)
Preconception 2'Descriptive linguistics is not
difficult enough to be interesting''Descriptive
linguistics does not require much skill''Making
lists is not the point'In fact, results of
descriptive linguistics are basic information for
theoretical linguistics and for computer
applications
32
Native linguists
Native linguists are much better than non-native
ones at- taking into account sense
distinctions- inserting idioms in relevant
sentences (this ensures that context is taken
into account)- taking into account semantic
propertiesExampleLa défense a cité un
témoin témoin can have co-referentsLe patron a
chié une pendule (not an elegant
phrase) pendule cannot have co-referents
33
Native linguists
DrawbacksResults depend on skill, training and
effort of the linguistNot applicable to
languages without native speakers with higher
educationNot applicable to extinct languages
34
Mutual control
An idiom resource should be built by a
teamExamplesGross' Lexicon-Grammar of French
verbal idiomsMost idioms were listed during the
meetings of construction of the Lexicon-Grammar
of French verbs (5 linguists)The
Belgium/France/Québec/Switzerland (BFQS)
projectDifferences between idioms in these 4
varieties of French (4 to 6 linguists)
35
The BFQS project
36
The BFQS project
1. Make a separate list for each variety2.
Compare listsComparison requires meetingsIf an
idiom is not in the F list, the F author can have
missed itIf an idiom in the B list is not
understood by the F author, it is considered
evidence that it does not belong to the F
varietyIntermediate case passively understood,
not actively usedIf an idiom in the B list is
understood by the F author, compare
interpretations, they can be different
37
Mutual control
DrawbackCost several years of weekly or monthly
meetingsThe grant for the BFQS project will
cover only a part of publication costs
38
Readability of description
Goal facilitate critical reviewing, update of
resourcesExampleTable representationRows
lexical itemsColumns structure and
propertiesOpen-source software HOOP (Sastre
2006)Density of representationNumber of
lexical items on the same screen or pageNumber
of properties on the same screen or
pageMetalanguage should not invade the
description (which is the case with feature
structures)
39
Readability of description
DrawbacksReadable formats are usually not
directly exploitable in computer
applicationsCompilation processes are
requiredCf. source code vs. executable
codelemma lexicon vs. inflected-form lexicon
40
Time limitation
The description of a lexical item is normally
limited to a few minutesRegularities --gt
classification --gt similar items are described in
sequence --gt efficiencyFor properties,
description by property is more efficient than
description by entryEven so, manual description
of all idioms of a language takes several years
41
Formal criteria
Formal criteria based on acceptability of
sentencesExample co-referent of
possessivesLeurs hôtes préviennent leurs
désirsLeurs hôtes préviennent nos désirsLeurs
hôtes reprennent leurs espritsLeurs hôtes
reprennent nos espritsIdentifying such a
constraint is immediate for a linguist trained to
distributional analysis
42
Formal criteria
Complementarity between idiom resource and
grammar is obtained through distributional
analysisLuc a couché par écrit ses
instructionsLuc a mis par écrit ses
instructionsLuc a placé par écrit ses
instructionsLuc a couché par imprimé ses
instructionsLuc a couché par écrit ses
demandes?Ses instructions sont par écrit--gt 2
expressionsN0 coucher par écrit N2 N0 mettre
par écrit N2
43
Formal criteria
Limits of variation are obtained through
systematic testsLuc met cela par écritCela est
mis par écrit par LucLuc n'entend pas cela de
cette oreilleCela n'est pas entendu de cette
oreille par Luc
44
Differential semantic judgment
Comparison of variantsDistributional
analysisTaking into account connotations,
implicationsRecognising the limits of language
playsIntuition (cf. acceptability judgment,
lexicalization, institutionalization)Requires a
corpus also (context headlines, advertisement...)
45
Results
Theoretical results(M. Gross 1982, P. Freckleton
1985, P. Machonis 1985)'Free' grammar vs. idiom
grammar Idiom grammar accounts for
variantsIdiom grammar is close to free grammar
same structures, same transformationsIdiom
entries are more numerous than simple
entriesSupport verb constructions vs. idioms
46
Results
Idioms with free determiner, including indefinite
determiner(1)La défense a cité (un ce)
témoinCe numéro a été le clou de (un le)
spectacle de 2002Distributionally frozenThe
noun with the free determiner can have
co-referents, even without language playsMore
in core than in peripheryThe noun has to be
attached both to a simple entry and to the idiom
entry1700 examples like (1), mostly technical
47
Conclusion
Different backgrounds, different
approachesBackgrounds are so different that much
synergy between researchers is missedA result
can have very different usersTheoreticalPractica
l computer applications construction of
further resourcesDistinct approaches can
converge to an objective
48
Conclusion
Synergy between corpus approaches and
introspective approachesIntrospective approaches
produce dense, informative resourcesResources
are useful to corpus explorationCorpus
exploration is an aid to introspective
approachesExcessive methodological
provisionsThrowing away the baby of idiom
description with the bath water of introspective
linguistics
49
Bibliographical references
Freckleton, Peter. 1985. Sentence idioms in
English, Working Papers in Linguistics,
University of Melbourne, pp. 153-168 appendix
(196 p.). Gross, Maurice. 1982. Une
classification des phrases "figées" du français,
Revue Québécoise de Linguistique 11.2, pp.
151-185, Montréal UQAM. Machonis, Peter A.
1985. Transformations of verb phrase idioms
passivization, particle movement, dative shift,
American Speech 604, pp. 291-308. Sastre
Martinez, Javier M. 2006. Computer Tools for the
Management of Lexicon-Grammar Databases, poster,
Proceedings of the 13th Conference on natural
language processing, TALN 2006, Leuven, 10-13
April 2006, UCL, Presses Universitaires de
Louvain, pp. 600-608.
Write a Comment
User Comments (0)
About PowerShow.com