Title: Computational Approaches to Lexical Collocations
1Computational Approaches to Lexical Collocations
- Stefan Evert
- IMS
- Stuttgart
Brigitte Krenn ÖFAI Vienna
2Collocations
- What kind of animal are they?
- What are they good for?
- How can we find them?
- How likely are we to find them?
3Schedule
- Monday Introduction to Collocations
- Tuesday Extraction of Coocurrence Data
- Wednesday Association Measures (AMs)
- Thursday Evaluation of AMs
- Friday Significance of Result Differences
4CollocationsTerminology Definitions
- Firth's Notion of Collocation
- Meaning by collocation is an abstraction at the
syntagmatic level and is not directly concerned
with the conceptual or idea approach to the
meaning of words.'' - One of the meanings of night is its
collocability with dark, and of dark, of course,
its collocation with night.' (Firth 1957)
5CollocationsTerminology Definitions
- Choueka's Notion of Collocation
- "A collocation is defined as a sequence of two
or more consecutive words, that has
characteristics of a syntactic and semantic unit,
and whose exact and unambiguous meaning cannot be
derived directly from the meaning or connotation
of its components." (Choueka 1988)
6CollocationsTerminology Definitions
- Summing up, we have 2 different views
- Firth collocations as lexical proximities in
text - Choueka collocations as syntactic and semantic
units, semantic irregularity
7Terminology
- Idiomspreferably used in the English literature,
e.g. Bar-Hillel55, Hockett58, KatzPostal63,
Healey68,Makkai72.
8Terminology
- Phraseological Units, (Ge. Phraseologismen) a
widely used generic term in the German
literature, e.g. BurgerEA82, Fleischer82.
9Terminology
- Light-verb Constructions, Support-verb
Constructionsrefer to very particular phenomena,
cross-categorisation with idioms
10Terminology
- Terms stemming from Computational Linguistics,
e.g. - multi-word lexemes, e.g. Tschichold97,
BreidtEA96. - multi-word expressions, e.g. SegondTapanainen95
- non-compositional compounds, e.g. Melamed97
11Terminology
- Collocation e.g, red herring, kick the bucket,
dark-night, dog-bark - Base, Collocateelements if a collocation,the
base selects its collocates,it is not always
clear to define what the base is,e.g.
red-herring, kick-bucket
12Terminology
- Collocation Phrasegrammatical unit which is
constituted by a base and its collocates or by
two or more collocates, e.g. a red herring, kick
the bucket, unexpectedly kick the bucket
13Summing up
- Terminology and definitions are influenced by
- different linguistic traditions
- computational linguistic applications
14Summing up
- The phenomena covered are manifold
- lexical proximities in texts
- syntactic and semantic units
- semantic irregularity
- syntactic rigidity
15Observations Generativity
- Collocations range from completely fixed to
syntactically flexible constructions. - Syntactic restrictions usually coincide with
semantic restrictions and thus are indicators for
the degree of lexicalization of a particular word
combination.
16Observations Generativity
- Particular word combinations are associated with
specific restrictions that cannot be inferred
from standard rules of grammar and thus need to
be stored together with the collocation.
17Observations Recurrence
- Within corpora, the proportion of collocations is
larger among highly recurrent word combination
than among infrequent ones. - Recurrence is an effect of lexicalization.
18Observations Idiomaticity
- Semantic opacity is not sufficient for the
definition of collocations as there exists a
variety of conventionalized word combinations
that range from - fully compositional ones like Hut aufsetzen (put
on a hat'), Jacke anziehen (put on a jacket') - to
- semantically opaque ones like ins Gras beissen
(bite into the grass' literal meaning, die'
idiomatic meaning).
19Observations Words, Multi-words or Phrases
- Collocations can be
- word level phenomena
- phrase level phenomena (collocation phrase)
- Collocation phrases consist of the lexically
determined words (collocates) only or contain
additional lexically underspecified material.
20Word-level Collocations
- Adjective- and Adverb-Like Collocations
- nichts desto trotz (nonetheless') adverb
- fix und fertig (exhausted') adjective
- Preposition-Like Collocations
- im Lauf(e), im Zuge (during')
- an Hand (with the help of')
21Word-level Collocations
- Noun-Like Collocations
- Rotes Kreuz (Red Cross)
- Wiener Sängerknaben (Vienna choir boys)
- Hinz und Kunz (every Tom, Dick and Harry')
- Sequences where the nouns are duplicated
- Schulter an Schulter (shoulder to shoulder),
- Kopf an Kopf (neck and neck)
22Phrase-level Collocations
- Modal constructions
- sich (nicht) lumpen lassen (to splash out')
- Verb-object combinations
- übers Ohr hauen (take somebody for a ride')
- unter die Lupe nehmen (take a close look at')
- zum Vorschein bringen (bring something to the
light') - des Weges kommen (to approach')
- Lügen strafen (prove somebody a liar')
23Phrase-level Collocations
- Copula constructions
- guten Glaubens sein (be in good faith')
- auf Draht sein (be on the ball')
- Proverbs
- Morgenstund hat Gold im Mund (morning hour has
gold in the mouth, the early bid catches the
worm) - wissen, wo der Barthel den Most holt (know where
the Barthel the cider fetches, know every trick
in the book')
24Summing up
- Structural dependency
- the collocates of a collocation are syntactic
dependents, thus knowledge of syntactic structure
is a precondition for accurate collocation
identification. - Syntactic context
- may help to discriminate literal and
collocational readings, see for instance im Lauf,
im Zug where a genitive to the right is a strong
indicator for collocational reading.
25Summing up
- Markedness
- morphologically or syntactically marked
constructions like seemingly incomplete syntactic
structure or archaic e-suffix are suitable
indicators for collocations, see im Laufe, im
Zuge for e-suffix and zu Recht, an Hand for
incomplete syntactic structures.
26Summing up
- Single-word versus multi-word units
- single-word occurrences of word combinations
indicate word-level collocations, see for
instance zu Recht, zurecht. - Syntactic rigidity
- is an important indicator for collocations
see for instance Hinz und Kunz, an und für sich,
fix und fertig, Kopf an Kopf.
27Identification of Collocations
Operationalizable Criteria
- over proportionally high recurrence of
collocational word combinations compared to
noncollocational word combinations in corpora - lexical determination of the collocates of a
collocation - collocations constitute grammatical units
- grammatical restrictions in the collocation
phrases
28Positional N-Grams
- numerical span
- typically AMs work on bi-grams , i.e., all ltwi,
wjgt pairs within a certain span are considered - ltwi,wigt pairs are not considered!
- e.g. wi-3 ... wi-1 wi wi1 ... wi3(span size
is 6)
29Positional N-Grams
- grammatical span
- e.g. we assume that collocational relations do
not exceed sentence boundaries - i.e., span size sentence size
30Positional N-GramsProblems
- huge amounts of cooccurrence data need to be
managed - this requires special algorithms(e.g.
Yamamoto,Church 2001)
31Positional N-GramsProblems
- definition of span size is crucial
- If the span size is kept small, it is unlikely to
properly cover nonadjacent collocates of
structurally flexible collocations. - Enlarging the span size leads to an increase of
candidate collocations including an increase of
noisy data which need to be discarded in a
further processing step.
32Positional N-GramsProblems
- High amount of noise in the collocation data due
to - inappropriate span size
33Positional N-GramsProblems
- High amount of noise in the collocation data due
to - over-proportional frequency of function words
within texts - use stop word lists
34Positional N-GramsProblems
- High amount of noise in the collocation data due
to - insensitivity to punctuation
- use a sentence as the largest unit within which
the collocates of a collocation may occur
35Positional N-GramsProblems
- High amount of noise in the collocation data due
to - insensitivity to parts-of-speech
- knowing parts-of-speech allows a large number of
syntactically invalid n-grams to be excluded
beforehand
36Positional N-GramsProblems
- High amount of noise in the collocation data due
to - insensitivity to syntactic structure
- further improvement of the appropriateness of the
collocation candidates selected is achieved by
the availability of structural and/or dependency
information
37Proposal
- (Partial) replacement of
- positional n-grams
- by
- relational n-grams
- What does it imply?
- In which cases do we want/need it?
38From Positional to Relational N-Grams
- positional with part-of-speech, eg. (Evert,Kermes
2003) - ltAdj,Ngt pairs within certain span
39From Positional to Relational N-Grams
- positional with part-of-speech and lexical
knowledge, e.g. Breidt 93 - sentence final ltnoun,past participlegt pairs
- past participles must be part of a list of 16
German support verbs e.g. kommen, bringen,
stehen, stellen, ...
40From Positional to Relational N-Grams
- partially relational, e.g. Krenn 2000
- extraction of PN-combinations from PPs
- extraction of main verbs
- combination of PN-pairs and verbs co-occurring in
a sentence
41From Positional to Relational N-Grams
- Result
- a theoretical maximum of PNV combinations, i.e.,
- verbs are duplicated in sentences that contain
more than one PP, - PPs are duplicated in sentences where more than
one main verb is found. - This has effects on counting!
42Relational N-Grams
- Assumption
- The collocates of a collocation are syntactically
related!
43Relational N-Grams
- Requirements
- part-of-speech tagging
- partial parsing (phrase chunking)
- full parsingEvert,Kermes 2003
44Relational N-Grams
- Problems
- full parsing of free text is error prone
- there are cases where collocation extraction from
fully parsed text is less accurate
45Summary
- Use application dependent notion of collocation
as opposed to Firth/Choueka - Extract collocations from relational data
(relational n-grams) - Interested in collocation phrases and their
collocates - Consider grammatically homogenous data (e.g.
Adj-N, PP-V) - Recurrence/cooccurrence frequency is main
criterion for collocation extraction (statistical
approaches)