Computational Approaches to Lexical Collocations - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Computational Approaches to Lexical Collocations

Description:

`Meaning by collocation is an abstraction at the syntagmatic level and is not ... which is constituted by a base and its collocates or by two or more collocates, ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 46
Provided by: brig7
Category:

less

Transcript and Presenter's Notes

Title: Computational Approaches to Lexical Collocations


1
Computational Approaches to Lexical Collocations
  • Stefan Evert
  • IMS
  • Stuttgart

Brigitte Krenn ÖFAI Vienna
2
Collocations
  • What kind of animal are they?
  • What are they good for?
  • How can we find them?
  • How likely are we to find them?

3
Schedule
  • Monday Introduction to Collocations
  • Tuesday Extraction of Coocurrence Data
  • Wednesday Association Measures (AMs)
  • Thursday Evaluation of AMs
  • Friday Significance of Result Differences

4
CollocationsTerminology Definitions
  • Firth's Notion of Collocation
  • Meaning by collocation is an abstraction at the
    syntagmatic level and is not directly concerned
    with the conceptual or idea approach to the
    meaning of words.''
  • One of the meanings of night is its
    collocability with dark, and of dark, of course,
    its collocation with night.' (Firth 1957)

5
CollocationsTerminology Definitions
  • Choueka's Notion of Collocation
  • "A collocation is defined as a sequence of two
    or more consecutive words, that has
    characteristics of a syntactic and semantic unit,
    and whose exact and unambiguous meaning cannot be
    derived directly from the meaning or connotation
    of its components." (Choueka 1988)

6
CollocationsTerminology Definitions
  • Summing up, we have 2 different views
  • Firth collocations as lexical proximities in
    text
  • Choueka collocations as syntactic and semantic
    units, semantic irregularity

7
Terminology
  • Idiomspreferably used in the English literature,
    e.g. Bar-Hillel55, Hockett58, KatzPostal63,
    Healey68,Makkai72.

8
Terminology
  • Phraseological Units, (Ge. Phraseologismen) a
    widely used generic term in the German
    literature, e.g. BurgerEA82, Fleischer82.

9
Terminology
  • Light-verb Constructions, Support-verb
    Constructionsrefer to very particular phenomena,
    cross-categorisation with idioms

10
Terminology
  • Terms stemming from Computational Linguistics,
    e.g.
  • multi-word lexemes, e.g. Tschichold97,
    BreidtEA96.
  • multi-word expressions, e.g. SegondTapanainen95
  • non-compositional compounds, e.g. Melamed97

11
Terminology
  • Collocation e.g, red herring, kick the bucket,
    dark-night, dog-bark
  • Base, Collocateelements if a collocation,the
    base selects its collocates,it is not always
    clear to define what the base is,e.g.
    red-herring, kick-bucket

12
Terminology
  • Collocation Phrasegrammatical unit which is
    constituted by a base and its collocates or by
    two or more collocates, e.g. a red herring, kick
    the bucket, unexpectedly kick the bucket

13
Summing up
  • Terminology and definitions are influenced by
  • different linguistic traditions
  • computational linguistic applications

14
Summing up
  • The phenomena covered are manifold
  • lexical proximities in texts
  • syntactic and semantic units
  • semantic irregularity
  • syntactic rigidity

15
Observations Generativity
  • Collocations range from completely fixed to
    syntactically flexible constructions.
  • Syntactic restrictions usually coincide with
    semantic restrictions and thus are indicators for
    the degree of lexicalization of a particular word
    combination.

16
Observations Generativity
  • Particular word combinations are associated with
    specific restrictions that cannot be inferred
    from standard rules of grammar and thus need to
    be stored together with the collocation.

17
Observations Recurrence
  • Within corpora, the proportion of collocations is
    larger among highly recurrent word combination
    than among infrequent ones.
  • Recurrence is an effect of lexicalization.

18
Observations Idiomaticity
  • Semantic opacity is not sufficient for the
    definition of collocations as there exists a
    variety of conventionalized word combinations
    that range from
  • fully compositional ones like Hut aufsetzen (put
    on a hat'), Jacke anziehen (put on a jacket')
  • to
  • semantically opaque ones like ins Gras beissen
    (bite into the grass' literal meaning, die'
    idiomatic meaning).

19
Observations Words, Multi-words or Phrases
  • Collocations can be
  • word level phenomena
  • phrase level phenomena (collocation phrase)
  • Collocation phrases consist of the lexically
    determined words (collocates) only or contain
    additional lexically underspecified material.

20
Word-level Collocations
  • Adjective- and Adverb-Like Collocations
  • nichts desto trotz (nonetheless') adverb
  • fix und fertig (exhausted') adjective
  • Preposition-Like Collocations
  • im Lauf(e), im Zuge (during')
  • an Hand (with the help of')

21
Word-level Collocations
  • Noun-Like Collocations
  • Rotes Kreuz (Red Cross)
  • Wiener Sängerknaben (Vienna choir boys)
  • Hinz und Kunz (every Tom, Dick and Harry')
  • Sequences where the nouns are duplicated
  • Schulter an Schulter (shoulder to shoulder),
  • Kopf an Kopf (neck and neck)

22
Phrase-level Collocations
  • Modal constructions
  • sich (nicht) lumpen lassen (to splash out')
  • Verb-object combinations
  • übers Ohr hauen (take somebody for a ride')
  • unter die Lupe nehmen (take a close look at')
  • zum Vorschein bringen (bring something to the
    light')
  • des Weges kommen (to approach')
  • Lügen strafen (prove somebody a liar')

23
Phrase-level Collocations
  • Copula constructions
  • guten Glaubens sein (be in good faith')
  • auf Draht sein (be on the ball')
  • Proverbs
  • Morgenstund hat Gold im Mund (morning hour has
    gold in the mouth, the early bid catches the
    worm)
  • wissen, wo der Barthel den Most holt (know where
    the Barthel the cider fetches, know every trick
    in the book')

24
Summing up
  • Structural dependency
  • the collocates of a collocation are syntactic
    dependents, thus knowledge of syntactic structure
    is a precondition for accurate collocation
    identification.
  • Syntactic context
  • may help to discriminate literal and
    collocational readings, see for instance im Lauf,
    im Zug where a genitive to the right is a strong
    indicator for collocational reading.

25
Summing up
  • Markedness
  • morphologically or syntactically marked
    constructions like seemingly incomplete syntactic
    structure or archaic e-suffix are suitable
    indicators for collocations, see im Laufe, im
    Zuge for e-suffix and zu Recht, an Hand for
    incomplete syntactic structures.

26
Summing up
  • Single-word versus multi-word units
  • single-word occurrences of word combinations
    indicate word-level collocations, see for
    instance zu Recht, zurecht.
  • Syntactic rigidity
  • is an important indicator for collocations
    see for instance Hinz und Kunz, an und für sich,
    fix und fertig, Kopf an Kopf.

27
Identification of Collocations
Operationalizable Criteria
  • over proportionally high recurrence of
    collocational word combinations compared to
    noncollocational word combinations in corpora
  • lexical determination of the collocates of a
    collocation
  • collocations constitute grammatical units
  • grammatical restrictions in the collocation
    phrases

28
Positional N-Grams
  • numerical span
  • typically AMs work on bi-grams , i.e., all ltwi,
    wjgt pairs within a certain span are considered
  • ltwi,wigt pairs are not considered!
  • e.g. wi-3 ... wi-1 wi wi1 ... wi3(span size
    is 6)

29
Positional N-Grams
  • grammatical span
  • e.g. we assume that collocational relations do
    not exceed sentence boundaries
  • i.e., span size sentence size

30
Positional N-GramsProblems
  • huge amounts of cooccurrence data need to be
    managed
  • this requires special algorithms(e.g.
    Yamamoto,Church 2001)

31
Positional N-GramsProblems
  • definition of span size is crucial
  • If the span size is kept small, it is unlikely to
    properly cover nonadjacent collocates of
    structurally flexible collocations.
  • Enlarging the span size leads to an increase of
    candidate collocations including an increase of
    noisy data which need to be discarded in a
    further processing step.

32
Positional N-GramsProblems
  • High amount of noise in the collocation data due
    to
  • inappropriate span size

33
Positional N-GramsProblems
  • High amount of noise in the collocation data due
    to
  • over-proportional frequency of function words
    within texts
  • use stop word lists

34
Positional N-GramsProblems
  • High amount of noise in the collocation data due
    to
  • insensitivity to punctuation
  • use a sentence as the largest unit within which
    the collocates of a collocation may occur

35
Positional N-GramsProblems
  • High amount of noise in the collocation data due
    to
  • insensitivity to parts-of-speech
  • knowing parts-of-speech allows a large number of
    syntactically invalid n-grams to be excluded
    beforehand

36
Positional N-GramsProblems
  • High amount of noise in the collocation data due
    to
  • insensitivity to syntactic structure
  • further improvement of the appropriateness of the
    collocation candidates selected is achieved by
    the availability of structural and/or dependency
    information

37
Proposal
  • (Partial) replacement of
  • positional n-grams
  • by
  • relational n-grams
  • What does it imply?
  • In which cases do we want/need it?

38
From Positional to Relational N-Grams
  • positional with part-of-speech, eg. (Evert,Kermes
    2003)
  • ltAdj,Ngt pairs within certain span

39
From Positional to Relational N-Grams
  • positional with part-of-speech and lexical
    knowledge, e.g. Breidt 93
  • sentence final ltnoun,past participlegt pairs
  • past participles must be part of a list of 16
    German support verbs e.g. kommen, bringen,
    stehen, stellen, ...

40
From Positional to Relational N-Grams
  • partially relational, e.g. Krenn 2000
  • extraction of PN-combinations from PPs
  • extraction of main verbs
  • combination of PN-pairs and verbs co-occurring in
    a sentence

41
From Positional to Relational N-Grams
  • Result
  • a theoretical maximum of PNV combinations, i.e.,
  • verbs are duplicated in sentences that contain
    more than one PP,
  • PPs are duplicated in sentences where more than
    one main verb is found.
  • This has effects on counting!

42
Relational N-Grams
  • Assumption
  • The collocates of a collocation are syntactically
    related!

43
Relational N-Grams
  • Requirements
  • part-of-speech tagging
  • partial parsing (phrase chunking)
  • full parsingEvert,Kermes 2003

44
Relational N-Grams
  • Problems
  • full parsing of free text is error prone
  • there are cases where collocation extraction from
    fully parsed text is less accurate

45
Summary
  • Use application dependent notion of collocation
    as opposed to Firth/Choueka
  • Extract collocations from relational data
    (relational n-grams)
  • Interested in collocation phrases and their
    collocates
  • Consider grammatically homogenous data (e.g.
    Adj-N, PP-V)
  • Recurrence/cooccurrence frequency is main
    criterion for collocation extraction (statistical
    approaches)
Write a Comment
User Comments (0)
About PowerShow.com