Title: Corpora in lexical studies
1Corpora in lexical studies
- Corpus Linguistics
- Richard Xiao
- lancsxiaoz_at_googlemail.com
2Aims of this session
- Lecture
- Corpus-based lexicography
- Collocation and colligation
- Lab session
- Collocation using WST
- Collocation using AntConc
- Collocation and colligation in Xaira
- Using the BNCweb to study collocation
3Corpus revolution in lexicographic and lexical
studies
- Lexicographic and lexical studies are the
greatest beneficiaries of corpora - Corpora have revolutionised dictionary making
and reference publishing - It is now nearly unheard of for new dictionaries
and new editions of old dictionaries published
from the 1990s onwards not to claim to be based
on corpus data
4Why use corpora in dictionary making?
- Machine-readable corpora allow dictionary makers
to extract all authentic, typical examples of the
usage of a lexical item from a large body of text
in a few seconds - Corpora allow dictionary makers to select entries
based on frequency information - Corpora can readily provide frequency information
and collocation information for readers - Textual (e.g. register, genre and domain) and
sociolinguistic (e.g. user gender and age)
information encoded in corpora allows
lexicographers to give a more accurate
description of the usage of a lexical item
5Why use corpora in dictionary making?
- Corpus annotations such as part-of-speech tagging
and word sense disambiguation also enable a more
sensible grouping of words which are polysemous
and homographs - A monitor corpus allows lexicographers to track
subtle change in the meaning and usage of a
lexical item so as to keep their dictionaries
up-to-date - Corpus evidence can complement or refute the
intuitions of individual lexicographers, which
are not always reliable because of potential
biases in intuitions
6Five emphases
- Changes brought about by corpora to dictionaries
and other reference books - five emphases
(Hunston 2002) - an emphasis on frequency
- an emphasis on collocation and phraseology
- an emphasis on variation
- an emphasis on lexis in grammar
- an emphasis on authenticity
7Top 1000 written / spoken words
Authentic examples
8Corpus-based learner dictionaries
- First fully corpus-based dictionary
- Collins Cobuild English Dictionary (1987)
- Some corpus-based learner dictionaries
- Longman Dictionary of Contemporary English (3rd
edition) - Oxford Advanced Learners Dictionary (OALD, 5th
edition) - Cambridge International Dictionary of English
(1st edition)
9Frequency dictionaries
10Collocation
- Collocation is among the linguistic concepts
which have benefited most from advances in corpus
linguistics - What is collocation?
- strong tea, powerful car (Halliday 1976)
- collocations of a given word are statements of
the habitual or customary places of that wordthe
company that words keep (Firth 1968181-2) - One of the meanings of night is its
collocability with dark (Firth 1957196) - a frequent co-occurrence of two lexical items in
the language (Greenbaum 197482) - expel a school child vs. cashier an army officer
- I propose to bring forward as a technical term,
meaning by collocation, and apply the test of
collocability (Firth 1957 194)
11Meaning by collocation
- There is frequently so high a degree of
interdependence between lexemes which tend to
occur in texts in collocation with one another
that their potentiality for collocation is
reasonably described as being part of their
meaning (Lyons 1977 613) - Complete description of the meaning of a word
would have to include the other word or words
that collocate with it - You shall know a word by the company it keeps!
(Firth 1968179) - Collocation is part of the word meaning
12Two types of collocation
- Coherence collocation vs. neighbourhood
(horizontal) collocation (Scott 1998) - Coherence collocation
- Collocates associated with a word (e.g. letter
stamp, post office) - Neighbourhood collocation
- Words which do actually co-occur with the word
(letter - my, this, a, etc)
13Coherence collocation
- A cover term for the cohesion that results from
the co-occurrence of lexical items that are in
some way or other typically associated with one
another, because they tend to occur in similar
environments. (Halliday Hasan 1976287) - candle flame flicker
- hair comb curl wave
- sky sunshine cloud rain
- Difficult to measure using a statistical formula
14Neighbourhood collocation
- Collocation in corpus linguistics
- Structure of collocation collocation window
- We may use the term node to refer to an item
whose collocations we are studying, and we may
then define a span as the number of lexical items
on each side of a node that we consider relevant
to that node. Items in the environment set by the
span we will call collocates. (Sinclair
1966415) - Casual vs. significant collocation
- Significant collocation collocation that occurs
more frequently than would be expected (in a
statistical sense) on the basis of the individual
items - n.b. Neighbourhood (horizontal) collocations can
include some coherence collocations
15Intuition vs. collocation
- Greenbaum (1974) people disagree on
collocations in introspection-based elicitation
experiments - Although collocation can be observed informally
on the basis of intuitions, it is more reliable
to measure it statistically, and for this a
corpus is essential (Hunston 2002 68) - Intuition is often a poor guide to collocation
- because each of us has only a partial knowledge
of the language, we have prejudices and
preferences, our memory is weak, our imagination
is powerful (so we can conceive of possible
contexts for the most implausible utterances),
and we tend to notice unusual words or structures
but often overlook ordinary ones (Krishnamurthy
2000 32-33) - Collocation can be measured on the basis of
co-occurrence statistics (MI, z, t, LL etc)
more discussion to follow
16Collocation is syntagmatic
Langue (Language system) paradigmatic
- famous boots. On the stroke of full time the
- Stoke the lead on the stroke of half-time
with a goal - Smith sin-binned on the stroke of half-time,
added a - clinched their win on the stroke of lunch after
resuming - chase by declaring on the stroke of lunch. ltpgt
With a lead - expectant crowd, on the stroke of midday. The
bird - hour began not upon the stroke of midnight but
upon the - of midnight but upon the stroke of noon. There
was, - booked in advance. On the stroke of seven, a
gong summons - Promptly on the stroke of six 'clock,
the chooks - from Edinburgh on the stroke of the
Millennium. - Parole (Utterance)
syntagmatic
17Collocation vs. colligation
- Collocation
- Relationship between a lexical item and other
lexical items - Relationship between words at the lexical level
- E.g. very collocates with good
- Colligation
- Relationship between a lexical item and a
grammatical category - Relationship between words at the grammatical
level - E.g. very colligates with ADJ
18WST Collocate settings
Concord tab
19WST collocates
Strength of relationship is displayed as 0.000 if
it hasn't yet been computed
20Strength of collocation relationship
A wordlist is required
21Highlight and double click
22to see the selected collocate
23Collocates in AntConc
24Collocation in Xaira
25Colligation in Xaira
26Exploring collocation with BNCweb
- http//bncweb.lancs.ac.uk/bncwebSignup/user/login.
php
27Search for sweet
28Concordances of sweet
KWIC view
29KWIC view
30Dropdown menu collocations
31Collocation setting
32Collocation database (default settings)
33Adjusting settings
34Noun collocates of sweet
Click on a word to see its collocation info
35Collocation info of sweet smell
Click on a number to see concordances of
collocates at that position
36Concordances of smell at R2
37Collocation statistics
38Rank by frequency
Frequent words crowd into the top of the
collocate list Are they genuine collocates?
39Rank by the t test
- Also focusing on frequent words?
40Rank by MI
- Infrequent words at the top of the list
- How useful are they (especially to English
learners)?
41Rank by the z score
- Like MI, the z score also over-estimates
infrequent items (e.g. nothings, afton, marjoram)
42Log-likelihood test
43Rank by MI3
44Rank by dice coefficient