Title: Wei Naixing
1Corpus-based and Corpus-driven Studies of
Collocation ideas and methods
- Wei Naixing
- Shanghai Jiaotong University
2 1 Major Principles of Firthian
Linguistics 2 Towards defining collocations3
Corpus based and Corpus-driven Studies4
Extensions
An Outline for the presentation
3J. R Firth (1890-1960)
1 Major Principles of Firthian Linguistics
- Major works include
- Papers in Linguistics 1934-1951, London Oxford
university Press - A Synopsis of Linguistic Theory 1930-1955.
Oxford Philological Society.
4Neo-Firthians
- M. Halliday
- J. Sinclair
- A. McIntosh
- P. Strevens
- etc.
5Post-Firthian corpus Linguists
- John Sinclair
- Michael Stubbs
- Antoinette Renouf
- Wolfgang Teubert
- Elena Tognini-Bonelli
- Etc.
6Principles
- Language is mode of actions, a way of doing
things. - Linguistics is concerned with the study of
meaning meaning is always contextual
7Principles
- Language should be studied in actual, attested,
authentic instances of use, not as intuitive,
invented sentences. - Linguistic analysis is empirical.
8Principles
- There is no boundary between lexis and syntax
lexis and syntax are interdependent. Form and
meaning are inseparable. - Much language use is routine.
- Language in use transmits the culture.
- language is monist and probabilistic
91.1 Language is a mode of action, a way of doing
things.
- Firth Anti-mentalism
-
- As we know so little about mind and as our
study is essentially social, I shall cease to
respect the duality of mind and body, thought and
word, and be satisfied with the whole man,
thinking and acting as a whole, in association
with his fellows. (195719)
- Firthians linguistics is a part of sociology
The object of study is E language.
10- A comparison Chomskyan Mentalism
- linguistics is to account for the ideal language
users competence of his language, that is, the
innate knowledge of linguistic rules. The central
object of linguistics is the I Language.
- Inevitably, linguistics is a part of cognitive
psychology it is speculative hypothetical
explanatory.
111.2 Linguistics is concerned with the study of
meaning meaning is always contextual
- Firth Contextualism
- The complete meaning of a word is always
contextual, and no study of meaning apart from a
complete context can be taken seriously. (1957 7)
12What Firth says about collocation?
- You shall know a word by the company it
keeps. - The collocation of a given word, rather than a
mere juxtaposition, is an order of mutual
expectancy. The words in a collocation have
customary or habitual places and are mutually
expected and prehended
13What Firth says about collocation?
- Collocation is a mode of meaning
- Meaning by collocation is an abstraction at
the syntagmatic level and is not directly
concerned with the conceptual or idea approach to
the meaning of words. One of the meanings of
night is its collocability with dark, and of
dark, of course, collocation with night. (1957
196)
14What Firth says about collocation?
- Colligation and collocation
- Colligation is the abstract inter-relations
of grammatical categories. - Collocations are actual words in habitual
company. A word in a usual collocation stares you
in the face just as it is. Colligation cannot be
words as such. Colligations of grammatical
categories related in a given structure do not
necessarily follow word divisions or even
sub-divisions of words. A colligation is not to
be interpreted as abstraction in parallel with
collocation of exemplifying words in text.
151.3 Language should be studied in actual,
attested, authentic instances of use, not as
intuitive, invented, isolated sentences
- The farmer kills the duckling.
- I have not seen your fathers pen, but I have
read the book of your uncles gardener (ibid
60-1). - Walter played the piano more often in Chicago
than his brother conducted concerts in the rest
of the states. (Quirk 1985 1132) - Ive never seen a dog more obviously friendly
than your cat. (Quirk 1985 1132)
16- Much linguistics is based on invented sentences,
and often only a small number of invented
sentences are discussed.
Invented isolated sentences Grammatically
well-formed, but difficult or impossible to
imagine in use.
17- An important point
- invented examples are really part of the
explanations. They have no independent authority
or reason for their existence, and they are
constructed to refine the explanations and in
many cases to clarify the explanation. Usage
cannot be invented, it can only be recorded.
18A comparison of Chomskyan assumptions with the
neo-Firthian principles
- The critical problem for grammatical theory today
is not a paucity of evidence but rather the
inadequacy of present theories of language to
account for masses of evidence that are hardly
open to serious question.(Chomsky 196519-20) - Starved of adequate data, linguistics
languished-indeed it became almost totally
introverted.(Sinclair 1991a1)
191.4 There is no boundary between lexis and
syntax lexis and syntax are interdependent
- Grammar and lexis are two perspectives from
which we look at language. They are two sides of
the same coin. - All linguistic items enter into patterns of
both kinds. They are grammatical items when
described grammatically, as entering (via
classes) into closed systems and ordered
structures, and lexical items when described
lexically, as entering into open sets and linear
collocations (Halliday, 1976 77)
20- Collocation is where grammar and lexis meet
- 1) Any syntactic structure restricts the lexis
that occurs in it and conversely any lexical
item can be specified in terms of the structures
in which it occurs. - 2) Such restrictions are typically not absolute,
but clear tendencies grammar is inherently
probabilistic.
21- 3) Native speakers have no reliable intuitions
about such statistical tendencies. Grammars based
on intuitive data will imply more freedom of
combination than is in fact possible. Grammar is
corpus-driven in the sense that the corpus tells
us what the facts are.
22- 4) Every sense or meaning of a word has its own
grammar each meaning is associated with a
distinct formal patterning. - 5) Words are systematically co-selected the
normal use of language is to select more than one
word at a time.
23- 6) Since paradigmatic choices are not made
independently of position in syntagmatic chain,
the relation between paradigmatic and syntagmatic
has to be rethought. - 7) in all cases so far examined, each meaning
can be associated with a distinct formal
patterning. There is ultimately no distinction
between form and meaning. meaning affects the
structure and this is the principal observation
of corpus linguistics in the last decade.
(Sinclair, 1991 6-7)
24- A comparison Chomskyan positions
- Grammar is grammar and usage is usage.
- the understanding of knowledge of grammar
involves going beyond an examination of language
in use. (P691) - Probabilistic information drawn from corpora
is of utmost value for many aspects of linguistic
inquiry. But it is all but useless for providing
insights into the grammar of any individual
speaker. (P 698) - In summary, we have grammar and we have
usage. Grammar supports usage, but there is a
world of difference between what a grammar is and
what we do and need to do when we speak. (P695) - (Frederick J. Newmeyer 2003)
251.5 Much language use is routine
- Language use is conventional and prefabricated.
- Man is born free and is everywhere in chains.
The bonds of family, neighborhood, class,
occupation, country and religion are knit by
speech and language. (Firth 1957 185) - It is true that everyday life we generally say
what the other fellow expects us, one way or the
other, to say, but this expectancy is the measure
even of our delightful surprises, and good
personal style is highly valued - (Firth 1957 186)
26- A multitude of terms
- linguistic pre-fabrications
- stereotyping
- memorized chunks
- formulaic expressions
- pre-assembled parts
- etc
271.6 Language in use transmits the culture.
- Cultural behavioral patterns of language users
Behavioral pattern of key words Recurrent
collocations
Cultural behavioral pattern
Linguistic units
Cultural units
281.7 Language is monist and probabilistic
- A monist view
- Firth Saussurean dualisms are misconceived.
- Such a language in the Saussurean sense is a
system of signs placed in categories. It is a
system of different values, not of concrete and
positive terms. Actual people do not talk such a
language. However systematically you may talk,
you do not talk systematics. According to strict
Saussurean doctrine, therefore, there are no
sentences in a language considered as a system.
Strictly speaking, in a language there are no
real words either, but only examples of
phonological and morphological categories. - (Firth 1957 180)
29Chomskyan dualisms are not necessary.
- Chomskys theory of competence and performance
had driven a massive wedge between the system and
instance, making it impossible by definition that
analysis of actual texts could play any part in
explaining the grammar of a language- let alone
in formulating a general linguistic theory. - (Halliday1991 30)
30A probabilistic view
- It had always seemed to me that the linguistic
system was inherently probabilistic, and that
frequency in text was the instantiation of
probability in the grammar. - (Halliady 1991 31)
- Metaphor of weather and climate
31- The weather and the climate are the same
phenomenon but regarded from different time
depths. If we are thinking of the next few hours,
then we are thinking of the weather and this
perspective determines what kinds of actions we
might take, for instance going to the beach or
taking an umbrella. If we are thinking of the
next decade or the next century, then we are
thinking of the climate and this perspective
also determines what kinds of actions we might
take, for example, legislating against industrial
processes which are destroying the ozone layer.
If the climate changes, then obviously the
weather changes. But conversely, each days
weather affects the climate, however
infinitesimally, either maintaining the status
quo or helping to tip the balance towards
climatic change. Instance and system, probability
and categoriality, micro and macro, are two sides
of the same coin, relative to the observers
position.
322 Towards defining collocations
2.1 Semantic considerations The mechanism is the
selectional restrictions, that is, a lexical
items semantic property will presuppose certain
restrictions on the choice of items to occur in
its environment.
Semantically motivated collocations Drink
water pregnant women murder a suspect
33- Semantically unmotivated collocations
- spotless reputation flawless
reputation - flawless performance unblemished
performance - bear a grudge bear
a hatred bear a scorn - pay attention/ respect/ visit pay
greeting pay welcome - rulelessness, arbitrariness,
idiosyncraticness -
342.2 Grammatical considerations
- A collocation is a sequence of words that occur
more than once in identical form (in the Brown
Corpus) and which is grammatically
well-structured. (Kjellmer, 1987 133) - By collocation is meant the co-occurrence of two
or more lexical items as realizations of
structural elements within a given syntactic
pattern. (Cowie, 1978 132) - e.g.
Table 6.2
35Many discontinuous collocations cut across
sentence boundaries.
- laugh...joke
- ill...doctor
- try...succeed
- king...crown
- cradle...flame...flicker
- hair...comb...curl...wave
- sky...sunshine...cloud...rain, etc.
- (Halliday and Hasan 287).
362.3 The Lexical co-occurrence Approach
- The lexical approach is based on the
assumption that words receive their meaning from
the words they co-occur with. These linguists,
Firthians in particular, perceived collocations
as a lexical phenomenon independent of grammar. -
37- You shall know a word by the company it
keeps - (Firth, 1957 12)
-
-
... lexis seems to require the recognition
merely of linear co-occurrence together with some
measure of significant proximity, either a scale
or at least a cut-off point. It is this
syntagmatic relation which is referred to as
collocation. (Halliday, 1976 75)
38- Collocation is the occurrence of two or more
words within a short space of each other in a
text. The usual measure of proximity is a maximum
of four words intervening. (Sinclair, 1991 170)
39- Collocates of back
- I crawled back to camp.
- Ill drive you back to your flat.
- We had to go back to the hotel.
- You have just got back from the office.
- Set back from the road.
- All the way back to the village.
- He leaned back in his chair.
- Tom went back to the window.
- Britain would be back on his feet.
- They got back into the car.
- You must come back to the kitchen.
- She went back into the living room.
(Sinclair,1991 120). -
40- Lexical Combinations on the Syntagmatic Axis
Figure 6-1 A continuum for syntagmatic
combinations
41- Defining features
- 1. Collocations are syntagmatic
associations of words in contexts.
- 2. Collocations may exist in a grammatical
construct, or may cut across structure
boundaries. - 3. Collocations are recurrent or
significant expressions in terms of
statistics. - 4. Collocations are largely
register-dependent. - 5. Collocations are arbitrary and
conventional. - 6. Collocations vary in length.
423.1 Generalizing collocational patterns on the
basis of colligations and lexical co-occurrences
3. Corpus approaches to the study of
collocations
- Colligation a grammatical construct in which a
key word co-occurs with other words. Table
6.3 - Concordance e.g. of 'data'
-
433.2 Corpus-driven lexical computing
- Node
- The node word in a collocational study is
the one whose lexical behaviour is under
examination. - Span
- The span is the measurement, in words, of
the co-text of a word selected for study. Usually
a span of 4/ 4, or a span of 5, is adopted in
collocational studies, which means that four or
five words on either side of the node word will
be taken to be its relevant environment. - Collocates
- Collocates refer to those items which are in
the environment defined by the span. - The idea is to investigate the collocational
pattern of the node word by examining its 2SN
occurrences of collocates 2S stands for the
defined span, and N stands for the occurrences of
the node word in a corpus.
44Statistical measures
- Z-score
- Z-score compares the difference between the
observed frequency of a collocate and its
expected frequency in standard deviation units,
and, thus, tells where one score lies in relation
to other scores. It is used to compare a scores
relative position in two or more score sets. In
statistics, Z-score is usually applied to the
test of a large sample while T-score is applied
to the test of a small sample.
45- W the total number of words in a corpus
- N the total occurrences of the node
- C1 the occurrences of a collocate in the corpus
- S the defined span
- C2 the frequency of the collocate co-occurring
with the node - The probability of a collocate co-occurring with
each successive node - C1(2S1)/W
- The probability of a collocate co-occurring with
a node occurring N times
46The expected frequency of the co-occurrence of
the node and the collocate
47Casual collocation and Threshold frequency a
minimum of frequency is set for statistical
measurement if a word form co-occurs with the
node less than the minimum frequency, it will not
go to the statistical measure Significant
collocation A significant collocation is one in
which the two items co-occur more often than
could be predicted on the basis of their
respective frequencies in the length of text
under consideration. Table 6.5
??performed??????Z??????? Table 6.6 ??
knowledge ??????T???????
48- Mutual Information.
- Mutual information measures the amount of
information that the occurrence of one word
yields about the probability of the occurrence of
another word. - MI principle In a corpus of 10 million words,
the word kin occurs 10 times. This will mean
that the probability of occurrence of kin is
0.000001. But if, in the same corpus, the word
kith occurs 5 times and, in all the five
instances, kin follows kith. If we have seen
kith, we could have estimated the probability
of seeing kin to be 0.5. So the occurrence of
the word kith gives us a great deal of
information about the likelihood of seeing the
word kin nearby.
49MI calculates the probability of the two words
co-occurring by comparing the product of their
relative frequencies in the corpus with the
observed frequencies of their co-occurrences. The
difference between these values will reveal the
degree of significance of the co-occurrence.
??,a ?b????????????,P(a, b)????????, P(a)
???a????????,P(b)???b?????????? ?a ?b??????,?P(a,
b)??P(a). P(b)???,?????? ???I(a,b)?????????,
??,
50??????????W,F(a)???a ?????,F(b)???b? ????, F(a,
b)?????????????,?
????????????
??,
51- If word a and word b tend to occur in
conjunction, their mutual information will be
high, which means that their collocational
strength is strong. If they are not related and
occur together only by chance, their mutual
information will be zero, which means their
collocational strength is weak or no mutual
attraction exists between them. Finally, if the
two events tend to avoid each other, their
mutual information will be negative (The two
words repel each other). - Table 6.7 ?? knowledge ??????MI ?????
523.3 Extended collocations Word Clusters and
chunks
- A word cluster is a continuous collocational
sequence. - The principle of idiom is that a language user
has available to him or her a large number of
semi-preconstructed phrases that constitute
single choices, even though they might appear to
be analysable into segments - It thus appears that a model of language
which divides grammar and lexis, and which uses
the grammar to provide a string of lexical choice
points, is a secondary model. It cannot be
relinquished, because a text still has many
switch points where the open-choice model will
come into play. It has an abstract relevance, in
the sense that much of the text shows a potential
for being analysed as the result of open choices,
but the other principle, the idiom principle,
dominates. (Sinclair (1987320 1991 110) )
53- Extracting word clusters of various length
from the corpus - Table 6.8 The 20 most frequent 4-word
clusters in the LOB - The ratio between observed frequency and
expected frequency
If the observed frequency of the word form a and
that of word form b are F(a) and F(b)
respectively, then the probability of the two
forming a sequence is F(a)/W
F(b)/W Multiply this theoretical probability by
the corpus size W, we can obtain the expected
frequency of the two word forms forming a
sequence
54- If a sequence consists of n word forms,that is,
a1, a2, a an, then, - According to the formula we can work out the
expected frequencies of the above clusters and
the observed frequency/ expected frequency ratio. - Table 6.9 statistics of the most frequent
clusters of point in LOB
55- Instances of Extended collocations
564 Extensions
- 4.1 Collocations and culture
- Frequent collocations reflect cultural nits
and meaning
The case of afternoon tea.
574.2 Collocations and social phenomena
- Newly emerging collocations reflect growth of new
phenomena and concepts
Instances of working mothers.
Instances of single parent
584.3 Collocations and ideology
- Cultural key words Social values and attitudes
Labour Casual, cheap, deskilling, manual,
semi-skilled, unemployed, unproductive, unskilled
Labourer Agricultural, building, casual, clerk,
farm, manual, poor, shop, unemployed, unskilled
Career
59Collocations and ideology
- Critical discourse analysis how language is used
to intervene society
Frequent collocations in Chris Patterns
speech Individual rights, individual freedom,
opportunities of the individual, privacy of the
individual, respects for the individual, rule of
law
Purpose to create an illusion of a good colony
604.4 Collocations and language change
- Strong collocations tend to become fixed phrases
and convey packages of information
Instances of falling standards
61- Thank you for attention!
- Comments and questions are welcome!
62- He is a heavy drinker.
- He is drinking pretty heavily.
- He drinks heavily.
- He is putting in some heavy drinking.
-
back
63back
64?6.3 Colligations and collocations
back
65(No Transcript)
66(No Transcript)
67- N
- field, input, inspection, laboratory,
performance, survey - N
- Management, banks, base, introduction
- ADJ
- analytical, attitudinal, available,
centralized, digital, distributed, empirical,
experimental, intuitional, invented,
longitudinal, measured, natural, observational,
preliminary, quantitative, raw -
back
68back
69back
7022 80 70 65 80 60 190 266 104 79 417 149 1445 707
338
back
71Table 6.8 The 20 most frequent 4-word clusters in
the LOB
back
72back