Corpus Linguistics Lecture 1 - PowerPoint PPT Presentation

About This Presentation
Title:

Corpus Linguistics Lecture 1

Description:

Quiver/quake: the corpus linguist s answer A study by Atkins and Levin (1995) found that quiver and quake do occur in transitive constructions: ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 35
Provided by: staffUmE8
Category:

less

Transcript and Presenter's Notes

Title: Corpus Linguistics Lecture 1


1
Corpus LinguisticsLecture 1
  • Albert Gatt

2
Contact details
  • My email albert.gatt_at_um.edu.mt
  • Drop me a line with queries etc, and to arrange
    meetings.

3
Course web page
  • Course web page http//staff.um.edu.mt/albert.gat
    t/home/teaching/corpusLing.html
  • Details of tutorials, lectures etc will always be
    on the web page.
  • Readings for the lecture
  • Downloadable lecture notes (available after the
    lecture)

4
Suggested text
  • T. McEnery and A. Wilson. (2001). Corpus
    Linguistics. Edinburgh University Press
  • NB Over the course of these lectures, other
    readings will also be proposed and made
    available, usually online.

5
Lectures and assessment
  • Structure of lectures
  • all lectures will take place in the lab
  • usually, about half the lecture (1hr) will be
    devoted to practical work
  • Course assessment assignment
  • Final essay (ca. 1500-2000 words)
  • Essay topics will involve research on corpora!

6
Questions
  • ?

7
What is corpus linguistics?
  • A new theory of language?
  • No. In principle, any theory of language is
    compatible with corpus-based research.
  • A separate branch of linguistics (in addition to
    syntax, semantics)?
  • No. Most aspects of language can be studied using
    a corpus (in principle).
  • A methodology to study language in all its
    aspects?
  • Yes! The most important principle is that aspects
    of language are studied empirically by analysing
    natural data using a corpus.
  • A corpus is an electronic, machine-readable
    collection of texts that represent real life
    language use.

8
Goals of this lecture
  • To define the terms
  • corpus linguistics
  • corpus
  • To give an overview of the history of corpus
    linguistics
  • To contrast the corpus-based approach to other
    methodologies used in the study of language

9
An initial example
  • Suppose youre a linguist interested in the
    syntax of verb phrases.
  • Some verbs are transitive, some intransitive
  • I ate the meat pie (transitive)
  • I swam (intransitive)
  • What about
  • quiver
  • quake
  • Are these really intransitive?

Most traditional grammars characterise these as
intransitive
10
One possible methodology
  • The standard method relies on the linguists
    intuition
  • I never use quiver/quake with a direct object.
  • I am a native speaker of this language.
  • All native speakers have a common mental grammar
    or competence (Chomsky).
  • Therefore, my mental grammar is the same as
    everyone elses.
  • Therefore, my intuition accurately reflects
    English speakers competence.
  • Therefore, quiver/quake are intransitive.
  • NB The above is a gross simplification! E.g.
    linguists often rely on judgements elicited from
    other native speakers.

11
Another possible methodology
  • This one relies on data
  • I may never use quiver/quake with a direct
    object, but
  • other people might
  • Therefore, Ill get my hands on a large sample of
    written and/or spoken English and check.

12
Quiver/quake the corpus linguists answer
  • A study by Atkins and Levin (1995) found that
    quiver and quake do occur in transitive
    constructions
  • the insect quivered its wings
  • it quaked his bowels (with fear)
  • Used a corpus of 50 million words to find
    examples of the verbs.
  • With sufficient data, you can find examples that
    your own intuition wont give you

13
Example II lexical semantics
  • Quasi-synonymous lexical items exhibit subtle
    differences in context.
  • strong
  • powerful
  • A fine-grained theory of lexical semantics would
    benefit from data about these contextual cues to
    meaning.

14
Example II continued
  • Some differences between strong and powerful
    (source British National Corpus)
  • strong
  • powerful
  • The differences are subtle, but examining their
    collocates helps.

15
Some preliminary definitions
  • The second approach is typical of the
    corpus-based methodology
  • Corpus A large, machine-readable collection of
    texts.
  • Often, in addition to the texts themselves, a
    corpus is annotated with relevant linguistic
    information.
  • Corpus-based methodology An approach to Natural
    Language analysis that relies on generalisations
    made from data.

16
Example (British National Corpus)
  • British National Corpus (BNC)
  • 100 million words of English
  • 90 written, 10 spoken
  • Designed to be representative and balanced.
  • Texts from different genres (literature, news,
    academic writing)
  • Annotated Every single word is accompanied by
    part-of-speech information.

17
Example (continued)
  • A sentence in the BNC
  • Explosives found on Hampstead Heath.
  • ltsgt
  • ltw NN2gtExplosives
  • ltw VVDgtfound
  • ltw PRPgton
  • ltw NP0gtHampstead
  • ltw NP0gtHeath
  • ltPUNgt.

18
Example (continued)
new sentence
  • ltsgt
  • ltw NN2gtExplosives
  • ltw VVDgtfound
  • ltw PRPgton
  • ltw NP0gtHampstead
  • ltw NP0gtHeath
  • ltPUNgt.
  • Explosives found on Hampstead Heath

plural noun
past tense verb
preposition
proper noun
proper noun
punctuation
19
Important to note
  • This is not raw text.
  • Annotation means we can search for particular
    patterns.
  • E.g. for the quiver/quake study find all
    occurrences of quiver which are verbs, followed
    by a determiner and a noun
  • The collection is very large
  • Only in very large collections are we likely to
    find rare occurrences.
  • Corpus search is done by computer. You cant
    trawl through 100 million words manually!

20
The practical objections
  • But were linguists not computer scientists! Do I
    have to write programs?
  • No, there are literally dozens of available tools
    to search in a corpus.
  • Are all corpora good for all purposes?
  • No. Some are general-purpose, like the BNC.
    Others are designed to address specific issues.

21
The theoretical objections
  • What guarantee do we have that the texts in our
    corpus are good data, quality texts, written by
    people we can trust?
  • How do I know that what I find isnt just a
    small, exceptional case. E.g. quiver in a
    transitive construction could be really a
    one-off!
  • Just because there are a few examples of
    something, doesnt mean that all native speakers
    use a certain construction!
  • Do we throw intuition out of the window?

22
Part 2
  • A brief history of corpus linguistics

23
Language and the cognitive revolution
  • Before the 1950s, the linguists task was
  • to collect data about a language
  • to make generalisations from the data (e.g. In
    Maltese, the verb always agrees in number and
    gender with the subject NP)
  • The basic idea language is out there, the sum
    total of things people say and write.
  • After the 1950s
  • the so-called cognitive revolution
  • language treated as a mental phenomenon
  • no longer about collecting data, but explaining
    what mental capabilities speakers have

24
The 19th early 20th Century
  • Many early studies relied on corpora.
  • Language acquisition research was based on
    collections of child data.
  • Anthropologists collected samples of unknown
    languages.
  • Comparative linguists used large samples from
    different languages.
  • A lot of work done on frequencies
  • frequency of words
  • frequency of grammatical patterns
  • frequency of different spellings
  • All of this was interrupted around 1955.

25
Chomsky and the cognitive turn
  • Chomsky (1957) was primarily responsible for the
    new, cognitive view of language.
  • He distinguished (1965)
  • Descriptive adequacy describing language, making
    generalisations such as X occurs more often than
    Y
  • Explanatory adequacy explaining why some things
    are found in a language, but not others, by
    appealing to speakers competence, their mental
    grammar
  • He made several criticisms of corpus-based
    approaches.

26
Criticisms of corpora (I)
  • Competence vs. performance
  • To explain language, we need to focus on
    competence of an idealised speaker-hearer.
  • Competence internalised, tacit knowledge of
    language
  • Performance the language we speak/write is
    not a good mirror of our knowledge
  • it depends on situations
  • it can be degraded
  • it can be influenced by other cognitive factors
    beyond linguistic knowledge

27
Criticisms of corpora (II)
  • Early work using corpora assumed that
  • the number of sentences of a language is finite
    (so we can get to know everything about language
    if the sample is large enough)
  • But actually, it is impossible to count the
    number of sentences in a language.
  • Syntactic rules make the possibilities literally
    infinite
  • the man in the house (NP -gt NP PP)
  • the man in the house on the beach (PP -gt PREP
    NP)
  • the man in the house on the beach by the lake
  • So what use is a corpus? Were never going to
    have an infinite corpus.

28
Criticisms of corpora (III)
  • A corpus is always skewed, i.e. biased in favour
    of certain things.
  • Certain obvious things are simply never said.
    E.g. We probably wont find a dog is a dog in our
    corpus.
  • A corpus is always partial We will only find
    things in a corpus if they are frequent enough.
  • A corpus is necessarily only a sample.
  • Rare things are likely to be omitted from a
    sample.

29
Criticisms of corpora (IV)
  • Why use a corpus if we already know things by
    introspection?
  • How can a corpus tell us what is ungrammatical?
  • Corpora wont contain disallowed structures,
    because these are by definition not part of the
    language.
  • So a corpus contains exclusively positive
    evidence you only get the allowed things
  • But if X is not in the corpus, this doesnt mean
    its not allowed.
  • It might just be rare, and your corpus isnt big
    enough. (Skewness)

30
Refutations
  • Corpora can be better than introspectvie evidence
    because
  • They are public other people can verify and
    replicate your results (the essence of scientific
    method).
  • Some kinds of data are simply not available to
    introspection. E.g. people arent good at
    estimating the frequency of words or structures.
  • Skewness can itself be informative If X occurs
    more frequently than Y in a corpus, that in
    itself is an interesting fact.

31
Refutations (II)
  • By the way, nobodys saying throw introspection
    out the window
  • There is no reason not to combine the
    corpus-based and the introspection-based method.
  • Many other objections can be overcome by using
    large enough corpora.
  • Pre-1950, most corpus work was done manually, so
    it was error prone.
  • Machine-readable corpora means we have a great
    new tool to analyse language very efficiently!

32
Corpora in the late 20th Century
  • Corpus linguistics enjoyed a revival with the
    advent of the digital personal computer.
  • Kucera and Francis the Brown Corpus, one of the
    first
  • Svartvik the London-Lund Corpus, which built on
    Brown
  • These were rapidly followed by others Today,
    corpora are firmly back on the linguistic
    landscape.

33
Summary
  • Introduced the notion of corpus and corpus-based
    research
  • Gave a quick overview of the history of this
    methodology
  • Looked at some possible objections to
    corpus-based methods, and some possible
    counter-arguments

34
Next lecture
  • We look more closely at some important properties
    of a corpus
  • Machine-readability
  • Balance
  • Representativeness
Write a Comment
User Comments (0)
About PowerShow.com