LELA 30922 English Corpus Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

LELA 30922 English Corpus Linguistics

Description:

Not a theory of linguistics ... History of Corpus Linguistics ... Corpus is closed (finite, synchronic) All text tagged to high quality ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 18
Provided by: Har134
Category:

less

Transcript and Presenter's Notes

Title: LELA 30922 English Corpus Linguistics


1
LELA 30922English Corpus Linguistics
  • Harold Somers
  • Professor of Language Engineering
  • Office Lamb 1.15

2
Syllabus
3
Assessment
  • A practical project in which students will use
    the BNC (or other approved corpus material) to
    investigate some question of English language
    usage.
  • Suggestion base your project (more or less
    closely) on some existing study.
  • Project write-up will include relevant background
    material and results and discussion of a
    corpus-based analysis.
  • In other words summarize (and criticize) the
    chosen study, then do your own version, and
    compare the results

4
Reading matter
  • Main recommendations
  • Kennedy, G.D. (1998) An introduction to corpus
    linguistics. London Longman.
  • McEnery, T. A. Wilson (2001, 2nd ed) Corpus
    linguistics. Edinburgh Edinburgh University
    Press.
  • Meyer, C. (2002) English corpus Linguistics An
    introduction. Cambridge Cambridge University
    Press.
  • Lots of other books, focussing on particular
    aspects
  • Do not ignore journals (Int J Corp Ling) and
    specialist conferences, especially when
    considering practical assignment.
  • http//tinyurl.com/32abhb for list of resources
    available at UoM

5
What is a corpus?
  • Corpus (pl. corpora) body
  • Collection of written text or transcribed speech
  • Usually but not necessarily purposefully
    collected
  • Usually but not necessarily structured
  • Usually but not necessarily annotated
  • (Usually stored on and accessible via computer)
  • Corpus text archive

6
Computers and corpus linguistics
  • Historically, manual analysis of large bodies of
    text (esp. in literary and biblical studies)
  • Error-prone, time-consuming, not verifiable
  • Computers have introduced
  • Reliability, accuracy and replicability
  • increased speed and capacity means you can do
    more on a grander scale
  • new tools mean you can do things you might not
    have thought of doing

7
What is corpus linguistics?
  • Not a branch of linguistics, like socio,
    psycho,
  • Not a theory of linguistics
  • A set of tools and methods (and a philosophy) to
    support linguistic investigation across all
    branches of the subject

8
Evidence in linguistics
  • Real attested usage as linguistic evidence
  • Contrasts with introspective approach previously
    typical
  • Relates to the competenceperformance
    (langueparole) distinction
  • Corpus linguists often more interested in trends
    than rules (probabilities rather than
    certainties)
  • Famous stories of corpus evidence contradicting
    widely-held assumptions about language use.

9
Activities in corpus linguistics
  • Design and compilation of corpora
  • Development of tools for corpus analysis
  • Descriptive linguists using corpora to analyze
    lexical and grammatical behaviour of language, eg
    for lexicography
  • Exploiting corpora in applied linguistics
    language teaching, translation.

10
History of Corpus Linguisticswww.essex.ac.uk/ling
uistics/clmt/w3c/corpus_ling/content/history.html
  • Textual study has always included an element of
    counting and cataloguing, despite
    impracticalities notably concordances of
    Shakespeare, the Bible, etc.
  • Arrival of computers in 1950s of course changed
    everything

11
Brown corpus
  • First modern computer-readable corpus
  • W.N. Francis and H. Kucera, Brown University,
    Providence, RI
  • one million words of American English texts
    printed in 1961
  • sampled from 15 different text categories
  • used as model for other corpora, including

12
LOB corpus
  • compiled by researchers in Lancaster, Oslo and
    Bergen
  • one million words of British English texts
    printed in 1961
  • sampled from same 15 text categories as Brown
    corpus
  • All texts 2,000 words long
  • Kolhapur corpus of Indian English compiled in
    1978 to same sepcification

13
Chomskys criticisms
  • Chomskys ideas drove linguists away from
    empiricism (data) towards rationalism
    (introspection)
  • Chomsky switched focus onto abstract models of
    language competence
  • He was especially scathing about corpus-based
    approaches
  • Based on mistaken view that corpus linguists
    confused finiteness of data with finiteness of
    language
  • See McEnery Wilson, chapter 1

14
The London-Lund Corpus of Spoken English (LLC)
  • First corpus of transcribed spoken language
  • Part of Survey of Spoken English at Lund
    University under the direction of J. Svartvik
  • 500,000 words of spoken British English recorded
    from 1953 to 1987
  • different categories, such as spontaneous
    conversation, spontaneous commentary, spontaneous
    and prepared oration

15
COBUILD
  • 1m-word corpus too small for many applications
  • 1980 Collins instigated collection of 20m-word
    corpus to support lexicographers writing new
    Collins Birmingham University International
    Learners Dictionary (John Sinclair)
  • Now expanded to Bank of English corpus, 320m
    words and growing
  • www.collins.co.uk/Corpus/CorpusSearch.aspx
  • www.collins.co.uk/books.aspx?group153

16
BNC (1995)
  • http//www.natcorp.ox.ac.uk/
  • 100m word collection of written and spoken text
    from 1975-93 (already dated in some respects!)
  • Carefully designed and balanced
  • Corpus is closed (finite, synchronic)
  • All text tagged to high quality
  • Lots of tools available for exploration

17
etc.
  • Many other corpus projects now underway,
    sometimes modelled on BNC or other well-known
    corpora
  • Various national projects
  • Specialized corpora
  • Historical texts
  • Learner English
  • International English
  • Translated English
  • Spoken dialogues for certain domains
  • When widely used, they become a kind of
    benchmark, eg Wall Street Journal corpus
    (treebank)
  • This can have pros and cons
Write a Comment
User Comments (0)
About PowerShow.com