Corpus Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

Corpus Linguistics

Description:

Corpus Linguistics Varieties of English Relevance of corpus linguistics to this course Previously studies of stylistics were largely informal and subjective Using ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 27
Provided by: Har57
Category:

less

Transcript and Presenter's Notes

Title: Corpus Linguistics


1
Corpus Linguistics
2
Varieties of English
  • Relevance of corpus linguistics to this course
  • Previously studies of stylistics were largely
    informal and subjective
  • Using computers to look at larger amounts of data
    allows us to be more formal and objective
  • Corpus linguistics basically provides a
    mindset (and some procedures) for doing this

3
What is a corpus?
  • Corpus (pl. corpora) body
  • Collection of written text or transcribed speech
  • Usually but not necessarily purposefully
    collected
  • Usually but not necessarily structured
  • Usually but not necessarily annotated
  • (Usually stored on and accessible via computer)
  • Corpus text archive

4
Purposefully collected
  • Text samples collected to meet a specific need
  • Corpus may be quite focused, eg corpus of
    newswire texts, or may be more general
  • Issue of balance often important
  • Demographic features (age, sex, location, social
    class of writer/reader)
  • Different styles and genres

5
Structured
  • Overall corpus is divided into sections defined
    by parameters
  • Again balance will ensure that different genres
    or demographic features are equally represented

6
Parameters in the BNC (written portion)
7
Genre distinctions in the BNC (written portion)
8
Parameters in BNC (spoken part)
9
Parameters in BNC (spoken part) cont
10
Annotated
  • Not just plain text
  • Most corpora are at least POS tagged
  • Each word has its part of speech (POS) identified
  • POS tags contain quite rich information, eg not
    just verb but including some morphological
    information
  • tags also disambiguate, eg between book (N/V) if
    possible
  • Some may also have other information indicated
  • structural information resulting from parse
  • word sense distinctions for same-POS homonyms

11
(No Transcript)
12
(No Transcript)
13
What is corpus linguistics?
  • Not a branch of linguistics, like socio,
    psycho,
  • Not a theory of linguistics
  • A set of tools and methods (and a philosophy) to
    support linguistic investigation across all
    branches of the subject

14
Evidence in linguistics
  • Real attested usage as linguistic evidence
  • Contrasts with introspective approach previously
    typical
  • Relates to the competenceperformance
    (langueparole) distinction
  • Corpus linguists often more interested in trends
    than rules (probabilities rather than
    certainties)
  • Famous stories of corpus evidence contradicting
    widely-held assumptions about language use.

15
Activities in corpus linguistics
  • Design and compilation of corpora
  • Development of tools for corpus analysis
  • Descriptive linguists using corpora to analyze
    lexical and grammatical behaviour of language, eg
    for lexicography, and of course stylistics
  • Exploiting corpora in applied linguistics
    language teaching, translation.

16
History of Corpus Linguisticswww.essex.ac.uk/ling
uistics/clmt/w3c/corpus_ling/content/history.html
  • Textual study has always included an element of
    counting and cataloguing, despite
    impracticalities notably concordances of
    Shakespeare, the Bible, etc.
  • Arrival of computers in 1950s of course changed
    everything

17
Brown corpus
  • First modern computer-readable corpus
  • W.N. Francis and H. Kucera, Brown University,
    Providence, RI
  • one million words of American English texts
    printed in 1961
  • sampled from 15 different text categories
  • used as model for other corpora, including

18
LOB corpus
  • compiled by researchers in Lancaster, Oslo and
    Bergen
  • one million words of British English texts
    printed in 1961
  • sampled from same 15 text categories as Brown
    corpus
  • All texts 2,000 words long
  • Kolhapur corpus of Indian English compiled in
    1978 to same sepcification

19
The London-Lund Corpus of Spoken English (LLC)
  • First corpus of transcribed spoken language
  • Part of Survey of Spoken English at Lund
    University under the direction of J. Svartvik
  • 500,000 words of spoken British English recorded
    from 1953 to 1987
  • different categories, such as spontaneous
    conversation, spontaneous commentary, spontaneous
    and prepared oration

20
COBUILD
  • 1m-word corpus too small for many applications
  • 1980 Collins instigated collection of 20m-word
    corpus to support lexicographers writing new
    Collins Birmingham University International
    Learners Dictionary (John Sinclair)
  • Now expanded to Bank of English corpus, 320m
    words and growing
  • www.collins.co.uk/Corpus/CorpusSearch.aspx
  • www.collins.co.uk/books.aspx?group153

21
BNC (1995)
  • http//www.natcorp.ox.ac.uk/
  • 100m word collection of written and spoken text
    from 1975-93 (already dated in some respects!)
  • Carefully designed and balanced
  • Corpus is closed (finite, synchronic)
  • All text tagged to high quality
  • Lots of tools available for exploration
  • Nice online interface (available on campus)

http//bnc.humanities.manchester.ac.uk/cgi-bnc/BNC
query.pl?theQuerysearchurlTestyes
22
What can you do with a corpus?
  • Many things, but just some examples
  • Investigate behaviour of words and how they
    relate to genre, mode, sex of speaker/hearer
  • Prove (or disprove) supposed trends with
    quantitative data

23
Example 1 swearing
  • Women and men swear (and use taboo words)
    differently
  • Data (from BNC spoken part) shows
  • Women and men use different swear words
  • They use them for different effect (men use them
    to disparage, women use them to intensify)
  • Their use changes depending on the sex of the
    listener(s) women swear more in single-sex
    groups men dont swear more in mixed-sex than
    amongst themselves

24
Example 2.1 Near synonyms
  • Subtle differences in the meaning of near
    synonyms can be distinguished by looking at the
    words they collocate with
  • You shall know a word by the company it keeps
    (Firth)

25
frail vs fragile
26
Example 2.2 Near synonyms
  • In addition, near synonyms can be shown to be
    favoured depending on genre, eg big vs large

Category big large
 Spoken conversation 768.55 488.34
 Other spoken material 395.89 447.58
 Newspapers 365.27 431.62
 Fiction and verse 333 293.06
 Other published written material 290.84 223.43
 Unpublished written material 247.39 186.35
 Non-academic prose and biography 139.63 181.19
 Academic prose 38.85 45.11
Frequency per million words
Write a Comment
User Comments (0)
About PowerShow.com