Title: Corpus linguistics an introduction
1Corpus linguisticsan introduction
2What is a corpus?
- A collection of naturally occurring language
text, chosen to characterise a state or variety
of language (Sinclair) - A collection of linguistic data, either written
text or a transcription of recorded data, which
can be used as starting-point of linguistic
description or as a means of verifying hypotheses
about a language (Dictionary of linguistics and
phonetics)
3What is a corpus? (II)
- Large body of evidence typically composed of
attested language use (McEnery) - Usually a corpus is in machine-readable format
and is ideally viewable and analysable through (a
single) software package - The word corpus comes from Latin body and the
plural is corpora
4What is not a corpus
- Lists of words
- Lists of sentences produced with the purpose of
creating a corpus - Archive a repository of readable electronic
texts not linked in any coordinated way
(http//www.archive.org)The Internet Archive
is building a digital library of Internet sites
and other cultural artifacts in digital form.
Like a paper library, we provide free access to
researchers, historians, scholars, and the
general public.
5What can we do with a corpus?
- Corpus-based approaches hypotheses are checked
against a corpus - Corpus-driven approaches hypotheses are drawn
from the corpus
6What can we do with a corpus? (II)
- 'Alright,' said the computer Deep Thought. 'The
Answer to the Great Question...' - 'Yes...!'
- 'Of Life, the Universe and Everything ... ' said
Deep Thought. - 'Yes ... !'
- 'Is ...'
- 'Yes...!!!...?'
- 'Forty-two,' said Deep Thought, with infinite
majesty and calm. - It was a long time before anyone spoke.
- 'Forty-two!' yelled someone in the audience. 'Is
that all you've got to show for seven and a half
million years' work?' - 'I checked it very thoroughly,' said the
computer, 'and that quite definitely is the
answer. I think the problem, to be quite honest
with you, is that you've never actually known
what the question is.' - Hitchhikers guide to the galaxy by Douglas Adams
7Fields where corpora are used
- Lexicography to design dictionaries
- Language studies (relations between languages,
differences between genre, evolution of the
language) - Computational linguistics (training and testing
methods) - Language teaching (learners corpora)
- Cultural studies, psycholinguistics
8The characteristics of analysis using corpora
(Biber, 1998)
- It is empirical, analysing the actual patterns
of use from natural texts - It utilises a large and principled collection of
natural texts as the basis for analysis - It makes extensive use of computers for
analysis, using both automatic and interactive
techniques - It depends on both quantitative and qualitative
analytical techniques
9History
- We have to split the history in two periods
before Chomsky and after Chomsky - Before Chomsky, methods similar to the ones in
corpus linguistics were used (empiricism)http/
/www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpu
s1/1fra1.htm
10Early corpus linguistics
- Before Chomsky
- Computers were not available so it was difficult
to analyse large collections of text - Studies of child language using diaries kept by
parents - Spelling conventions in a German corpus of 11
million words - Foreign language pedagogy
11Early corpus linguistics (II)
- All the work of early corpus linguistics was
underpinned by two fundamental, yet flawed
assumptions - The sentences of a natural language are finite.
- The sentences of a natural language can be
collected and enumerated. - Most linguists saw the corpus as the only source
of linguistic evidence in the formation of
linguistic theories
12Chomsky
- Between 1957 and 1965 Chomsky changed the
direction of linguistics from empiricism towards
rationalism - Any natural corpus will be skewed. Some
sentences wont occur because they are obvious,
other because they are false, still others
because they are impolite. The corpus, if
natural, will be so wildly skewed that the
description would be no more than a mere list
(Chomsky, 1962) - Introspection started to be used instead
-
13Problems with introspection
- Naturally occurring data is observable and
verifiable by everyone. - Introspective data is artificial.
- Human beings have only the vaguest notion of the
frequency of a construct or a word.
14The revival of corpus linguistics
- The research in corpus linguistics was continued
in small centres - The hardware still imposed some restrictions,
the real development will start in the 80s - Fields like computational linguistics were not
interested to use corpora
15The revival of corpus linguistics (II)
- 1960s Brown Corpus (at the Brown University
American English) - 1970s LOB corpus British English
- 1980s Bank of English in Birmingham
- 1990s (BNC, LDC, ICE corpus, ELRA, TRACTOR,
ICAME)
16Why bother with corpora?
- Even expert speakers have only a partial
knowledge of a languageA corpus can be more
comprehensive and balanced - Even expert speakers tend to notice the unusual
and think of what is possibleA corpus can show
us what is common and typical - Even expert speakers cannot quantify their
knowledge of languageA corpus can give us
accurate statistics
17Why bother with corpora? (II)
- Even expert speakers cannot remember everything
they knowA corpus can store and recall all the
information that has been input - Even experts speakers cannot make up natural
examplesA corpus can provide us with a vast
number of real examples - Even expert speakers have prejudices and
preferences and every language has cultural
connotations and underlying ideologyA corpus can
give you more objective evidence
18Why bother with corpora? (III)
- Even expert speakers are not always available to
be consultedA corpus can be made permanently
accessible to all - Even expert speakers cannot keep up with language
changeA constantly updated corpus can reflect
even recent changes in the language - Even expert speakers lack authority they can be
challenged by other expert speakersA corpus can
encompass the actual language use of many expert
speakers
19Parameters of a corpus
- Language
- Monolingual
- Multilingual (comparable corpora)
- Parallel
- Type of source
- Written
- Spoken
- Mix
20Parameters of a corpus (II)
- Size of the corpus is not all important and it
depends very much on the type of texts used - Annotated/not annotated (type of encoding used
plain text, SGML/XML encoded) - Static corpus static/monitor corpus
- Corpus/sub-corpus
- Number of words/types
21Type/token ratio
- From Brown corpus 1m tokens (written only) -
50,406 types - From 1980s Birmingham/Cobuild corpora 1m tokens
(spoken only) - 36,807 types - 17,459 occur only
once - NB - fewer types than Brown (written only)
spoken language is more repetitive, smaller
vocabulary is used - 4m tokens (Times newspapers only) - 122,773 types
- 54,144 occur only once - 18m tokens (general corpus) - 228,323 types -
131,299 occur only once
22Type/token ratio
- 121m tokens (general corpus) - 475,633 types -
213,684 occur only once - 211m tokens (general corpus) - 638,901 types
- 323m tokens (general corpus) - 812,467 types
- 418m tokens (general corpus) - 938,914 types -
438,647 occur only once
23Ways to exploit a corpus
- Word (token) / types frequency lists
- N-grams
- Concordances
- Collocations/collegations
- Specially designed programs (especially when the
corpus is annotated)
24Frequency lists
- are lists which indicates the words which appear
in a corpus and their frequency - they provide a survey of the corpus
- a frequency list becomes more meaningful when
compared with other lists - they remove a word from its contexts
25N-grams
- groups of N words which appear in sequence in
the text - they are presented using frequency lists
- good way to identify recurring/specific
expressions for a corpus - provide limited context for the words
26Concordances
- show words in the context they appear
- usually they are obtained using special programs
which allow to manipulate the lists of
concordances - KWIC (Key Word In Context) is the most common
format
27Collocations
- collocation the occurrence of two or more
words within a short space of each other in text - the collocates are extracted using a window to
the left and right of a specified word - can be used to further analyse the context of a
word
28The word gamut
29Building corpora
- Ways to acquire corpora
- Direct conversion from electronic format
- Optical scanning
- Keyboarding
- Speech transcription
30Building corpora (II)
- Criteria in corpus design
- Size (small corpora are for genre specific
studies, whereas big corpora make robust, general
statements about a language) - Genre (domain, distribution, age, )
- The structure of the corpus can be decided
- A priori (Brown, LOB, )
- A posteriori
- Old material is replaced with new one (monitor
corpus)
31Building corpora (III)
- Selection, permission, acquisition
- Data conversion, optical scanning, keyboarding,
speech transcription - Cleaning, spell-checking, encoding (annotation),
indexing - Writing documentation
- Evaluation of corpora
- Distribution of corpora
32Possible problems when building a corpus
- A sampling frame designed to allow the
exploitation of a certain linguistics properties - Balance and representativeness
- Information can be lost through cleaning
- Duplication
- When working with speech information can be lost
through transcribing
33Web as a corpus
- The Web can be very useful source of texts
- The Web is very helpful for languages other than
English - Quite often there is not control on the language
which is investigated therefore filtering (if
possible) is necessary
34Corpus annotation
- Enrichment of a corpus with various types of
information - It can be done at every level
- Word part of speech, sense
- Sentence sentence boundaries, syntactic tree
- Discourse coreferential chains, discourse
segments - Certain expressions named entities
35Annotation scheme
- A standard used to annotate certain
characteristics - Gives meaning to a tag
- Nowadays it is in XML
- Usually in addition to an annotation scheme, a
set of guidelines is produces to assist the
annotation
36Examples (II)
- ltPgtltSgtltW POS"PRON" NUM"PL LEMMA"we"gtWelt/WgtltW
POS"V" LEMMA"have"gthavelt/WgtltW POS"EN"
LEMMA"develop"gtdevelopedlt/WgtltNPgtltW POS"DET"
LEMMA"a"gtalt/WgtltW POS"A LEMMA"computational"gt
computationallt/WgtltW POS"N" NUM"SG"
LEMMA"paradigm"gt paradigmlt/WgtltW
POS"PUNCT"gt,lt/Wgt ...lt/NPgt ... lt/Sgtlt/Pgt
37What are the advantages of corpus annotation?
- Ease of exploitation
- Reusability
- Multi-functionality
- Explicit analyses
- Once a corpus is annotated it can be used in
further research
38Annotation of a corpus
- Can be done automatically, semi-automatically
and manually - Sometimes the method is automatic and then the
results postprocessed - Usually special tools are used to minimise the
human error
39Criticism to corpus annotation
- Corpus annotation produce impure corpora
- Sometimes annotation can hide certain features
- Consistency versus accuracy
- Measures to compute the reliability of an
annotation - Sometimes the annotation scheme can cover a
phenomenon only partially.
40Existing corpora
- Brown Corpus/LOB corpus
- Bank of English
- Wall Street Journal, Penn Tree Bank, BNC, ANC,
ICE, WBE, Reuters Corpus - Canadian Hansard parallel corpus English-French
- York-Helsinki Parsed corpus of Old Poetry
- Tiger corpus German
- CORII/CODIS - contemporary written Italian
- MULTEX 1984 and The Republic in many languages
41Distributors of corpora
- LDC (Linguistic Data Consortium)
- ELRA (European Language Resources Association)
- TRACTOR (TELRI Research Archive of Computational
Tools and Resources) - ICAME (International Computer Archive of Modern
and Medieval English)
42References
- Karin Aijmer and Bengt Altenberg (1991) English
corpus linguistics, Longman - Duglas Biber, Susan Cnrad and Randi Reppen (1998)
Corpus linguistics, Cambridge University Press - Graeme D. Kennedy (1998) An introduction to
corpus linguistics, Longman - Tony McEnery and Andrew Wilson (1996) Corpus
linguistics, Edinburgh University Press
43References (II)
- Geoff Barnbrook (1996) Language and Computers,
Edinburgh University Press - Tony McEnery (2003) Corpus linguistics. In
Ruslan Mitkov (ed.) The Oxford Handbook of
Computational Linguistics, Oxford University
Press