Presentazione di PowerPoint

About This Presentation

Title:

Presentazione di PowerPoint

Description:

What is a CORPUS? A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a ... – PowerPoint PPT presentation

Number of Views:234

Avg rating:3.0/5.0

Slides: 103

Provided by: glo995

Category:

more less

Transcript and Presenter's Notes

Title: Presentazione di PowerPoint

1
(No Transcript)
2
What is a CORPUS?

A corpus is a collection of pieces of language
that are selected and ordered according to
explicit linguistic criteria in order to be used
as a sample of the language
(Sinclair 1996)

3
What is a CORPUS?

the term corpus as used in modern
linguistics can best be defined as a collection
of sampled texts, written or spoken, in
machine-readable form which may be annotated with
various forms of linguistic information
(McEnery, Xiao and Tono 2006)

4
Key concepts re. Corpora

Machine-readable texts
Authentic texts
Sampled texts
Representative of a particularlanguage or
language variety

5
Is Corpus Linguistics a new approach to the study
of language?

The expression Corpus Linguistics first appeared
in the early 80s.
Corpus-based language study,however has a
substantial history.

6
Corpus-based language study

In the pre-Chomskyan era
Field linguists (Boas)
Structuralists (Sapir, Newman, Bloomfield, Pike,
etc.)
Corpora where few paper slips with data.
Shoebox Corpora Non-representative.
Corpus-based only in that the methodology was
empirical and based on observable data.

7
The 50s the protests

Chomsky (1962) accused the (contemporary) corpus
methodology, by reason of the skewedness of
corpora.
Non-representative, time consuming, competence
vs. performance, I-language vs. E-language
Corpora were marginalized.

8
The revolutionary 60s

With the advances in computer technology the
exploitation of massive corpora became feasible.
Brown Corpus
Brown University Standard Corpus of American
Present-day English

9
The 80s the boom

From the 80s onwards the number and size of
corpora and corpus based studies have increased
dramatically.
Corpora have revolutionized almost all
branches of linguistics.

10
A few remarks

Computers
allow us to speed up the processing of data.
avoid human bias in data analysis
allow the enrichment of data with metadata

11
Intuition vs. Corpus

Intuition should be applied with caution
Influence of dialect, sociolect, idiolect
No universal agreement on (degree of)
acceptability
Informants monitor their use of language
(non-spontaneous)
Introspection is not observable

12
Intuition vs. Corpus

Corpus-based approach draws upon authentic or
real texts
Computer-based analysis can retrieve differences
that intuition alone cannot perceive
Reliable quantitative data

13
Should we dismiss intuition then?

Not at all!
The key to using corpus data is to find the
balance between the use of corpus data and the
use of ones own intuition.

14
Should we dismiss intuition then?

Not all research questions can be addressed by
the corpus-based approach.
Corpus-based approach and intuition-based
approach
ARE NOT MUTUALLY EXCLUSIVE

15
Leech (199114) writes

Neither the corpus linguist of the 1950s,
who rejected intuition, nor the general linguist
of the 1960s, who rejected corpus data, was able
to achieve the interaction of data coverage and
the insight that characterise the many successful
corpus analyses of recent years.

16
Is CL a methodology or a theory?

No universal agreement.
CL is a METHODOLOGY and not an independent branch
of linguistics such as semantics, pragmatics,
syntax, etc.
CL can be employed to explore almost any area of
linguistic research.

17
Corpus-based or Corpus-driven approaches?

Corpus-based approaches are used to expound,
test or exemplify theories and descriptions that
were formulated before large corpora became
available to inform language study
(Tognini-Bonelli 200165).
Therefore, corpus-based linguists are not
strictly committed to corpus data and they would
discard inconvenient evidence by insulation,
standardisation and instantiation (i.e. via
corpus annotation).

18
Corpus-based or Corpus-driven approaches?

Corpus-driven linguists are strictly committed
to the integrity of the data as a whole.
Theoretical statements are fully consistent with,
and reflect directly, the evidence provided by
the corpus.
(Tognini-Bonelli 200184-85).

19
Corpus-based or Corpus-driven approaches?

The distinction is overstated, they are 2
idealized extremes.
4 basic differences among the 2 approaches
Types of corpora used
Attitudes towards theories and intuitions
Focuses of research
Paradigmatic claims

C.B. Approaches
Corpus must be representative and balanced
Size is not all-important
Minimum frequency is used to exclude non-relevant
results
In favour of corpus annotation CB approaches
generally have existing theory as a starting
point and correct and revise such theory in the
light of corpus evidence
Distinction between the different levels of
language analysis.

C.D. Approaches
Corpus will balance itself when it grows to be
big enough (cumulative representativeness)
Corpus must be very large
Corpus evidence is exploited fully, but this way
the number of the combinations is enormous
Against corpus annotation (no preconceived
theories)
No distinction betweenlexis, syntax,
pragmatics,etc. There is only 1 levelof
language descriptionthe functionally complete
unit of meaning or languagepatterning

We will only refer to
CORPUS-BASED APPROACHES
A few key notions in
Corpus Linguistics

22
Representativeness

Essential feature of a corpus.
Balance (the range of genres included in a
corpus) and sampling (how the text chunks for
each genre are selected) ensure
representativeness.

23
Representativeness

A corpus is representative if
the findings based on its contents cane be
generalized to the said language variety (Leech
1991)
its samples include the full range of
variability in a population (Biber 1993)

24
Representativeness

It changes over time (Hunston 2002) if a corpus
is not regularly updated, it rapidly becomes
unrepresentative.

25
Representativeness

Criteria to select texts for a corpus
External criteria (Bibers situational
perspective) defined situationally, e.g. genres,
registers, text types, etc.
Internal criteria (Bibers linguistic
perspective) defined linguistically, taking into
account the distribution of linguistic features.
CIRCULAR because a corpus is typically design
to study linguistic distribution, so there is no
point in analysing a corpus where distribution of
linguistic features is predetermined.

26
Representativeness

2 main types (for the range of text categories
represented)
General corpora a basis for an overall
description of a language (variety) their r.
depends on the sampling from a broad range of
genres.
Specialized corpora domain- or genre specific
corpora their r. can be measured by the degree
of closure or saturation (lexical features).

27
Balance

The range of text categories included in the
corpus
The acceptable b. is determined by the intended
uses.
A balanced corpus covers a wide range of text
categories which are supposed to be
representative of the language (variety) under
consideration.

28
Balance

There is no scientific measure for balance.
It is more important for sample corpora than
for monitor corpora

29
Sampling

A corpus is a sample of a given population
A sample is representative if what we find for
the sample holds for the general population
Samples are scaled-down versions of a larger
population

30
Sampling

Sampling unit for written text, a s.u. could be
a book, periodical or newspaper.
Population the assembly of all sampling units
it can be defined in terms of language
production, reception (demographic, sex, age,
etc.) or language as a product (category, genre
of language data).
Sampling frame the list of sampling units

31
Sampling

Sampling techniques
Simple random sampling all sampling units within
the sampling frame are numbered and the sample is
chosen by use of a table or random numbers rare
features could not be accounted for.
Stratified random sampling the population is
divided in relatively homogeneous groups, i.e.
the strata, and then these latter are sampled at
random never less representative than the former
method.

32
Sampling

Sample size
Full texts no balance peculiarity of
individual texts may show through.
Text chunks are sufficient (e.g. 2000 running
words) frequent linguistic features are stable
in their distribution and hence short text chunks
are sufficient for their study (Biber 1993). Text
initial, middle and end samples must be balanced.

33
Sampling

Proportion and number of samples
The number of samples across text categories
should be proportional to their frequencies
and/or weights in the target population in order
for the resulting corpus to be considered as
representative

34
What matters is the Research Question!

Claims of corpus representativeness and balance
should be interpreted in relative terms as there
is no objective way to balance a corpus or to
measure its representativeness.
Representativeness is a fluid concept the
research question that one has in mind when
building a corpus determines what is an
acceptable balance for the corpus one should use
and whether it is suitably representative.

35
Data collection

Spoken data must be transcribed from audio
recordings.
Written text must be rendered machine-readable by
keyboarding or OCR (Optical Character
Recognition) scanning.
Language data so collected form a RAW CORPUS.

36
Corpus Mark-up

System of standard codes inserted into a
document stored in electronic form to provide
information about the text itself and govern
formatting, printing and other processes.
Most widely used mark-up schemes
TEI (Text Encoding Initiative)
CES (Corpus Encoding Standard)

37
Corpus Mark-up

It is essential in corpus-building because
sampled texts are out of context and it allows
to recover contextual information
it provides more information than the file
names alone (re. text types, sociolinguistic
variables, textual information structure)
it ads value to the corpus because it allows for
a broader range of questions to be addressed
it allows to insert editorial comments during
the corpus building process.

38
Corpus Mark-up

Extra-textual and textual information must be
kept separate from the corpus data.
Examples
COCOA mark-up scheme
ltA WILLIAM SHAKESPEAREgt
A author, attribute name
WILLIAM SHAKESPEARE attribute value

39
TEI Mark-up Scheme

Each individual text is a document consisting in
a header and a body, in turn composed of
different elements.
Ex. in the header there are 4 main elements
A file description ltfileDescgt
An encoding description ltencodingDescgt
A text profile ltprofileDescgt
A revision history ltrevisionDescgt
Tags can be nested, i.e. they can appear inside
other elements.

40
TEI Mark-up Scheme

It can be expressed using a number of different
formal languages.
SGML (Standard GeneralizedMark-up Language
used bythe BNC)
XML (Extensible Mark-up Language)

41
CES Mark-up Scheme

Designed specifically for the encoding of
language corpora.
Document-wide mark-up (bibliographical
descripion, encoding description, etc.)
Gross structural mark-up (volume, chapter,
paragraph, footnotes, etc. specifies recommended
character sets)
Mark-up for subparagraph structures (sentence,
quotations, words, abbreviations, etc.)

42
CES Mark-up Scheme

It specifies a minimal encoding level that
corpora must achieve to be considered
standardized in terms of descriptive
representation as well as general architecture.
3 levels of standardization designedto achieve
the goal of universal document interchange
Metalanguage level
Syntactic level
Semantic level

43
Corpus Annotation

Necessary in order to extract relevant
information from corpora.
The process of adding interpretive,
linguistic information to an electronic corpus of
spoken and/or written language data
(Leech 1997)

44
Annotation vs. Mark-up

Corpus mark-up provides objective, verifiable
information.
Annotation is concerned withinterpretive
linguistic information.

45
The advantages of annotation

It makes extracting information easier, faster
and enables human analysts to exploit and
retrieve analyses of which they are not
themselves capable.

46
The advantages of annotation

2. Annotated corpora are reusable resources.
3. Annotated corpora are multifunctional they
can be annotated with a purpose and be reused
with another.

47
The advantages of annotation

4. Corpus annotation records a linguistic
analysis explicitly.
5. Corpus annotation provides a standard
reference resource, a stable base of linguistic
analyses, so that successive studies can be
compared and contrasted on a common basis.

48
Criticisms to corpus annotation

Annotation produces cluttered corpora
Annotation imposes an analysis
Annotation overvalues corpora making them less
accessible
Is annotation accurate and consistent?

49
How are corpora annotated?

Automatic annotation
Computer-assisted annotation
Manual annotation
Sinclair (1992) the introduction of the human
element in corpus annotation reduces consistency.

50
Types of annotation

Different types of annotation can be carried out
with different means.
For some types automatic annotation is very
accurate. Other types require post-editing,
i.e. human correction.

51
Types of annotation

Corpora can be annotated at different levels of
linguistic analysis.
Phonological level
Syllable boundaries (phonetic/phonemic
annotation)
Prosodic features (prosodic annotation)

52
Types of annotation

Morphological level
Prefixes
Suffixes
Stems
(morphological annotation)

53
Types of annotation

Lexical level
Part of speech (POS Tagging)
Lemmas (lemmatization)
Semantic fields (semantic annotation)
Syntactic level
parsing
treebanking
bracketing

54
Types of annotation

Discourse level
Anaphoric relations (coreference annotation)
Speech acts (pragmatic annotation)
Stylistic features such as speech and thought in
presentation (stylistic annotation).

55
POS Tagging

POS is the most common type of annotation.
Also known as grammatical tagging or
morpho-syntactic annotation.
It provides the basis of further forms of
analysis such as parsing and semantic
annotation.
Many linguistic analyses, e.g. the collocates of
a word depend heavily on POS tagging.

56
POS Tagging

It can be performed automatically with taggers
like CLAWS
http//www.comp.lancs.ac.uk/ucrel/claws/
You can try it for free online.
Examples of tags NN1 (noun), VVZ (verb in the
third person of the simple present tense), VVD
(verb in the simple past form), ADJ0 (adjective
in the basic form), etc.

57
POS Tagging

Problems
Word segmentation (tokenization)
Multiwords (so that, inspite of)
Mergers (cant, gonna)
Variably spelled compounds (noticeboard,
notice-board, notice board)

58
Lemmatization

Type of annotation that reduces the inflextional
variants of words to their respective lexemes or
lemmas as they appear in dictionary entries
Do, does, did, done, doing DO
Corpus, corpora CORPUS
Small capital letters are the convention.

59
Lemmatization

It is important in vocabulary studies and
lexicography, e.g. in studying the distribution
pattern of lexemes and improving dictionaries
and computer lexicons.
It can be automatically performed.

60
Parsing

Once a corpus is POS tagged, it is possible to
bring these morpho-syntactic categories into
higher level syntactic relationships with one
another, that is, to analyse the sentences in a
corpus into their constituents.
Parsing consists in bracketing.
It can be automated but with a low precision
rate.

61
Parsing

Example
(S (NP Mary)
(VP visited)
(NP a
(ADJP very nice)
boy)))

62
Semantic annotation

It assigns codes indicating the semantic features
of the semantic fields of the words in a text. It
is knowledge-based so it needs to be manual most
of the time.
Two types
One marks the semantic relationships between the
constituents in a sentence
One marks the semantic features of words in a
text

63
Coreference annotation

Pronouns
Repetition
Substitution
Ellipsis
Computer-assisted at best.

64
Pragmatic annotation

Speech/dialogue acts in domain-specific dialogue.
The most coherent system is DRI (Discourse
Representation Initiative).
3 layers of coding
Segmentation (dividing dialogue in textual
units, utterances)
Functional annotation (dialogue act annotation)
Utterance tags (applying utterance tags that
characterize the role of the utterance as a
dialogue act)

65
Pragmatic annotation

Utterance tags
Communicative status (intelligible, complete,
etc.)
Information level and status (indicating the
semantic content of the utterance and how it
relates to the task in question)
Forward-looking communicative function
(utterances that may constrain or affect the
discourse, e.g. assert, request, question and
offer)
Backwarding-looking communicative function
(utterances that relate to previous parts of the
discourse, e.g. accept, backchannelling, answer)

66
Stylistic annotation

It is particularly associated with stylistic
features in literary texts.
An example the representation of peoples
speech and thoughts, known as speech ad thought
presentation (STP)

67
Other types of tagging

Error tagging
Problem-oriented annotation

68
Types of corpora

Multilingual
Monolingual

69
Multilingual Corpora

Parallel corpora (source texts plus
translations) Canadian Hansard
Comparable corpora (monolingual subcorpora
designed using the same sampling techniques)
Aahrus corpus of contract law
Multilingual
Bilingual

70
Multilingual Corpora

Important resources for translation and
contrastive studies.
Multilingual corpora
give new insight into the language compared
can be used to study language specific and
universal features
illuminate differences between source texts and
translations
can be used for a number of practical
applications, in lexicography, language teaching,
translation, etc.

71
Parallel Corpora

Bilingual vs.Multilingual
Unidirectional (from La to Lb or from Lb to Lc
alone) vs. Bidirectional (from La to Lb and from
Lb to La) vs. Multidirectional (from La to Lb, Lc
etc.)

72
Comparable corpora

A corpus containing components that are collected
using the same sampling techniques and similar
balance and representativeness, e.g. the same
proportions of the texts of the same genres in
the same domains in a range of different
languages in the same sampling period.

73
Comparable vs. parallel corpora

The sampling frame is essential for comparable
corpora but not for parallel corpora because the
texts are exact translations of each other.

74
Corpus Alignment

In order for us to be able to fully exploit
parallel corpora, they need to be aligned.
Different types of alignment
Word-level alignment
Sentence-level alignment
Paragraph alignment

75
General Corpora

British National Corpus (100,106,008 words)
The American National Corpus
ICE-CUP

76
Specialized Corpora

Guangzhou Petroleum English Corpus (411,612 words
of written English from the petrochemical domain)
HKUST Computer Science Corpus (1,000,000 words of
written English sampled from undergraduate
textbooks in computer science.
CPSA (Corpus of Professional Spoken American
English)
MICASE (1,700,000 words of English spoken in the
academic domain)

77
Written Corpora

BROWN Corpus (written texts, AE in 1961)
LOB Corpus (Comparable to BROWN Corpus, BE, early
1960s)
FROWN Corpus (AE, Early 1990s)
FLOB Corpus (BE, Early 1990s)

78
Spoken Corpora

London-Lund Corpus (LLC)
Lancaster/IBM Spoken English Corpus (SEC)
Cambridge and Nottingham Corpus of Discourse in
English (CANCODE)
Santa Barbara Corpus of Spoken American English
(SBCSAE)
Wellington Corpus of Spoken New Zealand English
(WSC)

79
Synchronic Corpora

Useful to compare varieties of English. Texts
date all to the same period.
Brown and Lob
Frown and Flob
International Corpus of English (ICE) (Texts
produced after 1989)
BNC

80
Diachronic Corpora

Texts date to different periods in time. Ideal to
study language change and history.
Brown/Frown
Lob/Flob
Helsinki Diachronic Corpus of English Texts
(8th-18th century)
Archer Corpus A representative Corpus of
Historical English Registers (BE and AE,
1650-1990).

81
Learner/developmental Corpora

Lstr or L2 acquisition/L1 acquired by children
CHILDES (DC)
International Corpus of Learner English ICLE
(LC)
Cambridge Learner Corpus (LC)

82
Monitor Corpora

Constantly supplemented with fresh material and
keep increasing in size, though the proportion of
text types included in the corpus remains
constant.
Bank of English (BoE)
Global English Monitor Corpus
AVIATOR

83
The BNC

The British National Corpus (BNC) is a 100
million word collection of samples of written and
spoken language from a wide range of sources,
designed to represent a wide cross-section of
current British English, both spoken and written.

84
The BNC

The written part of the BNC (90) includes, for
example, extracts from regional and national
newspapers, specialist periodicals and journals
for all ages and interests, academic books and
popular fiction, published and unpublished
letters and memoranda, school and university
essays, etc. The spoken part (10) includes a
large amount of unscripted informal conversation,
recorded by volunteers selected from different
age, region and social classes in a
demographically balanced way, together with
spoken language collected in all kinds of
different contexts, ranging from formal business
or government meetings to radio shows and
phone-ins.

85
The BNC

The corpus is encoded according to the
Guidelines of the Text Encoding Initiative (TEI)
to represent both the output from CLAWS
(automatic part-of-speech tagger) and a variety
of other structural properties of texts (e.g.
headings, paragraphs, lists etc.). Full
classification, contextual and bibliographic
information is also included with each text in
the form of a TEI-conformant header.

86
What sort of corpus is the BNC?

Monolingual It deals with modern British
English, not other languges used in Britain.
However non-British English and foreign language
words do occur in the corpus.
Synchronic It covers British English of the late
twentieth century, rather than the historical
development which produced it.
General It includes many different styles and
varieties, and is not limited to any particular
subject field, genre or register. In particular,
it contains examples of both spoken and written
language.
Sample For written sources, samples of 45,000
words are taken from various parts of
single-author texts. Shorter texts up to a
maximum of 45,000 words, or multi-author texts
such as magazines and newspapers, are included in
full. Sampling allows for a wider coverage of
texts within the 100 million limit, and avoids
over-representing idiosyncratic texts.

87
BNC and Sketchengine

Sketch Engine is an excellent user-interface to
query the BNC.
Here are some screenshots.

88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
An example of a POS-tagged text

I've been giving some thought to the whole idea
of writing a book as of late (I've also been
giving some thought to winning the lottery, and
we can all see where that's got me) and it came
to me while showering the other night that if I
were to ever write a book (which ain't gonna
happen, but let's just say for the sake of
argument) I would bill myself as the anti-Francis
Mayes.

102
An example of a POS-tagged text

I_PNP 've_VHB been_VBN giving_VVG some_DT0
thought_NN1 to_PRP the_AT0 whole_AJ0 idea_NN1
of_PRF writing_VVG a_AT0 book_NN1 as_PRP21
of_PRP22 late_AJ0 (_( I_PNP 've_VHB also_AV0
been_VBN giving_VVG some_DT0 thought_NN1 to_PRP
winning_VVG the_AT0 lottery_NN1 ,_, and_CJC
we_PNP can_VM0 all_DT0 see_VVI where_AVQ that_DT0
's_VHZ got_VVN me_PNP )_) and_CJC it_PNP came_VVD
to_PRP me_PNP while_CJS showering_VVG the_AT0
other_AJ0 night_NN1 that_CJT if_CJS I_PNP
were_VBD to_TO0 ever_AV0 write_VVI a_AT0 book_NN1
(_( which_DTQ ai_UNC n't_XX0 gon_VVG na_TO0
happen_VVI ,_, but_CJC let_VM021 's_VM022
just_AV0 say_VVI for_PRP the_AT0 sake_NN1 of_PRF
argument_NN1 )_) I_PNP would_VM0 bill_NN1
myself_PNX as_PRP the_AT0 anti-Francis_AJ0
Mayes_NP0 ._.