Title: The Newcastle Electronic Corpus of Tyneside English
1The Newcastle Electronic Corpus of Tyneside
English
- Karen Corrigan
- Hermann Moisl
2Introduction
- The Newcastle Electronic Corpus of Tyneside
English (NECTE) is a project devoted to the
preservation of materials relating to the
linguistic and cultural heritage of Tyneside in
north-east England.
3Introduction
- Specifically, the NECTE project
- Preserves interviews with Tyneside people of the
late 1960s and the early 1990s that provide
fascinating insights not only into how Tynesiders
spoke at those times, but also into their lives
and attitudes. - Uses current information technology to provide
ready access to this material on the Web, and to
ensure that it will not be lost to future
generations.
4Introduction
- The presentation is in three main parts
- Outlines the history and current state of the
project - Describes the construction of the NECTE project
- Shows how cluster analysis can be applied to gain
useful results from an actual electronic corpus
like NECTE.
51. History and current state of the project
- The NECTE project amalgamates two separate
collections of recorded speech. - One of the collections was made in the late 1960s
and early 1970s as part of the Tyneside
Linguistic Survey (TLS) project, based in the
Department of English Language at Newcastle
University. - The other collection was made in 1994 as part of
the Phonological Variation and Change (PVC)
project based in Newcastle Universitys
Department of Speech. - We will look briefly at both of these.
61. History and current state of the project
- The Tyneside Linguistic Survey
- The TLS project originally consisted of
reel-to-reel audiotaped, loosely-structured
30-minute interviews with 100 informants drawn
from a stratified random sample of Gateshead in
North-East England. - Many, but not all, of the interviews were
transcribed at a number of levels of
representation, including English orthography,
segmental phonology, and syntagmatic,
paralinguistic, prosodic, and grammatical
features. - In addition, a file of social data was
established for each informant. - Sixty-four of the segmental phonological
transcriptions and the social data were then
electronically encoded, and in that form provided
the basis for subsequent computational analysis
71. History and current state of the project
- The TLS was an extremely ambitious project, and
in many ways ahead of its time. - With the power of hindsight it is clear that,
given the technology available at the time, the
far-reaching objectives of the project were
unattainable. - Significant work on intonation, phonetics,
linguistic variation, and computational data
analysis did, however, emerge from the TLS, the
most comprehensive publication being Val
Jones-Sargent's book Tyne Bytes A computerised
sociolinguistic study of Tyneside, which appeared
in 1983. - Thereafter, work on the TLS material languished,
though it was occasionally used by individual
researchers.
81. History and current state of the project
- Phonological Variation and Change in Contemporary
Spoken British English Milroy et al. 1997 - The ESRC-funded PVC corpus was collected in the
Tyneside area in 1994. - High quality audio tape recorders / microphones
were used, and the corpus was originally in the
form of 20 DAT tapes, each of which averages 60
minutes in length. - Dyads of friends or relatives were encouraged to
converse freely with minimal interference from
the fieldworker, and, as with the TLS, informants
were divided between various social class
groupings of male and female speakers in young,
middle, and old-age cohorts
92. Construction
- In 2001 we were awarded a substantial research
grant from the AHRB/C to produce an enhanced
electronic corpus resource from a combination of
the TLS and the PVC collections. - The NECTE corpus was completed in 2004, and is
available on the Web at http//www.ncl.ac.uk/necte
. Access to the data is restricted for legal
reasons, but the necessary permission is readily
available to bona fide researchers from Karen
Corrigan. - The construction of NECTE is described in three
main parts - Content representation
- Content alignment
- Content structuring
102. Construction Representation
- The NECTE corpus contains four different
representations of the TLS and PVC materials - Audio
- Orthographic transcription
- Grammatical markup
- Phonetic transcription
112. Construction Representation - Audio
- The TLS and PVC corpora are preserved on
audiotape, and, as such, the primary NECTE data
representation is audio. - The high quality of the PVC recordings has
enabled a trouble-free preparation of the
material for the NECTE corpus. - The TLS recordings, on the other hand, required a
degree of restoration. - The original analog recordings, both reel-to-reel
and cassette versions, were first digitized at a
high sampling rate. - A graphic equalisation process was then applied
to clarify the sound. - a hiss reduction filter and a click eliminator
were applied. - Variations in tape recording speed were
eliminated. - Audio was represented in a high-resolution
digital audio format wav
122. Construction Representation - Orthographic
- The audio content of the TLS and PVC corpora has
been transcribed into British English
orthographic representation, and this, too, is
included in its entirety in the NECTE corpus. - Two problems were encountered and, we hope,
resolved in creating this representation - Application of English orthography to nonstandard
spoken English - Transcription accuracy
132. Construction Representation - Orthographic
- Application of English orthography to nonstandard
spoken English - Tyneside spoken English differs significantly
from standard spoken English across all
linguistic levels, from phonetic to pragmatic.
This raises the obvious question of how
nonstandard features should be rendered
orthographically. - Since NECTE makes sound files and some phonetic
transcriptions available, we decided not to try
to represent the non-standard phonology of
Tyneside English with semi-phonetic spelling.
Thus, for example, the characteristic /na/ for
SE know is transcribed ltknowgt, not ltknaagt, as in
popular representations of the dialect.
142. Construction Representation - Orthographic
- Transcription accuracy
- Any large-scale textual transcription is subject
to human error. In addition, the now very old TLS
tapes have become degraded in various ways, and
are often difficult or impossible to interpret. - Acoustic filtering in the course of digitization
improved audibility in some, but by no means all,
cases. - We have used orthographic transcriptions made by
the TLS, but these transcriptions cover only part
of the corpus. - To maximize accuracy, we conducted two correction
passes on our primary transcription. These were
carried out by two different members of the NECTE
team who were themselves not involved in the
primary transcription the decision criterion was
majority agreement.
152. Construction Representation - Grammatical
markup
- Grammatical markup or tagging of a corpus is an
extremely useful basis for linguistic analysis. - It is also extremely time-consuming if done
manually, and difficult to do reliably if
automated. - From the outset of the project NECTE wanted to
provide some degree of grammatical tagging as one
of its data representations. - The selection of part-of-speech tagging was
determined by what was possible within the
timescale of the project, subject to the
following constraints
162. Construction Representation - Grammatical
markup
- Existing tagging software had to be used, since
there was insufficient time to develop
project-specific software. - The chosen software had to be able to deal with
nonstandard English reliably, that is, without
the need for extensive human intervention in the
tagging process and/or for extensive subsequent
proofreading. - For reasons to be discussed below, the chosen
software had to be XML-conformant both in terms
of being able to deal with text in XML format,
and, preferably, of generating XML output.
172. Construction Representation - Grammatical
markup
- Having surveyed currently-available tagging
software, we selected the CLAWS4 (Constituent
Likelihood Automatic Word-tagging System) tagger
developed by UCREL (University Centre for
Computer Corpus Research on Language) at
Lancaster University, UK for part-of-speech
tagging the c.100 million word British National
Corpus. - It fulfilled the above requirements in that it is
a mature system continuously developed since the
early 1980s, has consistently achieved an
accuracy rate of 96-97 in relation to the BNC
corpus, and is XML-conformant. - CLAWS4 performed with that level of accuracy on
our corpus of nonstandard English.
182. Construction Representation - Grammatical
markup
Here is an example of output for a single line
randomly selected from the corpus
- Orthographic
- Well, I'm quite happy here I must ad-, you-know I
must say, but, Lobley-Hill. - Tagged
- ltw id"14.2" pos"RR"gtWelllt/wgt ltw id"14.3"
pos","gt,lt/wgt ltw id"14.4" pos"PPIS1"gtIlt/wgt ltw
id"14.5" pos"VVBM"gt'mlt/wgt ltw id"14.6"
pos"RG"gtquitelt/wgt ltw id"14.7"
pos"JJ"gthappylt/wgt ltw id"14.8"
pos"RL"gtherelt/wgt ltw id"14.9" pos"PPIS1"gtIlt/wgt
ltw id"14.10"pos"VM"gtmustlt/wgt ltad-gt ltyou-knowgt
ltw id"14.11" pos"PPIS1"gtIlt/wgt ltw id"14.12"
pos"VM"gtmustlt/wgt ltw id"14.13" pos"VVI"gtsaylt/wgt
ltw id"14.14" pos","gt,lt/wgt ltw id"14.15"
pos"CCB"gtbutlt/wgt ltLobley-Hillgt ltw id"14.16"
192. Construction Representation - Phonetic
transcription
- One of the data representations that the TLS
provided was phonetic transcription of the audio
material. This transcription was, and is, partial
in three ways - Most but not all of the original 100 recordings
were transcribed, and of those that were, only 63
have survived. - Because the interviewee responses were of primary
interest, only these, and not the interviewer's
utterances, were transcribed. -
- For each interview, only the first 200 or so
interviewee utterances were transcribed.
202. Construction Representation - Phonetic
transcription
- The phonetic transcriptions were originally
recorded on index cards
212. Construction Representation - Phonetic
transcription
- The TLS subsequently encoded the transcriptions
electronically, and all but one of these
electronic versions have survived. - They were systematically corrupted, but we have
been able to restore them to their original form
and to include them in our corpus as one of the
NECTE data representations. - NECTE has made no attempt either to review the
TLS phonetic transcriptions relative to the
original audio recordings, or to extend the
phonetic representation to what the TLS did not
cover. - The TLS transcriptions are offered as an
historical artefact, and the reason they are so
offered is their intrinsic interest to
researchers who want to study the phonetics of
the TLS material the phonetic analysis is
extremely detailed, providing from one up to ten
realizations of any given phonological segment.
222. Construction Representation - Phonetic
transcription
- The following example gives a broad phonetic IPA
representation. In the corpus each segment is,
however, indexed into a precise phonetic
realization that cannot be shown simply because
the IPA does not provide the requisite symbolism,
but that is nevertheless available for analysis.
Orthographic Down by Clark Chapman's
Phonetic d??n ba? kl?k ?æpm?nz
232. Construction Representation -Alignment
- Alignment in a corpus is the provision of a
mechanism whereby corresponding elements in
different data representations are linked so that
they are simultaneously available to the user. - NECTE provides such a linking mechanism to
coordinate corresponding audio, standard
orthographic, grammatical markup, and phonetic
transcription data representations. - The main issue in the design of such a mechanism
is the size of the alignment unit, that is, the
granularity.
242. Construction Representation - Alignment
- In NECTE, the two extremes of granularity are the
phonetic segment and the interview - in the first case, the audio, orthographic,
tagged, and phonetic representations would be
linked on a segment by segment basis - in the second, the alignment is constituted in a
juxtaposition of four complete and different
representations of the same interview. - Neither extreme is particularly useful, and the
segment-based alignment is probably unworkable --
some granularity between these two extremes is
required. - We looked at various alternatives, such as
alignment by speaker utterance or syntactic unit,
but these proved problematical, so in the end we
adopted alignment by real-time interval.
252. Construction Representation - Alignment
- Our real-time interval alignment mechanism works
as follows. - It begins with the observation that real time,
--time as it is conceived by humans in day-to-day
life-- is meaningful only for the audio level of
representation in the corpus text, be it
orthographic, tagged, or a sequence of phonetic
symbols, has no temporal dimension. - A time interval t is selected, and the audio
level is partitioned into some number n of
length-t audio segments s s(t x 1), s(t x
2)...s(t x n), where 'x' denotes multiplication.
262. Construction Representation - Alignment
- Corresponding markers are then inserted into the
other levels of representation such that they
demarcate substrings corresponding to the audio
segments - That is, there are markers in the other
representational levels which identify the
corresponding orthographic, phonetic, and
part-of-speech tagged segments. - In this way, selection of any segment s in any
level of representation allows the segments
corresponding to s in all the other levels to be
identified.
272. Construction Representation - Alignment
- For example, the excerpts below show orthographic
and phonetic transcription representations with
corresponding time anchors inserted using XML, of
which more in a moment. The anchors correspond to
an elapsed-time segment in the audio
representation. - ltanchor id"tlsg01necteortho0020"/gtwhere do you
mean by that eh lt/ugtltu who"informantTlsg01"gt
that's ehm ltpause/gt down by eh clark chapman's
lt/ugtltu who"interviewerTlsg01"gt oh aye like
saltmeadowslt/ugtltu who"informantTlsg01"gt yes
saltmeadows lt/ugtltu who"interviewerTlsg01"gt
ltunclear/gt whereabouts else have you lived since
then you know i mean how long did you stay there
lt/ugtltu who"informantTlsg01"gt five year ltanchor
id"tlsg01necteortho0040"/gtlt - anchor id"tlsg01phonetic0020"/gt02081 02301 08580
02322 01443 02741 02201 01284 08580 02383 02801
00421 02421 02501 00342 02164 02721 02021 02741
02642 04321 02621 00503 02825 02301 02721 00246
02341 12601 02642 02541 01284 02561 02881 01641
ltanchor id"tlsg01phonetic0040"/gt
282. Construction Representation - Structure
- The NECTE corpus is structured using
TEI-conformant XML. This and subsequent slides
describe that structure. - Every TEI corpus consists of two main elements
- A prolog that contains meta-information about the
corpus - The document instance that contains the content
of the corpus - The prolog is too technical to be presented here.
What follows is an overview of the document
instance.
292. Construction Representation - Structure
The corpus consists of a sequence of interviews.
Each of the labels is, in XML /
TEI-speak, an entity reference, and refers to a
file containing a single speaker interview.
- ltteiCorpus.2gt
- ltteiHeader type'corpus'gt
- ltfileDescgtlt/fileDescgt
- ltencodingDescgtlt/encodingDescgt
- ltprofileDescgtlt/profileDescgt
- ltrevisionDescgtlt/revisionDescgt
- lt/teiheadergt
- tlsg01tlsg22tlsn06 tlsg02tlsg23tlsn07
tlsg03tlsg24pvc01 tlsg04tlsg25pvc02
tlsg05tlsg26pvc03 tlsg06tlsg27pvc04 tl
sg07tlsg28pvc05 tlsg08tlsg29pvc06 tlsg
09tlsg30pvc07 tlsg10tlsg31pvc08 tlsg11
tlsg32pvc09 tlsg12tlsg33pvc10 tlsg13
tlsg34pvc11 tlsg14tlsg35pvc12 tlsg15tl
sg36pvc13 tlsg16tlsg37pvc14 tlsg17tlsn
01pvc15 tlsg18tlsn02pvc16 tlsg19tlsn03
pvc17 tlsg20tlsn04pvc18 tlsg21tlsn05
- lt/ teiCorpus.2gt
302. Construction Representation - Structure
Each entity referred to by an entity reference
like tlsg01 contains a single interview, which
itself has a structure
- ltTEI.2 idtlsg01gt
- ltteiHeader typetextgt
- lt!--Header information --gt
- lt/teiHeadergt
- lttextgt
- lt!-- Content --gt
- lt/textgt
- lt/TEI.2gt
312. Construction Representation - Structure
- lttextgt
- ltgroupgt
- lttext id'tlsg01audio'gt
- ltbodygt
- lt!-- content --gt
- lt/bodygt
- lt/textgt
- lttext id'tlsg01necteortho'gt
- ltbodygt
- lt!-- content --gt
- lt/bodygt
- lt/textgt
- lttext id'tlsg01phonetic'gt
- ltbodygt
- lt!-- content --gt
- lt/bodygt
- lt/textgt
- lttext id'tlsg01tagged'gt
- ltbodygt
Between the lttextgt and lt/textgt tags of an
interview, there is yet further structure there
is a group and the group consists of the four
types of content representation described
earlier. The next slide shows how this looks in
practice.
322. Construction Representation - Structure
- lttextgt
- ltgroupgt
- lttext id"tlsg01audio"gt
- ltbodygt
- ltpgttlsg01 audio filelt/pgt
- ltaudio entity"tlsaudiog01" /gt
- lt/bodygt
- lt/textgt
- lttext id"tlsg01necteortho"gt
- ltbodygt
- ltu who"interviewerTlsg01"gt
- ltanchor id"tlsg01necteortho0000" /gt
- ehm well could you tell us first of all
where you were born please where you born in
gateshead - lt/ugt
- .Remainder of orthographic representation
- lt/bodygt
- lt/textgt
- Phonetic and tagged representations
- lt/groupgt
333. Cluster analysis
- NECTE can be used in the traditional way for
research into such things as social history,
sociolinguistics, dialectology simply by reading
through it and noting features of interest. - This talk is, however, essentially about why
specifically digital electronic representation of
text collections is useful in AH research, and,
in Part 1, it used cluster analysis to exemplify
the type of computational analysis and results
that are not feasible using traditional methods. - The remainder of the discussion looks at cluster
analysis of the NECTE corpus and presents some
sociolinguistic results.
343. Cluster analysis
- Lets say one wants to know if there are any
systematic differences of pronunciation among
speakers say, between men and women, old and
young men, and so on. - One can either listen to all the speakers over
and over (and over) again, comparing them and
eventually drawing conclusions - OR
- One can use cluster analysis to do the job
quickly and objectively. - TLS had the foresight to use cluster analysis all
those years ago. It was cutting-edge in
linguistics all those years ago, and it still is.
353. Cluster analysis
- Here, in essence, is how cluster analysis of
NECTE works. - 1. Construct a profile for the pronunciation used
by each informant in the corpus by counting the
number of times each of the large number of
sounds used in speech occurs in that informants
interview. The resulting data looks like this
363. Cluster analysis
- 2. Compare the profiles to see if they can be
grouped according to similarity - This is difficult (and for large data sets
impossible) for humans, but easy for a computer
with cluster analysis software. - The result is a cluster tree is shown on the next
slide.
373. Cluster analysis
- The lengths of the horizontal lines represent
relativities of similarity between pairs of
speaker profiles or speaker profile groups --the
longer the line, the more dissimilar the
profiles. - Knowing this, it is clear that there are two main
clusters, here labelled NG1 and NG2, that NG1
contains well-defined subclusters NG1a and NG1b,
and that NG1a also contains well-defined
subclusters NG1a(i) and NG1a(ii). - Correlating these clusters with the social data
such as gender, age, and socio-economic status
available for the TLS speakers, it emerged that
those in the NG1 cluster were almost all working
class speakers with moderate levels of education
from Gateshead on the south side of the river
Tyne, and those in NG2 were all well educated
middle class speakers from Newcastle on the north
side.
383. Cluster analysis
Among the Gateshead speakers in NG1, moreover,
there are two main clusters, labelled NG1a and
NG1b, and NG1a itself consists of two main
subclusters NG1a(i) and NG1a(ii). Once again,
there was a systematic correlation with the
social data available for the speakers. The
clearest correlation is between cluster structure
and gender NG1b consists entirely of men, and
NG1a mainly though not exclusively of women.
393. Cluster analysis
- With a few slight exceptions, the men in NG1b
have the minimum legal level of education, and
all are in unskilled, semi-skilled, and skilled
manual employment. - In NG1a there is a clear split between a cluster
consisting mainly of women with minimum education
in unskilled, semi-skilled, and skilled manual
employment (NG1a(i)), and one consisting of men
and women with a slightly higher educational and
employment level (NG1a(ii)).
403. Cluster analysis
- Numerous advanced techniques for analyzing data
have been and are being developed in an effort to
deal with the deluge of electronic information
worldwide. - We are also experimenting with using such
techniques on the NECTE data. - The picture you are about to see was generated by
an artificial neural network working on the NECTE
data, and represents Tyneside linguistic usage as
a landscape. - I wont attempt to explain it, but apart from the
information it contains, it is rather beautiful
just like the dialect its based on.
413. Cluster analysis
- A topographic map of the NECTE phonetic data