The Newcastle Electronic Corpus of Tyneside English PowerPoint PPT Presentation

presentation player overlay
1 / 41
About This Presentation
Transcript and Presenter's Notes

Title: The Newcastle Electronic Corpus of Tyneside English


1
The Newcastle Electronic Corpus of Tyneside
English
  • Karen Corrigan
  • Hermann Moisl

2
Introduction
  • The Newcastle Electronic Corpus of Tyneside
    English (NECTE) is a project devoted to the
    preservation of materials relating to the
    linguistic and cultural heritage of Tyneside in
    north-east England.

3
Introduction
  • Specifically, the NECTE project
  • Preserves interviews with Tyneside people of the
    late 1960s and the early 1990s that provide
    fascinating insights not only into how Tynesiders
    spoke at those times, but also into their lives
    and attitudes.
  • Uses current information technology to provide
    ready access to this material on the Web, and to
    ensure that it will not be lost to future
    generations.

4
Introduction
  • The presentation is in three main parts
  • Outlines the history and current state of the
    project
  • Describes the construction of the NECTE project
  • Shows how cluster analysis can be applied to gain
    useful results from an actual electronic corpus
    like NECTE.

5
1. History and current state of the project
  • The NECTE project amalgamates two separate
    collections of recorded speech.
  • One of the collections was made in the late 1960s
    and early 1970s as part of the Tyneside
    Linguistic Survey (TLS) project, based in the
    Department of English Language at Newcastle
    University.
  • The other collection was made in 1994 as part of
    the Phonological Variation and Change (PVC)
    project based in Newcastle Universitys
    Department of Speech.
  • We will look briefly at both of these.

6
1. History and current state of the project
  • The Tyneside Linguistic Survey
  • The TLS project originally consisted of
    reel-to-reel audiotaped, loosely-structured
    30-minute interviews with 100 informants drawn
    from a stratified random sample of Gateshead in
    North-East England.
  • Many, but not all, of the interviews were
    transcribed at a number of levels of
    representation, including English orthography,
    segmental phonology, and syntagmatic,
    paralinguistic, prosodic, and grammatical
    features.
  • In addition, a file of social data was
    established for each informant.
  • Sixty-four of the segmental phonological
    transcriptions and the social data were then
    electronically encoded, and in that form provided
    the basis for subsequent computational analysis

7
1. History and current state of the project
  • The TLS was an extremely ambitious project, and
    in many ways ahead of its time.
  • With the power of hindsight it is clear that,
    given the technology available at the time, the
    far-reaching objectives of the project were
    unattainable.
  • Significant work on intonation, phonetics,
    linguistic variation, and computational data
    analysis did, however, emerge from the TLS, the
    most comprehensive publication being Val
    Jones-Sargent's book Tyne Bytes A computerised
    sociolinguistic study of Tyneside, which appeared
    in 1983.
  • Thereafter, work on the TLS material languished,
    though it was occasionally used by individual
    researchers.

8
1. History and current state of the project
  • Phonological Variation and Change in Contemporary
    Spoken British English Milroy et al. 1997
  • The ESRC-funded PVC corpus was collected in the
    Tyneside area in 1994.
  • High quality audio tape recorders / microphones
    were used, and the corpus was originally in the
    form of 20 DAT tapes, each of which averages 60
    minutes in length.
  • Dyads of friends or relatives were encouraged to
    converse freely with minimal interference from
    the fieldworker, and, as with the TLS, informants
    were divided between various social class
    groupings of male and female speakers in young,
    middle, and old-age cohorts

9
2. Construction
  • In 2001 we were awarded a substantial research
    grant from the AHRB/C to produce an enhanced
    electronic corpus resource from a combination of
    the TLS and the PVC collections.
  • The NECTE corpus was completed in 2004, and is
    available on the Web at http//www.ncl.ac.uk/necte
    . Access to the data is restricted for legal
    reasons, but the necessary permission is readily
    available to bona fide researchers from Karen
    Corrigan.
  • The construction of NECTE is described in three
    main parts
  • Content representation
  • Content alignment
  • Content structuring

10
2. Construction Representation
  • The NECTE corpus contains four different
    representations of the TLS and PVC materials
  • Audio
  • Orthographic transcription
  • Grammatical markup
  • Phonetic transcription

11
2. Construction Representation - Audio
  • The TLS and PVC corpora are preserved on
    audiotape, and, as such, the primary NECTE data
    representation is audio.
  • The high quality of the PVC recordings has
    enabled a trouble-free preparation of the
    material for the NECTE corpus.
  • The TLS recordings, on the other hand, required a
    degree of restoration.
  • The original analog recordings, both reel-to-reel
    and cassette versions, were first digitized at a
    high sampling rate.
  • A graphic equalisation process was then applied
    to clarify the sound.
  • a hiss reduction filter and a click eliminator
    were applied.
  • Variations in tape recording speed were
    eliminated.
  • Audio was represented in a high-resolution
    digital audio format wav

12
2. Construction Representation - Orthographic
  • The audio content of the TLS and PVC corpora has
    been transcribed into British English
    orthographic representation, and this, too, is
    included in its entirety in the NECTE corpus.
  • Two problems were encountered and, we hope,
    resolved in creating this representation
  • Application of English orthography to nonstandard
    spoken English
  • Transcription accuracy

13
2. Construction Representation - Orthographic
  • Application of English orthography to nonstandard
    spoken English
  • Tyneside spoken English differs significantly
    from standard spoken English across all
    linguistic levels, from phonetic to pragmatic.
    This raises the obvious question of how
    nonstandard features should be rendered
    orthographically.
  • Since NECTE makes sound files and some phonetic
    transcriptions available, we decided not to try
    to represent the non-standard phonology of
    Tyneside English with semi-phonetic spelling.
    Thus, for example, the characteristic /na/ for
    SE know is transcribed ltknowgt, not ltknaagt, as in
    popular representations of the dialect.

14
2. Construction Representation - Orthographic
  • Transcription accuracy
  • Any large-scale textual transcription is subject
    to human error. In addition, the now very old TLS
    tapes have become degraded in various ways, and
    are often difficult or impossible to interpret.
  • Acoustic filtering in the course of digitization
    improved audibility in some, but by no means all,
    cases.
  • We have used orthographic transcriptions made by
    the TLS, but these transcriptions cover only part
    of the corpus.
  • To maximize accuracy, we conducted two correction
    passes on our primary transcription. These were
    carried out by two different members of the NECTE
    team who were themselves not involved in the
    primary transcription the decision criterion was
    majority agreement.

15
2. Construction Representation - Grammatical
markup
  • Grammatical markup or tagging of a corpus is an
    extremely useful basis for linguistic analysis.
  • It is also extremely time-consuming if done
    manually, and difficult to do reliably if
    automated.
  • From the outset of the project NECTE wanted to
    provide some degree of grammatical tagging as one
    of its data representations.
  • The selection of part-of-speech tagging was
    determined by what was possible within the
    timescale of the project, subject to the
    following constraints

16
2. Construction Representation - Grammatical
markup
  • Existing tagging software had to be used, since
    there was insufficient time to develop
    project-specific software.
  • The chosen software had to be able to deal with
    nonstandard English reliably, that is, without
    the need for extensive human intervention in the
    tagging process and/or for extensive subsequent
    proofreading.
  • For reasons to be discussed below, the chosen
    software had to be XML-conformant both in terms
    of being able to deal with text in XML format,
    and, preferably, of generating XML output.

17
2. Construction Representation - Grammatical
markup
  • Having surveyed currently-available tagging
    software, we selected the CLAWS4 (Constituent
    Likelihood Automatic Word-tagging System) tagger
    developed by UCREL (University Centre for
    Computer Corpus Research on Language) at
    Lancaster University, UK for part-of-speech
    tagging the c.100 million word British National
    Corpus.
  • It fulfilled the above requirements in that it is
    a mature system continuously developed since the
    early 1980s, has consistently achieved an
    accuracy rate of 96-97 in relation to the BNC
    corpus, and is XML-conformant.
  • CLAWS4 performed with that level of accuracy on
    our corpus of nonstandard English.

18
2. Construction Representation - Grammatical
markup
Here is an example of output for a single line
randomly selected from the corpus
  • Orthographic
  • Well, I'm quite happy here I must ad-, you-know I
    must say, but, Lobley-Hill.
  • Tagged
  • ltw id"14.2" pos"RR"gtWelllt/wgt ltw id"14.3"
    pos","gt,lt/wgt ltw id"14.4" pos"PPIS1"gtIlt/wgt ltw
    id"14.5" pos"VVBM"gt'mlt/wgt ltw id"14.6"
    pos"RG"gtquitelt/wgt ltw id"14.7"
    pos"JJ"gthappylt/wgt  ltw id"14.8"
    pos"RL"gtherelt/wgt ltw id"14.9" pos"PPIS1"gtIlt/wgt
    ltw id"14.10"pos"VM"gtmustlt/wgt ltad-gt ltyou-knowgt
    ltw id"14.11" pos"PPIS1"gtIlt/wgt ltw id"14.12"
    pos"VM"gtmustlt/wgt ltw id"14.13" pos"VVI"gtsaylt/wgt
    ltw id"14.14" pos","gt,lt/wgt ltw id"14.15"
    pos"CCB"gtbutlt/wgt ltLobley-Hillgt ltw id"14.16"

19
2. Construction Representation - Phonetic
transcription
  • One of the data representations that the TLS
    provided was phonetic transcription of the audio
    material. This transcription was, and is, partial
    in three ways
  • Most but not all of the original 100 recordings
    were transcribed, and of those that were, only 63
    have survived.
  • Because the interviewee responses were of primary
    interest, only these, and not the interviewer's
    utterances, were transcribed.
  • For each interview, only the first 200 or so
    interviewee utterances were transcribed.

20
2. Construction Representation - Phonetic
transcription
  • The phonetic transcriptions were originally
    recorded on index cards

21
2. Construction Representation - Phonetic
transcription
  • The TLS subsequently encoded the transcriptions
    electronically, and all but one of these
    electronic versions have survived.
  • They were systematically corrupted, but we have
    been able to restore them to their original form
    and to include them in our corpus as one of the
    NECTE data representations.
  • NECTE has made no attempt either to review the
    TLS phonetic transcriptions relative to the
    original audio recordings, or to extend the
    phonetic representation to what the TLS did not
    cover.
  • The TLS transcriptions are offered as an
    historical artefact, and the reason they are so
    offered is their intrinsic interest to
    researchers who want to study the phonetics of
    the TLS material the phonetic analysis is
    extremely detailed, providing from one up to ten
    realizations of any given phonological segment.

22
2. Construction Representation - Phonetic
transcription
  • The following example gives a broad phonetic IPA
    representation. In the corpus each segment is,
    however, indexed into a precise phonetic
    realization that cannot be shown simply because
    the IPA does not provide the requisite symbolism,
    but that is nevertheless available for analysis.

Orthographic Down by Clark Chapman's
Phonetic d??n ba? kl?k ?æpm?nz
23
2. Construction Representation -Alignment
  • Alignment in a corpus is the provision of a
    mechanism whereby corresponding elements in
    different data representations are linked so that
    they are simultaneously available to the user.
  • NECTE provides such a linking mechanism to
    coordinate corresponding audio, standard
    orthographic, grammatical markup, and phonetic
    transcription data representations.
  • The main issue in the design of such a mechanism
    is the size of the alignment unit, that is, the
    granularity.

24
2. Construction Representation - Alignment
  • In NECTE, the two extremes of granularity are the
    phonetic segment and the interview
  • in the first case, the audio, orthographic,
    tagged, and phonetic representations would be
    linked on a segment by segment basis
  • in the second, the alignment is constituted in a
    juxtaposition of four complete and different
    representations of the same interview.
  • Neither extreme is particularly useful, and the
    segment-based alignment is probably unworkable --
    some granularity between these two extremes is
    required.
  • We looked at various alternatives, such as
    alignment by speaker utterance or syntactic unit,
    but these proved problematical, so in the end we
    adopted alignment by real-time interval.

25
2. Construction Representation - Alignment
  • Our real-time interval alignment mechanism works
    as follows.
  • It begins with the observation that real time,
    --time as it is conceived by humans in day-to-day
    life-- is meaningful only for the audio level of
    representation in the corpus text, be it
    orthographic, tagged, or a sequence of phonetic
    symbols, has no temporal dimension.
  • A time interval t is selected, and the audio
    level is partitioned into some number n of
    length-t audio segments s s(t x 1), s(t x
    2)...s(t x n), where 'x' denotes multiplication.

26
2. Construction Representation - Alignment
  • Corresponding markers are then inserted into the
    other levels of representation such that they
    demarcate substrings corresponding to the audio
    segments
  • That is, there are markers in the other
    representational levels which identify the
    corresponding orthographic, phonetic, and
    part-of-speech tagged segments. 
  • In this way, selection of any segment s in any
    level of representation allows the segments
    corresponding to s in all the other levels to be
    identified.

27
2. Construction Representation - Alignment
  • For example, the excerpts below show orthographic
    and phonetic transcription representations with
    corresponding time anchors inserted using XML, of
    which more in a moment. The anchors correspond to
    an elapsed-time segment in the audio
    representation.
  • ltanchor id"tlsg01necteortho0020"/gtwhere do you
    mean by that eh lt/ugtltu who"informantTlsg01"gt
    that's ehm ltpause/gt down by eh clark chapman's
    lt/ugtltu who"interviewerTlsg01"gt oh aye like
    saltmeadowslt/ugtltu who"informantTlsg01"gt yes
    saltmeadows lt/ugtltu who"interviewerTlsg01"gt
    ltunclear/gt whereabouts else have you lived since
    then you know i mean how long did you stay there
    lt/ugtltu who"informantTlsg01"gt five year ltanchor
    id"tlsg01necteortho0040"/gtlt
  • anchor id"tlsg01phonetic0020"/gt02081 02301 08580
    02322 01443 02741 02201 01284 08580 02383 02801
    00421 02421 02501 00342 02164 02721 02021 02741
    02642 04321 02621 00503 02825 02301 02721 00246
    02341 12601 02642 02541 01284 02561 02881 01641
    ltanchor id"tlsg01phonetic0040"/gt

28
2. Construction Representation - Structure
  • The NECTE corpus is structured using
    TEI-conformant XML. This and subsequent slides
    describe that structure.
  • Every TEI corpus consists of two main elements
  • A prolog that contains meta-information about the
    corpus
  • The document instance that contains the content
    of the corpus
  • The prolog is too technical to be presented here.
    What follows is an overview of the document
    instance.

29
2. Construction Representation - Structure
The corpus consists of a sequence of interviews.
Each of the labels is, in XML /
TEI-speak, an entity reference, and refers to a
file containing a single speaker interview.
  • ltteiCorpus.2gt 
  • ltteiHeader type'corpus'gt
  • ltfileDescgtlt/fileDescgt
  • ltencodingDescgtlt/encodingDescgt
  • ltprofileDescgtlt/profileDescgt
  • ltrevisionDescgtlt/revisionDescgt
  • lt/teiheadergt  
  • tlsg01tlsg22tlsn06 tlsg02tlsg23tlsn07
     tlsg03tlsg24pvc01 tlsg04tlsg25pvc02 
    tlsg05tlsg26pvc03 tlsg06tlsg27pvc04 tl
    sg07tlsg28pvc05 tlsg08tlsg29pvc06 tlsg
    09tlsg30pvc07 tlsg10tlsg31pvc08 tlsg11
    tlsg32pvc09 tlsg12tlsg33pvc10 tlsg13
    tlsg34pvc11 tlsg14tlsg35pvc12 tlsg15tl
    sg36pvc13 tlsg16tlsg37pvc14 tlsg17tlsn
    01pvc15 tlsg18tlsn02pvc16 tlsg19tlsn03
    pvc17 tlsg20tlsn04pvc18 tlsg21tlsn05 
      
  • lt/ teiCorpus.2gt

30
2. Construction Representation - Structure
Each entity referred to by an entity reference
like tlsg01 contains a single interview, which
itself has a structure
  • ltTEI.2 idtlsg01gt 
  • ltteiHeader typetextgt
  • lt!--Header information --gt
  • lt/teiHeadergt  
  • lttextgt
  • lt!-- Content --gt
  • lt/textgt 
  • lt/TEI.2gt

31
2. Construction Representation - Structure
  • lttextgt 
  • ltgroupgt 
  • lttext id'tlsg01audio'gt
  • ltbodygt
  • lt!-- content --gt
  • lt/bodygt
  • lt/textgt 
  • lttext id'tlsg01necteortho'gt
  • ltbodygt
  • lt!-- content --gt
  • lt/bodygt
  • lt/textgt 
  • lttext id'tlsg01phonetic'gt
  • ltbodygt
  • lt!-- content --gt
  • lt/bodygt
  • lt/textgt 
  • lttext id'tlsg01tagged'gt
  • ltbodygt

Between the lttextgt and lt/textgt tags of an
interview, there is yet further structure there
is a group and the group consists of the four
types of content representation described
earlier. The next slide shows how this looks in
practice.
32
2. Construction Representation - Structure
  • lttextgt
  • ltgroupgt
  • lttext id"tlsg01audio"gt
  • ltbodygt
  • ltpgttlsg01 audio filelt/pgt
  • ltaudio entity"tlsaudiog01" /gt
  • lt/bodygt
  • lt/textgt
  • lttext id"tlsg01necteortho"gt
  • ltbodygt
  • ltu who"interviewerTlsg01"gt
  • ltanchor id"tlsg01necteortho0000" /gt
  • ehm well could you tell us first of all
    where you were born please where you born in
    gateshead
  •   lt/ugt
  • .Remainder of orthographic representation
  • lt/bodygt
  • lt/textgt
  • Phonetic and tagged representations
  • lt/groupgt

33
3. Cluster analysis
  • NECTE can be used in the traditional way for
    research into such things as social history,
    sociolinguistics, dialectology simply by reading
    through it and noting features of interest.
  • This talk is, however, essentially about why
    specifically digital electronic representation of
    text collections is useful in AH research, and,
    in Part 1, it used cluster analysis to exemplify
    the type of computational analysis and results
    that are not feasible using traditional methods.
  • The remainder of the discussion looks at cluster
    analysis of the NECTE corpus and presents some
    sociolinguistic results.

34
3. Cluster analysis
  • Lets say one wants to know if there are any
    systematic differences of pronunciation among
    speakers say, between men and women, old and
    young men, and so on.
  • One can either listen to all the speakers over
    and over (and over) again, comparing them and
    eventually drawing conclusions
  • OR
  • One can use cluster analysis to do the job
    quickly and objectively.
  • TLS had the foresight to use cluster analysis all
    those years ago. It was cutting-edge in
    linguistics all those years ago, and it still is.

35
3. Cluster analysis
  • Here, in essence, is how cluster analysis of
    NECTE works.
  • 1. Construct a profile for the pronunciation used
    by each informant in the corpus by counting the
    number of times each of the large number of
    sounds used in speech occurs in that informants
    interview. The resulting data looks like this

36
3. Cluster analysis
  • 2. Compare the profiles to see if they can be
    grouped according to similarity
  • This is difficult (and for large data sets
    impossible) for humans, but easy for a computer
    with cluster analysis software.
  • The result is a cluster tree is shown on the next
    slide.

37
3. Cluster analysis
  • The lengths of the horizontal lines represent
    relativities of similarity between pairs of
    speaker profiles or speaker profile groups --the
    longer the line, the more dissimilar the
    profiles.
  • Knowing this, it is clear that there are two main
    clusters, here labelled NG1 and NG2, that NG1
    contains well-defined subclusters NG1a and NG1b,
    and that NG1a also contains well-defined
    subclusters NG1a(i) and NG1a(ii).
  • Correlating these clusters with the social data
    such as gender, age, and socio-economic status
    available for the TLS speakers, it emerged that
    those in the NG1 cluster were almost all working
    class speakers with moderate levels of education
    from Gateshead on the south side of the river
    Tyne, and those in NG2 were all well educated
    middle class speakers from Newcastle on the north
    side.

38
3. Cluster analysis
Among the Gateshead speakers in NG1, moreover,
there are two main clusters, labelled NG1a and
NG1b, and NG1a itself consists of two main
subclusters NG1a(i) and NG1a(ii). Once again,
there was a systematic correlation with the
social data available for the speakers. The
clearest correlation is between cluster structure
and gender NG1b consists entirely of men, and
NG1a mainly though not exclusively of women.
39
3. Cluster analysis
  • With a few slight exceptions, the men in NG1b
    have the minimum legal level of education, and
    all are in unskilled, semi-skilled, and skilled
    manual employment.
  • In NG1a there is a clear split between a cluster
    consisting mainly of women with minimum education
    in unskilled, semi-skilled, and skilled manual
    employment (NG1a(i)), and one consisting of men
    and women with a slightly higher educational and
    employment level (NG1a(ii)).

40
3. Cluster analysis
  • Numerous advanced techniques for analyzing data
    have been and are being developed in an effort to
    deal with the deluge of electronic information
    worldwide.
  • We are also experimenting with using such
    techniques on the NECTE data.
  • The picture you are about to see was generated by
    an artificial neural network working on the NECTE
    data, and represents Tyneside linguistic usage as
    a landscape.
  • I wont attempt to explain it, but apart from the
    information it contains, it is rather beautiful
    just like the dialect its based on.

41
3. Cluster analysis
  • A topographic map of the NECTE phonetic data
Write a Comment
User Comments (0)
About PowerShow.com