Title: Will Allen w'h'a'allenncl'ac'uk Warren Maguire w'n'maguirencl'ac'uk Hermann Moisl hermann'moislncl'a
1Phonetic variation in Tyneside exploratory
multivariate analysis of the Newcastle Electronic
Corpus of Tyneside English
Will Allen Warren Maguire Hermann Moisl
School of English Literature, Language, and
LinguisticsUniversity of Newcastle upon Tyne
2Introduction
- The newly-created Newcastle Electronic Corpus of
Tyneside English (NECTE) offers an opportunity to
study an historically-recent sample of English
spoken in the Tyneside region of North-East
England.
3Introduction
- This paper gives an overview of an exploratory
analysis of phonetic data derived from NECTE. - The analysis was undertaken with the aim of
generating hypotheses about the main directions
of phonetic variation among individual speakers
and speaker groups in the corpus, and about how
this variation correlates with associated social
factors. - The discussion is in four main parts.
- The first part outlines exploratory multivariate
analysis in general, and in particular
hierarchical cluster analysis, the method used in
this study. - The second describes the NECTE phonetic data used
in the analysis. - The third carries out a hierarchical cluster
analysis of a sample of that data and states some
hypotheses based on it. - The fourth states raises some caveats and
indicates directions for future work.
41. Multivariate analysis
- The proliferation of computational technology has
generated an explosive production of
electronically encoded information of all kinds. - In the face of this, traditional paper-based
methods for search and interpretation of data
have been overwhelmed by sheer volume, and a wide
variety of computational methods has been
developed in an attempt to make the deluge at
least tractable. - As such methods have been refined and new ones
introduced, something over and above tractability
has emerged new and unexpected ways of
understanding the data. - The fact that a computer can deal with vastly
larger data sets than a human is an obvious
factor, but there are two others of at least
equal importance - one is the ease with which data can be
manipulated and reanalyzed in interesting ways
without the often prohibitive labour that this
would involve using manual techniques - the other is the extensive scope for
visualization that computer graphics provide.
51. Multivariate analysis
- These developments have clear implications for
the analysis of large bodies of text in corpus
linguistics. - Effective analysis of the large electronic
corpora now being generated will increasingly be
tractable only by adapting the interpretative
methods developed by the statistical, data
mining, and related communities. - In the present paper we are interested in one
particular type of tool multivariate analysis.
What is multivariate analysis?
61. Multivariate analysis
- Observation of nature plays a fundamental role in
science. - In current scientific methodology, an hypothesis
about some phenomenon is proposed and its
adequacy assessed using data obtained from
observation of the domain of inquiry. - But nature is complex, and there is no hope of
being able to observe it exhaustively. Instead,
particular aspects of the domain are selected for
observation. - Each selected aspect is represented by a
variable, and a series of observations is
conducted in which, at each observation, the
values for each variable are recorded. A body of
data is thereby built up on the basis of which an
hypothesis can be assessed. - One might choose to observe only one aspect, in
which case the data set consists of a set of
values assigned to one variable such a data set
is referred to as univariate. - If two values are observed, then the data set is
bivariate, if three trivariate, and so on up to
some arbitrary number n the general term for
n-variable data sets is multivariate.
71. Multivariate analysis
- As the number of variables grows, so does the
difficulty of understanding the data, that is, of
conceptualizing -
- the interrelationships of variables within a
single data item how, for example, are height,
weight, and heart rate of any given person
interrelated? -
- the interrelationships of complete data items
how do people measured on the above variables
compare to one another? - Multivariate analysis is the computational use
of mathematical and statistical tools for
understanding such interrelationships in data.
81. Multivariate analysis
- Numerous techniques for multivariate analysis
exist. They can be divided into two main
categories which are usually referred to as
'exploratory' and 'confirmatory'. - Exploratory analysis aims to discover
regularities in data which can serve as the basis
for formulation of hypotheses about the domain
from which the data comes. Such techniques
emphasize intuitively accessible, usually
graphical representations of data structure. - Confirmatory multivariate analysis attempts to
determine whether or not there are significant
relationships between some number of selected
independent variables and one or more dependent
ones. - These two types are complementary in that the
first generates hypotheses about data, and the
second tries to determine whether or not such
hypotheses are valid. Exploratory analysis is
naturally prior to confirmatory this discussion
is concerned with the former.
91. Multivariate analysis hierarchical cluster
analysis
- Hierarchical cluster analysis is a variety of
exploratory multivariate analysis. - To understand how it works it is first necessary
to understand the concept of distance between
data points in vector space. - Assume a domain of inquiry, say a linguistic
corpus, which will be studied using six
variables. - If the six-dimensional data is to be analyzed
using an exploratory method, it has to be
represented mathematically. - This is done in the form of vectors, where a
vector is a sequence of values indexed by the
positive integers 1, 2, 3.
101. Multivariate analysis hierarchical cluster
analysis
- Where the data consists of more than one case,
which it usually does, then each case is
represented by a vector, and the set of vectors
is assembled into a matrix, which is a sequence
of vectors arranged in rows and the rows are
indexed by the positive integers 1, 2, 3 . - In matrix M, case 2 is at row M2 and the value of
the third variable for that case is at M2,3, that
is, 0.1.
111. Multivariate analysis hierarchical cluster
analysis
- A vector space is a geometrical interpretation of
a set of vectors - The dimensionality n of the vectors, that is, the
number of its elements, defines an n-dimensional
space. - The values in the vector define the coordinates
of a point in that space
121. Multivariate analysis hierarchical cluster
analysis
- For example, a bivariate data set defines a
2-dimensional space in which each vector
specifies the coordinates of a point in that
space. - Take a data set consisting of vectors that
specify the age and weight of some number of
individuals. A single such vector might be v
(36,160). - In geometrical terms, the x or age axis is
0..100, the y or weight axis is 0..200, and any
vector in the data set can be plotted in the
(x,y) space, as in
131. Multivariate analysis hierarchical cluster
analysis
- If more vectors are plotted in the space,
nonrandom structure may or may not emerge,
depending on the interrelationships of the
real-world characteristics that the variables
represent. - Where there are no structured real-world
interrelationships, the result will look
something like the upper plot of random points. -
- If there is structure, the plot might look
something like the lower one, where two clusters
have clearly emerged. These clusters say
something substantive about the
interrelationships of the represented entities.
141. Multivariate analysis hierarchical cluster
analysis
- Analogously, a trivariate (age, weight, height)
vector v (36, 160, 71) from a data set of
length-3 vectors defines a point in 3-dimensional
space, as in the upper plot. - Multiple vectors representing a structured domain
plotted in the space might look like the lower
figure.
151. Multivariate analysis hierarchical cluster
analysis
- The structure of data with dimensionality higher
than 3 cannot be directly visualized. - How can it be represented in an intuitively
accessible way? - The various exploratory multivariate methods
provide indirect visualizations. - Hierarchical cluster analysis, in particular,
constructs dendrograms or trees that show the
constituency structure of clusters using relative
distance between and among points in the
high-dimensional data vector space. - Distance can be understood quite literally
distance between points A and B in the figure
below can be measured, and it is less than the
measured distance between A and C.
161. Multivariate analysis hierarchical cluster
analysis
- As an example of how dendrograms graphically
represent the structure of higher-dimensional
data, we use a benchmark data set that measures
flowers on 4 variables. - A hierarchical cluster analysis generates a tree
in which the different lengths of the horizontal
lines represent relativities of distance among
data vectors in 4-dimensional space the longer
the lines, the greater the distance. - Knowing this, it can easily be seen that there
are three main clusters, that is, three types of
flower, and that each cluster has internal
structure.
172. Data the corpus
- The NECTE corpus is based on two pre-existing
corpora of audio-recorded speech, one of them
gathered during Tyneside Linguistic Survey (TLS)
undertaken in the late 1960s and the other
between 1991 and 1994 for the Phonological
Variation and Change in Contemporary Spoken
English project, both at Newcastle University. - NECTEs aim has been to enhance, improve access
to, and promote the re-use of the TLS and PVC
corpora by amalgamating them into a single,
TEI-conformant electronic corpus. - The result will shortly be made available to the
research community in a variety of formats
digitized sound, phonetic transcription, and
standard orthographic transcription, all aligned
and accessible on the Web.
182. Data the corpus
- This discussion is concerned with the TLS
component of NECTE - It originally consisted of 150 loosely-structured
30-minute interviews with Tyneside informants
that were recorded onto analog reel-to-reel
tapes. - As part of its research activity based on these
recordings, the TLS produced highly detailed
phonetic transcriptions of about 10 minutes of
each of 64 recordings, of which 61 survive. - These 61 transcriptions are the basis for the
data used in this presentation.
192. Data the corpus
- One of the main aims of the TLS project was to
see whether systematic phonetic variation among
Tyneside speakers of the period could be
significantly correlated with variation in their
social characteristics. - To this end they developed a methodology which
was radical at the time and remains so today - in contrast to the then-universal and
still-dominant theory driven approach, where
social and linguistic factors are pre-selected by
the analyst, - the TLS proposed a fundamentally empirical
approach in which salient factors are extracted
from the data itself and then serve as the basis
for model construction.
202. Data the corpus
- To realize its research aim using its empirical
methodology, the TLS had to compare the audio
interviews it had collected at the phonetic level
of representation. - To be able to do this, the TLS phonetically
transcribed a substantial sample of its audio
corpus, as noted. - The TLS invented its own transcription scheme. It
is too complex for presentation here, but its
main features are - It captures variation in the distribution of
phonetic segments across lexical environments. - There are two levels of transcription relatively
broad and very narrow.
212. Data abstraction
- The analysis reported in this talk is at the
broad phonetic transcription level. - It is based on comparison of phonetic profiles
associated with each of the TLS speakers. - A profile for any speaker S is the number of
times S uses each of the phonetic segments
defined by the TLS transcription scheme in his or
her interview.
222. Data abstraction
- More specifically, the profile P associated with
S is a vector having as many elements as there
are codes such that - Each vector element Pj represents the jth
segment, where j is in the range 1..number of
phonetic segments in the TLS scheme - The value stored at Pj is an integer representing
the number of times S uses the jth code code. - There are 156 codes, and so a profile is a
length-156 vector. For example
232. Data abstraction
- There are 61 TLS speakers, and their profiles are
represented in matrices having 61 rows, one for
each profile
242. Data preprocessing
- Prior to analysis, this matrix modified in two
ways. - Normalization for variation in text length
- Dimensionality reduction elimination of
superfluous variables - The result is a 61 x 80 length-normalized matrix,
which is the data for the analysis that follows.
253. Analysis
- This section analyzes the data matrix developed
in the preceding section. - One particular variety of hierarchical cluster
analysis is used squared Euclidean distance
measure and the increase in sum of squares
clustering algorithm, or, more simply, Wards
method. - The implications of this choice are discussed in
due course. - The aim is to get an initial indication of the
structure of the TLS phonetic data using the
matrix of speaker profiles normalized for length
and dimensionality reduced as described. The
cluster tree looks like this
263. Analysis
- Four main clusters emerge, labelled A-D.
- D clusters markedly against the rest, and
comprises the Newcastle group of speakers. - On the basis of the phonetic segment frequency
distribution evidence, therefore, Newcastle
speakers are strongly distinguished from
Gateshead ones. - Gateshead ones can be further analyzed, but for
present purposes we stay with the Newcastle /
Gateshead distinction
273. Analysis
- Knowing that there are well defined clusters is
one thing. - Knowing why is another what are the main
phonetic segmental determinants of the clusters? - Several ways of answering this question exist we
take a graphical approach. - The phonetic profiles for all 61 speakers were
simultaneously plotted with the aim of visually
identifying systematic differences between the
Newcastle and Gateshead clusters.
283. Analysis
- The problem is immediately apparent too much
detail. The level of detail was reduced as
follows - A significantly smaller number of the
highest-variance variables was selected, thus
reducing the density of information on the
x-axis. - The frequency vectors for all the Gateshead
speakers 1-57 were averaged, yielding a mean
frequency vector for these speakers. - The same was done for the Newcastle speakers
- The mean frequency vectors were plotted against
each other on the same graph
293. Analysis
- For the 30 highest-variance variables shown here,
those for which the mean vectors differ most are
the most important in differentiating Newcastle
and Gateshead speakers.
303. Analysis
- The table on the right gives the interpretation
of this plot - Rank indexes the 6 largest variable-differences
between the two clusters 1 is the largest
difference, 2 the second largest, etc. - Variable nr shows, for each Rank, the
corresponding variable number in the range 1..40
on the x-axis of the plot. - Variable symbol identifies the phonetic symbol
corresponding to the Variable nr.
313. Analysis
- The variables that are most important for
distinguishing Gateshead from Newcastle speakers
can be read off from the table. They are - ? in words like standard and interview
(localised Tyneside English often has ? here) - closely related to this, ? in words like baker
and china (localised Tyneside English often has
? here) - ? in the KIT lexical set
- ? in words like houses and places
- ?? in the GOAT lexical set (RP type English has
?? here) - e? in the PRICE lexical set (RP type English
has a? here).
323. Analysis
- We have conducted similar analyses on the
Gateshead subcluster. - There isnt time to go into the details, but the
essence of the results is that, on the basis of
the phonetic frequency profiles - The main phonetic segmental determinants for the
clustering can be extracted - The main subclusters can be correlated with
social factors, as follows.
333. Analysis
- 1. There is a correlation between gender and
cluster - Cluster A is almost exclusively female
- cluster BC is almost exclusively male
- cluster DE is mostly female.
- 2. There is a correlation between socio-economic
status and cluster - speakers in cluster A are, by and large, from the
lowest of the three socio-economic groups - speakers in cluster BC are from socio-economic
groups 1 and 2 - speakers from cluster DE are from the two higher
socio-economic groups, 2 and 3.
343. Analysis
- When the interaction between these two social
variables in considered, the picture becomes even
clearer. The following table summarises the
typical social characteristics of the clusters
354. Discussion
- The foregoing analysis has used only one of a
wide range of possible distance measure /
clustering algorithm combinations available under
the rubric hierarchical cluster analysis. - In general, with respect to any given data set,
different combinations of distance measure and
clustering algorithm typically differ from one
another to greater or lesser degrees. - Given that there is no obvious way of selecting
the best analysis that captures the true
structure of the data, the question is how
reliable a tool is hierarchical cluster analysis
for linguistic research? - Our own cluster analyses have shown such
variation with respect to the TLS data. - How can we claim any validity for our results?
364. Discussion
- In principle, the response is that the purpose of
exploratory multivariate analysis is to generate
hypotheses rather than to provide definitive
answers, and different analyses of the same data
simply generate different hypotheses to be
tested. - In practice, if different selections of distance
measure / clustering algorithm produce wildly
different analyses of one and the same data set,
its difficult to see what advantage they offer
over unguided hypothesizing. - Therefore, the next step is to analyze the data
using - various other distance measure / clustering
algorithm combinations in hierarchical cluster
analysis - Different clustering methods such as
multidimensional scaling and self organizing maps - to see if a stable analysis emerges.
374. Conclusion
- The hierarchical analyses we have undertaken so
far look promising. The next steps are - Detailed phonetic and sociolinguistic analysis of
the cluster tree generated by the distance
measure / clustering algorithm used in this talk. - Comparison of this tree with structural analyses
generated by other varieties of hierarchical
cluster analysis and by alternative clustering
methods. - Relation of our results to existing work on
Tyneside English, and in particular to that of
Val Jones-Sargent, one of the original TLS
researchers, on whose work our own is based. - V. Jones-Sargent, Tyne Bytes. A computerised
sociolinguistic study of Tyneside, Peter Lang,
1983