Title: Using corpora in contrastive and translation studies
1Using corpora in contrastive and translation
studies
- Corpus Linguistics
- Richard Xiao
- lancsxiaoz_at_googlemail.com
2Aims of this session
- Lecture
- Corpora in contrastive and translation studies
- Use of comparable and parallel corpora
- Case study Translation universals, do they
really exist? - Lab session
- CUC paraconc and Babel parallel corpus
- Closing
- Shedding of valedictory tears
3Types of corpora Some distinctions
- Monolingual versus multilingual corpora
- Parallel versus comparable corpora
- Comparable versus comparative corpora
4Monolingual versus multilingual corpora
- Monolingual corpora
- A corpus that only involves one language
- Multilingual corpora
- A corpus that involves texts of more than one
language - A corpus covering two languages is conventionally
known as bilingual - Multilingual corpora, in a narrow sense, must
involve more than two languages - Multilingual and bilingual are often used
interchangeably - Parallel and comparable corpora
5Parallel versus comparable corpora
- Terminological confusion centres around the terms
- For some scholars (e.g. Aijmer and Altenberg
1996 Granger 1996 38) - Corpora composed of source texts in one language
and their translations in another language (or
other languages) are translation corpora while
those comprising different components sampled
from different native languages using comparable
sampling techniques are called parallel corpora - For many others (e.g. Baker 1993 248, 1995,
1999 Barlow 1995, 2000 110 Hunston 2002 15
McEnery and Wilson 1996 57 McEnery, Xiao and
Tono 2006) - Corpora of the first type are labelled parallel
corpora while those of the latter type are
comparable corpora
6Parallel versus comparable corpora
- In classifying corpora, the criteria used must be
consistent and logical ways of doing things - - We can say a corpus is a translation or a
non-translation corpus if the criterion of corpus
content is used - - But if we choose to define corpus types by the
criterion of corpus form, we must use the
criterion consistently - - We can say a corpus is parallel if the corpus
contains source texts and translations in
parallel, or it is a comparable corpus if its
components or subcorpora are comparable by
applying the same sampling techniques and similar
balance and coverage - - It is simply inconsistent and illogical to
refer to corpora of the first type as translation
corpora by the criterion of content while
referring to corpora of the latter type as
parallel corpora by the criterion of form!
7Multilingual vs. monolingual comparable corpora
- A common practice in TS is to compare a corpus of
translated texts (translational corpus) with a
corpus consisting of comparably sampled
non-translated texts in the same language - The two sub-corpora form a monolingual comparable
corpus for translation research, as opposed to a
multilingual comparable corpus composed of
comparable texts for different languages for
cross-linguistic contrast
8Comparative corpora
- Corpora containing different regional varieties
of the same language are not comparable corpora - E.g. the International Corpus of English (ICE),
the Brown family of corpora - All corpora, as a resource for linguistic
research, have always been pre-eminently suited
for comparative studies (Aarts 1998 ix), either
intralingually or interlingually - Corpora of this kind are comparative corpora
9Use of parallel comparable corpora
- Parallel and comparable corpora offer specific
uses and possibilities for contrastive and
translation studies (Aijmer Altenberg 1996 12) - - they give new insights into the languages
compared insights that are not likely to be
gained from the study of monolingual corpora - - they can be used for a range of comparative
purposes and increase our knowledge of
language-specific, typological and cultural
differences, as well as of universal features - - they illuminate differences between source
texts and translations, and between native and
non-native texts - - they can be used for a number of practical
applications, e.g. in lexicography, language
teaching and translation.
10Use of parallel comparable corpora
- Used primarily for translation and contrastive
studies - The two types of corpora have their own
characteristics, and serve different purposes - Parallel corpora are useful in translation
studies, but they alone serve as a poor basis for
cross-linguistic contrast, because translations
cannot avoid the effect of translationese - Comparable corpora are well suited for
contrastive research, but are less useful in
translation studies
11Using corpora in translation studies
- Parallel corpora
- Useful in exploring how an idea in one language
is conveyed in another language, thus providing
indirect evidence to the study of translation
processes - Indispensable for building statistical or
example-based machine translation (EBMT) systems,
and for the development of bilingual lexicons and
translation memories - Parallel concordancing is a useful tool for
translators - Comparable corpora
- Useful in improving the translators
understanding of the subject field and improving
the quality of translation in terms of fluency,
correct term choice and idiomatic expressions in
the chosen subject field - Can also be used to build terminology banks
12Using corpora in translation studies
- Translational corpora
- Provide primary evidence in product-oriented
Translation Studies, and in studies of
translation universals - If corpora of this kind are encoded with
sociolinguistic and cultural parameters, they can
also be used to study the sociocultural
environment of translations (e.g. functions of
translation in DTS) - Monolingual corpora (source / target language )
- Raising the translators linguistic and cultural
awareness in general - Providing a useful and effective reference tool
for translators - In combination with a parallel corpus to form a
so-called translation evaluation corpus that
helps translator trainers or critics to evaluate
translations more effectively and objectively
13Corpus-based translation studies
- Laviosa (1998a)
- the corpus-based approach is evolving, through
theoretical elaboration and empirical
realisation, into a coherent, composite and rich
paradigm that addresses a variety of issues
pertaining to theory, description, and the
practice of translation. - Hypotheses that translation universals can be
tested by corpus data (Baker 1993, 1995) - Rapid development of corpus linguistics, esp.
multilingual corpus research in the early 1990s - Increasing interest in Descriptive Translation
Studies (Toury 1995) - Tymoczko (1998)
- Corpus Translation Studies is central to the way
that Translation Studies as a discipline will
remain vital and move forward. - Meta 43/4 (1998) Kenny (2001) Laviosa (2002)
Granger et al (eds.) (2003) Olohan (2004)
Mauranen et al (eds.) (2004) Kruger Munday
(ed.) (2011) Hu (2011), Wang (2011), Xiao (2012)
14The Holmes-Toury map
- Applied Translation Studies
- Descriptive Translation Studies
- Theoretical Translation Studies
15Applied Translation Studies
- Three major contributions of corpora
- Corpus-assisted translating
- Bowker (1998 631) corpus-assisted translations
are of a higher quality with respect to subject
field understanding, correct term choice and
idiomatic expressions. - Corpus-aided translation teaching and training
- Bernardini (1997) large corpora concordancing
(LCC) can help students to develop awareness,
reflectiveness and resourcefulness, which are
said to be the skills that distinguish a
translator from those unskilled amateurs - Development of translation tools
- Corpora, and especially aligned parallel corpora,
are essential for the development of translation
technology such as machine translation (MT)
systems, and computer-aided translation (CAT)
tools
16Descriptive Translation Studies
- Characterized by its emphasis on the study of
translation per se - It is to answer the question of why a translator
translates in this way instead of how to
translate - Baker (1993) predicted that the availability of
large corpora of both source and translated
texts, together with the development of the
corpus-based approach, would enable translation
scholars to uncover the nature of translation as
a mediated communicative event
17Descriptive Translation Studies
- Three focuses (Holmes 1972/1988)
- Translation as a product
- Concerned with describing translation as a
product by comparing corpora of translated and
non-translational native texts in the target
language - Attempting to uncover evidence to support or
reject the so-called translation universal
hypotheses - Translation as a process
- Aims at revealing the thought processes that take
place in the mind of the translator while she or
he is translating - One possible way for corpus-based DTS is to
investigate the written transcripts of these
recordings off-line, which is known as
Think-Aloud Protocols (or TAPs) - Translation as product providing indirect
evidence to translation as process - The function of translation
- The study of contexts rather than texts function
or impact of a translation - Relatively few function-oriented studies that are
corpus-based
18Theoretical Translation Studies
- Aims to establish general principles by means of
which these phenomena can be explained and
predicted (Holmes 1988 71) - Closely related to, and often reliant on the
empirical findings produced by Descriptive
Translation Studies - One good battleground of using DTS findings to
pursue general theory of translation is the
hypothesis of so-called translation universals
(TUs) and its related sub-hypotheses - Sometimes referred to as the inherent features of
translational language, or translationese
19TU A focus of CBTS
- An important area of corpus-based TS over the
past decade - Baker (1993, 1996) Chesterman (2004) Kenny
(1998, 1999, 2000, 2001) Laviosa (1998b)
Mauranen Kujamaki 2004) McEnery Xiao (2002,
2007) Olohan (2004) Olohan Bakers (2000)
Øverås (1998) Pym (2005) Xiao and Yue (2008),
Xiao (2010), Xiao Dai (2010), Xiao (2010, 2011,
2012) - The Translational English Corpus (TEC)
- Manual
- http//www.llc.manchester.ac.uk/ctis/research/engl
ish-corpus/ - Software
- http//ronaldo.cs.tcd.ie/tec2/jnlp/
20Features of translated English
- Laviosa (1998b) Four core patterns of lexical
use in translational English - - A relatively low proportion of lexical words
over function words - - A relatively high proportion of high-frequency
words over low-frequency words - - A relatively great repetition of the most
frequent words - - Less variety in most frequently used words
21Features of translated English
- Beyond the lexical level
- Simplification tendency to simplify the
language used in translation (Baker 1996
181-182) - simpler language than target native language
lexically / syntactically / stylistically - Normalization tendency to exaggerate features
of the target language and to conform to its
typical patterns (Baker 1996 183) - more normal than the target native language
- Explicitation tendency in translations to spell
things out rather than leave them implicit
(Baker 1996 180) - more frequent use of conjunctions, and increased
cohesion in translated text - Sanitization translated texts are somewhat
sanitized versions of the original (Kenny
1998 515) - Lost or reduced connotational meaning in
translation - TU hypotheses
22TU A target of debate
- Is translational language different from target
native language? - Translational language is at best an
unrepresentative special variant of the target
language because translations cannot possibly
avoid the effect of translationese - e.g. Baker 1993 Gellerstam 1996 Hartmann 1985
Laviosa 1997 McEnery Wilson 2001 McEnery
Xiao (2002, 2007) Teubert 1996
23TU A target of debate
- Are the features uncovered on the basis of
translational English generalizable to other
translated languages? - Existing evidence has largely come from
translational English and related European
languages - If such features are to be generalized as
translation universals, the language pairs
involved must not be restricted to English and
closely related languages - Cheongs (2006) study of English-Korean
translation contradicts even the least
controversial explicitation hypothesis - Evidence from genetically distinct language
pairs such as English and Chinese is undoubtedly
more convincing, if not indispensable
24The ZCTC corpus
- Created with the explicit aim of studying the
features of translated Chinese - A translational counterpart of the Lancaster
Corpus of Mandarin Chinese (LCMC), a
one-million-word balanced corpus of native
Chinese (McEnery Xiao 2004) - www.ling.lancs.ac.uk/corplang/lcmc/
- Five hundred 2,000-word text samples taken
proportionally from fifteen written text
categories published in China in the 1990s - www.ling.lancs.ac.uk/corplang/ZCTC/
25LCMC / ZCTC corpus design
26ZCTC vs. LCMC
27Corpus markup and annotation
- CES-compliant XML
- CES www.cs.vassar.edu/CES/
- Tokenization and POS tagging
- ICTCLAS2008 www.ictclas.org
- A precision rate of 98.54 for tokenization
- Paragraph, sentence, word token
- Encoded in Unicode (UTF-8)
28Core patterns of lexical use
- Do the core patterns of lexical use Laviosa
(1998b) observes in translational English also
apply in translated Chinese? - Same criteria and parameters as in Laviosa
(1998b) - Lexical density
- Frequency profiles
- Mean sentence length
29Lexical density
- The Stubbs-style lexical density the ratio
between the number of lexical words (i.e. content
words) and the total number of words (Stubbs
1986 33 1996 172) - Measure of informational load
- Adopted in Laviosa (1998b)
- Lexical density measured by TTR or Standardized
TTR (STTR) (Scott 2004) - Measure of lexical variability
- Commonly used in Corpus Linguistics
30Stubbs-style lexical density
- Mean LD is significantly greater in native than
translational corpus (66.93 vs. 61.59, t
-4.94, plt0.001) - In addition, the native Chinese corpus displays a
greater LD score in all of the 15 genres and
significant for nearly all genres (except for M) - Translations make more frequent use of function
words
31Standardized TTR
- Mean STTR is slightly greater in native than
translation corpus (46.58 vs. 45.73) not
significant (t -0.573, p0.571) - The differences in most genres are also marginal
- Greater STTR scores can be found in both native
(e.g. A) and translated (C) Chinese genres
32Lexical-function ratio Stubbs LD
- Mean ratio between lexical and function words is
significantly greater in native than
translational corpus (2.08 vs1.64, t -4.88,
plt0.001) - Also, native Chinese has a greater ratio in all
genres, and the differences are significant in
nearly all genres (except for M) - In line with Laviosas (1998b) initial hypothesis
that translational language has a relatively low
proportion of lexical words over function words
33Frequency profiles of LCMC/ZCTC
- Laviosas (1998b) list head or high frequency
words - Wordlist items which individually account for at
least 0.10 of the total number of tokens in a
corpus - The same criterion for high frequency words in
this study to ensure comparability
34Frequency profiles
- The numbers of high frequency words are very
similar in the two corpora - High frequency words account for a considerably
greater proportion of tokens in the translational
corpus (40.47 vs. 35.70) - High frequency words display a much greater
repetition rate in translated Chinese (3154.37
vs. 2870.37) - Also the ratio between high- and low-frequency
words is greater in the translational corpus
(0.6988 vs. 0.5659)
35Mean sentence length vs. simplification
- Conflicting observations of mean sentence length
as an indicator of simplification (e.g. Laviosa
1998b vs. Malmkjaer 1997) - The native Chinese corpus (LCMC) shows a
marginally greater mean sentence length not
significant (t - 1.41, p 0.17) - Mean sentence length is sensitive to genre
variation and may not be reliable as an indicator
of simplification in translational Chinese - (Mean sentence segment length)
36Lexical use in translational Chinese
- Summary
- - Analysis of lexical density and frequency
profiles shows that the four core patterns of
lexical use in translational English are
essentially also applicable in translated Chinese - - But mean sentence length is less reliable as
an indicator of simplification in translational
Chinese
37Explicitation Connectives as a device?
- Perhaps the most studied topic in TU research and
also the least controversial hypothesis - Chen (2006)
- Connectives are a device for explicitation in
English-Chinese translation of popular science
books - Xiao and Yue (2008)
- Connectives are significantly more frequent in
translational than native Chinese fiction - Question
- Can we generalize this finding from these
specific genres to Mandarin Chinese in general?
38Conjunctions in ZCTC and LCMC
- Mean frequency of conjunctions is significantly
greater in translational than native corpus - 306.42 and 243.23 instances per 10,000 tokens,
LL723.12 for 1 d.f., plt0.001 - In addition, genre-based distribution shows that
most genres covered in the corpora display a
significantly more frequent use of conjunctions
in translational Chinese in spite of some
genre-based subtleties (e.g. F, J)
39Conjunctions of different frequency bands
- More conjunction types of high frequency bands
(0.01 or above) are used in translational corpus - There are an equal number of conjunction types
(56 types) of medium frequency band (0.005) in
translational and native corpora - Beyond this balance point, the native corpus
displays a greater number of conjunction types of
low frequency band (0.001 or below) - In line with observations about high vs. low
frequency words
40Conjunctions of different styles
- A closer comparison of the lists of frequent
conjunctions (0.001 or above) in their
respective corpus also sheds some new light on
the simplification hypothesis - There are 91 and 99 types of frequent
conjunctions in the two corpora 86 items
overlap in the two lists - Conjunctions on the translational but not native
list are all informal, colloquial, and simple ,
which usually have more formal alternatives (e.g.
?? for ??,?? for ????) - Conjunctions on the native but not translation
list are typically formal, literate and archaic
(e.g. ????????????????????????) - These results provide evidence for the
simplification hypothesis but against the
normalization hypothesis
41Conclusions
- Results based on two comparable Chinese corpora
suggest that the core patterns of lexical use in
translational English are generally also
applicable in translated Chinese - Beyond the lexical level
- Mean sentence length is sensitive to genre
variation and may not be reliable as an indicator
of simplification - A comparison of frequent conjunctions in native
and translated Chinese shows that simpler forms
tend to be used in translations - In spite of some genre-based subtleties,
conjunctions are more frequently used in
translational Chinese, which provides evidence in
favour of the explicitation hypothesis - Corpus Translation Studies is a promising area of
research - ??????????????????????,?????????,2012
42CUC ParaConc
43- Shedding of valedictory tears
- lancsxiaoz_at_googlemail.com
- xiaoz_at_zju.edu.cn