Using corpora in contrastive and translation studies PowerPoint PPT Presentation

presentation player overlay
1 / 43
About This Presentation
Transcript and Presenter's Notes

Title: Using corpora in contrastive and translation studies


1
Using corpora in contrastive and translation
studies
  • Corpus Linguistics
  • Richard Xiao
  • lancsxiaoz_at_googlemail.com

2
Aims of this session
  • Lecture
  • Corpora in contrastive and translation studies
  • Use of comparable and parallel corpora
  • Case study Translation universals, do they
    really exist?
  • Lab session
  • CUC paraconc and Babel parallel corpus
  • Closing
  • Shedding of valedictory tears

3
Types of corpora Some distinctions
  • Monolingual versus multilingual corpora
  • Parallel versus comparable corpora
  • Comparable versus comparative corpora

4
Monolingual versus multilingual corpora
  • Monolingual corpora
  • A corpus that only involves one language
  • Multilingual corpora
  • A corpus that involves texts of more than one
    language
  • A corpus covering two languages is conventionally
    known as bilingual
  • Multilingual corpora, in a narrow sense, must
    involve more than two languages
  • Multilingual and bilingual are often used
    interchangeably
  • Parallel and comparable corpora

5
Parallel versus comparable corpora
  • Terminological confusion centres around the terms
  • For some scholars (e.g. Aijmer and Altenberg
    1996 Granger 1996 38)
  • Corpora composed of source texts in one language
    and their translations in another language (or
    other languages) are translation corpora while
    those comprising different components sampled
    from different native languages using comparable
    sampling techniques are called parallel corpora
  • For many others (e.g. Baker 1993 248, 1995,
    1999 Barlow 1995, 2000 110 Hunston 2002 15
    McEnery and Wilson 1996 57 McEnery, Xiao and
    Tono 2006)
  • Corpora of the first type are labelled parallel
    corpora while those of the latter type are
    comparable corpora

6
Parallel versus comparable corpora
  • In classifying corpora, the criteria used must be
    consistent and logical ways of doing things
  • - We can say a corpus is a translation or a
    non-translation corpus if the criterion of corpus
    content is used
  • - But if we choose to define corpus types by the
    criterion of corpus form, we must use the
    criterion consistently
  • - We can say a corpus is parallel if the corpus
    contains source texts and translations in
    parallel, or it is a comparable corpus if its
    components or subcorpora are comparable by
    applying the same sampling techniques and similar
    balance and coverage
  • - It is simply inconsistent and illogical to
    refer to corpora of the first type as translation
    corpora by the criterion of content while
    referring to corpora of the latter type as
    parallel corpora by the criterion of form!

7
Multilingual vs. monolingual comparable corpora
  • A common practice in TS is to compare a corpus of
    translated texts (translational corpus) with a
    corpus consisting of comparably sampled
    non-translated texts in the same language
  • The two sub-corpora form a monolingual comparable
    corpus for translation research, as opposed to a
    multilingual comparable corpus composed of
    comparable texts for different languages for
    cross-linguistic contrast

8
Comparative corpora
  • Corpora containing different regional varieties
    of the same language are not comparable corpora
  • E.g. the International Corpus of English (ICE),
    the Brown family of corpora
  • All corpora, as a resource for linguistic
    research, have always been pre-eminently suited
    for comparative studies (Aarts 1998 ix), either
    intralingually or interlingually
  • Corpora of this kind are comparative corpora

9
Use of parallel comparable corpora
  • Parallel and comparable corpora offer specific
    uses and possibilities for contrastive and
    translation studies (Aijmer Altenberg 1996 12)
  • - they give new insights into the languages
    compared insights that are not likely to be
    gained from the study of monolingual corpora
  • - they can be used for a range of comparative
    purposes and increase our knowledge of
    language-specific, typological and cultural
    differences, as well as of universal features
  • - they illuminate differences between source
    texts and translations, and between native and
    non-native texts
  • - they can be used for a number of practical
    applications, e.g. in lexicography, language
    teaching and translation.

10
Use of parallel comparable corpora
  • Used primarily for translation and contrastive
    studies
  • The two types of corpora have their own
    characteristics, and serve different purposes
  • Parallel corpora are useful in translation
    studies, but they alone serve as a poor basis for
    cross-linguistic contrast, because translations
    cannot avoid the effect of translationese
  • Comparable corpora are well suited for
    contrastive research, but are less useful in
    translation studies

11
Using corpora in translation studies
  • Parallel corpora
  • Useful in exploring how an idea in one language
    is conveyed in another language, thus providing
    indirect evidence to the study of translation
    processes
  • Indispensable for building statistical or
    example-based machine translation (EBMT) systems,
    and for the development of bilingual lexicons and
    translation memories
  • Parallel concordancing is a useful tool for
    translators
  • Comparable corpora
  • Useful in improving the translators
    understanding of the subject field and improving
    the quality of translation in terms of fluency,
    correct term choice and idiomatic expressions in
    the chosen subject field
  • Can also be used to build terminology banks

12
Using corpora in translation studies
  • Translational corpora
  • Provide primary evidence in product-oriented
    Translation Studies, and in studies of
    translation universals
  • If corpora of this kind are encoded with
    sociolinguistic and cultural parameters, they can
    also be used to study the sociocultural
    environment of translations (e.g. functions of
    translation in DTS)
  • Monolingual corpora (source / target language )
  • Raising the translators linguistic and cultural
    awareness in general
  • Providing a useful and effective reference tool
    for translators
  • In combination with a parallel corpus to form a
    so-called translation evaluation corpus that
    helps translator trainers or critics to evaluate
    translations more effectively and objectively

13
Corpus-based translation studies
  • Laviosa (1998a)
  • the corpus-based approach is evolving, through
    theoretical elaboration and empirical
    realisation, into a coherent, composite and rich
    paradigm that addresses a variety of issues
    pertaining to theory, description, and the
    practice of translation.
  • Hypotheses that translation universals can be
    tested by corpus data (Baker 1993, 1995)
  • Rapid development of corpus linguistics, esp.
    multilingual corpus research in the early 1990s
  • Increasing interest in Descriptive Translation
    Studies (Toury 1995)
  • Tymoczko (1998)
  • Corpus Translation Studies is central to the way
    that Translation Studies as a discipline will
    remain vital and move forward.
  • Meta 43/4 (1998) Kenny (2001) Laviosa (2002)
    Granger et al (eds.) (2003) Olohan (2004)
    Mauranen et al (eds.) (2004) Kruger Munday
    (ed.) (2011) Hu (2011), Wang (2011), Xiao (2012)

14
The Holmes-Toury map
  • Applied Translation Studies
  • Descriptive Translation Studies
  • Theoretical Translation Studies

15
Applied Translation Studies
  • Three major contributions of corpora
  • Corpus-assisted translating
  • Bowker (1998 631) corpus-assisted translations
    are of a higher quality with respect to subject
    field understanding, correct term choice and
    idiomatic expressions.
  • Corpus-aided translation teaching and training
  • Bernardini (1997) large corpora concordancing
    (LCC) can help students to develop awareness,
    reflectiveness and resourcefulness, which are
    said to be the skills that distinguish a
    translator from those unskilled amateurs
  • Development of translation tools
  • Corpora, and especially aligned parallel corpora,
    are essential for the development of translation
    technology such as machine translation (MT)
    systems, and computer-aided translation (CAT)
    tools

16
Descriptive Translation Studies
  • Characterized by its emphasis on the study of
    translation per se
  • It is to answer the question of why a translator
    translates in this way instead of how to
    translate
  • Baker (1993) predicted that the availability of
    large corpora of both source and translated
    texts, together with the development of the
    corpus-based approach, would enable translation
    scholars to uncover the nature of translation as
    a mediated communicative event

17
Descriptive Translation Studies
  • Three focuses (Holmes 1972/1988)
  • Translation as a product
  • Concerned with describing translation as a
    product by comparing corpora of translated and
    non-translational native texts in the target
    language
  • Attempting to uncover evidence to support or
    reject the so-called translation universal
    hypotheses
  • Translation as a process
  • Aims at revealing the thought processes that take
    place in the mind of the translator while she or
    he is translating
  • One possible way for corpus-based DTS is to
    investigate the written transcripts of these
    recordings off-line, which is known as
    Think-Aloud Protocols (or TAPs)
  • Translation as product providing indirect
    evidence to translation as process
  • The function of translation
  • The study of contexts rather than texts function
    or impact of a translation
  • Relatively few function-oriented studies that are
    corpus-based

18
Theoretical Translation Studies
  • Aims to establish general principles by means of
    which these phenomena can be explained and
    predicted (Holmes 1988 71)
  • Closely related to, and often reliant on the
    empirical findings produced by Descriptive
    Translation Studies
  • One good battleground of using DTS findings to
    pursue general theory of translation is the
    hypothesis of so-called translation universals
    (TUs) and its related sub-hypotheses
  • Sometimes referred to as the inherent features of
    translational language, or translationese

19
TU A focus of CBTS
  • An important area of corpus-based TS over the
    past decade
  • Baker (1993, 1996) Chesterman (2004) Kenny
    (1998, 1999, 2000, 2001) Laviosa (1998b)
    Mauranen Kujamaki 2004) McEnery Xiao (2002,
    2007) Olohan (2004) Olohan Bakers (2000)
    Øverås (1998) Pym (2005) Xiao and Yue (2008),
    Xiao (2010), Xiao Dai (2010), Xiao (2010, 2011,
    2012)
  • The Translational English Corpus (TEC)
  • Manual
  • http//www.llc.manchester.ac.uk/ctis/research/engl
    ish-corpus/
  • Software
  • http//ronaldo.cs.tcd.ie/tec2/jnlp/

20
Features of translated English
  • Laviosa (1998b) Four core patterns of lexical
    use in translational English
  • - A relatively low proportion of lexical words
    over function words
  • - A relatively high proportion of high-frequency
    words over low-frequency words
  • - A relatively great repetition of the most
    frequent words
  • - Less variety in most frequently used words

21
Features of translated English
  • Beyond the lexical level
  • Simplification tendency to simplify the
    language used in translation (Baker 1996
    181-182)
  • simpler language than target native language
    lexically / syntactically / stylistically
  • Normalization tendency to exaggerate features
    of the target language and to conform to its
    typical patterns (Baker 1996 183)
  • more normal than the target native language
  • Explicitation tendency in translations to spell
    things out rather than leave them implicit
    (Baker 1996 180)
  • more frequent use of conjunctions, and increased
    cohesion in translated text
  • Sanitization translated texts are somewhat
    sanitized versions of the original (Kenny
    1998 515)
  • Lost or reduced connotational meaning in
    translation
  • TU hypotheses

22
TU A target of debate
  • Is translational language different from target
    native language?
  • Translational language is at best an
    unrepresentative special variant of the target
    language because translations cannot possibly
    avoid the effect of translationese
  • e.g. Baker 1993 Gellerstam 1996 Hartmann 1985
    Laviosa 1997 McEnery Wilson 2001 McEnery
    Xiao (2002, 2007) Teubert 1996

23
TU A target of debate
  • Are the features uncovered on the basis of
    translational English generalizable to other
    translated languages?
  • Existing evidence has largely come from
    translational English and related European
    languages
  • If such features are to be generalized as
    translation universals, the language pairs
    involved must not be restricted to English and
    closely related languages
  • Cheongs (2006) study of English-Korean
    translation contradicts even the least
    controversial explicitation hypothesis
  • Evidence from genetically distinct language
    pairs such as English and Chinese is undoubtedly
    more convincing, if not indispensable

24
The ZCTC corpus
  • Created with the explicit aim of studying the
    features of translated Chinese
  • A translational counterpart of the Lancaster
    Corpus of Mandarin Chinese (LCMC), a
    one-million-word balanced corpus of native
    Chinese (McEnery Xiao 2004)
  • www.ling.lancs.ac.uk/corplang/lcmc/
  • Five hundred 2,000-word text samples taken
    proportionally from fifteen written text
    categories published in China in the 1990s
  • www.ling.lancs.ac.uk/corplang/ZCTC/

25
LCMC / ZCTC corpus design
26
ZCTC vs. LCMC
27
Corpus markup and annotation
  • CES-compliant XML
  • CES www.cs.vassar.edu/CES/
  • Tokenization and POS tagging
  • ICTCLAS2008 www.ictclas.org
  • A precision rate of 98.54 for tokenization
  • Paragraph, sentence, word token
  • Encoded in Unicode (UTF-8)

28
Core patterns of lexical use
  • Do the core patterns of lexical use Laviosa
    (1998b) observes in translational English also
    apply in translated Chinese?
  • Same criteria and parameters as in Laviosa
    (1998b)
  • Lexical density
  • Frequency profiles
  • Mean sentence length

29
Lexical density
  • The Stubbs-style lexical density the ratio
    between the number of lexical words (i.e. content
    words) and the total number of words (Stubbs
    1986 33 1996 172)
  • Measure of informational load
  • Adopted in Laviosa (1998b)
  • Lexical density measured by TTR or Standardized
    TTR (STTR) (Scott 2004)
  • Measure of lexical variability
  • Commonly used in Corpus Linguistics

30
Stubbs-style lexical density
  • Mean LD is significantly greater in native than
    translational corpus (66.93 vs. 61.59, t
    -4.94, plt0.001)
  • In addition, the native Chinese corpus displays a
    greater LD score in all of the 15 genres and
    significant for nearly all genres (except for M)
  • Translations make more frequent use of function
    words

31
Standardized TTR
  • Mean STTR is slightly greater in native than
    translation corpus (46.58 vs. 45.73) not
    significant (t -0.573, p0.571)
  • The differences in most genres are also marginal
  • Greater STTR scores can be found in both native
    (e.g. A) and translated (C) Chinese genres

32
Lexical-function ratio Stubbs LD
  • Mean ratio between lexical and function words is
    significantly greater in native than
    translational corpus (2.08 vs1.64, t -4.88,
    plt0.001)
  • Also, native Chinese has a greater ratio in all
    genres, and the differences are significant in
    nearly all genres (except for M)
  • In line with Laviosas (1998b) initial hypothesis
    that translational language has a relatively low
    proportion of lexical words over function words

33
Frequency profiles of LCMC/ZCTC
  • Laviosas (1998b) list head or high frequency
    words
  • Wordlist items which individually account for at
    least 0.10 of the total number of tokens in a
    corpus
  • The same criterion for high frequency words in
    this study to ensure comparability

34
Frequency profiles
  • The numbers of high frequency words are very
    similar in the two corpora
  • High frequency words account for a considerably
    greater proportion of tokens in the translational
    corpus (40.47 vs. 35.70)
  • High frequency words display a much greater
    repetition rate in translated Chinese (3154.37
    vs. 2870.37)
  • Also the ratio between high- and low-frequency
    words is greater in the translational corpus
    (0.6988 vs. 0.5659)

35
Mean sentence length vs. simplification
  • Conflicting observations of mean sentence length
    as an indicator of simplification (e.g. Laviosa
    1998b vs. Malmkjaer 1997)
  • The native Chinese corpus (LCMC) shows a
    marginally greater mean sentence length not
    significant (t - 1.41, p 0.17)
  • Mean sentence length is sensitive to genre
    variation and may not be reliable as an indicator
    of simplification in translational Chinese
  • (Mean sentence segment length)

36
Lexical use in translational Chinese
  • Summary
  • - Analysis of lexical density and frequency
    profiles shows that the four core patterns of
    lexical use in translational English are
    essentially also applicable in translated Chinese
  • - But mean sentence length is less reliable as
    an indicator of simplification in translational
    Chinese

37
Explicitation Connectives as a device?
  • Perhaps the most studied topic in TU research and
    also the least controversial hypothesis
  • Chen (2006)
  • Connectives are a device for explicitation in
    English-Chinese translation of popular science
    books
  • Xiao and Yue (2008)
  • Connectives are significantly more frequent in
    translational than native Chinese fiction
  • Question
  • Can we generalize this finding from these
    specific genres to Mandarin Chinese in general?

38
Conjunctions in ZCTC and LCMC
  • Mean frequency of conjunctions is significantly
    greater in translational than native corpus
  • 306.42 and 243.23 instances per 10,000 tokens,
    LL723.12 for 1 d.f., plt0.001
  • In addition, genre-based distribution shows that
    most genres covered in the corpora display a
    significantly more frequent use of conjunctions
    in translational Chinese in spite of some
    genre-based subtleties (e.g. F, J)

39
Conjunctions of different frequency bands
  • More conjunction types of high frequency bands
    (0.01 or above) are used in translational corpus
  • There are an equal number of conjunction types
    (56 types) of medium frequency band (0.005) in
    translational and native corpora
  • Beyond this balance point, the native corpus
    displays a greater number of conjunction types of
    low frequency band (0.001 or below)
  • In line with observations about high vs. low
    frequency words

40
Conjunctions of different styles
  • A closer comparison of the lists of frequent
    conjunctions (0.001 or above) in their
    respective corpus also sheds some new light on
    the simplification hypothesis
  • There are 91 and 99 types of frequent
    conjunctions in the two corpora 86 items
    overlap in the two lists
  • Conjunctions on the translational but not native
    list are all informal, colloquial, and simple ,
    which usually have more formal alternatives (e.g.
    ?? for ??,?? for ????)
  • Conjunctions on the native but not translation
    list are typically formal, literate and archaic
    (e.g. ????????????????????????)
  • These results provide evidence for the
    simplification hypothesis but against the
    normalization hypothesis

41
Conclusions
  • Results based on two comparable Chinese corpora
    suggest that the core patterns of lexical use in
    translational English are generally also
    applicable in translated Chinese
  • Beyond the lexical level
  • Mean sentence length is sensitive to genre
    variation and may not be reliable as an indicator
    of simplification
  • A comparison of frequent conjunctions in native
    and translated Chinese shows that simpler forms
    tend to be used in translations
  • In spite of some genre-based subtleties,
    conjunctions are more frequently used in
    translational Chinese, which provides evidence in
    favour of the explicitation hypothesis
  • Corpus Translation Studies is a promising area of
    research
  • ??????????????????????,?????????,2012

42
CUC ParaConc
  • Software demo

43
  • Shedding of valedictory tears
  • lancsxiaoz_at_googlemail.com
  • xiaoz_at_zju.edu.cn
Write a Comment
User Comments (0)
About PowerShow.com