The Lancaster Corpus of Mandarin Chinese (LCMC): A corpus for monolingual and contrastive study - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

The Lancaster Corpus of Mandarin Chinese (LCMC): A corpus for monolingual and contrastive study

Description:

Title: Developing Asian Language Corpora: standards and practice Author: Zhonghua Xiao Last modified by: Zhonghua Xiao Created Date: 3/12/2004 6:50:31 PM – PowerPoint PPT presentation

Number of Views:235
Avg rating:3.0/5.0
Slides: 17
Provided by: Zhongh8
Category:

less

Transcript and Presenter's Notes

Title: The Lancaster Corpus of Mandarin Chinese (LCMC): A corpus for monolingual and contrastive study


1
The Lancaster Corpus of Mandarin Chinese (LCMC)
A corpus for monolingual and contrastive study
  • Tony McEnery
  • Richard Xiao

2
The LCMC Corpus Aims
  • Built for the ESRC project Contrasting tense and
    aspect in English and Chinese (Grant Ref.
    RES-000-220135)
  • See http//www.regard.ac.uk
  • A Chinese match for FLOB/Frown for BrE/AmE
  • A publicly available balanced corpus of Mandarin
    Chinese
  • Distributed free of charge for use in
    non-profit-making research

3
LCMC Profile
  • One million words
  • Standard character and Romanized Pinyin versions
  • 1990-1993 (ca. 87 of samples from 1991-1992)
  • 15 text categories
  • 500 text samples
  • Major text provider SSReader Digital Library,
    China
  • Unicode (UTF-8)
  • XML-conformant mark-up
  • Marked for paragraphs and sentences
  • POS-tagged (precision rate 98)

4
Major Chinese corpus resources (1)
  • Sinica Corpus of Mandarin Chinese
  • 5 million words of Mandarin as used in Taiwan
  • http//www.sinica.edu.tw/SinicaCorpus
  • PH corpus
  • Ca. 2 million words of newswire text (1990-1991)
  • Available at ftp//ftp.cogsci.ed.ac.uk/pub/chinese
  • POS version available at http//www.ling.lancs.ac.
    uk/corplang/
  • PFR Peoples Daily Corpus
  • Newspaper text from Peoples Daily 1998
  • Sample (01/98) available at http//icl.pku.edu.cn/
    Introduction/corpustagging.htm
  • Searchable at http//www.ling.lancs.ac.uk/corplang
    /

5
Major Chinese corpus resources (2)
  • Linguistic Variation in Chinese Speech
    Communities
  • Text from newspapers and electronic media in six
    Chinese speech communities
  • http//www.livac.org/
  • Spoken Chinese Corpus of Situated Discourse
    (SCCSD)
  • See Gu, Y. 2002. Towards an understanding of
    workplace discourse in C. Candlin (ed) Research
    and Practice in Professional Discourse (pp.
    137-86). Hong Kong City University of Hong Kong
    Press.
  • Three Mandarin corpora released by LDC
  • TREC, Gigaword and Callhome
  • See the LDC catalogue

6
Chinese corpora A comparison
7
LCMC Sampling frame
8
LCMC Markup (XML)
9
LCMC Annotations
  • Word segmentation
  • POS tagging
  • Applying the Peking University tagset
  • 26 Level 1 POS tags
  • 50 Level 2 POS tags
  • POS tagger (ICT Chinese Lexical Analyzing System)
  • Developed by the Institute of Computing
    Technologies, the Chinese Academy of Sciences
  • Automatic tagging with a precision rate of 97.16
  • Post-editing improved the precision to over 98

10
LCMC Level 1 POS tags
  • a. adjective
  • b. non-predicative adj.
  • c. conjunction
  • d. adverb
  • e. interjection
  • f. directional locality
  • g. morpheme
  • h. prefix
  • i. Idiom
  • j. abbreviation
  • k. suffix
  • l. fixed expression
  • m. numeral
  • n. noun
  • o. onomatopoeia
  • p. preposition
  • q. classifier
  • r. pronoun
  • s. space word
  • t. time word
  • u. auxiliary
  • v. verb
  • w. punctuation/symbol
  • x. unclassified item
  • y. modal particle
  • z. descriptive

11
LCMC corpus exploration tools
  • Unicode-compliant, XML-aware corpus tools
  • WebConc designed for use with LCMC
  • http//www.ling.lancs.ac.uk/corplang/cgi-bin/conc
    .pl
  • Xaira (XML-aware Sara)
  • Sara SGML-aware Retrieval Application
  • Originally developed for use with the British
    National Corpus (BNC)
  • Known as Xara before beta version 1.06
  • Documentation available at http//www.oucs.ox.ac.u
    k/rts/xaira/
  • A tutorial available at the LCMC website
  • The WordSmith Tools version 4
  • Beta version available
  • http//www.lexically.net/wordsmith/version4/index.
    htm

12
Software demonstration
  • Using Xaira to search LCMC
  • Query types Quick query, word query (pattern),
    POS query, pattern query (regex), Query builder
    (e.g. a-n vs. a-de-n), etc
  • Display mode KWIC mode vs. sentence mode
  • Display format Plain vs. XML
  • Status bar Reference
  • Other useful features distribution, sort,
    collocation, partition, user-defined stylesheets,
    etc.
  • Using WebConC to access LCMC
  • http//www.ling.lancs.ac.uk/corplang/cgi-bin/conc.
    pl

13
LCMC Potential use
  • Monolingual study
  • Studying modern Mandarin Chinese as a whole
  • Exploring variation across 15 text categories
  • Contrastive study (in conjunction with
    FLOB/Frown)
  • Contrasting Chinese and BrE/AmE
  • Contrasting text categories in Chinese and English

14
LCMC Availability
  • Both the standard and Romanized versions are
    available free of charge for use in
    non-profit-making research
  • Distributed by ELRA and Oxford Text Archive
  • Searchable via WebConc on the corpus website
  • The LCMC website
  • http//www.ling.lancs.ac.uk/corplang/lcmc
  • The Chinese mirror site (the Chinese Academy of
    Social Sciences)
  • http//www.cass.net.cn/chinese/s18_yys/dangdai/LCM
    C/LCMC.htm

15
LCMC Release notes and licensees
  • Release notes
  • 06/2003 Corpus mounted on the website of
    Corpus-based Language Studies and announced at
    the UCREL website
  • 08/2003 Chinese mirror site for the corpus
    established, hosted by the Chinese Academy of
    Sciences, Beijing
  • 12/2003 Corpus release announced at
    CORPORA-list, ELSNET-list and CLUK-list
  • 03/2004 Corpus release publicised at the 4th
    Workshop of Asian Language Resources
  • 05/2004. Corpus taken over by the ELRA and Oxford
    Text Archive.
  • Number of licensees as of 08/04/2004
  • 55 academic institutions and over 40 private and
    non-academic users

16
Related publications
  • McEnery, A., Xiao, Z. Mo, L. 2003. Aspect
    marking in English and Chinese Using the
    Lancaster Corpus of Mandarin Chinese for
    contrastive language study. Literary and
    Linguistic Computing 18(4) 361-378.
  • Xiao, Z. McEnery, A. 2004. A corpus-based
    two-level model of situation aspect. Journal of
    Linguistics 40(2).
  • Xiao, Z., McEnery, A., Baker, P. Hardie, A.
    2004. Developing Asian language corpora
    Standards and practice. Proceedings of the 4th
    Workshop on Asian Language Resources, pp. 1-8.
    March 25, 2004. Sanya, China.
  • Xiao, Z. McEnery, A. Forthcoming. Aspect in
    Chinese. Amsterdam Benjamins. 
  • Xiao, Z. McEnery A. Under review. Near
    synonymy, collocation and semantic prosody a
    cross-linguistic perspective.
Write a Comment
User Comments (0)
About PowerShow.com