Using corpora to study Classifiers in Mandarin Chinese - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Using corpora to study Classifiers in Mandarin Chinese

Description:

Title: Richard presentation.ppt Author: Xiao, Zhonghua Last modified by: Richard Xiao Created Date: 5/27/2006 5:18:45 PM Subject: COST A31 WG1 Meeting presentation files – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 35
Provided by: Xiao132
Category:

less

Transcript and Presenter's Notes

Title: Using corpora to study Classifiers in Mandarin Chinese


1
Using corpora to studyClassifiers in Mandarin
Chinese
  • Richard Xiao
  • z.xiao_at_lancaster.ac.uk

2
Chinese corpus linguistics
  • In relation to English, Chinese has a much
    shorter history of using corpora
  • Sinica Balanced Corpus of Chinese
  • The first annotated corpus of Mandarin
  • Freely accessible online since the mid-1990s
  • Rapid progress over the last decade
  • Corpus building and exploration technology
  • Publicly available corpus resources

3
Chinese text processing
  • Computational processing of Chinese text is more
    complex than English
  • Chinese text is encoded in double-byte native
    encodings
  • Potential confusion of bytes in running text
  • GB2312 for SC and Big5 for TC
  • The advent of Unicode has facilitated Chinese
    computing
  • But most existing data and tools are based on
    native encoding
  • Word tokenization is an essential first step in
    serious Chinese computing
  • Defining legitimate words in running text
  • Involving dictionary matching and the use of
    statistic models
  • Part-of-speech tagging depends on the results of
    tokenizaton
  • Accuracy of accuracy 98
  • Accuracy of POS tagging 96

4
Concordancers for Chinese
  • Many concordancers designed for English do not
    work well with Chinese data
  • There are presently three types of tools for
    Chinese
  • Unicode-based tools
  • WordSmith version 4 (Commercial product)
  • Xaira (open source freeware)
  • Concordancers dependent on language support packs
    (or in WinXP, default non-Unicode font set as
    Chinese)
  • AntConc (freeware)
  • ConcApp (freeware)
  • MonoConc Pro (commercial product)
  • Concordance (shareware)
  • Web-based query systems bundled with specific
    online corpora

5
Chinese corpus resources
  • Sinica Balanced Corpus
  • http//www.sinica.edu.tw/SinicaCorpus/
  • Sinica Tagged Corpus of Early Mandarin
  • http//www.sinica.edu.tw/Early_Mandarin/
  • Modern Chinese Language Corpus
  • http//219.238.40.2138080/CpsQrySv.srf
  • PKU-CCL Chinese Corpus
  • http//ccl.pku.edu.cn/YuLiao_Contents.Asp
  • BLCU Modern Chinese Corpus
  • http//202.112.195.88089/ccir_login?input
  • Chinese Internet Corpus
  • http//corpus.leeds.ac.uk/query-zh.html
  • Lancaster Corpus of Mandarin Chinese
  • http//www.ling.lancs.ac.uk/corplang/lcmc/
  • Lancaster LOS Angeles Spoken Chinese Corpus
  • http//www.ling.lancs.ac.uk/corplang/llscc/
  • More details of more corpora in more languages
    are on the handout

6
Lancaster Corpus of Mandarin Chinese (LCMC)
  • Designed as a Chinese match for FLOB and Frown
  • Representing written Mandarin as used in mainland
    China in the early 1990s
  • A balanced corpus of one million words in 500
    samples proportionally taken from 15 text
    categories
  • Marked up in XML and Encoded in Unicode
  • Tokenized and POS tagged
  • Freely searchable online
  • http//www.ling.lancs.ac.uk/corplang/cgi-bin/conc.
    pl
  • Released by ELRA and OTA free of charge for
    academic and educational purposes
  • An indexed version for use with Xaira is
    available
  • V1.2 incorporates validated details of classifier
    use

7
Lancaster LOS Angeles Spoken Chinese Corpus
(LLSCC)
  • One million words of spoken Mandarin
  • Both dialogues (55) and monologues (45 )
  • Both spontaneous (57 ) and scripted (43) speech
  • Seven spoken registers
  • face-to-face conversation, telephone
    conversation, play/movie scripts, TV talk show
    transcripts, formal debates, spontaneous oral
    narrative, edited oral narrative
  • Marked up in XML and encoded in Unicode
  • Tokenised and POS tagged
  • The Telephone Conversation part is tagged with
    details of classifier use
  • The unannotated version of this part is available
    from the LDC as CallHome Mandarin Transcripts
  • More information
  • http//www.ling.lancs.ac.uk/corplang/llscc/

8
Annotation scheme for classifiers (q)
Tag Gloss
qu Unit classifier
ql Collective classifier
qa Arrangement classifier
qc Container classifier
qm Standard measure
qs Species classifier
qt Temporal classifier
qv Verbal classifier
9
Why classifiers are necessary (1)
  • Grammatically mandatory
  • san ben shu san shu
  • three CL book three book
  • three books three books
  • Distinguishing between word senses
  • yi tiao xian yi gen xian
  • one CL line one CL thread
  • a line a thread

10
Why classifiers are necessary (2)
  • Resolving syntactic ambiguity
  • Example A)
  • Ho laozong gei-le ta yi-ba shouqiang
  • Ho general give-Asp him one-CL pistol
  • General Ho gave him a pistol.
  • Example B)
  • Ho laozong gei-le ta yi shouqiang
  • Ho general give-Asp him one pistol (CL)
  • General Ho shot him once with a pistol.

11
Use and name of classifiers
  • The use of classifiers dated back as early as
    over 3,300 years ago
  • Oracle bone inscriptions excavated from the Yin
    Ruins (1300-1100 B.C.)
  • Classifiers became established as a separate word
    class in Chinese only in the 1950s
  • Ding et al (1952) A Talk on Grammar in Modern
    Chinese
  • Different terms had been used for classifiers
  • But mainly treated as a subclass of nouns

12
Syntactic features of classifiers
  • Classifiers were the last to have become one of
    the 11 word classes in Chinese because they
    cannot be used independently as sentential
    constituents
  • Typically following a numeral or demonstrative
    pronoun zhe ? this, na (?) that, or na (?)
    which
  • Monosyllabic classifiers can be reduplicated to
    function as different sentential constituents,
    expressing a general grammatical meaning with
    different situational variants (Guo 1999)
  • Co-existence or repetition of entities or events
  • All around, many, one by one, continuous

13
Levels of grammaticalization
  • Specialised classifiers
  • Fully grammaticalized
  • Functioning as classifiers only
  • Bleaching of lexical meaning, difficult to find
    translation equivalents in a non-classifier
    language
  • E.g. (n) ?,?,?,?,?,?,?,? (v) ?,?,?,?,?,?,?,?,?,?
  • Concurrent classifiers
  • Mainly derived from nouns and verbs
  • Can be used as nouns/verbs and classifiers
  • The classifier use is semantically related to the
    lexical meaning of the original noun/verb
  • E.g. ?,?,??,? ?,?,?,?,?
  • Temporary borrowings
  • Mainly borrowed from nouns, verbs, and adjectives
  • Functioning as classifiers only on an ad hoc
    basis
  • Full lexical meaning
  • E.g. ? (face),?? (house) ? (knife),? (gun),?
    (foot),? (fist)

14
Semantic types of classifiers (1)
  • Nominal classifiers (6 types) Quantifying nouns
  • Unit classifiers
  • Count individual entities
  • E.g. ? (63.5 of unit classifiers, 38.8 of all
    classifiers),?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
  • Collective classifiers
  • Provide a collective reference for separate
    entities
  • E.g. ? set ,? batch ,? pair ,?? series ,?
    pair ,? group ,? generation ,? group ,?
    pair ,? team
  • Arrangement classifiers
  • Also refer to a collection, but focus on
    constellation aspect (shape), i.e. how entities
    are arranged or grouped together
  • E.g. ? layer,? pile,? ball,? pad,?
    string,? thread,? row,? handful,?
    drop,? bunch,? thread,? row

15
Semantic types of classifiers (2)
  • Nominal classifiers Quantifying nouns
  • Standard measure classifiers
  • Express exact measures of various kinds, in local
    or international units
  • E.g. ?,?,?,?,?,??,?,??,?,?,???,?,??,??,?,?,?,?,?
  • Container classifiers
  • Denote types of containers, which are borrowed
    temporarily to provide an inexact measure of mass
    or entities usually associated with such
    containers
  • E.g. ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
    ,??
  • Special container classifiers, can only take yi
    -gt full, more descriptive than quantifying
  • Species classifiers
  • Denote the type of entities grouped together
  • E.g. ? (kind, over 90),? (sort),? (grade),?
    (type),? (grade),? (class)

16
Semantic types of classifiers (3)
  • Verbal classifiers quantifying verbs
  • 9 specialised verbal classifiers
  • E.g. ?(times, 40.8 of all verbal
    classifiers),?(stroke),?(course of action),?(once
    over),?(step of action),?(return
    journey),?(times),?(once through),?(criticising,
    abusing)
  • Borrowed verbal classifiers
  • An open set, mostly nouns denoting tools and
    related items
  • E.g. ?,?,?,?,?,?,??,?,?
  • Temporal classifiers measuring time
  • Exact measures
  • ?,?,?,??,??,?,?,??,?,??,?,?,??,??,?,?,?,??,?
  • Inexact measures
  • E.g. ??,?,??,??,?,?,??

17
Classifiers in writing and speech
  • Unit classifiers by far most common, in speech
    and writing
  • Because of the weight of generalised classifier
    ge, unit classifiers are particularly frequent in
    speech
  • Other common types temporal, verbal
  • Infrequent types container, arrangement,
    collective

18
Variation across genres
  • Apart from the speech-writing difference, various
    genres also differ in classifier use
  • Most frequent in news reportage (A), humour (R),
    and speech (S) over 3K in 100K
  • Least common in news review (B), news editorial
    (C), religious writing (D), and academic prose
    (J) below 2k in 100k
  • Generally more common in imaginative (K-R)
    writing and speech (S) than in informative
    writing (A-J)

19
Distribution of classifier types
  • Distribution of different types of classifiers
    also varies across genres
  • Unit classifier is the most common type in all
    genres (2/3 of all classifiers)
  • Container, arrangement, and collective
    classifiers are relatively rare in all genres
  • Std measure classifiers are most frequent in news
    reportage (A) and official docs (H)
  • Species classifiers are more common in
    informative than imaginative writing

20
Cognitive basis of classifier use
  • Allan (1977) number of dimensions
  • Adams and Conklin (1973) elasticity, hardness,
    discreteness
  • Shi (2001) ratio between different dimensions,
    and materiality
  • Dimensions and use of classifiers
  • 0-D point, e.g. yi dian (?) mo a point of ink
  • 1-D line, e.g. yi xian (?) xiwang a thread of
    hope
  • 2-D area (Y being the longer dimension)
  • Y/Xgtgt1 gt zhang (?) e.g. yi zhang zhaopian a
    photo
  • Y/Xgtgt0 gt tiao (?) e.g. yi tiao malu a road
  • 3-D block (QY/X)
  • Z/Q gtgt 0 gt pian (?) e.g., yi pian shuye a
    leaf
  • Z/Q gtgt 1 gt kuai (?) e.g. yi kuai tang a lump
    of sugar
  • Z/Q gtgt sufficiently large gt gen (?) e.g. yi gen
    dianxian a cable
  • While the use of nominal classifiers is closely
    associated with shape, this is not the only
    criterion nouns and classifiers co-select each
    other
  • Five co-selection criteria

21
Co-selection by similarity
  • Classifiers are closely related to shapes which
    are historically associated with the nouns that
    have given rise to these classifiers, e.g. tiao
    (?)
  • tiao small branch/twig gt long, narrow,
    flexible jie (?) street, tui (?) leg, lu
    (?) road, xian (?) line thread, he (?)
    river, yu (?) fish, etc bamboo slips for
    writing gt guiding (??) regulation, jianyi
    (??) suggestion, falu (??) law, xinwen (??)
    news, etc
  • kuai (?) (soil lump/block gt something of a
    lumpy/blocky shape, e.g. a wrist watch
    territory soil gt something with a boundary,
    e.g. a scar

22
Co-selection by metonymy
  • The original lexical meanings of classifiers
    refer to the most salient features of the
    entities being classified, e.g.
  • kou (?) mouth (for pigs), tou (?) head (for
    cattle), wei (?) tail (for fish), ding (?)
    top (for hats, sedan chairs etc)
  • BUT long term linguistic conventions are always
    important in language use
  • tou rabbit, cat
  • wei peacock, squirrel

23
Co-selection by relatedness
  • The original lexical meanings of classifiers
    refer to actions closely related to entities
    being classified, e.g.
  • bao (?) wrap-gt pack (resulting of packing)
  • chuan (?) string together-gt string, bunch
  • kun (?) tie up, fasten -gt bundle
  • peng (?) hold in both hands -gt a double handful

24
Co-selection by association
  • The original lexical meanings of classifiers
    refer to tools, containers, and places, etc
    closely associated with the entities being
    classified, e.g.
  • dao (?) knife -gt a cut of (meat)
  • wan (?) bowl -gt a bowl of (rice)
  • chuang (?) bed -gt a bed of (quilt/sheet etc)
  • mu (?) curtain -gt an act of (play)

25
Co-selection by conventions
  • Sometimes, co-selection has to be interpreted by
    following linguistic conventions because it is
    not always possible to track the
    grammaticalization path of a classifier to
    ascertain the relationship between its original
    lexical meaning with the entities being
    classified
  • In what way is tiao historically related to
    renming human life?
  • Why is tou used for pigs and cattle but not
    rabbits or cats?
  • Why is wei used for fish but not for peacocks or
    squirrels even though they have tails that are as
    salient as, if not more so, than that of fish
  • Such missing links have to be accounted for by
    linguistic conventions of the speech community

26
Collocates
  • Lets now have a look at the noun collocates of
    some common classifiers in Chinese to see how
    well the proposed co-selection criteria work
  • Defining collocates (in 2 million words)
  • Window span of L5-R5
  • zgt3.0
  • Minimum co-occurrence frequency of 5

27
Collocates of zhang (?)
Collocate Gloss Frequency z-score
? playing card 64 85.9
?? notepaper 9 49.7
?? cheque 6 40.4
?? photo 7 26.6
? ticket 10 21.5
? paper 13 21.2
? (thick/thin) face, cheek 12 17.2
? skin/leather 7 17.2
? drawing 6 9.5
? (prototypical) bed 6 9.3
28
Collocates of tiao (?) 1
Collocate Gloss Frequency z-score
?? stipulation 51 41.9
?? regulation 11 26.1
? street 11 25.4
? leg 14 22.5
?? (traffic) lane 6 20.4
? road 23 19.7
?? straight line 6 19.4
? river 6 11.7
29
Collocates of tiao (?) - 2
Collocate Gloss Frequency z-score
?? instruction 6 10.7
?? suggestion 7 9.2
? fish 9 8.8
? line thread 7 7.4
?? principle 8 7.0
?? comment 6 5.3
?? news 6 4.3
30
Collocates of kuai (?)
Collocate Gloss Frequency z-score
?? level ground 6 59.0
?? stone 11 51.1
? cloth 6 23.0
? land, field 9 3.2
31
Collocates of ge (?)
  • Generalised classifier ge (?) bamboo (?) split
    into halves, initially as a counter for bamboos
    and arrows when a bamboo chip is used for
    counting, it becomes a symbol of the entity being
    counted. In other words, the entity loses its
    shape, colour, function or any other attribute
    and becomes a unit of counting, ge.
  • Ge can be used for any noun (people or things,
    large or small) that does not have a specific
    classifier, and it can be used to replace
    specific classifiers of many nouns.
  • A total of 115 noun collocates
  • 29 refer to human beings, 86 to non-human
    entities
  • 66 refer to concrete entities, 49 to abstract
    entities
  • 12 related to time
  • Top 20 noun collocates (zgt8.8, Fgt5, in the order
    of z-scores)
  • ? month, ?? week, ? person, ?? hour, ??
    phone call, ?? week, ? character,???
    percentage, ?? place, ?? corner, ??
    project, ?? hour, ?? problem, question, ???
    rice cooker, ?? woman, ?? character, ??
    example, ?? box, ??? camera, ?? stuff

32
Classifiers for dongxi (??)
  • A noun with a rather general and vague referent
    can refer to anything, but not human being
  • It is an insult to say someone is a dongxi, or is
    not a dongxi
  • The vagueness in reference makes it possible to
    use a nominal classifier of any type for dongxi
  • Unit classifier
  • (General) ge (?), jian (?) piece, fen (?)
    portion
  • (Shape) tiao (?), zhang (?), and kuai (?)
  • (Book/paper) ben (? for books), pian (? for a
    piece of writing)
  • Collective classifier
  • tao (?) set
  • Arrangement classifier
  • dui (?) pile
  • Container classifier
  • xiangzi (??) box, bao (?) pack
  • Standard measure classifier
  • dun (?) ton
  • Species classifier
  • yang (?) type, zhong (?) kind, lei (?) class

33
Variations
  • Not all instances of classifier use are in line
    with these co-selection criteria
  • Regional variation
  • dao (?) knife
  • Mandarin yi-ba (?) dao a knife
  • Cantonese yi-zhang (?) dao a knife
  • niu (?) cattle
  • Mandarin yi-tou (?) niu a cow
  • Wu yi-zhi (?) niu a cow
  • ren (?) person
  • Mandarin yi-ge (?) ren
  • Fuzhou yi-zhi (?) ren
  • Unconventional, creative use of classifiers often
    found in literary works
  • Diachronic variaion

34
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com