Title: Using corpora to study Classifiers in Mandarin Chinese
1Using corpora to studyClassifiers in Mandarin
Chinese
- Richard Xiao
- z.xiao_at_lancaster.ac.uk
2Chinese corpus linguistics
- In relation to English, Chinese has a much
shorter history of using corpora - Sinica Balanced Corpus of Chinese
- The first annotated corpus of Mandarin
- Freely accessible online since the mid-1990s
- Rapid progress over the last decade
- Corpus building and exploration technology
- Publicly available corpus resources
3Chinese text processing
- Computational processing of Chinese text is more
complex than English - Chinese text is encoded in double-byte native
encodings - Potential confusion of bytes in running text
- GB2312 for SC and Big5 for TC
- The advent of Unicode has facilitated Chinese
computing - But most existing data and tools are based on
native encoding - Word tokenization is an essential first step in
serious Chinese computing - Defining legitimate words in running text
- Involving dictionary matching and the use of
statistic models - Part-of-speech tagging depends on the results of
tokenizaton - Accuracy of accuracy 98
- Accuracy of POS tagging 96
4Concordancers for Chinese
- Many concordancers designed for English do not
work well with Chinese data - There are presently three types of tools for
Chinese - Unicode-based tools
- WordSmith version 4 (Commercial product)
- Xaira (open source freeware)
- Concordancers dependent on language support packs
(or in WinXP, default non-Unicode font set as
Chinese) - AntConc (freeware)
- ConcApp (freeware)
- MonoConc Pro (commercial product)
- Concordance (shareware)
- Web-based query systems bundled with specific
online corpora
5Chinese corpus resources
- Sinica Balanced Corpus
- http//www.sinica.edu.tw/SinicaCorpus/
- Sinica Tagged Corpus of Early Mandarin
- http//www.sinica.edu.tw/Early_Mandarin/
- Modern Chinese Language Corpus
- http//219.238.40.2138080/CpsQrySv.srf
- PKU-CCL Chinese Corpus
- http//ccl.pku.edu.cn/YuLiao_Contents.Asp
- BLCU Modern Chinese Corpus
- http//202.112.195.88089/ccir_login?input
- Chinese Internet Corpus
- http//corpus.leeds.ac.uk/query-zh.html
- Lancaster Corpus of Mandarin Chinese
- http//www.ling.lancs.ac.uk/corplang/lcmc/
- Lancaster LOS Angeles Spoken Chinese Corpus
- http//www.ling.lancs.ac.uk/corplang/llscc/
- More details of more corpora in more languages
are on the handout
6Lancaster Corpus of Mandarin Chinese (LCMC)
- Designed as a Chinese match for FLOB and Frown
- Representing written Mandarin as used in mainland
China in the early 1990s - A balanced corpus of one million words in 500
samples proportionally taken from 15 text
categories - Marked up in XML and Encoded in Unicode
- Tokenized and POS tagged
- Freely searchable online
- http//www.ling.lancs.ac.uk/corplang/cgi-bin/conc.
pl - Released by ELRA and OTA free of charge for
academic and educational purposes - An indexed version for use with Xaira is
available - V1.2 incorporates validated details of classifier
use
7Lancaster LOS Angeles Spoken Chinese Corpus
(LLSCC)
- One million words of spoken Mandarin
- Both dialogues (55) and monologues (45 )
- Both spontaneous (57 ) and scripted (43) speech
- Seven spoken registers
- face-to-face conversation, telephone
conversation, play/movie scripts, TV talk show
transcripts, formal debates, spontaneous oral
narrative, edited oral narrative - Marked up in XML and encoded in Unicode
- Tokenised and POS tagged
- The Telephone Conversation part is tagged with
details of classifier use - The unannotated version of this part is available
from the LDC as CallHome Mandarin Transcripts - More information
- http//www.ling.lancs.ac.uk/corplang/llscc/
8Annotation scheme for classifiers (q)
Tag Gloss
qu Unit classifier
ql Collective classifier
qa Arrangement classifier
qc Container classifier
qm Standard measure
qs Species classifier
qt Temporal classifier
qv Verbal classifier
9Why classifiers are necessary (1)
- Grammatically mandatory
- san ben shu san shu
- three CL book three book
- three books three books
- Distinguishing between word senses
- yi tiao xian yi gen xian
- one CL line one CL thread
- a line a thread
10Why classifiers are necessary (2)
- Resolving syntactic ambiguity
- Example A)
- Ho laozong gei-le ta yi-ba shouqiang
- Ho general give-Asp him one-CL pistol
- General Ho gave him a pistol.
- Example B)
- Ho laozong gei-le ta yi shouqiang
- Ho general give-Asp him one pistol (CL)
- General Ho shot him once with a pistol.
11Use and name of classifiers
- The use of classifiers dated back as early as
over 3,300 years ago - Oracle bone inscriptions excavated from the Yin
Ruins (1300-1100 B.C.) - Classifiers became established as a separate word
class in Chinese only in the 1950s - Ding et al (1952) A Talk on Grammar in Modern
Chinese - Different terms had been used for classifiers
- But mainly treated as a subclass of nouns
12Syntactic features of classifiers
- Classifiers were the last to have become one of
the 11 word classes in Chinese because they
cannot be used independently as sentential
constituents - Typically following a numeral or demonstrative
pronoun zhe ? this, na (?) that, or na (?)
which - Monosyllabic classifiers can be reduplicated to
function as different sentential constituents,
expressing a general grammatical meaning with
different situational variants (Guo 1999) - Co-existence or repetition of entities or events
- All around, many, one by one, continuous
13Levels of grammaticalization
- Specialised classifiers
- Fully grammaticalized
- Functioning as classifiers only
- Bleaching of lexical meaning, difficult to find
translation equivalents in a non-classifier
language - E.g. (n) ?,?,?,?,?,?,?,? (v) ?,?,?,?,?,?,?,?,?,?
- Concurrent classifiers
- Mainly derived from nouns and verbs
- Can be used as nouns/verbs and classifiers
- The classifier use is semantically related to the
lexical meaning of the original noun/verb - E.g. ?,?,??,? ?,?,?,?,?
- Temporary borrowings
- Mainly borrowed from nouns, verbs, and adjectives
- Functioning as classifiers only on an ad hoc
basis - Full lexical meaning
- E.g. ? (face),?? (house) ? (knife),? (gun),?
(foot),? (fist)
14Semantic types of classifiers (1)
- Nominal classifiers (6 types) Quantifying nouns
- Unit classifiers
- Count individual entities
- E.g. ? (63.5 of unit classifiers, 38.8 of all
classifiers),?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
- Collective classifiers
- Provide a collective reference for separate
entities - E.g. ? set ,? batch ,? pair ,?? series ,?
pair ,? group ,? generation ,? group ,?
pair ,? team - Arrangement classifiers
- Also refer to a collection, but focus on
constellation aspect (shape), i.e. how entities
are arranged or grouped together - E.g. ? layer,? pile,? ball,? pad,?
string,? thread,? row,? handful,?
drop,? bunch,? thread,? row
15Semantic types of classifiers (2)
- Nominal classifiers Quantifying nouns
- Standard measure classifiers
- Express exact measures of various kinds, in local
or international units - E.g. ?,?,?,?,?,??,?,??,?,?,???,?,??,??,?,?,?,?,?
- Container classifiers
- Denote types of containers, which are borrowed
temporarily to provide an inexact measure of mass
or entities usually associated with such
containers - E.g. ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
,?? - Special container classifiers, can only take yi
-gt full, more descriptive than quantifying - Species classifiers
- Denote the type of entities grouped together
- E.g. ? (kind, over 90),? (sort),? (grade),?
(type),? (grade),? (class)
16Semantic types of classifiers (3)
- Verbal classifiers quantifying verbs
- 9 specialised verbal classifiers
- E.g. ?(times, 40.8 of all verbal
classifiers),?(stroke),?(course of action),?(once
over),?(step of action),?(return
journey),?(times),?(once through),?(criticising,
abusing) - Borrowed verbal classifiers
- An open set, mostly nouns denoting tools and
related items - E.g. ?,?,?,?,?,?,??,?,?
- Temporal classifiers measuring time
- Exact measures
- ?,?,?,??,??,?,?,??,?,??,?,?,??,??,?,?,?,??,?
- Inexact measures
- E.g. ??,?,??,??,?,?,??
17Classifiers in writing and speech
- Unit classifiers by far most common, in speech
and writing - Because of the weight of generalised classifier
ge, unit classifiers are particularly frequent in
speech - Other common types temporal, verbal
- Infrequent types container, arrangement,
collective
18Variation across genres
- Apart from the speech-writing difference, various
genres also differ in classifier use - Most frequent in news reportage (A), humour (R),
and speech (S) over 3K in 100K - Least common in news review (B), news editorial
(C), religious writing (D), and academic prose
(J) below 2k in 100k - Generally more common in imaginative (K-R)
writing and speech (S) than in informative
writing (A-J)
19Distribution of classifier types
- Distribution of different types of classifiers
also varies across genres - Unit classifier is the most common type in all
genres (2/3 of all classifiers) - Container, arrangement, and collective
classifiers are relatively rare in all genres - Std measure classifiers are most frequent in news
reportage (A) and official docs (H) - Species classifiers are more common in
informative than imaginative writing
20Cognitive basis of classifier use
- Allan (1977) number of dimensions
- Adams and Conklin (1973) elasticity, hardness,
discreteness - Shi (2001) ratio between different dimensions,
and materiality - Dimensions and use of classifiers
- 0-D point, e.g. yi dian (?) mo a point of ink
- 1-D line, e.g. yi xian (?) xiwang a thread of
hope - 2-D area (Y being the longer dimension)
- Y/Xgtgt1 gt zhang (?) e.g. yi zhang zhaopian a
photo - Y/Xgtgt0 gt tiao (?) e.g. yi tiao malu a road
- 3-D block (QY/X)
- Z/Q gtgt 0 gt pian (?) e.g., yi pian shuye a
leaf - Z/Q gtgt 1 gt kuai (?) e.g. yi kuai tang a lump
of sugar - Z/Q gtgt sufficiently large gt gen (?) e.g. yi gen
dianxian a cable - While the use of nominal classifiers is closely
associated with shape, this is not the only
criterion nouns and classifiers co-select each
other - Five co-selection criteria
21Co-selection by similarity
- Classifiers are closely related to shapes which
are historically associated with the nouns that
have given rise to these classifiers, e.g. tiao
(?) - tiao small branch/twig gt long, narrow,
flexible jie (?) street, tui (?) leg, lu
(?) road, xian (?) line thread, he (?)
river, yu (?) fish, etc bamboo slips for
writing gt guiding (??) regulation, jianyi
(??) suggestion, falu (??) law, xinwen (??)
news, etc - kuai (?) (soil lump/block gt something of a
lumpy/blocky shape, e.g. a wrist watch
territory soil gt something with a boundary,
e.g. a scar
22Co-selection by metonymy
- The original lexical meanings of classifiers
refer to the most salient features of the
entities being classified, e.g. - kou (?) mouth (for pigs), tou (?) head (for
cattle), wei (?) tail (for fish), ding (?)
top (for hats, sedan chairs etc) - BUT long term linguistic conventions are always
important in language use - tou rabbit, cat
- wei peacock, squirrel
23Co-selection by relatedness
- The original lexical meanings of classifiers
refer to actions closely related to entities
being classified, e.g. - bao (?) wrap-gt pack (resulting of packing)
- chuan (?) string together-gt string, bunch
- kun (?) tie up, fasten -gt bundle
- peng (?) hold in both hands -gt a double handful
24Co-selection by association
- The original lexical meanings of classifiers
refer to tools, containers, and places, etc
closely associated with the entities being
classified, e.g. - dao (?) knife -gt a cut of (meat)
- wan (?) bowl -gt a bowl of (rice)
- chuang (?) bed -gt a bed of (quilt/sheet etc)
- mu (?) curtain -gt an act of (play)
25Co-selection by conventions
- Sometimes, co-selection has to be interpreted by
following linguistic conventions because it is
not always possible to track the
grammaticalization path of a classifier to
ascertain the relationship between its original
lexical meaning with the entities being
classified - In what way is tiao historically related to
renming human life? - Why is tou used for pigs and cattle but not
rabbits or cats? - Why is wei used for fish but not for peacocks or
squirrels even though they have tails that are as
salient as, if not more so, than that of fish - Such missing links have to be accounted for by
linguistic conventions of the speech community
26Collocates
- Lets now have a look at the noun collocates of
some common classifiers in Chinese to see how
well the proposed co-selection criteria work - Defining collocates (in 2 million words)
- Window span of L5-R5
- zgt3.0
- Minimum co-occurrence frequency of 5
27Collocates of zhang (?)
Collocate Gloss Frequency z-score
? playing card 64 85.9
?? notepaper 9 49.7
?? cheque 6 40.4
?? photo 7 26.6
? ticket 10 21.5
? paper 13 21.2
? (thick/thin) face, cheek 12 17.2
? skin/leather 7 17.2
? drawing 6 9.5
? (prototypical) bed 6 9.3
28Collocates of tiao (?) 1
Collocate Gloss Frequency z-score
?? stipulation 51 41.9
?? regulation 11 26.1
? street 11 25.4
? leg 14 22.5
?? (traffic) lane 6 20.4
? road 23 19.7
?? straight line 6 19.4
? river 6 11.7
29Collocates of tiao (?) - 2
Collocate Gloss Frequency z-score
?? instruction 6 10.7
?? suggestion 7 9.2
? fish 9 8.8
? line thread 7 7.4
?? principle 8 7.0
?? comment 6 5.3
?? news 6 4.3
30Collocates of kuai (?)
Collocate Gloss Frequency z-score
?? level ground 6 59.0
?? stone 11 51.1
? cloth 6 23.0
? land, field 9 3.2
31Collocates of ge (?)
- Generalised classifier ge (?) bamboo (?) split
into halves, initially as a counter for bamboos
and arrows when a bamboo chip is used for
counting, it becomes a symbol of the entity being
counted. In other words, the entity loses its
shape, colour, function or any other attribute
and becomes a unit of counting, ge. - Ge can be used for any noun (people or things,
large or small) that does not have a specific
classifier, and it can be used to replace
specific classifiers of many nouns. - A total of 115 noun collocates
- 29 refer to human beings, 86 to non-human
entities - 66 refer to concrete entities, 49 to abstract
entities - 12 related to time
- Top 20 noun collocates (zgt8.8, Fgt5, in the order
of z-scores) - ? month, ?? week, ? person, ?? hour, ??
phone call, ?? week, ? character,???
percentage, ?? place, ?? corner, ??
project, ?? hour, ?? problem, question, ???
rice cooker, ?? woman, ?? character, ??
example, ?? box, ??? camera, ?? stuff
32Classifiers for dongxi (??)
- A noun with a rather general and vague referent
can refer to anything, but not human being - It is an insult to say someone is a dongxi, or is
not a dongxi - The vagueness in reference makes it possible to
use a nominal classifier of any type for dongxi - Unit classifier
- (General) ge (?), jian (?) piece, fen (?)
portion - (Shape) tiao (?), zhang (?), and kuai (?)
- (Book/paper) ben (? for books), pian (? for a
piece of writing) - Collective classifier
- tao (?) set
- Arrangement classifier
- dui (?) pile
- Container classifier
- xiangzi (??) box, bao (?) pack
- Standard measure classifier
- dun (?) ton
- Species classifier
- yang (?) type, zhong (?) kind, lei (?) class
33Variations
- Not all instances of classifier use are in line
with these co-selection criteria - Regional variation
- dao (?) knife
- Mandarin yi-ba (?) dao a knife
- Cantonese yi-zhang (?) dao a knife
- niu (?) cattle
- Mandarin yi-tou (?) niu a cow
- Wu yi-zhi (?) niu a cow
- ren (?) person
- Mandarin yi-ge (?) ren
- Fuzhou yi-zhi (?) ren
- Unconventional, creative use of classifiers often
found in literary works - Diachronic variaion
34