Title: COMP323 Foundations of Chinese Computing
1- COMP323 Foundations of Chinese Computing
2Course Introduction
- Lecturer
- Qin LU
- csluqin_at_comp.polyu.edu.hk
- Room PQ814, Tel. 27667247
- Teaching Assistant (Responsible for some Labs and
Project Assignments) - Chen Yirong
- csyrchen_at_comp.polyu.edu.hk
- Room QT416, Tel. 2766 7326
3Course Introduction
- COMP323 Reference Books
- CJKV Information Processing Chinese, Japanese,
Korean and Vietnamese Computing (PL1074.5 .L86) - An Introduction to Chinese, Japanese and Korean
Computing (QA76.H7795) - ????????? (PL1074.5.C42) and others
- Tutorials and labs PQ604A
- Tuesday Group 930 1030 Tuesdays
- Thursday Group 930 1030 Thursdays
- Try to finish the labs and the online
assignment/QA during lab hours
4Course Introduction
- COMP323 Website
- WebCT
- Lecture notes available Wed. by 5pm
- Print as NotePage
- Method of Assessment
- Course Work 55
- 2 Programming Assignments 20
- 2 online quizzes 20
- 1 online homework 5
- 4 online QA(labs) 8
- Class attendance (punctuation) 2
- Final Examination 45
5Course Introduction
- Introduction to Chinese Computing
- Computer processing of data related to Chinese,
involving any human-computer interaction activity
where communication is achieved using Chinese
language.
Chinese
Computing
About one-fifth of the people in the world speak
some form of Chinese as their native language,
making it the language with the most native
speakers.
6Course Introduction
- Fundamental Problems with Chinese Computing
- At Chinese Character Level
- Large and not Closed Character Set
- Computer Representation, Input and Output
- At Chinese Language Level
- Lack of Morphological Variation
- Lack of Grammar
- Very Arbitrary and Flexible
- Superimposed Grammar
- Texts are Running Together
7Course Introduction
- Fundamental Problems with Chinese Computing
8Course Introduction
- Fundamental Problems with Chinese Language
- Bi-lingual, Tri-lingual and Multi-lingual
Computing - Question Is Hong Kong a multi-lingual society?
- How can a system be designed so that it can be
used by different languages with minimal changes?
- How can a system be designed so that it can be
used for multiple languages? - Distinguish Chinese and English Characters
- Chinese Text, English Text or Chinese Text Mixed
Together with English Text?
9Course Introduction
- Fundamental Problems with Chinese Language
- Bi-lingual, Tri-lingual and Multi-lingual
Computing - Example Count the Number of (Chinese and/or
English) Characters or Words
?
10Tentative Teaching Content
- Characteristics of Chinese Language
- Reading System (Pronunciation)
- Writing System (Look)
- Computer Representation of Chinese Characters
- Character Set Standards (GB, Big5 and Unicode
...) - Encoding Schemes (ISO and UTF )
- Chinese Character Input
- Chinese Input Processing by (Pen, Image, Speech
and) Key Stroke - Shape-based Keystroke Input Method
- Phonetic-based Keystroke Input Method
11Tentative Teaching Content
- Chinese Character Output
- Bitmap and Outline Font Representation
- Compression
- Scaling Problem
- Software Development for Chinese
- Text Processing, such as Character Searching,
Editing, and Deletion - Software Localization and Internationalization
12Tentative Teaching Content
- Chinese Language Processing
- Word Segmentation
- Part-of-Speech (POS) Tagging
- Syntactic Analysis (Grammatical Analysis)
- Chinese Information (Document) Retrieval
- Document Retrieval Models
- Language-Related Issues
- Advanced Topics (possibly)
- Information Extraction
- Text Summarization
13Lecture 1 Characteristics of Chinese
14The Chinese Language
- General Characteristics
- The official language in China is mandarin (???),
but there are many dialects in spoken form (50). - Different Pronunciation across Different Dialects
- Relatively Unified Writing System
- Dialect-specific Characters and Variant Character
Writing - Different words express the same meaning, e.g. ?
and ? (to be) - Word order reversal, e.g. ?? and ?? (look for)
??????????????????????
15The Chinese Language
16The Chinese Language
- Characteristics of Chinese Characters
- Each Chinese character associates with three
features, namely its look (called graphemics),
its pronunciation (called phonetics), and its
meaning (called semantics).
Graphemics (The Look)
Phonetics (The Sound)
Semantic (The Meaning)
17Chinese Writing System
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????????????????????????????????????????
???????
- Radicals (??)
- Chinese characters are
- composed of smaller
- units, called radicals.
- 214 radicals are used
- for indexing Chinese
- characters.
- The advantage of a
- radical is that one does
- not have to know the
- pronunciation of the character,
- but can still look up a character in a
dictionary.
18Chinese Writing System
- Radicals
- Remark Several radicals can stand alone as
single and meaningful Chinese characters.
Radical
Standalone
Examples
????????????
?
Yes
????????????
?
Yes
????????????
?
Yes
????????????
?
Yes
19Chinese Writing System
- Strokes (??)
- Radicals in turn are composed of smaller units,
called strokes. - 30 strokes are the most basic elements of a
character. - 5 basic strokes are ? (?, a horizontal stroke),
? (?, a vertical stroke), ? (?, dot), ? (?,
a stroke curved to the left) and ? (?, a bend
stroke).
20Chinese Writing System
- Strokes
- Stroke Order (??)
- The strokes for each Chinese character are to be
drawn in a certain defined order. - Basic principles are from left to right, top to
bottom, outside to inside, horizontal before
vertical, left slant before right slant, center
before two sides, etc. - See Animations here http//www.chinawestexchange.c
om/Chinese/characters.htm
21Chinese Writing System
- Tree Structure of Chinese Characters
22Chinese Writing System
- Character Classifications and Formation
- Type 1 Pictographs (Picture Characters) (??)
- They look like the things they represent, e.g.
- Other examples are ? (sun),
- ? (mountain), ? (water),
- ? (bird), ? (fire), ? (tree),
- ? (car, cart), and ?
- (month, opening), etc.
Does this character ? really look like a moon to
you? Centuries ago, it was written like this
23Chinese Writing System
- Evolution of Chinese Characters
24Chinese Writing System
- Character Classifications and Formation
- Type 2(Simple) Ideographs (?? or ??)
- They represent abstract concepts or ideas, such
as numbers and directions, e.g. ? (one), ? (two),
? (three), and ? (center, middle), ? (above), ?
(below) etc.
25Chinese Writing System
- Character Classifications and Formation
- Type 3 Compound Ideographs (??)
- Pictographs and ideographs can be combined to
represent more complex characters, and usually
reflect the combined meaning of them. - Examples
- More
- Interesting
- Animations
- from Internet http//www.language.berkeley.edu/fa
njian/compound_ideographs.html
sun ? moon ? bright ? person ? person ?
agree/follow ? sun ? tree ? east (sun rising
above the trees in the east) ? tree ?
tree ? forest ? one more tree ?
full of trees ?
26Chinese Writing System
- Character Classifications and Formation
- Type 3 Compound Ideographs
27Chinese Writing System
- Character Classifications and Formation
- Type 3 Compound Ideographs
28Chinese Writing System
- Character Classifications and Formation
- Type 4 Phonetic Ideographs (??)
- They usually have at least two component
characters, one influences the sound and the
other influences the meaning. - For example,
- They account
- for more than
- 90 of all
- Chinese
- characters
- in use today.
For the character ? ( jump ), the left part ?
means foot. The meanings of those characters
that contain ? are related to foot in a
certain way. The right part ? indicates the
sound. They share the same vowel.
29Chinese Writing System
Thought to be the oldest types of characters,
pictographs were originally pictures of things.
During the past 5,000 years or so they have
become simplified and stylised.
Ideographs are graphical representations of
abstract ideas.
Compound pictographs and ideographs combine one
or more pictographs or ideographs to form new
characters. Both component parts contribute to
the meaning of the compound character.
30Chinese Writing System
Semantic-phonetic compounds represent around 90
of all existing characters and consist of two
parts a semantic component or radical which
hints at the meaning of the character, and a
phonetic component which gives a clue to the
pronunciation of the character. Characters
containing the same phonetic component may have
the same sound and the same tone, the same sound
but a different tone, the same initial or final
sound, or a different sound and a different tone.
Phonetic components are generally a more
reliable indication of pronunciation than
semantic components are of meaning.
31Chinese Writing System
- Traditional and Simplified Characters
- Over time, frequently used and complex Chinese
characters tend to be simplified. - More about Pitfalls and Complexities of Chinese
to Chinese Conversion http//www.cjk.org/cjk/c2c/c
2cbasis.htm
retain only one part from the traditional
character
32Chinese Writing System
- Chinese Language (Chinese Text)
- Chinese characters are subsequently combined with
other Chinese characters as words to form more
complex ideas and concepts. - Question How many Chinese characters?
The Chinese writing system is open-ended, meaning
that there is no upper limit to the number of
characters. The largest Chinese dictionaries
include about 56,000 characters, but most of them
are archaic, obscure or rare variant forms.
Knowledge of about 3,000 characters enables you
to read about 99 of the characters in Chinese
newspapers and magazines. To read Chinese
literature, technical writings or classical
Chinese, though, you need to be familiar with
about 6,000 characters.
33Chinese Reading System
- Pronunciation
- The phonetic information is not explicit.
- Sometimes, you can guess the pronunciation
through the component characters. - Sometimes, the pronunciation has no relation to
its components at all. - It makes the learning of Chinese difficult
without a phonetic transcription system. - Phonetic transcription Dictation of
pronunciations - Symbols to indicate all sounds in the language -
sufficient - One sound is denoted by only one symbol -
Uniqueness
34Chinese Reading System
- Pronunciation
- Pinyin dictating Mandarin Chinese
- Vowel (??, Initial) and Consonant (??, final)
- More about Pronunciation http//www.chinese-outpos
t.com/language/pronunciation/mandarin-chinese-init
ials-and-finals-table-1.asp
For example, consider Beijing bei b is an
initial, and ei is a final jing j is an initial,
and ing is a final In speech, Chinese words are
created using just 21 beginning sounds called
initials, and 37 ending sounds called finals.
Initials and finals, of course, combine to create
the basic sounds of Chinese.
35Chinese Reading System
36Chinese Reading System
- Pronunciation
- Tones of Chinese
- Chinese is a tonal Language.
- Mandarin has 4 (5) tones and Cantonese has 6 (9)
tones, which makes it much harder to learn than
Mandarin.
37Chinese Reading System
- Pronunciation
- Tones differentiate meanings.
Everyone seems to know this one Yes, just by
saying ma in different tones, you can ask, Did
mother scold the horse? ????? (ma mà ma ma?)
?? (Gong Li, with third and fourth tones), is the
name of the star of Raise the Red Lantern and
other contemporary Chinese films. However, ??
(gong li, with first and third tones, means
kilometer.