Wide character vs. Multi-byte characters - PowerPoint PPT Presentation

About This Presentation

Title:

Wide character vs. Multi-byte characters

Description:

Input for Chinese and Greek Letters in GB are two different input methods and ... Chinese/Japanes phonetic symbols (similar to Kantana or Hiragana) ... – PowerPoint PPT presentation

Number of Views:347

Avg rating:3.0/5.0

Slides: 18

Provided by: xxxx150

Category:

more less

Transcript and Presenter's Notes

Title: Wide character vs. Multi-byte characters

1
Wide character vs. Multi-byte characters

Text information needs to be represented by the
right data types.
Multi byte characters data are processed on a
per-byte basis Big5, GB, EUC, even UTF-8
Wide characters Fixed-byte encoding and no
testing of high bit is needed.
Processing representation for wide characters
Big Endian vs. Little Endian
Data type dependent only for wide characters
System architecture dependent
Distinction 0xFEFF for Big Endian and 0xFFFE for
Little Endian

2
Character Input

Input method A scheme of mapping characters
from their external representations to the
internal codepoints used in computer systems.
Classification of input methods
Images
Off-line character recognition (Optical character
recognition)
On-line character recognition
Speech voice recognition
Character features Keyboard input based on glyph
shapes and pronunciations.

3
Character Input Based on Images

Optical Character Recognition (via image,
off-line )
Written material --gt scanner --gt bitmap image
file (e.g. TIFF, JPEG) --gt characters
(represented by an internal code)
very difficult for unrestricted handwritten
characters, commercially viable for printed
materials and acuracy depends on printing quality
Degree of difficulty increases when the total
number of characters to be recognized increases
On-line character Recognition (by pen writing
devices)
Handwriting information capture (pen-in, pen-out,
pen-movement, on-line) --gt Stroke information
(pre processing with noise reduction) --gt
Searching for the character based on the sequence
of strokes.
commercially viable

Speech Recognition (by voice input)
Capture speech by microphones --gt speech signal
segmentation --gt speech signal converted to
phonetic transcription --gt phonetic spelling
converted to internal code.
becoming commercially viable, problem with
non-native speaker, conversion from colloquial to
written text
more affordable and getting common in the next
5-10yrs

Keyboard based Input method an encoding method
which maps a sequence of keystrokes (with a
predefined keyboard layout) to an internal code
of a character.
Conceptually, an input method can be considered
as a mapping table with two columns 1st column X
is a sequence of keys, 2nd column Y is the
corresponding internal code.
Uniqueness requirement for any two internal
codepoints Yi and Yj, if Yi ? Yj then Xi ? Xj.
Input methods are normally language (script)
dependent
Input for Chinese and Greek Letters in GB are two
different input methods and are thus separately
invoked.

Typing in the internal code is straight forward,
easiest to implement, and accurate, but requires
labour intensive training, only good for
professionals
Why do we need to design input methods
People cannot relate characters with internal
code
? gt(BCAB16 ) ?gt(BCAC16 )
Number of characters is much larger that the
number of keys on the keyboardgta sequence of
keystrokes maps into one key
What is the restriction limited number of
keys(people cannot remember too many different
keys with unrelated numbers)

What are the information we know?
All input methods must use some features
associated with the characters pronunciation,
radicals, components, strokes, writing sequence,
etc., or combinations of them.
Different mapping methods leads to different
input methods
Users Professional typists, casual users, daily
users
Different mode of inputs
Typing by looking at printed material
Typing while thinking

8
Design considerations

Ease of learning
Shorter learning time Easy to pick up(perhaps
easy to forget), but slow input speed
Longer learning time Difficult to learn, but
once you are trained, not easy to forget and
faster input speed
Mapping of features to keys on the keyboard
Physical control of the different fingers and
access to different key positions on the keyboard
Frequency analysis of the features
Uniqueness one to one mapping and user
friendliness
Equal keystroke sequence vs. uneven keystroke
sequence

9
Input methods based on glyphs

Problems
What are the fundamental units?
How to put the units together (or how to form
sequences)? Need to translate 2-D spatial
relations into 1-D ordering
Example ?(U5935) and ?(U5C16)
How difficult is it to learn? Trade-off between
ease of learning and speed
Features related to glyphs
Strokes(??)? ? ? ? ?
Radicals(??) for indexing mostly, not unique
Components(?? ) ? and ?in ??
Character(?? ) ?
Spatial relations(????) left-right, upper-lower,

10
Principles of Input method design

Design example using strokes only
Suppose we assign the strokes to keys 1,2,3,4,5,
respectively, using only 5 keys
Example ? , 23144233232, very long a sequence
What problems do we have for characters like
these??
gt At least an extra key must be used to
distinguish them
As there are more keys available, some keys can
be assigned to multiple strokes

2-stroke keys if the first stroke is x, second
stroke is y, how many different 2-stroke keys?
Example
Total No. of keys now?
With these additional keys the number of key
presses is reduced to
23 14 42 33 23 2
With 3 stroke keys xyz, additional keys
Total No. of keys

12
Study of character features and use patterns

Study of character frequency(based on
50,000char.)
2,000 most frequently used characters 97
out of that first 100 characters 45
the first 10 characters 12
Example ? ? ? ? ? ? ? ? assign keys
2-stroke keys
3-stroke keys, etc, use the most frequently used,
Other considerations are
easily identifiable
reducing the length of key sequence

13
Keyboard Arrangements

Some fingers are easier to control, assign
priority L use only index(2nd finger) to 5th
finger for typing.
General Principle Assign more frequently used
features keys to the position on the keyboard
which are easier to reach
One simple method
Some keyboard rows are easy to press R
Keys are ranked according to LxR
all the selected strokes(characters, and combined
strokes) are ranked according to frequency of
use, K
Then mapping the feature keys according to rank.

14
Phonetic-based IM ?? (Pinyin)

Romanized input method vs. native phonetic
symbols based input method
Romanized letter strings (usually 1-2 characters)
which can use the English keyboard readily
Native phonetic symbols are easier for people to
relate
Design Problems and Solutions
Homonyms(??? ) in GB
No tone only 18 char. Have no homonyms. Largest
set yi is 114.
With tone 262 no homonyms, largest is reduce to
60.
Solutions (1) Specification of tone is optional
(1-4 for Putonghua and 1-9 for Cantonese), (2)
use a window to show all the candidates, (3)
word/bigram input.
Multiple pronunciations of the same character.
Enter all possible pronunciation into the
phonetic spelling database. (e.g. che and kui for
? in Cantonese).
Quantitatively not a significant problem
May slow down if for fault-tolerance reason
(fuzzy input)

User Problems
Some sounds are difficult to analyze
similar consonants /b/ vs /p/, /t/ vs /d/, /g/
vs /k/
tone interact with vowel the way we say things
and the standard pinyin is different ?? pu3 er3
to pu2 er3(Putonghua)
Difficult to analyze the behaviour of non-native
speakers because of accent interfering with
phonetic analysis
Tedious to find the correct character from the
set of candidates that have no apparent
relationships
When user cannot use shape-based keystroke input,
then try phonetic spelling!

16
Other Ims for Chinese

Zhuyin (??) also called bopomofo
Chinese/Japanes phonetic symbols (similar to
Kantana or Hiragana)
Includes the use of numerals keystrokes
Similar English sounds bpmfdtnlgkhjsaor
tone . (tone 0), ltspacegt (tone 1), 2 (tone 2),
3, (tone 3), 4 (tone 4)
One-to-one mapping to PinYin(Pages 218-219)
???? to bo, po mo fo
??mapping into number keys good for small
appliances mobile phone, PDA, etc.

17
Japanese and Korean

Since hiragana and katakana are all phonetic
based, they have unique Romanized mapping
Example a i u e o, ha hi hu he ho
But separate key(native symbols) mapping is also
provided pp248
Romanized input and native symbol-based direct
mapping input methods are different
Similar for Korean Hangul

Write a Comment

User Comments (0)