Wide character vs. Multi-byte characters - PowerPoint PPT Presentation

About This Presentation
Title:

Wide character vs. Multi-byte characters

Description:

Input for Chinese and Greek Letters in GB are two different input methods and ... Chinese/Japanes phonetic symbols (similar to Kantana or Hiragana) ... – PowerPoint PPT presentation

Number of Views:347
Avg rating:3.0/5.0
Slides: 18
Provided by: xxxx150
Category:

less

Transcript and Presenter's Notes

Title: Wide character vs. Multi-byte characters


1
Wide character vs. Multi-byte characters
  • Text information needs to be represented by the
    right data types.
  • Multi byte characters data are processed on a
    per-byte basis Big5, GB, EUC, even UTF-8
  • Wide characters Fixed-byte encoding and no
    testing of high bit is needed.
  • Processing representation for wide characters
  • Big Endian vs. Little Endian
  • Data type dependent only for wide characters
  • System architecture dependent
  • Distinction 0xFEFF for Big Endian and 0xFFFE for
    Little Endian

2
Character Input
  • Input method A scheme of mapping characters
    from their external representations to the
    internal codepoints used in computer systems.
  • Classification of input methods
  • Images
  • Off-line character recognition (Optical character
    recognition)
  • On-line character recognition
  • Speech voice recognition
  • Character features Keyboard input based on glyph
    shapes and pronunciations.

3
Character Input Based on Images
  • Optical Character Recognition (via image,
    off-line )
  • Written material --gt scanner --gt bitmap image
    file (e.g. TIFF, JPEG) --gt characters
    (represented by an internal code)
  • very difficult for unrestricted handwritten
    characters, commercially viable for printed
    materials and acuracy depends on printing quality
  • Degree of difficulty increases when the total
    number of characters to be recognized increases
  • On-line character Recognition (by pen writing
    devices)
  • Handwriting information capture (pen-in, pen-out,
    pen-movement, on-line) --gt Stroke information
    (pre processing with noise reduction) --gt
    Searching for the character based on the sequence
    of strokes.
  • commercially viable

4
  • Speech Recognition (by voice input)
  • Capture speech by microphones --gt speech signal
    segmentation --gt speech signal converted to
    phonetic transcription --gt phonetic spelling
    converted to internal code.
  • becoming commercially viable, problem with
    non-native speaker, conversion from colloquial to
    written text
  • more affordable and getting common in the next
    5-10yrs

5
  • Keyboard based Input method an encoding method
    which maps a sequence of keystrokes (with a
    predefined keyboard layout) to an internal code
    of a character.
  • Conceptually, an input method can be considered
    as a mapping table with two columns 1st column X
    is a sequence of keys, 2nd column Y is the
    corresponding internal code.
  • Uniqueness requirement for any two internal
    codepoints Yi and Yj, if Yi ? Yj then Xi ? Xj.
  • Input methods are normally language (script)
    dependent
  • Input for Chinese and Greek Letters in GB are two
    different input methods and are thus separately
    invoked.

6
  • Typing in the internal code is straight forward,
    easiest to implement, and accurate, but requires
    labour intensive training, only good for
    professionals
  • Why do we need to design input methods
  • People cannot relate characters with internal
    code
  • ? gt(BCAB16 ) ?gt(BCAC16 )
  • Number of characters is much larger that the
    number of keys on the keyboardgta sequence of
    keystrokes maps into one key
  • What is the restriction limited number of
    keys(people cannot remember too many different
    keys with unrelated numbers)

7
  • What are the information we know?
  • All input methods must use some features
    associated with the characters pronunciation,
    radicals, components, strokes, writing sequence,
    etc., or combinations of them.
  • Different mapping methods leads to different
    input methods
  • Users Professional typists, casual users, daily
    users
  • Different mode of inputs
  • Typing by looking at printed material
  • Typing while thinking

8
Design considerations
  • Ease of learning
  • Shorter learning time Easy to pick up(perhaps
    easy to forget), but slow input speed
  • Longer learning time Difficult to learn, but
    once you are trained, not easy to forget and
    faster input speed
  • Mapping of features to keys on the keyboard
  • Physical control of the different fingers and
    access to different key positions on the keyboard
  • Frequency analysis of the features
  • Uniqueness one to one mapping and user
    friendliness
  • Equal keystroke sequence vs. uneven keystroke
    sequence

9
Input methods based on glyphs
  • Problems
  • What are the fundamental units?
  • How to put the units together (or how to form
    sequences)? Need to translate 2-D spatial
    relations into 1-D ordering
  • Example ?(U5935) and ?(U5C16)
  • How difficult is it to learn? Trade-off between
    ease of learning and speed
  • Features related to glyphs
  • Strokes(??)? ? ? ? ?
  • Radicals(??) for indexing mostly, not unique
  • Components(?? ) ? and ?in ??
  • Character(?? ) ?
  • Spatial relations(????) left-right, upper-lower,

10
Principles of Input method design
  • Design example using strokes only
  • Suppose we assign the strokes to keys 1,2,3,4,5,
    respectively, using only 5 keys
  • Example ? , 23144233232, very long a sequence
  • What problems do we have for characters like
    these??
  • gt At least an extra key must be used to
    distinguish them
  • As there are more keys available, some keys can
    be assigned to multiple strokes

11
  • 2-stroke keys if the first stroke is x, second
    stroke is y, how many different 2-stroke keys?
  • Example
  • Total No. of keys now?
  • With these additional keys the number of key
    presses is reduced to
  • 23 14 42 33 23 2
  • With 3 stroke keys xyz, additional keys
  • Total No. of keys

12
Study of character features and use patterns
  • Study of character frequency(based on
    50,000char.)
  • 2,000 most frequently used characters 97
  • out of that first 100 characters 45
  • the first 10 characters 12
  • Example ? ? ? ? ? ? ? ? assign keys
  • 2-stroke keys
  • 3-stroke keys, etc, use the most frequently used,
  • Other considerations are
  • easily identifiable
  • reducing the length of key sequence

13
Keyboard Arrangements
  • Some fingers are easier to control, assign
    priority L use only index(2nd finger) to 5th
    finger for typing.
  • General Principle Assign more frequently used
    features keys to the position on the keyboard
    which are easier to reach
  • One simple method
  • Some keyboard rows are easy to press R
  • Keys are ranked according to LxR
  • all the selected strokes(characters, and combined
    strokes) are ranked according to frequency of
    use, K
  • Then mapping the feature keys according to rank.

14
Phonetic-based IM ?? (Pinyin)
  • Romanized input method vs. native phonetic
    symbols based input method
  • Romanized letter strings (usually 1-2 characters)
    which can use the English keyboard readily
  • Native phonetic symbols are easier for people to
    relate
  • Design Problems and Solutions
  • Homonyms(??? ) in GB
  • No tone only 18 char. Have no homonyms. Largest
    set yi is 114.
  • With tone 262 no homonyms, largest is reduce to
    60.
  • Solutions (1) Specification of tone is optional
    (1-4 for Putonghua and 1-9 for Cantonese), (2)
    use a window to show all the candidates, (3)
    word/bigram input.
  • Multiple pronunciations of the same character.
    Enter all possible pronunciation into the
    phonetic spelling database. (e.g. che and kui for
    ? in Cantonese).
  • Quantitatively not a significant problem
  • May slow down if for fault-tolerance reason
    (fuzzy input)

15
  • User Problems
  • Some sounds are difficult to analyze
  • similar consonants /b/ vs /p/, /t/ vs /d/, /g/
    vs /k/
  • tone interact with vowel the way we say things
    and the standard pinyin is different ?? pu3 er3
    to pu2 er3(Putonghua)
  • Difficult to analyze the behaviour of non-native
    speakers because of accent interfering with
    phonetic analysis
  • Tedious to find the correct character from the
    set of candidates that have no apparent
    relationships
  • When user cannot use shape-based keystroke input,
    then try phonetic spelling!

16
Other Ims for Chinese
  • Zhuyin (??) also called bopomofo
  • Chinese/Japanes phonetic symbols (similar to
    Kantana or Hiragana)
  • Includes the use of numerals keystrokes
  • Similar English sounds bpmfdtnlgkhjsaor
  • tone . (tone 0), ltspacegt (tone 1), 2 (tone 2),
    3, (tone 3), 4 (tone 4)
  • One-to-one mapping to PinYin(Pages 218-219)
  • ???? to bo, po mo fo
  • ??mapping into number keys good for small
    appliances mobile phone, PDA, etc.

17
Japanese and Korean
  • Since hiragana and katakana are all phonetic
    based, they have unique Romanized mapping
  • Example a i u e o, ha hi hu he ho
  • But separate key(native symbols) mapping is also
    provided pp248
  • Romanized input and native symbol-based direct
    mapping input methods are different
  • Similar for Korean Hangul
Write a Comment
User Comments (0)
About PowerShow.com