Lecture4 1
Wide character vs. Multi-byte characters
• Text information needs to be represented by the right data types.– Multi byte characters: data are processed on a per-byte
basis: Big5, GB, EUC, even UTF-8
– Wide characters: Fixed-byte encoding and no testing of high bit is needed.
• Processing representation for wide characters:– Big Endian vs. Little Endian
• Data type dependent: only for wide characters
• System architecture dependent
• Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian
Lecture4 2
Character Input
• Input method: A scheme of mapping characters from their external representations to the internal codepoints used in computer systems.
• Classification of input methods:– Images:
• Off-line character recognition (Optical character recognition)
• On-line character recognition– Speech: voice recognition– Character features: Keyboard input based on glyph
shapes and pronunciations.
Lecture4 3
Character Input Based on Images• Optical Character Recognition (via image, off-line ):
– Written material --> scanner --> bitmap image file (e.g. TIFF, JPEG) --> characters (represented by an internal code)
– very difficult for unrestricted handwritten characters, commercially viable for printed materials and acuracy depends on printing quality
– Degree of difficulty increases when the total number of characters to be recognized increases
• On-line character Recognition (by pen writing devices):– Handwriting information capture (pen-in, pen-out, pen-
movement, on-line) --> Stroke information (pre processing with noise reduction) --> Searching for the character based on the sequence of strokes.
– commercially viable
Lecture4 4
• Speech Recognition (by voice input):– Capture speech by microphones --> speech signal
segmentation --> speech signal converted to phonetic transcription --> phonetic spelling converted to internal code.
– becoming commercially viable, problem with non-native speaker, conversion from colloquial to written text
– more affordable and getting common in the next 5-10yrs
Lecture4 5
• Keyboard based Input method: an encoding method which maps a sequence of keystrokes (with a predefined keyboard layout) to an internal code of a character.
– Conceptually, an input method can be considered as a mapping table with two columns: 1st column X is a sequence of keys, 2nd column Y is the corresponding internal code.
– Uniqueness requirement: for any two internal codepoints Yi and Yj, if Yi ≠ Yj then Xi ≠ Xj.
• Input methods are normally language (script) dependent:
– Input for Chinese and Greek Letters in GB are two different input methods and are thus separately invoked.
Lecture4 6
• Typing in the internal code is straight forward, easiest to implement, and accurate, but requires labour intensive training, only good for professionals
• Why do we need to design input methods:– People cannot relate characters with internal code
• 憤 =>(BCAB16 ) 憔 =>(BCAC16 )– Number of characters is much larger that the
number of keys on the keyboard=>a sequence of keystrokes maps into one key
• What is the restriction: limited number of keys(people cannot remember too many different keys with unrelated numbers)
Lecture4 7
• What are the information we know?All input methods must use some features associated with the characters: pronunciation, radicals, components, strokes, writing sequence, etc., or combinations of them.
• Different mapping methods leads to different input methods
• Users: Professional typists, casual users, daily users• Different mode of inputs:
– Typing by looking at printed material– Typing while thinking
Lecture4 8
Design considerations:
• Ease of learning– Shorter learning time: Easy to pick up(perhaps easy to
forget), but slow input speed– Longer learning time: Difficult to learn, but once you are
trained, not easy to forget and faster input speed
• Mapping of features to keys on the keyboard:– Physical control of the different fingers and access to
different key positions on the keyboard– Frequency analysis of the features
• Uniqueness: one to one mapping and user friendliness• Equal keystroke sequence vs. uneven keystroke
sequence
Lecture4 9
Input methods based on glyphs• Problems:
– What are the fundamental units?
– How to put the units together (or how to form sequences)? Need to translate 2-D spatial relations into 1-D orderingExample: 夵 (U+5935) and 尖 (U+5C16)
– How difficult is it to learn? Trade-off between ease of learning and speed
• Features related to glyphs:• Strokes( 筆劃 ): 點 橫 豎 撇 捺• Radicals( 偏旁) : for indexing mostly, not unique
• Components( 部件 ): 女 and 且 in 姐組• Character( 整字 ): 甘• Spatial relations( 方位關係) : left-right, upper-lower,
Lecture4 10
Principles of Input method design
• Design example: using strokes only
• Suppose we assign the strokes to keys 1,2,3,4,5, respectively, using only 5 keys
• Example: 哲 , 23144233232, very long a sequence
• What problems do we have for characters like these:岭岺 => At least an extra key must be used to distinguish them
• As there are more keys available, some keys can be assigned to multiple strokes:
Lecture4 11
• 2-stroke keys: if the first stroke is x, second stroke is y, how many different 2-stroke keys?– Example:
• Total No. of keys now?
• With these additional keys the number of key presses is reduced to:
23 14 42 33 23 2
• With 3 stroke keys: xyz, additional keys:
• Total No. of keys:
Lecture4 12
Study of character features and use patterns
• Study of character frequency(based on 50,000char.)– 2,000 most frequently used characters: 97%
– out of that: first 100 characters: 45%
– the first 10 characters: 12%
– Example: 有 的 口 是 我 不 女 日 : assign keys
– 2-stroke keys:
– 3-stroke keys, etc, use the most frequently used,
• Other considerations are • easily identifiable
• reducing the length of key sequence
Lecture4 13
Keyboard Arrangements
• Some fingers are easier to control, assign priority L: use only index(2nd finger) to 5th finger for typing.
• General Principle: Assign more frequently used features keys to the position on the keyboard which are easier to reach
• One simple method:– Some keyboard rows are easy to press R:
– Keys are ranked according to LxR
– all the selected strokes(characters, and combined strokes) are ranked according to frequency of use, K
– Then mapping the feature keys according to rank.
Lecture4 14
Phonetic-based IM: 拼音 (Pinyin)
• Romanized input method vs. native phonetic symbols based input method– Romanized letter strings (usually 1-2 characters) which can use the
English keyboard readily– Native phonetic symbols are easier for people to relate
• Design Problems and Solutions:– Homonyms( 同音字 ) in GB:
• No tone: only 18 char. Have no homonyms. Largest set yi is 114.• With tone: 262 no homonyms, largest is reduce to 60.• Solutions: (1) Specification of tone is optional (1-4 for Putonghua
and 1-9 for Cantonese), (2) use a window to show all the candidates, (3) word/bigram input.
– Multiple pronunciations of the same character. Enter all possible pronunciation into the phonetic spelling database. (e.g. che and kui for 車 in Cantonese). • Quantitatively not a significant problem• May slow down if for fault-tolerance reason (fuzzy input)
Lecture4 15
• User Problems:
– Some sounds are difficult to analyze:
• similar consonants: /b/ vs /p/, /t/ vs /d/, /g/ vs /k/
• tone interact with vowel: the way we say things and the standard pinyin is different: 普洱 pu3 er3 to pu2 er3(Putonghua)
– Difficult to analyze the behaviour of non-native speakers because of accent interfering with phonetic analysis
– Tedious to find the correct character from the set of candidates that have no apparent relationships
• When user cannot use shape-based keystroke input, then try phonetic spelling!
Lecture4 16
Other Ims for Chinese
• Zhuyin ( 注音 ) [also called bopomofo]– Chinese/Japanes phonetic symbols (similar to Kantana or Hiragana)
– Includes the use of numerals keystrokes
– Similar English sounds: bpmfdtnlgkhjsaor
– tone: . (tone 0), <space> (tone 1), 2 (tone 2), 3, (tone 3), 4 (tone 4)
– One-to-one mapping to PinYin(Pages 218-219)
ㄅㄆㄇㄈ to bo, po mo fo
• 九方: mapping into number keys good for small appliances: mobile phone, PDA, etc.
Lecture4 17
Japanese and Korean• Since hiragana and katakana are all phonetic based, they
have unique Romanized mapping• Example: a i u e o, ha hi hu he ho• But separate key(native symbols) mapping is also provided
pp248
• Romanized input and native symbol-based direct mapping input methods are different
• Similar for Korean Hangul