lecture4 1 wide character vs. multi-byte characters text information needs to be represented by the...

Lecture4 1

Wide character vs. Multi-byte characters

• Text information needs to be represented by the right data types.– Multi byte characters: data are processed on a per-byte

basis: Big5, GB, EUC, even UTF-8

– Wide characters: Fixed-byte encoding and no testing of high bit is needed.

• Processing representation for wide characters:– Big Endian vs. Little Endian

• Data type dependent: only for wide characters

• System architecture dependent

• Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian

Lecture4 2

Character Input

• Input method: A scheme of mapping characters from their external representations to the internal codepoints used in computer systems.

• Classification of input methods:– Images:

• Off-line character recognition (Optical character recognition)

• On-line character recognition– Speech: voice recognition– Character features: Keyboard input based on glyph

shapes and pronunciations.

Lecture4 3

Character Input Based on Images• Optical Character Recognition (via image, off-line ):

– Written material --> scanner --> bitmap image file (e.g. TIFF, JPEG) --> characters (represented by an internal code)

– very difficult for unrestricted handwritten characters, commercially viable for printed materials and acuracy depends on printing quality

– Degree of difficulty increases when the total number of characters to be recognized increases

• On-line character Recognition (by pen writing devices):– Handwriting information capture (pen-in, pen-out, pen-

movement, on-line) --> Stroke information (pre processing with noise reduction) --> Searching for the character based on the sequence of strokes.

– commercially viable

Lecture4 4

• Speech Recognition (by voice input):– Capture speech by microphones --> speech signal

segmentation --> speech signal converted to phonetic transcription --> phonetic spelling converted to internal code.

– becoming commercially viable, problem with non-native speaker, conversion from colloquial to written text

– more affordable and getting common in the next 5-10yrs

Lecture4 5

• Keyboard based Input method: an encoding method which maps a sequence of keystrokes (with a predefined keyboard layout) to an internal code of a character.

– Conceptually, an input method can be considered as a mapping table with two columns: 1st column X is a sequence of keys, 2nd column Y is the corresponding internal code.

– Uniqueness requirement: for any two internal codepoints Yi and Yj, if Yi ≠ Yj then Xi ≠ Xj.

• Input methods are normally language (script) dependent:

– Input for Chinese and Greek Letters in GB are two different input methods and are thus separately invoked.

Lecture4 6

• Typing in the internal code is straight forward, easiest to implement, and accurate, but requires labour intensive training, only good for professionals

• Why do we need to design input methods:– People cannot relate characters with internal code

• 憤 =>(BCAB16 ) 憔 =>(BCAC16 )– Number of characters is much larger that the

number of keys on the keyboard=>a sequence of keystrokes maps into one key

• What is the restriction: limited number of keys(people cannot remember too many different keys with unrelated numbers)

Lecture4 7

• What are the information we know?All input methods must use some features associated with the characters: pronunciation, radicals, components, strokes, writing sequence, etc., or combinations of them.

• Different mapping methods leads to different input methods

• Users: Professional typists, casual users, daily users• Different mode of inputs:

– Typing by looking at printed material– Typing while thinking

Lecture4 8

Design considerations:

• Ease of learning– Shorter learning time: Easy to pick up(perhaps easy to

forget), but slow input speed– Longer learning time: Difficult to learn, but once you are

trained, not easy to forget and faster input speed

• Mapping of features to keys on the keyboard:– Physical control of the different fingers and access to

different key positions on the keyboard– Frequency analysis of the features

• Uniqueness: one to one mapping and user friendliness• Equal keystroke sequence vs. uneven keystroke

sequence

Lecture4 9

Input methods based on glyphs• Problems:

– What are the fundamental units?

– How to put the units together (or how to form sequences)? Need to translate 2-D spatial relations into 1-D orderingExample: 夵 (U+5935) and 尖 (U+5C16)

– How difficult is it to learn? Trade-off between ease of learning and speed

• Features related to glyphs:• Strokes( 筆劃 ): 點橫豎撇捺• Radicals( 偏旁） : for indexing mostly, not unique

• Components( 部件 ): 女 and 且 in 姐組• Character( 整字 ): 甘• Spatial relations( 方位關係） : left-right, upper-lower,

Lecture4 10

Principles of Input method design

• Design example: using strokes only

• Suppose we assign the strokes to keys 1,2,3,4,5, respectively, using only 5 keys

• Example: 哲 , 23144233232, very long a sequence

• What problems do we have for characters like these:岭岺 => At least an extra key must be used to distinguish them

• As there are more keys available, some keys can be assigned to multiple strokes:

Lecture4 11

• 2-stroke keys: if the first stroke is x, second stroke is y, how many different 2-stroke keys?– Example:

• Total No. of keys now?

• With these additional keys the number of key presses is reduced to:

23 14 42 33 23 2

• With 3 stroke keys: xyz, additional keys:

• Total No. of keys:

Lecture4 12

Study of character features and use patterns

• Study of character frequency(based on 50,000char.)– 2,000 most frequently used characters: 97%

– out of that: first 100 characters: 45%

– the first 10 characters: 12%

– Example: 有的口是我不女日 : assign keys

– 2-stroke keys:

– 3-stroke keys, etc, use the most frequently used,

• Other considerations are • easily identifiable

• reducing the length of key sequence

Lecture4 13

Keyboard Arrangements

• Some fingers are easier to control, assign priority L: use only index(2nd finger) to 5th finger for typing.

• General Principle: Assign more frequently used features keys to the position on the keyboard which are easier to reach

• One simple method:– Some keyboard rows are easy to press R:

– Keys are ranked according to LxR

– all the selected strokes(characters, and combined strokes) are ranked according to frequency of use, K

– Then mapping the feature keys according to rank.

Lecture4 14

Phonetic-based IM: 拼音 (Pinyin)

• Romanized input method vs. native phonetic symbols based input method– Romanized letter strings (usually 1-2 characters) which can use the

English keyboard readily– Native phonetic symbols are easier for people to relate

• Design Problems and Solutions:– Homonyms( 同音字 ) in GB:

• No tone: only 18 char. Have no homonyms. Largest set yi is 114.• With tone: 262 no homonyms, largest is reduce to 60.• Solutions: (1) Specification of tone is optional (1-4 for Putonghua

and 1-9 for Cantonese), (2) use a window to show all the candidates, (3) word/bigram input.

– Multiple pronunciations of the same character. Enter all possible pronunciation into the phonetic spelling database. (e.g. che and kui for 車 in Cantonese). • Quantitatively not a significant problem• May slow down if for fault-tolerance reason (fuzzy input)

Lecture4 15

• User Problems:

– Some sounds are difficult to analyze:

• similar consonants: /b/ vs /p/, /t/ vs /d/, /g/ vs /k/

• tone interact with vowel: the way we say things and the standard pinyin is different: 普洱 pu3 er3 to pu2 er3(Putonghua)

– Difficult to analyze the behaviour of non-native speakers because of accent interfering with phonetic analysis

– Tedious to find the correct character from the set of candidates that have no apparent relationships

• When user cannot use shape-based keystroke input, then try phonetic spelling!

Lecture4 16

Other Ims for Chinese

• Zhuyin ( 注音 ) [also called bopomofo]– Chinese/Japanes phonetic symbols (similar to Kantana or Hiragana)

– Includes the use of numerals keystrokes

– Similar English sounds: bpmfdtnlgkhjsaor

– tone: . (tone 0), <space> (tone 1), 2 (tone 2), 3, (tone 3), 4 (tone 4)

– One-to-one mapping to PinYin(Pages 218-219)

ㄅㄆㄇㄈ to bo, po mo fo

• 九方： mapping into number keys good for small appliances: mobile phone, PDA, etc.

Lecture4 17

Japanese and Korean• Since hiragana and katakana are all phonetic based, they

have unique Romanized mapping• Example: a i u e o, ha hi hu he ho• But separate key(native symbols) mapping is also provided

pp248

• Romanized input and native symbol-based direct mapping input methods are different

• Similar for Korean Hangul

lecture4 1 wide character vs. multi-byte characters text information needs to be represented by the...

Documents