Lecture4 1 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. Multi byte characters: data are.

Download Lecture4 1 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. Multi byte characters: data are.

Post on 22-Dec-2015




0 download


  • Slide 1
  • Lecture4 1 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. Multi byte characters: data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 Wide characters: Fixed-byte encoding and no testing of high bit is needed. Processing representation for wide characters: Big Endian vs. Little Endian Data type dependent: only for wide characters System architecture dependent Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian
  • Slide 2
  • Lecture4 2 Character Input Input method: A scheme of mapping characters from their external representations to the internal codepoints used in computer systems. Classification of input methods: Images: Off-line character recognition (Optical character recognition) On-line character recognition Speech: voice recognition Character features: Keyboard input based on glyph shapes and pronunciations.
  • Slide 3
  • Lecture4 3 Character Input Based on Images Optical Character Recognition (via image, off-line ): Written material --> scanner --> bitmap image file (e.g. TIFF, JPEG) --> characters (represented by an internal code) very difficult for unrestricted handwritten characters, commercially viable for printed materials and acuracy depends on printing quality Degree of difficulty increases when the total number of characters to be recognized increases On-line character Recognition (by pen writing devices): Handwriting information capture (pen-in, pen-out, pen- movement, on-line) --> Stroke information (pre processing with noise reduction) --> Searching for the character based on the sequence of strokes. commercially viable
  • Slide 4
  • Lecture4 4 Speech Recognition (by voice input): Capture speech by microphones --> speech signal segmentation --> speech signal converted to phonetic transcription --> phonetic spelling converted to internal code. becoming commercially viable, problem with non-native speaker, conversion from colloquial to written text more affordable and getting common in the next 5-10yrs
  • Slide 5
  • Lecture4 5 Keyboard based Input method: an encoding method which maps a sequence of keystrokes (with a predefined keyboard layout) to an internal code of a character. Conceptually, an input method can be considered as a mapping table with two columns: 1 st column X is a sequence of keys, 2 nd column Y is the corresponding internal code. Uniqueness requirement: for any two internal codepoints Y i and Y j, if Y i Y j then X i X j. Input methods are normally language (script) dependent: Input for Chinese and Greek Letters in GB are two different input methods and are thus separately invoked.
  • Slide 6
  • Lecture4 6 Typing in the internal code is straight forward, easiest to implement, and accurate, but requires labour intensive training, only good for professionals Why do we need to design input methods: People cannot relate characters with internal code =>(BCAB 16 ) =>(BCAC 16 ) Number of characters is much larger that the number of keys on the keyboard=>a sequence of keystrokes maps into one key What is the restriction: limited number of keys(people cannot remember too many different keys with unrelated numbers)
  • Slide 7
  • Lecture4 7 What are the information we know? All input methods must use some features associated with the characters: pronunciation, radicals, components, strokes, writing sequence, etc., or combinations of them. Different mapping methods leads to different input methods Users: Professional typists, casual users, daily users Different mode of inputs: Typing by looking at printed material Typing while thinking
  • Slide 8
  • Lecture4 8 Design considerations: Ease of learning Shorter learning time: Easy to pick up(perhaps easy to forget), but slow input speed Longer learning time: Difficult to learn, but once you are trained, not easy to forget and faster input speed Mapping of features to keys on the keyboard: Physical control of the different fingers and access to different key positions on the keyboard Frequency analysis of the features Uniqueness: one to one mapping and user friendliness Equal keystroke sequence vs. uneven keystroke sequence
  • Slide 9
  • Lecture4 9 Input methods based on glyphs Problems: What are the fundamental units? How to put the units together (or how to form sequences)? Need to translate 2-D spatial relations into 1-D ordering Example: (U+5935) and (U+5C16) How difficult is it to learn? Trade-off between ease of learning and speed Features related to glyphs: Strokes( ): Radicals ( : for indexing mostly, not unique Components( ): and in Character ( ): Spatial relations ( : left-right, upper-lower,
  • Slide 10
  • Lecture4 10 Principles of Input method design Design example: using strokes only Suppose we assign the strokes to keys 1,2,3,4,5, respectively, using only 5 keys Example: , 23144233232, very long a sequence What problems do we have for characters like these: => At least an extra key must be used to distinguish them As there are more keys available, some keys can be assigned to multiple strokes:
  • Slide 11
  • Lecture4 11 2-stroke keys: if the first stroke is x, second stroke is y, how many different 2-stroke keys? Example: Total No. of keys now? With these additional keys the number of key presses is reduced to: 23 14 42 33 23 2 With 3 stroke keys: xyz, additional keys: Total No. of keys:
  • Slide 12
  • Lecture4 12 Study of character features and use patterns Study of character frequency(based on 50,000char.) 2,000 most frequently used characters: 97% out of that: first 100 characters: 45% the first 10 characters: 12% Example: : assign keys 2-stroke keys: 3-stroke keys, etc, use the most frequently used, Other considerations are easily identifiable reducing the length of key sequence
  • Slide 13
  • Lecture4 13 Keyboard Arrangements Some fingers are easier to control, assign priority L: use only index(2nd finger) to 5th finger for typing. General Principle: Assign more frequently used features keys to the position on the keyboard which are easier to reach One simple method: Some keyboard rows are easy to press R: Keys are ranked according to LxR all the selected strokes(characters, and combined strokes) are ranked according to frequency of use, K Then mapping the feature keys according to rank.
  • Slide 14
  • Lecture4 14 Phonetic-based IM: (Pinyin) Romanized input method vs. native phonetic symbols based input method Romanized letter strings (usually 1-2 characters) which can use the English keyboard readily Native phonetic symbols are easier for people to relate Design Problems and Solutions: Homonyms ( ) in GB: No tone: only 18 char. Have no homonyms. Largest set yi is 114. With tone: 262 no homonyms, largest is reduce to 60. Solutions: (1) Specification of tone is optional (1-4 for Putonghua and 1-9 for Cantonese), (2) use a window to show all the candidates, (3) word/bigram input. Multiple pronunciations of the same character. Enter all possible pronunciation into the phonetic spelling database. (e.g. che and kui for in Cantonese). Quantitatively not a significant problem May slow down if for fault-tolerance reason (fuzzy input)
  • Slide 15
  • Lecture4 15 User Problems: Some sounds are difficult to analyze: similar consonants: /b/ vs /p/, /t/ vs /d/, /g/ vs /k/ tone interact with vowel: the way we say things and the standard pinyin is different: pu3 er3 to pu2 er3(Putonghua) Difficult to analyze the behaviour of non-native speakers because of accent interfering with phonetic analysis Tedious to find the correct character from the set of candidates that have no apparent relationships When user cannot use shape-based keystroke input, then try phonetic spelling!
  • Slide 16
  • Lecture4 16 Other Ims for Chinese Zhuyin ( ) [also called bopomofo] Chinese/Japanes phonetic symbols (similar to Kantana or Hiragana) Includes the use of numerals keystrokes Similar English sounds: bpmfdtnlgkhjsaor tone:. (tone 0), (tone 1), 2 (tone 2), 3, (tone 3), 4 (tone 4) One-to-one mapping to PinYin(Pages 218-219) to bo, po mo fo mapping into number keys good for small appliances: mobile phone, PDA, etc.
  • Slide 17
  • Lecture4 17 Japanese and Korean Since hiragana and katakana are all phonetic based, they have unique Romanized mapping Example: a i u e o, ha hi hu he ho But separate key(native symbols) mapping is also provided pp248 Romanized input and native symbol-based direct mapping input methods are different Similar for Korean Hangul