localizing your apps for multibyte languages
TRANSCRIPT
Localizing your apps for multibyte languagesKen ISHIMOTO (K’s Room Japan)
Localizing your apps
• Part 1 - WebObject
• Part II - What is a multibyte Language
• Part III - Combine multibyte Language with WebObjects
• Part IV - multibyte & WOdka
Localizing your apps
• Part 1 - WebObject
• Part II - What is a multibyte Language
• Part III - Combine multibyte Language with WebObjects
• Part IV - multibyte & WOdka
Part 1 - WebObject
• Eclipse
• Ant build
• Properties (to make WebObjects ready)
• Database
Eclipse
• Set your Workspace to UTF-8
if you not do that you can get all kind of problems, also having not English Code in Source can break the compilation.
Ant build
• Set your Ant Compile task script to UTF-8
Properties in you APP
• This are the Properties that we use
• file.encoding=UTF-8
• er.extensions.ERXApplication.DefaultEncoding=UTF-8
• er.extensions.ERXApplication.DefaultMessageEncoding=UTF-8
• er.extensions.ERXLocalizationEditor.encoding=UTF-8
• wodka.Application.LanguageEncoding={Japanese = UTF-8; }
CSS
@charset "UTF-8";
Javascript
<script type="text/javascript" charset="UTF-8">
Database - MySQL
• MySQL = &useUnicode=true&characterEncoding=UTF-8
don’t forget to create a ‘utf8’ database
Database - FrontBase
Nothing to do, just works
Localizing your apps
• Part 1 - WebObject
• Part II - What is a multibyte Language
• Part III - Combine multibyte Language with WebObjects
• Part IV - multibyte & WOdka
Part II - What is a multibyte Language (Japanese)
• Basics
• Alphabet (How works Japanese)
• Encoding (What Encoding I have to use)
Basics
• This is a sample Page from a Book
• a Book starting reading from right to left, so you open it where usually close it.
• you read from right to left and from top to bottom
• This can be very complex for Word-processingSoftware so XX Word isn’t a good choice towrite Books or Magazines. That’s also one Reason why there are some Japanese Text Editor that can do that.
Spaces between Words
• This is a pen.
• これはペンです。
• Today we have a good weather in Tokyo.
• 今日、東京はとてもいい天気です。 also a big problem can be that there are no spaces between words.
yen symbol vs backslash
• If you’re familiar with the Japanese keyboard, the backslash key () is replaced by the symbol for the Yen (¥). Way back when, we did a Japanese version of BRIEF, so I was familiar with this phenomenon—paths would be separated by Yen symbols, but everything worked as expected.
• set the URL_A_chars to “$+!’,?;&@=#%><{}[]"~`^\|*()”
• completely failed to compile, because it looked like this:
• set the URL_A_chars to “$+!’,?;&@=#%><{}[]¥"~`^¥¥|*()”
• and ¥ didn’t escape as you’d expect.
• If I create a new file, either on my system or the English only system I can use any font and type the \ key and I get the \ glyph. Side by side in this file I can use exactly the same font but when I type the \ symbol I get the ¥ glyph.
Japanese Alphabet
• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
Japanese Alphabet
•漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
漢字 Kanji
• The complexity of this Characters
• The vast majority of these are not in common use in either Japan or China; as discussed below, approximately 2,000 to 3,000 characters are in common use in Japan, a few thousand more find occasional use, and a total of about 13,000 characters can be encoded in various Japanese Industrial Standards for kanji.
• Kyōiku kanji The Kyōiku kanji (教育漢字, "education kanji") are 1,006 characters that Japanese children
learn in elementary school.
• Jōyō kanji The Jōyō kanji (常用漢字, "regular-use kanji") are 2,136 characters consisting of all the Kyōiku
kanji, plus 1,130 additional kanji taught in junior high and high school. In publishing, characters outside this category are often given furigana.
• Jinmeiyō kanji Since September 27, 2004, the Jinmeiyō kanji (人名用漢字, "kanji for use in personal
Encoding of 生
• UNICODE : 751F
• UTF-8 : E7 94 9F
• Shift-JIS : 90B6A character can have not only 16 bit, and today multibyte characters can also have more than 32 bit. so it is difficult to say in a database the name field has only 20 varchar. That would be enough for some Languages but in UTF-8 that can be only a few chars long and not enough.生
Pronunciation : 生• ON : Chinese-style reading for kanji.
ショウ, ショウ_ジル, ショウ_ズル, ジョウ, セイ, ゼイShou, Shou_jiru, Shou_zuru, Jou, Sei, Zei
• KUN : Japanese-style reading for kanji. イ_カス, イ_キ, イ_キル, イ_ケル, ウ_マレ, ウ_マレル, ウ_ム, ウブ, ウマ_レ, ウマ_レル, オ_イ, オ_ウ, キ, ナ_ス, ナ_ル, ナマ, ハ_エ, ハ_エル, ハ_ヤス, バ_エi_kasu, i_ki, i_kiru, i_keru, u_mare, U-mareru, u_mu ....
• Special reading.アイ, イク, イケ, エ, オ, サ, ナリ, ニュウ, ヌク, フ, ブ, ム_ス, ヨイai, iku, ike, e, o, sa, nari, nyuu, nuku, fu, bu, mu_su, yoi
• In China this get read : Shēng
difference between Countries
手紙
Letter Toilet paperJapanese and Chinese are very different even if there are some Kanji’s that looks the some.
It is like English and French, the share some Letters but can you read and understand it?
Character : 生
• 生きる Ikiru ..... live, living , alive
• 生クリーム Nama kuri-mu ..... fresh cream
• 生涯 Shougai ..... lifetime
• 生命 Seimei ..... life
• 生む Umu ..... born
We can see that 1 Kanji can have a lot of different meanings, and pronunciations.
So it makes 100% no sense to sort a Database with Kanji’s.
People wouldn’t find the Data where the excepted. And the sort would be only a Unicode Sort that has no meaning.
every Char is very easy to use and access, no special treatment is necessary.
Japanese Alphabet• 漢字 Kanji (Chinese characters)
•ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
ひらがな Hiragana
• Hiragana is a Japanese syllabary, one basic component of the Japanese writing system.
• Hiragana is used to write native words for which there are no kanji, including grammatical particles , and suffixes such as さん
~san "Mr., Mrs., Miss, Ms.". every Char is very easy to use and access, no special treatment is necessary.
Japanese Alphabet• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
•カタカナ Katakana (Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
カタカナ Katakana
• Katakana is a Japanese syllabary, one component of the Japanese writing system.
• In contrast to the hiragana syllabary, which is used for those Japanese language words and grammatical inflections which kanji does not cover, the katakana syllabary is primarily used for transcription of foreign language words into Japanese
every Char is very easy to use and access, no special treatment is necessary.
Half-width kana 半角カナ
• Half-width kana (半角カナ Hankaku kana) are katakana characters displayed at half their normal width (a
2:1 aspect ratio), instead of the usual square (1:1) aspect ratio.
• Half-width kana were used in the early days of Japanese computing, to allow Japanese characters to be displayed on the same grid as monospaced fonts of Latin characters.
• Half-width hiragana or kanji were not used.
• Half-width kana characters are not generally used today, but find some use in specific settings, such as cash register displays, on shop receipts, and Japanese digital television and DVD subtitles.
注意!
those kind of char’s can be a pain, so a good program will make a conversion from half to full size Katakana.
String s1 = "アナタ"; String s2 = "アナタ";
ERXStringUtilitiesEXTENDED.changeHanKatakanaToZenkakuKatakana(s1);// RESULT = "アナタ"
s1.equalsIgnoreCase(s2)// RESULT = false
s1.length()// RESULT = 3
s2.length()// RESULT = 3
Half-width kana 半角カナ
Japanese Alphabet• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
•ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
NUMBER 数字
NUMBER 数字
• like with Space also Numbers have variations.
• single Byte (Hankaku)
• double Byte (Zenkaku)
• chinese Char version (Kanji)
• Hankaku (Single) - 0123456789
• Zenkaku - 0123456789
• Kanji - 0 is 零 or 〇1 is 一 or 壱 / 2 is 二 or 弐 / 3 is 三 or 参四五六七八九
to convert every Number into single size before storing in the database is the easy way to go.
String s1 = “0123456789”; String s2 = "0123456789";
ERXStringUtilities.isDigitsOnly(s1);// RESULT = true
ERXStringUtilities.isDigitsOnly(s2);// RESULT = true
s1.equalsIgnoreCase(s2);// RESULT = false
isDigitsOnly
replace double to single
String s = "0123456789";
ERXStringUtilitiesEXTENDED.changeZenkakuNumberToHanNumber(s);// RESULT = “0123456789”
LETTER 英字
LETTER 英字
• Everybody loves the simple 26 characters, that in most School takes 2 years to learn.
• In some Countries there are variations like German with ÜÖÄ
LETTER 英字
• There is for each Letter a double byte Letter
• ‘U‘ == ‘U ’
to convert every Letter into single size before storing in the database is the easy way to go.
String s1 = "BC";String s2 = "BC";
s1.equalsIgnoreCase(s2);// RESULT = false
s1 = ERXStringUtilitiesEXTENDED.changeZenkakuEijiToHanEiji(s2);// RESULT = ‘BC’
LETTER 英字
Japanese Alphabet• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
•記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
Sign 記号
Sign 記号
• For each Sign there is a double byte
counterpart
• ‘!‘ == ‘! ’
to convert every Sign into single size before storing in the database is the easy way to go.
String s1 = "!@#$%^&*()";String s2 = "!@#$%^&*()";
s1 = ERXStringUtilitiesEXTENDED.changeZenkakuKigouToHanKigou(s2);// RESULT = ‘!@#$%^&*()’
Sign 記号
SPACE スペース
SPACE スペース
• String a = “ “;
• String b = “ ”;
a == space charb == double-size space char
to convert every Number into single size before storing in the database is the easy way to go.
// head and tail are 3 space charsString s = “ A B C ”;
s.trim();// RESULT = ‘A B C’
ERXStringUtilities.trimString(s);// RESULT = ‘A B C’
ERXStringUtilitiesEXTENDED.trimStringWithZenkaku(s);// RESULT = ‘A B C’
trim
// head and tail are 3 japanese ZENKAKU(double byte) space charsString s = “ A B C ”;
s.trim();// RESULT = ‘ A B C ’
ERXStringUtilities.trimString(s);// RESULT = ‘ A B C ’
ERXStringUtilitiesEXTENDED.trimStringWithZenkaku(s);// RESULT = ‘A B C’
better trim
// between A and B are 2 single space + 2 double space + 2 single spaceString s = “A B”;
s.replace(" ", "");// RESULT = ‘A B’
ERXStringUtilities.removeCharacters(s, " ");// RESULT = ‘A B’
ERXStringUtilitiesEXTENDED.changeZenkakuToHanKakaku(s).replace(" ", "");// RESULT = ‘ABC’
remove Space between chars
Japanese Alphabet• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
•絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
絵文字 Emoji (Smilies)
絵文字 Emoji (Smilies)
• Emoji (絵文字); Japanese pronunciation: [emodʑi] is the Japanese term for the
ideograms or smileys used in Japanese electronic messages and webpages.
• Emoji pictograms by au are specified using the IMG tag. SoftBank Mobile emoji are wrapped between SI/SO escape sequences, and support colors and animation. DoCoMo's emoji are the most compact to transmit while au's version is more flexible based on open standards.
If you are creating a CMS or Data Entry like Blog, Forum or whatever else, you will have to deal with
this Emoji. Japanese People loves to use it.
WOEmojilast year WOWODC 2012, I spoke about SnoWOman CMS and there is a Framework named WOEmoji, with using this Framework it is easy to convert Emojis for saving to the database and will automatically working also on Windofs or Androiddevices.
Version 2 of this Framework(working on it) can also convert to the new open standard Emoji that is under developing just right now in Japan.
I am a payed supporter of this Project and waiting for delivery, so WOEmoji can be updated.
Japanese Alphabet
• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
•外字 Gaiji (Self-made characters)
• 振り仮名 Furigana
外字 Gaiji (Self-made characters)
• Gaiji (外字), literally meaning "external characters", are kanji that are not represented in existing
Japanese encoding systems. These include variant forms of common kanji that need to be represented alongside the more conventional glyph in reference works, and can include non-kanji symbols as well.
辻葛楢
Win XP : the had only a few 1000 Kanjis and it wasn’t easy to use some Kanjis that was not available. so People started with creating their own, also the look was sometimes different.Win Vista : you can see the font is a little different.
But you have to buy this 1500 char Gaiji Package for about USD 500.-
OS X : works out of the Box and it is free.
Gaiji 外字 Editor• This is a old Gaiji Editor, so the user
could make his own characters and that was nice. it started with the first version of Win. but now with the Internet there is a problem, because lot of People really recognize that this character can bee seen only on this one machine, and after pushing it up via mail or data entry into a database, it looks different on every other machine. so need to stripe out this characters and give a feedback to not use that.
ERXStringUtilitiesEXTENDED.delete_ModelDependenceCharacters(true, s, 200, false, false);
Because i don’t have a Win Machine here, so I wasn’t able to create a Sample-string,but their is a command for deleting that kind of character Area.
Gaiji 外字
Japanese Alphabet
• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
•振り仮名 Furigana
Furigana 振り仮名• Furigana (振り仮名) is a Japanese reading aid, consisting of smaller kana, or syllabic characters, printed
next to a kanji (ideographic character) or other character to indicate its pronunciation. It is typically used to clarify rare, nonstandard or ambiguous readings, or in children's or learners' materials.
Encoding
Encoding
• UTF-8
• EUC-JP
• Shift JIS
• ISO/IEC 2022
• and some more ...
UTF-8
• UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.
We use for every project UTF-8 now, and you are mostly save and have not take care about other
Encoding, but...
EUC-JP• EUC-JP Extended Unix Code
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
• The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).
If you have to do work with some Win Machines it can happen that you have to import Data that are
encoded with this encoding. For my experience I never used that.
Shift JIS• Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS) is a character encoding for the
Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microzoft and standardized as JIS X 0208 Appendix 1.
This is the most used encoding in Japan, and you can be sure that if you get Data from an existing Database or have to connect to an Database you have to deal with this.We did a lot of SJIS - UTF-8 conversion in the past.
ISO/IEC 2022
• ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard (equivalent to the ECMA standard ECMA-35[1] ) specifying
• a technique for including multiple character sets in a single character encoding system, and
• a technique for representing these character sets in both 7 and 8 bit systems using the same encoding.
You have only to deal with that if you do someMailing solutions, but I really don’t care about that
anymore, JavaMail works just fine.
Localizing your apps
• Part 1 - WebObject
• Part II - What is a multibyte Language
• Part III - Combine multibyte Language with WebObjects
• Part IV - multibyte & WOdka
Localization ローカライズ
• Localization of your App
• Localization Data
• Sorting
Localization of your App
ERXLocalizer// Writing Components and code with ERXLocalizer makes your life very easy// their are so many things you can do with it, so get comfortable with it.
// Localized String from Code
ERXLocalizer.defaultLocalizer().localizedStringForKey("Nav.Main");
// Localized String in HTML
<wo:str value = "$localizer.Nav.Main" />
<wo:localized value="Nav.Main" />
* This is a bad example because I am using the power of the ‘dark force’ Inline Binding. You shouldn’t do that, * but I use it always. Sorry I am a bad guy.
.strings
in your App ‘Resources’ folder create a folder with Language-name + ‘.lproj’
make it a plist file with Key Value.
and save the File as
UTF-16UTF-8with UTF-8 it is easier to read and also git commits can be viewed.
Localization Data
Localization of Data
1. Attributes in Entity
2. set Data in Edit-page
3. Display the Attribute depending on the Localizer
[[eo]].name_en()or
[[eo]].name_jaor
[[eo]].valueForKey("name")
Sorting
Sorting 1
name (how it is written)
furigana(how it is pronounce)
Sorting 2
林森 漢字 Kanji
(Chinese characters)
Person 1 Person 2ひらがな Hiragana
or カタカナ Katakana
(Japanese Alphabet)もり はやし
Mr. Mori Mr. Hayashi
Localizing your apps
• Part 1 - WebObject
• Part II - What is a multibyte Language
• Part III - Combine multibyte Language with WebObjects
• Part IV - multibyte & WOdka
WOdka improvements
• Language-switching
WOdkaLanguageEnums
• Language name
• Locale Code
• Date format + 24 hours setting
• Data for Flag information
WOdkaCountryEnums• Country name
• code2 : ISO Code for Country
• code3 : ISO Code for Country
• money : ERXMoneyEnums
• language : WOdkaLanguageEnums
• telephone code
• tax : tax info
• zip : zip format
• company Mailing Format
• family Mailing Format
• Localized words : male, female, sexMale, sexFemale
• flag : Path to Flag-data
• continent : ERXContinentEnums
• EU : ERXEuropeanUnionsEnums
"[S][CR][T][_][F][_][L]"
"[L] [F]様"
family Mailing Format
s = sext = titlef = first namel = last namecr = next line
Thanks to• Masahiko TANI - A10 Objects Inc., (Japan)
• Hiroyuki FUKUI - Astonish Create (Japan)
Special Thanks to
• Paul YU - Green orchid llc (USA)
Thank YouWOWODC
2013