localizing your apps for multibyte languages

Localizing your apps for multibyte languagesKen ISHIMOTO (K’s Room Japan)

Localizing your apps

• Part 1 - WebObject

• Part II - What is a multibyte Language

• Part III - Combine multibyte Language with WebObjects

• Part IV - multibyte & WOdka

Part 1 - WebObject

• Eclipse

• Ant build

• Properties (to make WebObjects ready)

• Database

Eclipse

• Set your Workspace to UTF-8

if you not do that you can get all kind of problems, also having not English Code in Source can break the compilation.

Ant build

• Set your Ant Compile task script to UTF-8

Properties in you APP

• This are the Properties that we use

• file.encoding=UTF-8

• er.extensions.ERXApplication.DefaultEncoding=UTF-8

• er.extensions.ERXApplication.DefaultMessageEncoding=UTF-8

• er.extensions.ERXLocalizationEditor.encoding=UTF-8

• wodka.Application.LanguageEncoding={Japanese = UTF-8; }

CSS

@charset "UTF-8";

Javascript

<script type="text/javascript" charset="UTF-8">

Database - MySQL

• MySQL = &useUnicode=true&characterEncoding=UTF-8

don’t forget to create a ‘utf8’ database

Database - FrontBase

Nothing to do, just works

Part II - What is a multibyte Language (Japanese)

• Basics

• Alphabet (How works Japanese)

• Encoding (What Encoding I have to use)

Basics

• This is a sample Page from a Book

• a Book starting reading from right to left, so you open it where usually close it.

• you read from right to left and from top to bottom

• This can be very complex for Word-processingSoftware so XX Word isn’t a good choice towrite Books or Magazines. That’s also one Reason why there are some Japanese Text Editor that can do that.

Spaces between Words

• This is a pen.

• これはペンです。

• Today we have a good weather in Tokyo.

• 今日、東京はとてもいい天気です。 also a big problem can be that there are no spaces between words.

yen symbol vs backslash

• If you’re familiar with the Japanese keyboard, the backslash key () is replaced by the symbol for the Yen (¥). Way back when, we did a Japanese version of BRIEF, so I was familiar with this phenomenon—paths would be separated by Yen symbols, but everything worked as expected.

• set the URL_A_chars to “$+!’,?;&@=#%><{}[]"~`^\|*()”

• completely failed to compile, because it looked like this:

• set the URL_A_chars to “$+!’,?;&@=#%><{}[]¥"~`^¥¥|*()”

• and ¥ didn’t escape as you’d expect.

• If I create a new file, either on my system or the English only system I can use any font and type the \ key and I get the \ glyph. Side by side in this file I can use exactly the same font but when I type the \ symbol I get the ¥ glyph.

Japanese Alphabet

• 漢字 Kanji (Chinese characters)

• ひらがな Hiragana (Japanese Alphabet)

• カタカナ Katakana (Alphabet for Foreign Words)

• ローマ字 Romaji (English characters)

• 記号 Kigo (Sign)

• 絵文字 Emoji (Smilies)

• 外字 Gaiji (Self-made characters)

• 振り仮名 Furigana

Japanese Alphabet

•漢字 Kanji (Chinese characters)








漢字 Kanji

• The complexity of this Characters

• The vast majority of these are not in common use in either Japan or China; as discussed below, approximately 2,000 to 3,000 characters are in common use in Japan, a few thousand more find occasional use, and a total of about 13,000 characters can be encoded in various Japanese Industrial Standards for kanji.

• Kyōiku kanji The Kyōiku kanji (教育漢字, "education kanji") are 1,006 characters that Japanese children

learn in elementary school.

• Jōyō kanji The Jōyō kanji (常用漢字, "regular-use kanji") are 2,136 characters consisting of all the Kyōiku

kanji, plus 1,130 additional kanji taught in junior high and high school. In publishing, characters outside this category are often given furigana.

• Jinmeiyō kanji Since September 27, 2004, the Jinmeiyō kanji (人名用漢字, "kanji for use in personal

http://en.wikipedia.org/wiki/Furigana

http://en.wikipedia.org/wiki/Furigana

Encoding of 生

• UNICODE : 751F

• UTF-8 : E7 94 9F

• Shift-JIS : 90B6A character can have not only 16 bit, and today multibyte characters can also have more than 32 bit. so it is difficult to say in a database the name field has only 20 varchar. That would be enough for some Languages but in UTF-8 that can be only a few chars long and not enough.生

Pronunciation : 生• ON : Chinese-style reading for kanji.

ショウ, ショウ＿ジル, ショウ＿ズル, ジョウ, セイ, ゼイShou, Shou_jiru, Shou_zuru, Jou, Sei, Zei

• KUN : Japanese-style reading for kanji. イ＿カス, イ＿キ, イ＿キル, イ＿ケル, ウ＿マレ, ウ＿マレル, ウ＿ム, ウブ, ウマ＿レ, ウマ＿レル, オ＿イ, オ＿ウ, キ, ナ＿ス, ナ＿ル, ナマ, ハ＿エ, ハ＿エル, ハ＿ヤス, バ＿エi_kasu, i_ki, i_kiru, i_keru, u_mare, U-mareru, u_mu ....

• Special reading.アイ, イク, イケ, エ, オ, サ, ナリ, ニュウ, ヌク, フ, ブ, ム＿ス, ヨイai, iku, ike, e, o, sa, nari, nyuu, nuku, fu, bu, mu_su, yoi

• In China this get read : Shēng

difference between Countries

手紙

Letter Toilet paperJapanese and Chinese are very different even if there are some Kanji’s that looks the some.

It is like English and French, the share some Letters but can you read and understand it?

Character : 生

• 生きる Ikiru ..... live, living , alive

• 生クリーム Nama kuri-mu ..... fresh cream

• 生涯 Shougai ..... lifetime

• 生命 Seimei ..... life

• 生む Umu ..... born

We can see that 1 Kanji can have a lot of different meanings, and pronunciations.

So it makes 100% no sense to sort a Database with Kanji’s.

People wouldn’t find the Data where the excepted. And the sort would be only a Unicode Sort that has no meaning.

every Char is very easy to use and access, no special treatment is necessary.

Japanese Alphabet• 漢字 Kanji (Chinese characters)

•ひらがな Hiragana (Japanese Alphabet)







ひらがな Hiragana

• Hiragana is a Japanese syllabary, one basic component of the Japanese writing system.

• Hiragana is used to write native words for which there are no kanji, including grammatical particles , and suffixes such as さん

~san "Mr., Mrs., Miss, Ms.". every Char is very easy to use and access, no special treatment is necessary.

http://en.wikipedia.org/wiki/Japanese_particles

http://en.wikipedia.org/wiki/Japanese_particles



•カタカナ Katakana (Foreign Words)






カタカナ Katakana

• Katakana is a Japanese syllabary, one component of the Japanese writing system.

• In contrast to the hiragana syllabary, which is used for those Japanese language words and grammatical inflections which kanji does not cover, the katakana syllabary is primarily used for transcription of foreign language words into Japanese

every Char is very easy to use and access, no special treatment is necessary.

http://en.wikipedia.org/wiki/Transcription_into_Japanese




Half-width kana 半角カナ

• Half-width kana (半角カナ Hankaku kana) are katakana characters displayed at half their normal width (a

2:1 aspect ratio), instead of the usual square (1:1) aspect ratio.

• Half-width kana were used in the early days of Japanese computing, to allow Japanese characters to be displayed on the same grid as monospaced fonts of Latin characters.

• Half-width hiragana or kanji were not used.

• Half-width kana characters are not generally used today, but find some use in specific settings, such as cash register displays, on shop receipts, and Japanese digital television and DVD subtitles.

注意！

those kind of char’s can be a pain, so a good program will make a conversion from half to full size Katakana.

http://en.wikipedia.org/wiki/Aspect_ratio

http://en.wikipedia.org/wiki/Aspect_ratio

String s1 = "ｱﾅﾀ"; String s2 = "アナタ";

ERXStringUtilitiesEXTENDED.changeHanKatakanaToZenkakuKatakana(s1);// RESULT = "アナタ"

s1.equalsIgnoreCase(s2)// RESULT = false

s1.length()// RESULT = 3

s2.length()// RESULT = 3

Half-width kana 半角カナ




•ローマ字 Romaji (English characters)





NUMBER 数字

NUMBER 数字

• like with Space also Numbers have variations.

• single Byte (Hankaku)

• double Byte (Zenkaku)

• chinese Char version (Kanji)

• Hankaku (Single) - 0123456789

• Zenkaku - ０１２３４５６７８９

• Kanji - 0 is 零 or 〇1 is 一 or 壱　　/　　2 is 二 or 弐　　/　　3 is 三 or 参四五六七八九

to convert every Number into single size before storing in the database is the easy way to go.

String s1 = “0123456789”; String s2 = "０１２３４５６７８９";

ERXStringUtilities.isDigitsOnly(s1);// RESULT = true

ERXStringUtilities.isDigitsOnly(s2);// RESULT = true

s1.equalsIgnoreCase(s2);// RESULT = false

isDigitsOnly

replace double to single

String s = "０１２３４５６７８９";

ERXStringUtilitiesEXTENDED.changeZenkakuNumberToHanNumber(s);// RESULT = “0123456789”

LETTER 英字

LETTER 英字

• Everybody loves the simple 26 characters, that in most School takes 2 years to learn.

• In some Countries there are variations like German with ÜÖÄ

LETTER 英字

• There is for each Letter a double byte Letter

• ‘U‘ == ‘Ｕ ’

to convert every Letter into single size before storing in the database is the easy way to go.

String s1 = "BC";String s2 = "ＢＣ";

s1.equalsIgnoreCase(s2);// RESULT = false

s1 = ERXStringUtilitiesEXTENDED.changeZenkakuEijiToHanEiji(s2);// RESULT = ‘BC’

LETTER 英字





•記号 Kigo (Sign)




Sign 記号

Sign 記号

• Ｆor each Sign there is a double byte

counterpart

• ‘!‘ == ‘！ ’

to convert every Sign into single size before storing in the database is the easy way to go.

String s1 = "!@#$%^&*()";String s2 = "！＠＃＄％＾＆＊（）";

s1 = ERXStringUtilitiesEXTENDED.changeZenkakuKigouToHanKigou(s2);// RESULT = ‘!@#$%^&*()’

Sign 記号

SPACE スペース

SPACE スペース

• String a = “ “;

• String b = “　”;

a == space charb == double-size space char

to convert every Number into single size before storing in the database is the easy way to go.

// head and tail are 3 space charsString s = “ A B C ”;

s.trim();// RESULT = ‘A B C’

ERXStringUtilities.trimString(s);// RESULT = ‘A B C’

ERXStringUtilitiesEXTENDED.trimStringWithZenkaku(s);// RESULT = ‘A B C’

trim

// head and tail are 3 japanese ZENKAKU(double byte) space charsString s = “　　　A B C　　　”;

s.trim();// RESULT = ‘　　　A B C　　　’

ERXStringUtilities.trimString(s);// RESULT = ‘　　　A B C　　　’

ERXStringUtilitiesEXTENDED.trimStringWithZenkaku(s);// RESULT = ‘A B C’

better trim

// between A and B are 2 single space + 2 double space + 2 single spaceString s = “A 　　 B”;

s.replace(" ", "");// RESULT = ‘A　　B’

ERXStringUtilities.removeCharacters(s, " ");// RESULT = ‘A　　B’

ERXStringUtilitiesEXTENDED.changeZenkakuToHanKakaku(s).replace(" ", "");// RESULT = ‘ABC’

remove Space between chars






•絵文字 Emoji (Smilies)



絵文字 Emoji (Smilies)

絵文字 Emoji (Smilies)

• Emoji (絵文字); Japanese pronunciation: [emodʑi] is the Japanese term for the

ideograms or smileys used in Japanese electronic messages and webpages.

• Emoji pictograms by au are specified using the IMG tag. SoftBank Mobile emoji are wrapped between SI/SO escape sequences, and support colors and animation. DoCoMo's emoji are the most compact to transmit while au's version is more flexible based on open standards.

If you are creating a CMS or Data Entry like Blog, Forum or whatever else, you will have to deal with

this Emoji. Japanese People loves to use it.

http://en.wikipedia.org/wiki/Help:IPA_for_Japanese

http://en.wikipedia.org/wiki/Help:IPA_for_Japanese

WOEmojilast year WOWODC 2012, I spoke about SnoWOman CMS and there is a Framework named WOEmoji, with using this Framework it is easy to convert Emojis for saving to the database and will automatically working also on Windofs or Androiddevices.

Version 2 of this Framework(working on it) can also convert to the new open standard Emoji that is under developing just right now in Japan.

I am a payed supporter of this Project and waiting for delivery, so WOEmoji can be updated.

Japanese Alphabet







•外字 Gaiji (Self-made characters)


外字 Gaiji (Self-made characters)

• Gaiji (外字), literally meaning "external characters", are kanji that are not represented in existing

Japanese encoding systems. These include variant forms of common kanji that need to be represented alongside the more conventional glyph in reference works, and can include non-kanji symbols as well.

辻葛楢

Win XP : the had only a few 1000 Kanjis and it wasn’t easy to use some Kanjis that was not available. so People started with creating their own, also the look was sometimes different.Win Vista : you can see the font is a little different.

But you have to buy this 1500 char Gaiji Package for about USD 500.-

OS X : works out of the Box and it is free.

http://en.wikipedia.org/wiki/Glyph

http://en.wikipedia.org/wiki/Glyph

Gaiji 外字 Editor• This is a old Gaiji Editor, so the user

could make his own characters and that was nice. it started with the first version of Win. but now with the Internet there is a problem, because lot of People really recognize that this character can bee seen only on this one machine, and after pushing it up via mail or data entry into a database, it looks different on every other machine. so need to stripe out this characters and give a feedback to not use that.

ERXStringUtilitiesEXTENDED.delete_ModelDependenceCharacters(true, s, 200, false, false);

Because i don’t have a Win Machine here, so I wasn’t able to create a Sample-string,but their is a command for deleting that kind of character Area.

Gaiji 外字

Japanese Alphabet








•振り仮名 Furigana

Furigana 振り仮名• Furigana (振り仮名) is a Japanese reading aid, consisting of smaller kana, or syllabic characters, printed

next to a kanji (ideographic character) or other character to indicate its pronunciation. It is typically used to clarify rare, nonstandard or ambiguous readings, or in children's or learners' materials.

Encoding

Encoding

• UTF-8

• EUC-JP

• Shift JIS

• ISO/IEC 2022

• and some more ...

UTF-8

• UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

We use for every project UTF-8 now, and you are mostly save and have not take care about other

Encoding, but...

EUC-JP• EUC-JP Extended Unix Code

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.

• The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).

If you have to do work with some Win Machines it can happen that you have to import Data that are

encoded with this encoding. For my experience I never used that.

Shift JIS• Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS) is a character encoding for the

Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microzoft and standardized as JIS X 0208 Appendix 1.

This is the most used encoding in Japan, and you can be sure that if you get Data from an existing Database or have to connect to an Database you have to deal with this.We did a lot of SJIS - UTF-8 conversion in the past.

ISO/IEC 2022

• ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard (equivalent to the ECMA standard ECMA-35[1] ) specifying

• a technique for including multiple character sets in a single character encoding system, and

• a technique for representing these character sets in both 7 and 8 bit systems using the same encoding.

You have only to deal with that if you do someMailing solutions, but I really don’t care about that

anymore, JavaMail works just fine.

http://en.wikipedia.org/wiki/International_Organization_for_Standardization

http://en.wikipedia.org/wiki/International_Organization_for_Standardization

Localization ローカライズ

• Localization of your App

• Localization Data

• Sorting

Localization of your App

ERXLocalizer// Writing Components and code with ERXLocalizer makes your life very easy// their are so many things you can do with it, so get comfortable with it.

// Localized String from Code

ERXLocalizer.defaultLocalizer().localizedStringForKey("Nav.Main");

// Localized String in HTML

<wo:str value = "$localizer.Nav.Main" />

<wo:localized value="Nav.Main" />

* This is a bad example because I am using the power of the ‘dark force’ Inline Binding. You shouldn’t do that, * but I use it always. Sorry I am a bad guy.

.strings

in your App ‘Resources’ folder create a folder with Language-name + ‘.lproj’

make it a plist file with Key Value.

and save the File as

UTF-16UTF-8with UTF-8 it is easier to read and also git commits can be viewed.

Localization Data

Localization of Data

1. Attributes in Entity

2. set Data in Edit-page

3. Display the Attribute depending on the Localizer

[[eo]].name_en()or

[[eo]].name_jaor

[[eo]].valueForKey("name")

Sorting

Sorting 1

name (how it is written)

furigana(how it is pronounce)

Sorting 2

林森漢字 Kanji

(Chinese characters)

Person 1 Person 2ひらがな Hiragana

or カタカナ Katakana

(Japanese Alphabet)もりはやし

Mr. Mori Mr. Hayashi

WOdka improvements

• Language-switching

WOdkaLanguageEnums

• Language name

• Locale Code

• Date format + 24 hours setting

• Data for Flag information

WOdkaCountryEnums• Country name

• code2 : ISO Code for Country

• code3 : ISO Code for Country

• money : ERXMoneyEnums

• language : WOdkaLanguageEnums

• telephone code

• tax : tax info

• zip : zip format

• company Mailing Format

• family Mailing Format

• Localized words : male, female, sexMale, sexFemale

• flag : Path to Flag-data

• continent : ERXContinentEnums

• EU : ERXEuropeanUnionsEnums

"[S][CR][T][_][F][_][L]"

"[L] [F]様"

family Mailing Format

s = sext = titlef = first namel = last namecr = next line

Thanks to• Masahiko TANI - A10 Objects Inc., (Japan)

• Hiroyuki FUKUI - Astonish Create (Japan)

Special Thanks to

• Paul YU - Green orchid llc (USA)

Thank YouWOWODC

2013

localizing your apps for multibyte languages

Technology

multibyte wodka

charset utf

webobjects ready database

ant compile task script

english code

getall kind of problems