arabic script should not be so scary!

27
Arabic Script SHOULD NOT be so Scary! What else Unicode still needs to do for the Perso-Arabic script Behnam Esfahbod behnam.es 37th Internationalization and Unicode Conference – October 2013 – Santa Clara, CA

Upload: behnam-esfahbod

Post on 07-Dec-2014

1.013 views

Category:

Technology


3 download

DESCRIPTION

What else Unicode still needs to do for the Perso-Arabic script. Behnam Esfahbod – http://behnam.es/ 37th Internationalization and Unicode Conference – October 2013 – Santa Clara, CA

TRANSCRIPT

Page 1: Arabic Script SHOULD NOT be so Scary!

Arabic ScriptSHOULD NOT

be so Scary!

What else Unicode still needs to dofor the Perso-Arabic script

Behnam Esfahbodbehnam.es

37th Internationalization and Unicode Conference – October 2013 – Santa Clara, CA

Page 2: Arabic Script SHOULD NOT be so Scary!

In the past decade...

The FarsiWeb Project● National Standard for Persian Unicode text● Standard Persian keyboard layouts on all major OSs● Standard-based Open Source Persian TTF fonts● Persian support in GNOME & Mozilla projects

IRNIC, The .IR ccTLD Registry● Deploy Persian (second-level) IDNs● Apply for the IDN ccTLD (actually, 2 of them!)

Persian Computing Community● Over 300 engineers, designers, ● SIGs, e.g. Persian Typography

ICANN & IETF● IDN Variant Issues Project● MESWG Task Force on Arabic-Script IDNs

Page 3: Arabic Script SHOULD NOT be so Scary!

Previously on Unicode

Page 4: Arabic Script SHOULD NOT be so Scary!

Mainly Arabic and Indo-Iranian languages● Is the basis for many alphabets● Persian, Urdu, Pashto, …, and Arabic!

No alphabet uses all kinds of shapes● Four-dots is not used in Arabic or Persian● But 99% of the shapes are recognized by everyone

Many writing styles● Naskh, Nasta’liq, etc

Not all alphabets use all the features● Arabic language/alphabet doesn’t use ZWNJ, traditionally

Rooted in the language properties and its conjugation forms● But recently is being used for non-Arabic words

IBM ⇒ آییبیمام

Perso-Arabic Script

Page 5: Arabic Script SHOULD NOT be so Scary!

Semantic Encoding

Writing Direction● Letters: right-to-left● Digits: left-to-right

⇒ Unicode Bidirectional Algorithm (Bidi/UBA)س ل ما م

One code-point for each letter⇒ Unicode Arabic Joining Algorithm

سلـامسلم

● Joining Control Characters

Page 6: Arabic Script SHOULD NOT be so Scary!

Dual-Joining (yellow) & Right-Joining (blue)

Page 7: Arabic Script SHOULD NOT be so Scary!

Special Characters

1. U+0640 ARABIC TATWEEL● a.k.a. Kashida● کشیــــــــــــــــــده ⇒ کشیده

2. U+200C ZERO WIDTH NON-JOINER● a.k.a. ZWNJ● نامهمای ⇒ نامهای

3. U+200D ZERO WIDTH JOINER● a.k.a. ZWJ● ‍ه.ش. ⇒ ه.ش.

Page 8: Arabic Script SHOULD NOT be so Scary!

Status Quo

1. Kashida● On the keyboard: too similar to Hyphen● Makes the text hard to process (search, …)

2. ZWNJ● Inaccessible in non-standard keyboard layouts● Confuses search engines and applications● IDNA2003● UTR #31 (Unicode Identifier and Pattern Syntax)● IDNA2008● UTR #36 (Unicode Security Considerations)

3. ZWJ● Too little use

But when it’s needed, user is already tired of the other issues

Page 9: Arabic Script SHOULD NOT be so Scary!

Be Stylish

Page 10: Arabic Script SHOULD NOT be so Scary!

Western styles for Latin-like scripts● Bold● Italic & Slanted (including Iranic: left-slanted)● Underline

Question:When, how, and by whom these three became the de facto standard?

Justification & Elongation● Traditionally used in manuscripts● Used to be available in movable type too● Recent study shows Elongation is still the right way to

emphasize text in Perso-Arabic script

Typographic Styles

Page 11: Arabic Script SHOULD NOT be so Scary!

© 2013 Nasser Hadjloo, “Curious Case of Word Stylers”, TEDxTehran

1000 persons × 10 emphasis methods ⇒ Justification & Elongation lead the way!

Page 12: Arabic Script SHOULD NOT be so Scary!

Typographic Emphasis

Letter Spacing

L A T I N S C R I P T

Elongation + Word Spacing

خــــط فـــــارســــی

● CSS● “letter-spacing” is useless (ط ‍ ‍خ )● “text-justify” does work, only for justification

⇒ No sane way to do elongation

Page 13: Arabic Script SHOULD NOT be so Scary!

Verbal Emphasis

Letter Repetition

Latin script… Noooo!

Elongation

خط فارسی… یبـلـــــی!

● Keyboard● Direct mapping of characters to typographic elements (glyphs)

is bad!● Press a Kashida key a few times to emphasize some letter,

bad again!

Page 14: Arabic Script SHOULD NOT be so Scary!

Text Input & Storage

Page 15: Arabic Script SHOULD NOT be so Scary!

Joining Control

ZWNJ is very common in Persian● Necessary to create compound words● And mandatory for some words

خانهماییبییبی

● But preferential in some othersخطکشمیوشود

Larry Tesler: “no modes”● Mode: a distinct setting in which the same user input will produce

perceived different results than it would in other settings

Page 16: Arabic Script SHOULD NOT be so Scary!

Modeless Input Methods

Current keyboards techniques● Based on the typewriter technologies● Based on the needs of Latin script

Kashida insertion should be implicit● Three key presses give you three letters● Three letters need three key presses

[ب]...............[ل]....[ه]یبـــــــــــــــــــلــــــه

ZWNJ insertion should be implicit● Instead of “mode”, we need a “modifier”● Shift key would make much sense!

Note: only applies to dual-joining letters, if followed by a joining letter

Page 17: Arabic Script SHOULD NOT be so Scary!

Obstacles

Input● Need to remove unnecessary ZWNJ dynamically● Hard to implement in keyboard engines● Have to be implemented in an IME

Processing● Search

● Kashida should be ignored● ZWNJ should be respected, sometimes

● Identifiers● Standards have to deal with them, individually!

Output● A few apps still have problems with ZWNJ● Kashida not handled correctly in most fonts

Page 18: Arabic Script SHOULD NOT be so Scary!

Multilingual Environments

Page 19: Arabic Script SHOULD NOT be so Scary!

HCI Model & Language

Language in the computer Language in the user’s mind

Page 20: Arabic Script SHOULD NOT be so Scary!

Encoding Challenges

Characters with similar shapes (Not in all joining forms)

Arabic Yeh vs. Persian Yeh

Page 21: Arabic Script SHOULD NOT be so Scary!

© SIL International

Prefered shapes of letter Heh family based on languages

Page 22: Arabic Script SHOULD NOT be so Scary!

Internationalized Identifiers

Confused about Kashida● Use NFKC to get rid of it

Confused about ZWNJ● Some protocols mandate some rules

● And FAIL if the input doesn’t comply● UTR #31 (Unicode Identifier and Pattern Syntax) rule

2.3.A1● UTR #46 IDNA2008 UTR #31⇢ ⇢

● No standard approach to make it more user-friendly● Although the semantics are obvious to the user

No harmony in same-shape problems● Protocols barely talk about it● Policy-making works only for a few cases● Applications decide how to handle it, if they do

Page 23: Arabic Script SHOULD NOT be so Scary!

Developing a Solution

Page 24: Arabic Script SHOULD NOT be so Scary!

Existing Similar Methods

Case-Mapping● Letter cases (different representation of the same concept)● Useful for western scripts (Latin, Cyrillic, Greek, …)

● Doesn’t work for the other scripts● Language-dependent

Normalization Form Canonical● Deal with encoding issues

● Ensure backward-compatibility● Allow more expansion of UCD

● Better forward-compatibility

Normalization Form Compatibility● Works good for Arabic script (Kashida)● But too much damage to other scripts

Page 25: Arabic Script SHOULD NOT be so Scary!

The Solution SHALL be able to…

● Remove unnecessary (invisible) ZWNJs

● Remove Kashida characters

● Place text into an specific language● Maintaining the expected shape

● Stay consistent when language is not specified

Page 26: Arabic Script SHOULD NOT be so Scary!

Arabic Shape Mapping

● Language-less Basic Normalization● Remove any unnecessary (invisible) ZWNJ

● Language-less Identifier Normalization● Basic normalization● Remove any Kashida

● Language-based Shape Mapping● Map characters with same joining-form shapes to the right

one for the target language

● Language-less Shape Mapping● Not straightforward at all!

Page 27: Arabic Script SHOULD NOT be so Scary!

T H E E N Dپـــــــــــــــــــــــایان