creating multi-lingual and multi-locale databases international unicode conference 19 presented by...

33
Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods, Inc.

Upload: diego-salisbury

Post on 26-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Creating Multi-Lingual and Multi-Locale Databases

International Unicode Conference 19

Presented by Addison Phillips

Globalization Architect

webMethods, Inc.

Page 2: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Introduction

Audience: Beginning Developers Presenter: Addison Phillips

Globalization ArchitectwebMethods, Inc.mailto:[email protected]

Presentation: http://www.inter-locale.com Creating complex systems in a global environment requires more than

internationalized code. Since most Enterprise system rely on relational databases, a global-ready system must also consider database design in order to be truly effective.

Page 3: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Our Problem

This presentation is based on the lessons learned in developing and deploying a B2B “conversation management” system (webMethods for Trading Networks) and “partnership management” software (webMethods PartnerConnect).

The products we created share a central database that allow webMethods customers to manage their B2B trading partnerships.

Terminology: A “trading partner” is a company that you want to do business with. An “initiative” is a specific opportunity to work together.

Page 4: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Trading Networks

Companies need to store information about transactions and business relationships world wide and in real time.

We call this “Global Business Visibility”

Page 5: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Partner Connect Goals

Centrally served (one instance).

Centrally managed (initiatives can be deployed anywhere).

Localized (so partners can interact with initiatives in their own language).

Cultural and market sensitivity (customized to fit different market conditions locally).

Created and managed by the customer entirely through HTML interface.

Page 6: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Profile and Conversation Management

Page 7: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Enter the Database

Serve both global and local initiatives from a single instance.

Store data in multiple writing systems (scripts, languages).

Provide for actual differences in the data due to user location (“locale”).

Provide for localization of global content.

Provide for local content management.

Page 8: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Basic Rules for a Global DB Schema

1. Expand fields to support changes in character encoding.2. Expand fields to support differences in the storage

requirements of other locales (cultural or linguistic expansion, as opposed to encoding)

3. Classify data as locale-neutral, locale-intrinsic, or locale-related and re-normalize the tables accordingly.

4. Create efficient access to both global and locale-specific information.

Page 9: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Selecting an Encoding

If a database instance will only serve a single locale (or compatible locales), then the character encoding can be selected based on local requirements (“legacy encoding”).

If the database must store data from many locales (or incompatible writing systems), then the character encoding selected must be a Unicode encoding.

Each database vendor has a unique approach to this. Encodings vary in terms of performance and capability.

Generally the two choices you have are: UTF-8 UTF-16 (formerly known as UCS-2)

Page 10: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Character Encodings and DDL

Each database vendor provides their own encoding support.

Most support “legacy” encodings and their variants.

Need a Unicode encoding to support multiple languages (globally)*

Each vendor handles Unicode encodings differently.

CREATE TABLE Address (

cust_id number,

attn varchar(50),

department varchar(50),

street1 varchar(50),

street2 varchar(50),

city varchar(50),

state char(2),

zip varchar(5),

country varchar(18));

Page 11: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Example: Cloudscape

Cloudscape is a pure Java database.

Uses java.lang.String objects to store char and varchar data, so all string data is stored as UCS-2.

(1) java.lang.Character equals (1) unit in DDL

CREATE TABLE Address (

cust_id number,

attn varchar(50),

department varchar(50),

street1 varchar(50),

street2 varchar(50),

city varchar(50),

state char(2),

zip varchar(5),

country varchar(18));

Page 12: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Example: Oracle

Oracle provides several native Unicode encodings. The most commonly used one is called “UTF8”.

Characters in UTF-8 range from one to four bytes*

Char and varchar2 types are defined in bytes, not characters.

So a varchar2(30) can reliably store 10 Unicode 2.1.8 characters (and as many as 30).

Note that a varchar2(60) is required to store surrogate pairs.

CREATE TABLE Address (

cust_id number,

attn varchar2(150),

department varchar2(150),

street1 varchar2(150),

street2 varchar2(150),

city varchar2(150),

state char(6),

zip varchar2(15),

country varchar2(18));

Page 13: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Oracle Example

Create a table and insert values. Notice that “multibyte” values take

more room to store.

Page 14: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Example: MS SQL Server 2000

SQL Server 2000 provides support for the UTF-16 encoding of Unicode via the nchar and nvarchar datatypes.

Char and varchar2 must use a legacy encoding, with sizes defined in bytes (so a varchar(30) can store as many as 30 and as few as 15 characters in Shift-JIS [CP932]).

Nchar and nvarchar are defined in characters, so an nvarchar(30) can store 30 characters.

Note that an nvarchar(30) can only store 15 characters beyond U+FFFF.

CREATE TABLE Address (

cust_id integer,

attn nvarchar(50),

department nvarchar(50),

street1 nvarchar(50),

street2 nvarchar(50),

city nvarchar(50),

state nchar(2),

zip nvarchar (5),

country nvarchar(18));

Page 15: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

SQL Server Example

Create a table. Insert data using “multibyte”

characters. Insert data using “single-byte”

characters.

Page 16: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Oracle Unicode Encoding Variations

AL24UTFFSS. The original Unicode encoding supported by Oracle. It is not compatible with modern Unicode and should be avoided.

UTF8. A multibyte encoding used by most versions of Oracle. This version encodes Unicode Scalar Values larger than U+FFFF as the UTF-8 sequence for a pair of surrogate characters.

This results in binary sorting sequences compatible with UTF-16 representations.

This violates the Unicode “shortest form” requirement (note that this is invisible to my Java application).

*(All JDBC drivers adjust the connection to use this encoding automatically.)

AL32UTF8. A UTF-8 encoding provided in Oracle 9i that correctly encodes Unicode Scalar Values larger than U+FFFE using the shortest form. Note that the sorting sequence is different.

nchar/nvharchar support for UTF-16 in Oracle 9i.

Page 17: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

MS SQL Server 2000 Issues

Code Page 65001: This is Microsoft’s code page for UTF-8. It can not be used as a char/varchar/text encoding, even in the most recent

versions of MS SQL Server. See http://support.microsoft.com/support/kb/articles/Q232/5/80.ASP for

more information. You can use a different encoding (by setting the collation) for each data

column, but this is not a very convenient way to work in a global environment.

JDBC connections to SQL Server use the JDBC-ODBC driver. This driver cannot tell the difference between n-types and “regular” types, and thus cannot retrieve Unicode string values.

Note that this also applies to several middleware products, notably Merant. Note that this also applies to use of variant text types in other databases

(such as Oracle 9i).

Page 18: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Some Other Databases

Sybase ASE supports UTF-8. Sybase 11 supports UTF-8 via an add-on. ASE is adding support for UTF-16 via a new data type.

IBM DB/2 Supports UTF-16 (as CCSID 13844). Supports UTF-8 as a database encoding.

MySQL doesn’t support Unicode. The Open Source folks need to get to work …

Page 19: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Modifying Size Constraints

UTF-8 has a maximum number of bytes-per-character of 4.

Vast majority of characters use 3 or less.

Older systems (JDK, for example) cannot access the 4-byte characters.

Determine size requirements: Specific constraint

--or--

Arbitrary constraint.

Specific Size Limit: Check length using code (database

fields have variable restrictions). For UTF-8: Multiply by 3 (or 4) bytes

to get field length. Example: varchar2(10) becomes

varchar2(30).

Arbitrary Size Limit: Multiply the desired maximum by 3

bytes to get approximate size. Adjust according to database and

performance requirements. Example: varchar2(100) becomes

varchar2(255). [Was able to store 100 characters, now a “minimum maximum” of 85.]

Page 20: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Cultural Data Expansion

Data also changes size (and sometimes type) because of “culture” or locale.

Examples: Social Security Number is 13 digits in France “Postal code” not all numeric outside USA and may be quite long. Different address units than “State” Spanish users often have two or three “middle” names.

Avoid arbitrarily small char and varchar field lengths. Most databases optimize storage of variable length fields.

But avoid performance killing sizes. Oracle block size limitations. Oracle JDBC character conversion “latching” maximum

(2000 bytes in 8.0.5).

Page 21: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Cultural Data Expansion

State (char2) becomes province (up to 85 characters).

ZIP code (varchar9) becomes postalcode (up to 50 characters and probably much more).

Address fields expand from 50 bytes to 255 bytes (or about 85 characters).

Don’t assume that the same fields will always represent the exact same data values.

CREATE TABLE Address (

address_id char(24),

cust_id char(24),

contact_id char(24),

country_id char(2),

department varchar2(255),

street1 varchar2(255),

street2 varchar2(255),

city varchar2(255),

province varchar2(255),

postalcode varchar2(150));

Page 22: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

What’s Left?

So far we’ve: Expanded storage to deal with character encodings. Expanded storage to deal with cultural and linguistic data expansion.

We still need to: Allow for localization of textual elements. Allow for relational changes due to cultural or linguistic requirements.

Page 23: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Basic Questionnaire Table Structure

QUESTION--------Q_ID char(24)QUES_ID char(24)TYPE_ID char(2)QUESTION varchar(255)SEQ_NUM number

Page 24: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Structure with Localization

QUESTION--------Q_ID char(24)PKQUES_ID char(24)PKTYPE_ID char(2)DESCRIPTION varchar(255)SEQ_NUM number

QUESTION_LOCALE---------------Q_ID char(24)PK, FKQUES_ID char(24)PK, FKLCID number PKNAME varchar(255)

Page 25: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Selecting the Locale

How Java does it: <baseclass>+<specific language>+ <specific

country>+<specific variant> <baseclass>+<specific language>+ <specific country> <baseclass>+<specific language> <baseclass>+<default language>+ <default country>+<default

variant> <baseclass>+<default language>+ <default country> <baseclass>+<default language> <baseclass>

How can we replicate this in SQL?

Page 26: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

One Method…

SELECT * FROM Questionnaire WHERE InitiativeID = ?SELECT * FROM Question WHERE Q_ID = ?(while more questions)(do)SELECT * FROM QuestionLocale WHERE Ques_ID = ? AND LCID=?

(until you find a record…)(wend)

QUESTIONLOCALE--------------Q_IDQUES_IDLANG_IDTERRITORY_IDQUESTION

• Inefficient.• Difficult to manage.

Page 27: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Our Solution

Concept of “installed locale”

Create associated “installed locale” records at the hub or questionnaire level.

Perform locale negotiation once.

No additional searches required.

Let’s look…

Page 28: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,
Page 29: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,
Page 30: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Appearance of Questions

Page 31: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Data and Locale

Some data is locale neutral. Formatted at display time to

match user’s locale. Values don’t vary by locale. Note: It may be in a language.

Some data is locale related. Data locale implied by context. Formatting/Validation is supplied

by context. Locale can be inherited or

cascaded.

Some data is locale intrinsic.Business Logic (format/validation) changes due to data’s locale.Locale must be tagged.Implies a separate table.

Page 32: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

Simplify with Locale Related

Page 33: Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods,

QUESTIONS AND ANSWERS

Presentation Available at http://www.inter-locale.com

mailto:[email protected]