multi-byte characters

20
Adobe ® Marketing Cloud Multi-Byte Characters

Upload: vohanh

Post on 01-Jan-2017

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-Byte Characters

Adobe® Marketing Cloud

Multi-Byte Characters

Page 2: Multi-Byte Characters

Contents

Multi-Byte Character Sets................................................................................................3

Web Page Encodings and Character Sets.......................................................................4ISO-8859-1 Encoding and Character Set...................................................................................................................4

CP1252 Windows-1252 Character Set.....................................................................................................................10

UTF-8 Encoding Unicode Character Set..................................................................................................................11

Analytics Report Suites - Standard ISO and Multi-byte Enabled................................12

Using the charSet Property............................................................................................13

Analytics Display Language..........................................................................................14

Character Codes 128-255 - ISO vs. UTF-8......................................................................15

Variable Lengths.............................................................................................................16

Enabling Multi-Byte Support.........................................................................................17Supported Character Sets............................................................................................................................................17

Contact and Legal Information.....................................................................................20

Multi-Byte CharactersLast updated 2/11/2015

Page 3: Multi-Byte Characters

Multi-Byte Character SetsAnalytics allows data to be captured and reported in multiple languages, which allows international sites to be easily tagged withAnalytics code, and generate reports that reflect the site content as displayed to the user. A single report suite can be used tocollect and report data in multiple languages.

Properly utilizing the internationalization capability of Analytics involves coordination of the report suite configuration, webpage encoding and the Analytics property charSet.

For example, if the sites mysite.com (English), mysite.co.jp (Japanese), and mysite.co.kr (Korean) are all sendingdata to a single global report suite, Analytics can display the English, Japanese, and Korean data simultaneously in a single report.

In addition to collecting and displaying international data, the Analytics interface can be displayed in several languages, includingEnglish, German, Japanese, Chinese, and Korean.

3Multi-Byte Character Sets

Page 4: Multi-Byte Characters

Web Page Encodings and Character SetsWeb pages display textual data by converting numeric character codes to physical characters based on the page encoding, whichdefines the range of available characters that can be properly displayed on the page.

The page encoding is set with one of the following three methods.

• Using a <META> tag inside the <HEAD> tag of the page, for example, <META http-equiv="Content-Type"content="text/html;charSet=ISO-8859-1">

• Within the http header, for example, Content-Type: text/html; charSet=ISO-8859-1• By browser auto-detection; If methods one and two are not used, modern browsers will attempt to detect the page encoding

based on the content or simply use a default encoding based on user preferences.

For greater visibility of the page encoding, Adobe recommends using the first method whenever possible. The third methodmay be unreliable for international sites and should be avoided whenever possible.

For additional information on encodings and character sets, refer to http://www.w3.org/International/tutorials/tutorial-char-enc/.

ISO-8859-1 Encoding and Character Set

The most commonly used encoding for Latin based languages (English, French, Spanish, etc.) is "ISO-8859-1," which is one ofmany standards that use single-byte encodings.

Each character is represented by one (and only one) byte of data. Therefore, single-byte encodings, including ISO-8859-1, islimited to 256 displayable characters.

The following table contains the complete set of characters that are available within ISO-8859-1.

Character DescriptionCharacterCharacter Number

N/Anon-displayed control codes0-31

spacespace32

exclamation point!33

straight quote marks"34

hash mark/number sign#35

dollar sign$36

percent sign%37

ampersand&38

straight quote mark/apostrophe'39

left parenthesis(40

right parenthesis)41

asterisk*42

plus sign+43

comma,44

hyphen-45

period.46

slash/47

4Web Page Encodings and Character Sets

Page 5: Multi-Byte Characters

Character DescriptionCharacterCharacter Number

zero048

one149

two250

three351

four452

five553

six654

seven755

eight856

nine957

colon:58

semi-colon;59

less than sign<60

equals sign=61

greater than sign>62

question mark?63

commercial "at" sign@64

uppercase AA65

uppercase BB66

uppercase CC67

uppercase DD68

uppercase EE69

uppercase FF70

uppercase GG71

uppercase HH72

uppercase II73

uppercase JJ74

uppercase KK75

uppercase LL76

uppercase MM77

uppercase NN78

uppercase OO79

uppercase PP80

uppercase QQ81

5Web Page Encodings and Character Sets

Page 6: Multi-Byte Characters

Character DescriptionCharacterCharacter Number

uppercase RR82

uppercase SS83

uppercase TT84

uppercase UU85

uppercase VV86

uppercase WW87

uppercase XX88

uppercase YY89

uppercase ZZ90

left square bracket[91

backslash\92

right square bracket]93

caret^94

underscore bar_95

grave accent`96

lowercase aa97

lowercase bb98

lowercase cc99

lowercase dd100

lowercase ee101

lowercase ff102

lowercase gg103

lowercase hh104

lowercase ii105

lowercase jj106

lowercase kk107

lowercase ll108

lowercase mm109

lowercase nn110

lowercase oo111

lowercase pp112

lowercase qq113

lowercase rr114

lowercase ss115

6Web Page Encodings and Character Sets

Page 7: Multi-Byte Characters

Character DescriptionCharacterCharacter Number

lowercase tt116

lowercase uu117

lowercase vv118

lowercase ww119

lowercase xx120

lowercase yy121

lowercase zz122

left curly brace{123

solid vertical bar/pipe|124

right curly brace}125

tilde~126

N/Aunused127-159

non-breaking spacespace160

inverted exclamation point¡161

cents sign¢162

pound sterling sign£163

general currency sign¤164

yen sign¥165

broken vertical bar¦166

section§167

umlaut/dieresis¨168

copyright symbol©169

feminine ordinalª170

left angle quote marks«171

not sign¬172

soft hyphens hyphen173

registered symbol®174

macron accent¯175

degree sign°176

plus or minus±177

superscript 2²178

superscript 3³179

acute accent´180

micro sign (Greek mu)µ181

7Web Page Encodings and Character Sets

Page 8: Multi-Byte Characters

Character DescriptionCharacterCharacter Number

paragraph sign¶182

middle dot·183

cedilla¸184

superscript 1¹185

masculine ordinalº186

right angle quote marks»187

fraction one-fourth¼188

fraction one-half½189

fraction three-fourths¾190

inverted question mark¿191

uppercase A, grave accentÀ192

uppercase A, acute accentÁ193

uppercase A, circumflex accentÂ194

uppercase A, tildeÃ195

uppercase A, umlaut/dieresisÄ196

uppercase A, ringÅ197

uppercase AE ligature, diphthongÆ198

uppercase C, cedillaÇ199

uppercase E, grave accentÈ200

uppercase E, acute accentÉ201

uppercase E, circumflex accentÊ202

uppercase E, umlaut/dieresisË203

uppercase I, grave accentÌ204

uppercase I, acute accentÍ205

uppercase I, circumflex accentÎ206

uppercase I, umlaut/dieresisÏ207

uppercase Eth, IcelandicÐ208

uppercase N, tildeÑ209

uppercase O, grave accentÒ210

uppercase O, acute accentÓ211

uppercase O, circumflex accentÔ212

uppercase O, tildeÕ213

uppercase O, umlaut/dieresisÖ214

multiplication sign×215

8Web Page Encodings and Character Sets

Page 9: Multi-Byte Characters

Character DescriptionCharacterCharacter Number

uppercase O, slashØ216

uppercase U, grave accentÙ217

uppercase U, acute accentÚ218

uppercase U, circumflex accentÛ219

uppercase U, umlaut/dieresisÜ220

uppercase Y, acute accentÝ221

uppercase Thorn, IcelandicÞ222

small sharp s, Germanß223

lowercase a, grave accentà224

lowercase a, acute accentá225

lowercase a, circumflex accentâ226

lowercase a, tildeã227

lowercase a, umlaut/dieresisä228

lowercase a, ringå229

lowercase ae ligature, diphthongæ230

lowercase c, cedillaç231

lowercase e, grave accentè232

lowercase e, acute accenté233

lowercase e, circumflex accentê234

lowercase e, umlaut/dieresisë235

lowercase i, grave accentì236

lowercase i, acute accentí237

lowercase i, circumflex accentî238

lowercase i, umlaut/dieresisï239

lowercase eth, Icelandicð240

lowercase n, tildeñ241

lowercase o, grave accentò242

lowercase o, acute accentó243

lowercase o, circumflex accentô244

lowercase o, tildeõ245

lowercase o, umlaut/dieresisö246

division sign÷247

lowercase o, slash/null setø248

lowercase u, grave accentù249

9Web Page Encodings and Character Sets

Page 10: Multi-Byte Characters

Character DescriptionCharacterCharacter Number

lowercase u, acute accentú250

lowercase u, circumflex accentû251

lowercase u, umlaut dieresisü252

lowercase y, acute accentý253

small thorn, Icelandicþ254

lowercase y, umlaut/dieresisÿ255

CP1252 Windows-1252 Character Set

The CP1252 encoding and character set (otherwise known as the Windows-1252 or simply Windows character set) is a supersetof ISO-8859-1.

The CP1252 characte rset was developed by Microsoft and is used primarily by Microsoft Windows systems. This encoding usesthe 128-159 code range to display additional characters not included in the ISO-8859-1 character set.

Character DescriptionCharacterCharacter Number

Euro currency symbol€128

129

single low-9 quotation mark'130

Latin letter f with hookƒ131

double low-9 quotation mark"132

horizontal elipsis…133

dagger†134

double dagger‡135

modifier letter circumflex accentˆ136

per mille sign‰137

Latin letter S with caronŠ138

single left angle quotation mark‹139

Latin ligature OEŒ140

141

Latin letter Z with caronŽ142

143

144

left single quotation mark'145

right single quotation mark'146

left double quotation mark“147

right double quotation mark”148

10Web Page Encodings and Character Sets

Page 11: Multi-Byte Characters

Character DescriptionCharacterCharacter Number

bullet•149

endash–150

emdash—151

small tilde˜152

trademark sign˜153

Latin letter s with caronš154

single right angle quotation mark›155

Latin ligature oeœ156

157

Latin letter z with caronž158

Latin letter Y with dieresisŸ159

Note: Since this character set is not standardized across all platforms and browsers, these character codes are not validHTML, though they will display properly on some systems and browsers. Use of these character codes will result in inconsistentdisplay across browser versions and operating systems. To properly display these characters requires a more advancedcharacter set and encoding, such as UTF-8 Encoding Unicode Character Set.

UTF-8 Encoding Unicode Character Set

UTF-8 encoding is quickly becoming the standard for displaying multilingual (as well as mathematical and scientific) data onthe web. UTF-8 is based on the standardized (but evolving) Unicode character set.

Unicode is an advanced character set that as of version 4.0, includes more than 70,000 characters from nearly all written languages.UTF-8 is one of the most common encoding methods used to convert Unicode character codes into a data byte sequence. Unlikesingle-byte encoding methods, each character can consist of one to four bytes of data in Unicode.

For more information on Unicode and UTF-8, refer to the following web sites.

• http://www.unicode.org• http://en.wikipedia.org/wiki/Unicode• http://en.wikipedia.org/wiki/UTF-8

11Web Page Encodings and Character Sets

Page 12: Multi-Byte Characters

Analytics Report Suites - Standard ISO and Multi-byteEnabledEach Analytics report suite is configured to be either standard (or ISO) or a multi-byte (UTF-8/localized) report suite.

This setting determines what encoding is to be used to store and display Analytics data. A standard report suite uses ISO-8859-1encoding while a multi-byte suite uses UTF-8 encoding. Any characters that are not in the ISO-8859-1 character set (includingthose in the CP 1252 character set) will not display properly in a standard ISO report suite. Some of these non-supportedcharacters might cause display problems such as line breaks, odd characters, or even truncation of the value passed to Analytics.

If the data you are passing to Analytics contains any characters not in the ISO-8859-1 character set, you should use a multi-bytereport suite. Contact your Implementation Consultant or Adobe Client Care to make the change. A report suite can be changedfrom standard to multi-byte, and vice-versa. However, for data that has already been collected, characters above ISO 127 mightnot display properly after the change is made. The best practice is to determine the needed report suite type when the reportsuite is created.

12Analytics Report Suites - Standard ISO and Multi-byteEnabled

Page 13: Multi-Byte Characters

Using the charSet PropertyThe charSet property, which is normally set in the JavaScript file, is used by Analytics to convert incoming data into UTF-8 forstorage and reporting by Analytics.

Note: The charSet property is required when sending data to a multi-byte report suite and should never be used with astandard report suite. Setting the charSet property with a standard ISO report suite can result in variable truncation orunexpected character conversion.

The value of the charSet property should match the web page encoding in the META tag or http header, even though the syntaxmay differ slightly. Although the META tag may use an alias for the encoding, the value of charSet should use the preferred (orofficial) name of the encoding.

Some of the more common encodings with their preferred name and aliases are listed in the following table.

AliasesPreferred Name

ISO_8859-1, CP819, latin1ISO-8859-1

ISO_8859-2, latin2ISO-8859-2

ISO_8859-5, cyrillicISO-8859-5

Big-5Big5

SJISShift_JIS

Because numerous encodings and aliases exist, contact your Implementation Consultant or Adobe Customer Care to confirmthe proper value for charSet if it does not appear in the table above.

If a site has different web encodings on different pages, or a single JavaScript file is used for multiple sites, the charSet propertycan be set to a default value in the JavaScript file and then reset on specific pages as needed to override the default; for example,s.charSet="UTF-8" or s.charSet="SJIS.".

Any non-blank value of the charSet parameter will cause data to be converted into UTF-8 for storage. Any characters in the128-255 range will be converted to the proper UTF-8 two-byte sequence and stored. These characters will not display properlyin a standard report suite. Therefore, the charSet property should never be used with a standard report suite.

Likewise, a blank value of the charSet parameter will bypass the data conversion process, and any characters in the range 128-255will be stored as a single byte. These characters will not display properly in a multi-byte report suite since the single-byte codesfor these characters are not valid UTF-8. Therefore, the charSet parameter should always be used with a multi-byte report suite.Additionally, the proper value should be used with respect to the web page encoding.

13Using the charSet Property

Page 14: Multi-Byte Characters

Analytics Display LanguageThe Analytics interface can be displayed in alternate languages using the Language menu in the interface.

Selecting any option other than English causes Analytics to display using UTF-8 encoding. Displaying a standard report suiteusing a setting other than English might cause some data to display improperly.

14Analytics Display Language

Page 15: Multi-Byte Characters

Character Codes 128-255 - ISO vs. UTF-8Characters in the range 1-127 are represented by the same byte sequence (actually a single byte) in ISO-8859-1 and UTF-8.However, the characters in the range 128-255 (including all diacritical characters (accent marks)) are represented by a singlebyte in ISO-8859-1 and two bytes in UTF-8.

The difference becomes apparent when changing the report suite type. For collected data, characters in the 128-255 range thatdisplay properly in a standard report suite will not display properly in a multi-byte report suite. Any of these characters thatdisplay properly in a multi-byte report suite will not display properly in a standard report suite. Determining the proper reportsuite type before collecting data is absolutely critical.

15Character Codes 128-255 - ISO vs. UTF-8

Page 16: Multi-Byte Characters

Variable LengthsFor a standard report suite, all characters occupy a single byte by definition. When sending data to a standard report suite, allvariable length limits expressed in bytes have the same length limit in characters.

For a multi-byte report suite, data is stored at UTF-8. Each character in UTF-8 encoding can occupy one to four bytes of data,which means all Analytics variables may have their length limit as low as 25 characters. Additionally, the limit on the numberof characters is determined by the characters themselves. For example, in UTF-8 you could have a page name consisting of 100characters "A." However, the character "A" would have a limit of only 50 characters since its character code (192) requires twobytes for storage.

Languages such as French and Spanish frequently make use of diacritical characters. Since each of these characters occupies twobytes of data when stored as UTF-8, variable length limits become an issue. With languages such as Japanese and Chinese, theissue is more profound since each variable can be limited to as little as 25 characters.

Compounding the issue is that if you simply pass a longer variable to Analytics, the string will be truncated at the byte limitwhen the data is stored, which has the potential of changing the last character displayed since the database may only containthe entire character byte sequence. For web pages using UTF-8 encoding, you can only use JavaScript to properly limit a variableto a set number of bytes before sending it to Analytics. However, this technique may not be possible with other encodings suchas Big5 or Shift-JIS.

Each Analytics variable has a defined length limit expressed in bytes. For standard report suites, each character is representedby a single byte; therefore, a variable with a limit of 100 bytes also has a limit of 100 characters. However, multi-byte reportsuites store data as UTF-8, which expresses each character with one to four bytes of data. This action effectively limits somevariables to as little as 25 characters with languages such as Japanese and Chinese that commonly use between two and fourbytes per character.

The character limit is directly related to the characters being used, which makes a predetermined character limit difficult todetermine. For multi-byte report suites, the best practice is to limit Analytics variables to the specific number of bytes for thevariable before passing data to Analytics.

16Variable Lengths

Page 17: Multi-Byte Characters

Enabling Multi-Byte SupportSteps to enable multi-byte support.

1. The multi-byte pages must use a standard language encoding character set.

2. The Analytics report suite must be multi-byte enabled.

3. The Analytics code (charSet) must be set to the correct language identifier for a given language-encoded page.

The JS file must define the charSet variable. (All pageviews and traffic are assumed to be standard 7-bit ASCII unless otherwisespecified.) Setting the charSet variable, tells the Analytics engine what language should be translated into UTF-8. Some languageidentifiers used in meta-tags or JavaScript variables do not match up with the Analytics conversion filter. Supported CharacterSets describes the character sets currently supported by Analytics.

Supported Character Sets

List of other single-byte and multi-byte encodings that are used on the web.

Some of the more common additional encodings include the following:

Character Set3-Character LanguageCode

Language2-Character CodeCountry

Big5chiHK Trad ChinesehkHong Kong

Big5chiTW Trad ChinesetwTaiwan

EUC-KRkorKoreankrKorea

GB2312chiSimp ChinesecnChina

ISO-8859-1engEnglishaaAfrica

ISO-8859-1freFrenchaaAfrica

ISO-8859-1spaLA SpanisharArgentina

ISO-8859-1engEnglishauAustralia

ISO-8859-1gerGermanatAustria

ISO-8859-1dutDutchbeBelgium

ISO-8859-1freFrenchbeBelgium

ISO-8859-1spaLA SpanishboBolivia

ISO-8859-1porBR PortuguesebrBrazil

ISO-8859-1freCanadian FrenchcaCanada

ISO-8859-1engEnglishcaCanada

ISO-8859-1engEnglishcbCaribbean

ISO-8859-1spaLA SpanishnsCentral America

ISO-8859-1spaLA SpanishclChile

ISO-8859-1spaLA SpanishcoColumbia

ISO-8859-1danDanishdkDenmark

17Enabling Multi-Byte Support

Page 18: Multi-Byte Characters

Character Set3-Character LanguageCode

Language2-Character CodeCountry

ISO-8859-1spaLA SpanishecEcuador

ISO-8859-1finFinnishfiFinland

ISO-8859-1freFrenchfrFrance

ISO-8859-1gerGermandeGermany

ISO-8859-1engEnglishhkHong Kong

ISO-8859-1engEnglishinIndia

ISO-8859-1engEnglishidIndonesia

ISO-8859-1engEnglishieIreland

ISO-8859-1itaItalianitItaly

ISO-8859-1engEnglishmyMalaysia

ISO-8859-1spaLA SpanishmxMexico

ISO-8859-1engEnglishmeMiddle East

ISO-8859-1dutDutchniNetherlands

ISO-8859-1engEnglishnzNew Zealand

ISO-8859-1norNorwegiannoNorway

ISO-8859-1spaLA SpanishpyParaguay

ISO-8859-1spaLA SpanishpePeru

ISO-8859-1engEnglishphPhilippines

ISO-8859-1porPT PortugueseptPortugal

ISO-8859-1spaLA SpanishprPuerto Rico

ISO-8859-1engEnglishsgSingapore

ISO-8859-1engEnglishzaSouth Africa

ISO-8859-1spaSpanishesSpain

ISO-8859-1sweSwedishseSweden

ISO-8859-1freFrenchchSwitzerland

ISO-8859-1gerGermanchSwitzerland

ISO-8859-1engEnglishthThailand

ISO-8859-1engEnglishukUnited Kingdom

ISO-8859-1engEnglishusUnited States

ISO-8859-1spaLA SpanishuyUruguay

ISO-8859-1spaLA SpanishveVenezuela

ISO-8859-1engEnglishvnVietnam

ISO-8859-10estEstonianeeEstonia

ISO-8859-2croCroatianhrCroatia

18Enabling Multi-Byte Support

Page 19: Multi-Byte Characters

Character Set3-Character LanguageCode

Language2-Character CodeCountry

ISO-8859-2czeCzechczCzech Republic

ISO-8859-2hunHungarianhuHungary

ISO-8859-2polPolishplPoland

ISO-8859-2romRomanianroRomania

ISO-8859-2slkSlovakskSlovak Republic

ISO-8859-2slvSloveniansiSlovenia

ISO-8859-4litLithuanianltLithuania

ISO-8859-5bulBulgarianbgBulgaria

Windows-1257ukrRussianuaUkraine

Windows-1257rusRussianruRussian Federation

Windows-1257greGreekgrGreece

Windows-1257turTurkishtrTurkey

Windows-1257hebHebrewilIsrael

Windows-1257latLatvianlvLatvia

SJISjpnJapanesejpJapan

19Enabling Multi-Byte Support

Page 20: Multi-Byte Characters

Contact and Legal InformationInformation to help you contact Adobe and to understand the legal issues concerning your use of this product and documentation.

Help & Technical Support

The Adobe Marketing Cloud Customer Care team is here to assist you and provides a number of mechanisms by which theycan be engaged:

• Check the Marketing Cloud help pages for advice, tips, and FAQs• Ask us a quick question on Twitter @AdobeMktgCare• Log an incident in our customer portal• Contact the Customer Care team directly• Check availability and status of Marketing Cloud Solutions

Service, Capability & Billing

Dependent on your solution configuration, some options described in this documentation might not be available to you. Aseach account is unique, please refer to your contract for pricing, due dates, terms, and conditions. If you would like to add toor otherwise change your service level, or if you have questions regarding your current service, please contact your AccountManager.

Feedback

We welcome any suggestions or feedback regarding this solution. Enhancement ideas and suggestions for Adobe Analytics canbe added to our Customer Idea Exchange.

Legal

© 2015 Adobe Systems Incorporated. All Rights Reserved.Published by Adobe Systems Incorporated.

Terms of Use | Privacy Center

Adobe and the Adobe logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United Statesand/or other countries.

All third-party trademarks are the property of their respective owners.

20Contact and Legal Information