abap slide class4 unicode-plusfiles

ABAP workshopUnicode and File Handling

2

Topics• Characters and Encoding• ASCII standards• Glyphs and Fonts• Extended ASCII and issues• Character Sets and Code Pages• Little and Big Endian• Unicode• Unicode Transformation Formats [UTF-8, etc]• Unicode SAP system• SAP Unicode Overhead• SAP File Interface• SAP Authorization for File Access• Files on the Application Server• File Interface Statements (Open, Transfer, Read, Get, Set, etc) • Error Handling• Attributes and Other commands• Files on the Presentation Server

3

Characters and Encoding

• Characters are represented by character codes

• This coding is a called character Encoding • Character codes are generated and stored

when a user inputs and saves a document• When a document is read by the system, it

interprets the character codes that were stored and displays them as characters in the format that we understand

4

ASCII standards• The American National Standards Institute (ANSI)

created the American Standard Code for Information Interchange (ASCII) standard

• For example in ASCII, character ‘A’ is represented by decimal code 65 or hexadecimal code 41 and is stored as binary code 01000001

• Single-Byte character sets provide 256 character codes. This is an adequate number to encode most of the characters needed for Western Europe

• BTW: Extended Binary Coded Decimal Interchange Code [EBCDIC] (that existed before ASCII) is an 8-bit character encoding used on IBM mainframe operating systems – is not being discussed here

5

Glyphs and Fonts • A Glyph (glif) is a visual representation of a

character – example: A A A A A A A A• Users don't view or print characters they views

or print Glyphs• The character "Capital Letter A" represented by

the Glyph in Times New Roman Bold is different from the Glyph in Arial Bold (each Glyph look visually different)

• A single character can be represented by several different Glyphs in a Font

• A Font is a collection of glyphs

6

Extended ASCII and issues• ASCII represent every character using a number between 32 and 127.

Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits because the total characters were less than 128 (27)

• Historically most computers used 8-bit bytes, therefore there was still 1 bit to spare

• Extended ASCII that made use of this spare bit was not standardized all over the world

• The IBM-PC had something that came to be known as the OEM [Original Equipment Manufacturer] character set which provided some accented characters for European languages and text-mode PCs could display and print vertical and horizontal line drawing characters

• An assortment of 256-character Windows ANSI character sets cover all the 8-bit languages targeted by Windows

• Programmers from Israel, Russia (USSR), Asia used the 8th bit to represent their own language characters, so there were no universal standard left for the characters from 128 and up – confusion prevailed with the 8th bit

• Something was required to map various Character Code created and used -not only for Extended ASCII but also for any new mapping developed

7

Character Sets and Code Pages• A Character Set is any specific collection of characters• Code Page is a list of selected character codes for a Character Set

in a particular order• Code Page is another name for encoding of each character in a

Character Set (Fonts could have their own Character Set)• Code Page is a character set encoding that can include numbers,

punctuation marks, and other glyphs. Code Pages are not the samefor each language

• Many Code Pages are single-byte Character Sets - that is, they contain no more than 256 characters.

• A Code Page is a representation of Character Set used by a computer (OS) to support a specific language or set of languages.

Character Sets Windows Code PageUS-ASCII 20127German (IA5) 20106 Korean (ISO) 50225

• Some languages, such as Japanese have multi-byte characters, while others, like English and German, only need one byte to represent each character

8

Character Sets and Code Page (cont…)

Within each Code Page, the Characters from Character Setare mapped to the Character Codes (Encoded)

9

Character Sets and CodePage(cont…)

So potentially we could have hundreds of Character Sets and these have to be mapped to numerous Code Pages which is a maintenance nightmare

10

Character Sets and CodePage(cont…)

• All Code pages may not exist on all the computers, or they can be different on different computers, or they can be changed for a single computer.

• This will result in confusion and emails like these: – Dear □ □ ??? Thank □□□ █ █ █ █ ???

11

Little and Big Endian• Some examples of ABAP build-in Data Types are:

b 1 Byte - 1 byte Integer (internal)i 4 Bytes - 4 byte integerf 8 Bytes - Floating point number

• Question: For the multi-byte data (say, i or f shown above), where does the biggest (most significant or highest-order) byte appear in the memory?

• Little Endian: as used in Intel processors stores low-order byte of a number in memory at the lowest address

• Big Endian: as used by Motorola processors and IBM's 370 mainframes, and most RISC-based computers store the high-order byte of a number in memory at the lowest address

(Example 1: 4 byte Long Int [Byte3 Byte2 Byte1 Byte0]. In the memory the arrangement is as shown)

Byte0 Byte1 Byte2 Byte3Little Endian

Base Address+0 +1 +2 +3

Byte3 Byte2 Byte1 Byte0Big Endian

Base Address+0 +1 +2 +3

12

Little and Big Endian (cont..)• Example 2: to store two bytes required for the hexadecimal number 4F52, the

following shows the representation by the two methods (BTW: this is equal to 2*16^0+ 5*16^1 + 15*16^2 + 4*16^3 = 20306 in decimal)

• Little Endian – representation in memory: Base Address+0 52Base Address+1 4F

• Big Endian – representation in memory: Base Address+0 4FBase Address+1 52

• Big Endian is easy to understand, because it is consistent with the order we use naturally - when we read and write text and numbers.

• Irrespective of the BYTE order which depends on the Big Endian or Little Endianrepresentation, the BIT order within each Byte is always big-endian

01001001 = (0 + 2^6 + 0 + 0 + 2^3 + 0 + 0 + 2^0 = 64 + 8 + 1 = 73)

13

Need for Standards - Unicode• We have seen the confusion that arises when each entity including

hardware manufacturers, Software companies, Regions, Countries, Groups create Code Pages as per their own requirements and for their own Character sets

• Without any set standards, and with the advent of internet, sharing of information could be almost impossible

• What if we have one standard Code Page, having a set of all possible character codes that any computer or software could decipher?

• Well, Unicode is the answer. It is not a Code Page, but more like a “meta-Code Page”

• Unicode is a brave effort to create a single character set that included every reasonable writing system on the planet

• Think of Unicode as a set of all possible character codes.• Unicode is a single very large (and still growing) character set and

encoding, which encompasses essentially all the standard computer character sets that predated it.

14

Unicode• Unicode provides a unique number (or encoding or code

point) for every characterNO matter what the platformNO matter what the programNO matter what the language

• Unicode is an international standard that assigns a unique number to characters from virtually every language and script

• Unicode currently defines more than 90,000 characters, with room for more than 1 million characters. With Unicode, all characters used in business-relevant languages can be represented

15

Unicode (cont…)• Most any computer Code Page can be mapped to Unicode and

back. However, in computer systems Unicode is largely replacing Code Page based approaches

• Instead of having dozens of Code Pages each using and re-using the same numbered slots for different characters, each charactergets its own unique numbered slot in Unicode

• Think of Unicode as a label attached to the character via which the character can be accessed by applications and operating systems

• Example: The English letter A is U+0041, Hebrew letter alef is U+05D0, Greek letter alpha (α) would be U+03B1, etc – basically we have covered them all

16

Does Unicode encode Language, Font, Size, Positioning, Glyphs?

• The Unicode Standard does not attempt to encode features such aslanguage, font, size, positioning, glyphs, and so forth. For example, it does not preserve language as a part of character encoding: just as French i grec, German ypsilon, and English wye are all represented by the same character code, U+0057 “Y”. The Unicode Standard deals only with character codes.

• Glyphs represent the shapes that characters can have when they are rendered or displayed. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters. A repertoire of glyphs makes up a font. Glyph shape and methods ofidentifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard.

• A A A A A A A A All represented by Latin capital letter A (U+0041)

• a a a a a a a a a a All represented by Latin small letter a (U+0061)

17

Unicode Challenges• But, have we addressed all the issues? • Of course not, Unicode has mapped all the

characters uniquely, but how to store this in memory or represent it in an email message. The English letter A would be U+0041, but in memory should it be stored as [00 41] or as [41 00] – Endianness?

• What about all those zeros. Are we doubling the disk space, resulting in more cooling costs and more greenhouse issues? [TX okay, but CA?]

• Welcome to the UTF-8 Standards!

18

Unicode UTF-8 standard• UTF-8 (8-bit UCS/UTF) is a variable-length character encoding for

Unicode. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact,up to 6 bytes

• If a legacy system can understand ASCII, they can understand theEnglish portion of the UTF-8, therefore old programs can still decipher English text from UTF-8. They cannot decipher any other language in UTF-8 that has two or more bytes (they were not designed to read other languages so are basically not effected)

• With UTF-8 standard, memory and disk space is conserved• UTF-8 is interpreted as a sequence of bytes, there is no endian

problem as there is for encoding forms that use 16-bit or 32-bit code units.

• UCS stands for Universal Character Set• UTF stands for Unicode Transformation Format

19

Unicode other standards• UCS-2 (2 bytes) or UTF-16 (16 bits)

– High Endian UCS-2 or Low Endian UCS-2• UTF-7 (similar to UTF-8 but guarantees that the high bit

will always be zero to be consistent with old programs requirements)

• UTF-32 (32 bits)• UTF-8 is most popular standard today• A byte order mark (BOM) consists of the character code

U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order

• Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order

20

Conveying the Encoding used• How do we preserve this information about what

encoding a string uses? – For an email message, you are expected to have a string in the

header of the formContent-Type: text/plain; charset="UTF-8"

– For HTML page by using some kind of special tag.<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">

• For the most consistent results, any new applications developed should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page

• For Unicode UTF-8, the Windows Code Page is 65001

21

Unicode SAP system• Enables you to harness Internet technologies better • Allows better integration with non-SAP products and

seamless integration with existing SAP systems • Offers a superior platform for collaborative, cross-system

business applications • Work with all languages and language combinations in

the world • Allows you to install a central system for worldwide

business processes, e.g. to gather and store aggregate customer data

• Enables you to optimize your system landscape and reduce your costs

22

Unicode SAP system (cont…)• Unicode Program: A Unicode program is an

ABAP program in which the Unicode checks are run effectively and in which certain statements involve different semantics from those that apply in non-Unicode program.

• Unicode System: Single-code-page system in which characters are coded in Unicode character representation.

• The Unicode check was tightened as of Release 6.10

23

SAP Unicode Overhead• Main Memory:

– Average increase +40...50% -> Reason: Application servers are based on UTF-16

• Network load: – ~0% -> Almost no change due to efficient

compression.

• Database size: Average increase– UTF-8: +10% (smaller systems (< 200GB) might grow

more)– UTF-16: +20...60%

24

SAP File Interface and Unicode• It is possible to exchange file between Unicode and non-

Unicode systems, between different Unicode systems and between different non-Unicode systems with different code pages

• Instead of implicit programming with standard settings on which we have no control, programmers are required to do explicit programming and all important parameters need to be specified (with stringent requirements to maintain good programming practice)

• Examples of explicit programming are: file must be opened before each read/write, access type and type of data storage needs to be specified, file opened with read-only access remains that way through out the program, file opened as text can have text only, etc

25

SAP Authorization for File Access

• Operating system checkSystem automatically checks the entries in the database table SPTH for access to individual files - none of the following (S_PATH / S_DATASET) can override this.

• Program independent authorization checkThe check against the authorization object S_PATH is independent of ABAP program used and is not restricted to an individual file but all files in the PATH/folder.

• User and program authorization checkThe check against the authorization object S_DATASET, and is based on the program name, filename and activity (Delete, Read, Write, Read with filter and Write with Filter).

26

File Interface Statements

• OPEN DATASET• TRANSFER• READ DATASET• GET DATASET• SET DATASET• TRUNCATE DATASET• CLOSE DATASET• DELETE DATASET

27

Opening a File

• OPEN DATASET dset FOR access IN mode [position] [os_addition] [error_handling].– dset is the file name including path (/usr/tmp/test.dat)– access can be

• INPUT (opens only for reading, the file pointer is set at the start of the file, if file does not exist, sy-subrc is set to 8, In Unicode program, it is not possible to write to a file open for reading, whereas non-Unicode program allows both)

• OUTPUT (opens a new file for writing, if file already exists, its content are deleted. Read access is permitted)

• APPENDING (opens the file for appending, and the file pointer set at the end of the file, if file does not exist, it is created. Read attempt fails and sy-subrcis set to 4)

• UPDATE (opens the file for updating, and the file pointer set at the start of the file, if file does not exist, sy-subrc is set to 8)

28

INPUT command (continued)

– Syntax of mode• BINARY MODE (opens the file as a binary file, and the

binary content of a data object is transferred unchanged)• TEXT MODE ENCODING code (opens the file as a text file,

when writing and the content of a data object is converted to the representation specified after code [UTF-8 or non-Unicode] and transferred to file. For characters, closing blank values are truncated, but not for strings. When reading, the content of file is read until the next end-of-line marking, converted from the format specified after code into the current character format [UTF-8 or non-Unicode specified in database table TCP0C] and transferred to a data object)

• LEGACY BINARY MODE [endian] [codepage]• LEGACY TEXT FILE [endian] [codepage]

29

INPUT command (continued)

• AT POSITION pos When opening file with this option pos defines where the file pointer is positioned in bytes (0 means start of fine, -1 means end of file and any value i means i bytes from the start of the file)

• TYPE attrFor Non MS O/S, attr can contain O/S specific parameters for a file to be opened (OS/400 ‘blksize=8000’, etc). On MS O/S if attrcontains “NT” the end-of line is marked by “CRLF”, and if it contains “UNIX” the end-of-line is marked by “LF”.

• FILTER opcomUsing Filter option, opcom can be an OS command that is started when OPEN DATASET is executed, example: FILTER ‘compress’ or FILTER ‘uncompress”

OPEN DATASET filexyz FOR OUTPUT in BINARY MODE FILTER ‘compress’.OPEN DATASET filexyz FOR INPUT in BINARY MODE FILTER ‘uncompress’.

30

Error Handling

• [MESSAGE msg]When errors occurs the O/S error message is assigned to the dataobject msg to be displayed by the ABAP program to the user

• [IGNORING CONVERSION ERRORS]This addition can suppress treatable exceptions defined by classCX_SY_CONVERSION_CODEPAGE, each unconvertible character is replaced by literal ‘#’

• [REPLACEMENT CHARACTER rc]Same as above, except that each unconvertible character is replaced by the single character specified by rc – not applicable for binary files

31

TRANSFER and READ Commands

• TRANSFER dobj TO dset [LENGTH len] [NO END OF LINE]The content are written to the file from the current file pointer, Length determines how many characters/bytes are written to the file, NO END OF LINE avoids the end-of-line marking to be appended to the data transferred

• READ DATASET dset INTO dobj [MAXIMUM LENGTH mlen] [[ACTUAL] LENGTH alen]This exports the data from the file specified in dset into the data object dobj starting from the current file pointer. Using the Maximum length addition, the number of characters or bytes to be read from the file can be limited. Using the Actual Length the number of characters or bytes actually used can be determined (mlen can be 100, but actual can be 60 if the file is small, so alen is returned with 60)

32

GET and SET Commands

• GET DATASET dset [POSITION pos] [ATTRIBUTES attr]

Position determines the current position of the file pointer. Attributes enables us to read/get the value of fixed and changeable file attributes

• SET DATASET dset [POSITION pos|{END OF FILE}] [ATTRIBUTES attr]Position sets the position of the file pointer to new position indicated by pos. Attributes enables us to update the value of changeable file attributes

33

ATTRIBUTES• Fixed Attributes

– Indicator (sub-structure with the following fields and indicates ‘X’ if the following are significant)

– Mode (Text (T), Binary (B), Legacy Binary (LB) and Legacy Text (LT))– Access_type (Reading (I), writing (O), appending (A) and editing (U)) – Encoding (UTF-8 and NON-UNICODE)– Filter (filter command, example ‘compress’)

• Changeable Attributes– Indicator (sub-structure with the following fields and indicates ‘X’ if the following

are significant)– Repl_char (replacemen character rc)– Conv_error (contains ‘I’ if IGNORE conversion errors addition ws used ‘R’

otherwise)– Code_page (code page that was specified, initial otherwise)– Endian (B for Big Endian, L for Little Endian, initial otherwise)

Example:DATA attr TYPE dset_attributes. “dset_attributes SAP defined in type group DSET.GET DATASET dset ATTRIBUTES attr.IF attr-fixed-indicator-filter <> ‘X’… ENDIF.

34

Other commands

• TRUNCATE DATASET dset AT {Current Position} | {POSITION pos}File size is modified by setting the end of the file indicator at the current or pos position. When shortened the file is truncated after the new end of file, when extended (pos > current file size) thefile is filled with hexadecimal null from the old to the new end of file.

• CLOSE DATASET dsetCloses file on the application server.

• DELETE DATASET dsetDeletes file on the application server.

35

Files on the Presentation Server

• The CL_GUI_FRONTEND_SERVICES class of the class library contains the required methods for processing files on the presentation server (client/PC). There are no ABAP statements available for processing files here.– GUI_DOWNLOAD for writing files– GUI_UPLOAD for reading files– DIRECTORY_CREATE and DIRECTORY_DELETE for

creating and deleting a directory– FILE_DELETE, FILE_COPY, FILE_EXIST, etc., for file

operations• The above is the class, but function modules

GUI_DOWNLOAD and GUI_UPLOAD can also be used.

abap slide class4 unicode-plusfiles

Technology

single character

various character code

character set code page

character setin

character set encoding

singlebyte character

issues character sets

character capital letter