collaborate 2011– leveraging and enriching the capabilities of oracle database 11g

BI/Data Warehouse

LLEVERAGINGEVERAGING ANDAND E ENRICHINGNRICHING THETHE CAPABILITIESCAPABILITIES OFOF OORACLERACLE D DATABASEATABASE 11 11GG

Dylan Kucera, Director – Data ArchitectureOntario Teachers’ Pension Plan

INTRODUCTION Facilitating and managing a large scale data warehouse in an effective and efficient manner cannot be achieved using a simple turn-key set of approaches. Oracle Database 11g offers up a number of key features to help facilitate success in data warehouse management. Where features fall short, enough extensibility is available in the platform to be able to bridge the gap. This paper will focus on the capabilities that enable the sustenance and efficiency of a data warehouse inside of the Oracle Database 11g environment. We will remain focused on facilitation and will not be exploring what might be traditionally found in a BI/Data Warehousing paper such as ETL or Reporting tools.

PARTITIONING – ENABLING EFFICIENT DATA RETENTION Partitioning is a feature often looked at in light of how it can assist with data access. While this is an absolutely critical aspect, the job of the data warehouse administrator can also be aided through the use of partitioning. Consider a problem where you have an analysis process that runs daily a produces a 200 million record result set that is stored in your Data Warehouse. The data retention requirements state that you must store the analysis results for the previous 10 business days, in addition to the previous 12 month end dates, and the previous year end date.

Every day, you must delete 200 million records; once a month it will be 400 million records, and once a year it’s 600 million records to delete. The traditional way could take hours:

DELETE FROM RISK.PNL_DETAIL WHERE ANALYSIS_ID = 12345

However, were you to partition the results table by the “ANALYSIS_ID”, this operation is virtually instantaneous:

ALTER TABLE RISK.PNL_DETAIL DROP PARTITION P12345

There are two other important benefits to note with this approach. Firstly, the lengthy delete will chew up Undo space while it is working, and take up valuable I/O bandwidth. Also consider, depending on the nature of your data, that you may be able to avoid costly Index maintenance where you can instead rely on access by partition/sub-partition.

1 Session 239

BI/Data Warehouse

ADVANCED COMPRESSION – SAVING SPACE AND I/O TIME Consider the case where you have a need to store flat files in your Data Warehouse, perhaps for traceability and audit-ability of the artifacts you load into your Data Warehouse from your counterparties or data providers. These days, with high-end disc being the most expensive hardware in your data centre, and with Disc I/O times being the slowest link in your data access chain, it pays to minimize the amount of storage you use and minimize the amount of I/O time spent accessing that storage. Oracle Database Advanced Compression is an excellent tool in achieving this goal and is very simple to implement. All you need do is to add the compression clause to your file archive table like so:

CREATE TABLE ETL.ETL_ARCHIVE ( ETL_ID NUMBER,

DATE_ DATE,

LENGTH NUMBER

FILE_ BLOB)

LOB(FILE_) STORE AS SECUREFILE (

COMPRESS

CACHE

)

Figure 1 : Oracle Database Advanced Compression

If the files you are storing are simple comma delimited text files or XML files, you can expect to achieve well in excess of 50% compression. This means you’ve more than halved the amount of I/O it takes to retrieve the files. With the speed of today’s CPUs, you will see no material difference in access times due to the compression algorithm, and whatever overhead there might be is dwarfed by the savings in I/O time.

2 Session 239

BI/Data Warehouse

ORACLE DATABASE JVM Suppose that you would like your Data Warehouse to serve itself in terms of data acquisition or parsing, but External Tables are not sufficient because, for example, the target data resides on a foreign (S)FTP site or is published in Excel .XLS format. In these cases you can supplement the Oracle Database functionality with a (Open Source) Java Library. Take for example the case where your Data Warehouse needs to import the following Excel (native .XLS, not CSV) file:

AFN Afghanistan Afghanis 44.61500

ALL Albania Leke 107.80000

DZD Algeria Dinars 75.29000

ARS Argentina Pesos 3.94400

AUD Australia Dollars 1.11865

BSD Bahamas Dollars 1.00000

BHD Bahrain Dinars 0.37705

BDT Bangladesh Taka 69.75000

BBD Barbados Dollars 2.00000

BMD Bermuda Dollars 1.00000

BRL Brazil Reais 1.75806

BGN Bulgaria Leva 1.54157

XOF CFA BCEAO Francs 516.94708

XAF CFA BEAC Francs 516.94708

CAD Canada Dollars 1.05650

Figure 2 : Excel File to be loaded to Oracle Data Warehouse

One could perhaps shell out and use an Excel macro to save the file in a more readable CSV format. This method forms a disconnected architecture and could lead to reliability issues. In order to perform the task in a single connected set of operations, we need to begin by teaching the Oracle Database how to read native XLS format. This can be achieved by importing a Java library into the Database that is suited for the task. “Java Excel API” aka “jexcelapi” is a very complete and compatible set of classes designed to read and write native Microsoft Excel format files. What’s better is that the library is completely Open Source which affords you the opportunity to fix or enhance the library should it not be up to your particular requirement. First, download the latest build of the Java Excel API here:

http://jexcelapi.sourceforge.net

Then, load the .jar file that contains the library’s classes to your target database schema from a Shell/Command window:

loadjava -schema ETL -user *******/********@EDBDEV -verbose jxl.jar

3 Session 239

http://jexcelapi.sourceforge.net/

BI/Data Warehouse

This operation will produce some rather lengthy output, though really all you must verify is that you see that the Java Classes were loaded and that there were no errors encountered in the process. These indicators appear at the very end of the loadjava output:

4 Session 239

BI/Data Warehouse

creating : class ETL.jxl/NumberCell

loading : class ETL.jxl/NumberCell

creating : class ETL.jxl/NumberFormulaCell

loading : class ETL.jxl/NumberFormulaCell

creating : class ETL.jxl/StringFormulaCell

loading : class ETL.jxl/StringFormulaCell

Classes Loaded: 520

Resources Loaded: 7

Sources Loaded: 0

Published Interfaces: 0

Classes generated: 0

Classes skipped: 0

Synonyms Created: 0

Errors: 0

Figure 3 : Output of a successful “loadjava” operation

The challenging part is putting the classes together such that an Excel file can be turned into something that the Database can understand. Now that the Java Excel API classes are imported to your schema (in this example, the schema is “ETL”), it’s straightforward to get started:

create or replace and compile java source named etl.utl_file as

…

import jxl.Workbook;

import jxl.WorkbookSettings;

import jxl.Sheet;

import jxl.Cell;

import jxl.CellType;

import jxl.NumberCell;

import jxl.FormalaCell;

import jxl.DateCell;

…

Figure 4 : Java Stored Procedure – importing supplimental Open Source classes

In the body of your Java class, you would then have a static procedure that takes a BLOB as a parameter, the contents of the BLOB is expected to be an XLS file. You can then loop through every sheet, row, and column and write each to a row in a (temporary) table. From there you have something that you can deal with in the Database with SQL or PL/SQL. While the entire code sample cannot be provided for the Java class due to ownership/copyright, a screen shot of the core bits is included here which should provide a running start for your implementation:

5 Session 239

BI/Data Warehouse

Figure 5 : Partial code sample bridging gap between Java capabilities and Oracle Database

After reading the Excel file into a generic Excel-style database table by calling this procedure, the resulting data is still very “Excel”-ish in structure with generic columns named SHEET, ROW_, COLUMN_, and VALUE_ (see Figure 7). However, knowing the structure of the Excel file, a simple and elegant PIVOT query can be written to turn the general Excel structure into something more usable for your Data Warehouse:

SELECT ISO_CODE, CURRENCY_NAME, RATE FROM ETL.TEMP_ETL_EXCEL

PIVOT (MAX(VALUE_)

FOR COLUMN_ IN (1 AS ISO_CODE,

2 AS CURRENCY_NAME,

3 AS RATE))

WHERE ROW_ > 0

ORDER BY CURRENCY_NAME

Figure 6 : Partial code sample bridging gap between Java capabilities and Oracle Database

By designing our Java class in a way that can read any arbitrary Excel file, we create a re-usable component. The output from this component can then be turned into its final product using the approach above so long as you can rely on consistency of the overall structure of the Excel file.

6 Session 239

BI/Data Warehouse

Figure 7 : Excel Data translated to generic Oracle table structure

Figure 8 : PIVOTing the generic Excel table to a Data Warehouse style structure

7 Session 239

BI/Data Warehouse

Other useful Java libraries that are freely available as Open Source components are:

Apache Commons Net – Useful for implementing File Transfer protocols other than HTTP(S) (which is available natively in Oracle Database package SYS.UTL_HTTP)

http://commons.apache.org/net/

Orion SSH2 – Apache Commons Net is missing SFTP protocol. This library fills that gap

http://sourceforge.net/projects/orion-ssh2/

jCIFS – NetBIOS/CIFS/SMB network protocols – useful if you are running on a Microsoft Windows driven network (eg. Can take an IP Address of a PC and query the Windows Machine Name even if the IP is not in the DNS).

http://jcifs.samba.org/

ORACLE ADVANCED QUEUEING – ASYNCHRONOUS NOTIFICATION Following on the topic of extending the capabilities of the Database is the situation where the Data Warehouse must notify an external process of a data event. Oracle Advanced Queueing (AQ) is often well suited to this problem space, depending somewhat on the capabilities of the target that must be notified. In a sentence, Oracle Advanced Queueing provides a fully featured Messaging implementation, with the ability to enqueue and dequeue messages using simple PL/SQL calls. Oracle Advanced Queueing also supplies a complete JMS API, so attaching your ESB to an AQ is a simple way to allow your Data Warehouse to announce important changes to data.

Another powerful feature of AQ is its ability to dynamically create consumers for a queue when a message is enqueued. This feature is called PL/SQL Notification. This feature comes in most handy when you want to spin up a background process in order to complete a long running operation without holding back the user that created the need for the operation. As a more specific example, suppose you have a team of individuals preparing monthly reports. Once the team has signed off on the results in the operational system, the Team Leader closes the book for the month which in turn demands a refresh of the materialized views in the Data Warehouse. The Team Leader doesn’t want their reporting publisher to show an “hourglass” for (say) half an hour or more while the materialized views refresh; so long as they get notified some time later that the operation was complete, and that’s all they need to know. PL/SQL Notification is one effective way to spin up the refresh operation in the background, allowing the Team Lead to move on with their other work.

Always be sure to consider the necessary communication (ie. send an e-mail) requirements at the end of your PL/SQL Notification processes. Should the process fail, it may be necessary for someone to intervene to fix the problem and without appropriate communication the problem could go unnoticed until a Data Warehouse consumer phones you with an issue!

8 Session 239

http://jcifs.samba.org/



BI/Data Warehouse

ORACLE STREAMS AS AN ENABLER OF DATA WAREHOUSE CENTRALIZATION

Suppose your environment contains one or more Data Warehouses that need to be consolidated, or Data Marts that must be populated from the Warehouse. Perhaps these targets are technologies other than Oracle Databases. Oracle Streams combined with Transparent Gateway allows for seamless Heterogeneous Replication back to the legacy Data Warehouse or Data Marts. Using this approach, the Data Warehouse staff need build and support only one ETL process, and DBA’s support Oracle Streams like any other aspect of the Database Infrastructure.

ORACLE STREAMS – BUILDING A HETEROGENEOUS STREAM

When building a Heterogeneous Streams setup, the traditional separated Capture and Apply model must be used. Much can be learned about the architecture of Oracle Streams by reading the Oracle Streams Concepts and Administration manual. In a very small nutshell, the Capture Process is responsible for Mining the archive logs and finding/queueing all DML that needs to be sent to the legacy Data Warehouse target. The Apply Process takes from this queue and actually ships the data downstream to the legacy target.

In general, Streams is a very memory hungry process. Be prepared to allocate 2 to 4 gigabytes of memory to the Streams Pool. Explicitly split your Capture and Apply processes over multiple nodes if you are employing RAC in order to smooth the memory usage across your environment. The value that Streams will provide to your Data Warehouse migration strategy should hopefully pay for the cost of the memory resources it requires.

ORACLE STREAMS – CAPTURE PROCESS AND RULES

The Capture process is created the same way as any Homogeneous capture process would be and is well described in the manual Oracle Streams Concepts and Administration. This paper will therefore not focus on the creation of the Capture process further, except to show a script that can be used to create an example Capture process called “SAMPLE_CAPTURE” and a Capture rule to capture the table “PLAY.NHL_PLAYER_STAT”:BEGIN

DBMS_STREAMS_ADM.SET_UP_QUEUE(

queue_table => 'SAMPLE_STREAM_QT',

queue_name => 'SAMPLE_STREAM_Q',

queue_user => 'STRMADMIN'

);

END;

/

BEGIN

DBMS_CAPTURE_ADM.CREATE_CAPTURE(


capture_name => 'SAMPLE_CAPTURE',

capture_user => 'STRMADMIN',

9 Session 239

BI/Data Warehouse

checkpoint_retention_time => 3

);

END;

/

Figure 9 : Oracle Streams – Standard Capture

10 Session 239

BI/Data Warehouse

BEGIN

DBMS_STREAMS_ADM.ADD_TABLE_RULES(

table_name => 'PLAY.NHL_PLAYER_STAT',

streams_type => 'CAPTURE',

streams_name => 'SAMPLE_CAPTURE',


include_dml => true,

include_ddl => false,

include_tagged_lcr => false,

inclusion_rule => true

);

END;

/

Figure 10 : Oracle Streams – Standard Capture Rule

ORACLE STREAMS – TRANSPARENT GATEWAY CONFIGURATION

Before you begin building the Streams Apply process, a Transparent Gateway Database Link must first be in place. The recommended configuration is to create a separate Database Link for your Streams processes even if you have a Database Link available to applications and users to the same remote target. Doing so allows you to use different permissions for the Streams user (eg. The Streams link must be able to write to remote tables while Applications must not write to these same tables or the replication will become out of sync!), and also provides flexibility in configuring or even upgrading and patching the gateway for Streams in a different way than the gateway for applications and users.

Creating and configuring the Database Link for Streams is therefore like any other Database Link, except we will make it owned by the database user STRMADMIN. This example shows a link named MSSQL_STREAMS_NORTHWIND that links to the SQL Server Northwind database on a server named SQLDEV2:

#

# HS init parameters

#

HS_FDS_CONNECT_INFO=SQLDEV2//Northwind

HS_FDS_TRACE_LEVEL=OFF

HS_COMMIT_POINT_STRENGTH=0

HS_FDS_RESULTSET_SUPPORT=TRUE

HS_FDS_DEFAULT_OWNER=dbo

Figure 11 : Text file “initLDB_STREAMS_NORTHWIND.ora”

CREATE DATABASE LINK MSSQL_STREAMS_NORTHWIND

CONNECT TO STRMADMIN IDENTIFIED BY ********

11 Session 239

BI/Data Warehouse

USING 'LDB_STREAMS_NORTHWIND’;

Figure 12 : DDL to create Database Link MSSQL_STREAMS_NORTHWIND

12 Session 239

BI/Data Warehouse

ORACLE STREAMS – APPLY PROCESS AND RULES

The Streams Apply Process is where the work to send rows to the Heterogeneous target occurs. Each step in the Apply Process and Rules creation/configuration is worth looking at in some detail and so this paper will focus more closely on the Apply Process configuration than previous steps.

When creating a Heterogeneous Apply Process, a Database Link is named. This means that in the design of your Streams Topology, you will need to include at least one Apply Process for each “Database” on the target server. This is especially important to consider when targeting Microsoft SQL Server or Sybase, as a Database in those environments is more like a Schema in Oracle. Below is a script to create a sample Heterogeneous Apply process called “SAMPLE_APPLY_NORTHWIND”:

BEGIN

DBMS_APPLY_ADM.CREATE_APPLY(


apply_name => 'SAMPLE_APPLY_NORTHWIND',

apply_captured => TRUE,

apply_database_link => 'MSSQL_STREAMS_NORTHWIND'

);

END;

/

Figure 13 : Oracle Streams – Heterogeneous Apply

In a Heterogeneous Apply situation, the Apply Table Rule itself does not differ from a typical Streams Apply Table Rule. Below is an example of an Apply Table Rule that includes the same table we captured in the sections above, PLAY.COLLABORATE_SCHEDULE, as a part of the table rules for the Apply Process SAMPLE_APPLY_NORTHWIND.

BEGIN

DBMS_STREAMS_ADM.ADD_TABLE_RULES(


streams_type => 'APPLY',

streams_name => 'SAMPLE_APPLY_NORTHWIND',


include_dml => true,

include_ddl => false

);

END;

/

Figure 14 : Oracle Streams – Standard Apply Rule

13 Session 239

BI/Data Warehouse

ORACLE STREAMS – APPLY TRANSFORMS – TABLE RENAME

The Apply Table Rename transform is one of the most noteworthy steps in the process of setting up Heterogeneous streams because it is absolutely required, unless you are applying to the same schema on the legacy Data Warehouse as the schema owner of the table in the new Oracle Data Warehouse. It is more likely that you have either redesigned your schemas to be aligned with the current business model, or in the case of a Microsoft SQL Server legacy you have made Oracle Schemas out of the Databases on the SQL Server, and the legacy owner of the tables is “dbo”. You may also have wanted to take the opportunity to create the table in the Oracle Data Warehouse using more accurate or standardized names. Below is an example of an Apply Table Rename transform that maps the new table PLAY.NHL_PLAYER_STAT to the legacy dbo.NHL_PLAYER_STATS table in the Northwind database:

BEGIN

DBMS_STREAMS_ADM.RENAME_TABLE(

rule_name => 'NHL_PLAYER_STAT2283',

from_table_name => 'PLAY.NHL_PLAYER_STAT',

to_table_name => '"dbo".NHL_PLAYER_STATS',

step_number => 0,

operation =>'ADD');

END;

/

Figure 15 : Oracle Streams – Apply Table Rename rule

Notice that the rule name is suffixed in this example with the number 554. This number was chosen by Oracle in the Add Table Rule step. You will need to pull this out of the view DBA_STREAMS_RULES after executing the ADD_TABLE_RULE step, or write a more sophisticated script that stores the rule name in a variable using the overloaded ADD_TABLE_RULE procedure that allows this to be obtained as an OUT variable.

One final note about the Rename Table transform: it is not possible to Apply to a Heterogeneous target table whose name is in Mixed Case. For example, Microsoft SQL Server allows for mixed case table names. You will need to have your DBA’s change the table names to upper case on the target before the Apply process will work. Luckily Microsoft SQL Server is completely case insensitive when it comes to the use of the tables, and so while changing the table names to upper case may make a legacy “Camel Case” table list look rather ugly, nothing should functionally break as a result of this change.

ORACLE STREAMS – APPLY TRANSFORMS – COLUMN RENAME

The Column Rename transform is similar in nature to the Table Rename Transform. Notice in the example below how a column is being renamed because the legacy table contains a column named “DATE” which is completely disallowed in Oracle as a column name because DATE is a key word (data type). The same restriction applies to Column names as with Table names in a

14 Session 239

BI/Data Warehouse

Heterogeneous Apply configuration: All column names on the target must be in upper case. Again, this should have no impact on your legacy code as systems that allow mixed case column names such as Microsoft SQL Server are typically not case sensitive when using the column.

15 Session 239

BI/Data Warehouse

BEGIN

DBMS_STREAMS_ADM.RENAME_COLUMN(



from_column_name => '"DATE_"',

to_column_name => '"DATE"',

value_type => '*',

step_number => 0,

operation => 'ADD');

END;

/

BEGIN

DBMS_STREAMS_ADM.RENAME_COLUMN(



from_column_name => '"NAME_"',

to_column_name => '"NAME"',

value_type => '*',

step_number => 0,

operation => 'ADD');

END;

/

Figure 16 : Oracle Streams – Apply Column Rename rule

DECLARE

iscn NUMBER;

BEGIN

iscn := DBMS_FLASHBACK.GET_SYSTEM_CHANGE_NUMBER;

DBMS_APPLY_ADM.SET_TABLE_INSTANTIATION_SCN(

source_object_name => 'PLAY.NHL_PLAYER_STAT',

source_database_name =>'EDBDEV.OTPP.COM',

instantiation_scn => iscn,

apply_database_link =>'MSSQL_STREAMS_NORTHWIND');

END;

Figure 17 : Oracle Streams – Instantiating the Apply process

16 Session 239

BI/Data Warehouse

ORACLE STREAMS – EXERCISING THE STREAM

Assuming we have tables set up in both the legacy and new Data Warehouse that have only the table name and one column name difference in terms of structure, the steps above are sufficient to now put the Stream into action. Streams has no ability to synchronize tables that are out of sync. Before setting up the Stream you must ensure that the table content matches exactly. Let’s assume for now that you are starting with zero rows and plan to insert all the data after the Stream is set up. The screenshot below illustrates for this example that the legacy target table on Microsoft SQL Server is empty:

Figure 18 : Oracle Streams – Empty Microsoft SQL Server target table

Below is a rudimentary script showing the execution of some seed data being inserted into the Oracle table. You would of course want to use a more sophisticated approach such as SQL*Loader, however, this sample is meant to be simple for the purposes of understanding and transparency:

17 Session 239

BI/Data Warehouse

SQL> INSERT INTO PLAY.NHL_PLAYER_STAT (DATE_, TEAM, SWEATER_NO, NAME_, BIRTH_DATE, POINTS, FACE_OFF_PCT)

VALUES ('2010-08-31', 'VAN', 33, 'Henrik Sedin', '1980-09-26', 112, 49.5);

1 row inserted


VALUES ('2010-08-31', 'PIT', 87, 'Sidney Crosby', '1987-08-07', 109, 55.9);

1 row inserted


VALUES ('2010-08-31', 'WSH', 8, 'Alex Ovechkin', '1985-09-17', 109, 45.4);

1 row inserted


VALUES ('2010-08-31', 'SJS', 19, 'Joe Thornton', '1979-07-02', 89, 53.9);

1 row inserted


VALUES ('2010-08-31', 'OTT', 11, 'Daniel Alfredsson', '1972-12-11', 71, 35.0);

1 row inserted


VALUES ('2010-08-31', 'CGY', 12, 'Jarome Iginla', '1977-07-01', 69, 47.0 );

1 row inserted


VALUES ('2010-08-31', 'TOR', 15, 'Tomas Kaberle', '1978-03-02', 49, NULL);

1 row inserted

SQL> COMMIT;

Commit complete

Figure 19 : Oracle Streams – Inserting to the new Oracle Data Warehouse Table

Allowing the Capture and Apply processes a few seconds to catch-up, re-executing the query from above on the legacy Data Warehouse shows that the rows have been replicated through Streams to the target.

Figure 20 : Oracle Streams – Populated Microsoft SQL Server target Table

18 Session 239

BI/Data Warehouse

DATA WAREHOUSE MODERIZATION – EDITION BASED REDEFINITION Every information system has a useful life, Data Warehouse structures being no exception. The lifecycle of the Data Warehouse structures is particularly difficult to facilitate due to the wide number of consumers. Eventually, one reaches the point where everyone agrees that new structures are required; however, time is required to make the transition. It is likely not the case that every Data Warehouse modernization would justify an entirely new warehouse, in fact, a targeted and staged approach would be more prudent. Historically, the challenge with modernizing a subset of a Data Warehouse has been: how does one control the use of newer structures in conjunction with the older structures such that new dependencies on legacy structures are not created? After all, do we not wish to retire the legacy structures after some reasonable period of parallel availability?

Edition Based Redefinition, available in Oracle Database as of 11gR2, facilitates the need to provide two (or more) structural representations of Data Warehouse segments, while allowing only one to be used at a time by any one database connection. The edition context of the connection specifies one clear definition of data warehouse entities, either legacy or current. Of course, a crafty developer could work around the edition segregation by opening multiple connections, or switching their edition context mid-session, however, this would most likely be a purposeful violation of the target state rather than an easily made oversight.

CONCLUSION Oracle Database is well equipped to handle some of the most challenging Data Warehouse requirements with sophisticated features such as Partitioning, Advanced Compression, Materialized Views, Advanced Queueing and Heterogenous Streams. When the stack of capabilities as shipped are insufficient, often times it is not too far of a stretch to enrich the available functionality with some Open Source Java Classes. As our Data Warehouses begin to show signs of age and are demanding modernization, perhaps Oracle’s latest addition to the stack with Edition Based Redefinition has arrived not a moment too soon. Through the extraordinary capability of Editions, we can now take control of our legacy and manage an orderly transition of data models in our Data Warehouse environment.

19 Session 239

collaborate 2011– leveraging and enriching the capabilities of oracle database 11g

Documents