case study real time olap cubes

Post on 25-Jun-2015

3.344 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Are OLAP cubes "large monters" that deliver quick data retrieval at the expense of long upload time? This presentation shows one way to kill this myth.

TRANSCRIPT

Case study: 

Quasi real-time OLAP cubes

by Ziemowit JankowskiDatabase Architect

OLAP Cubes - what is it?

• Used to quickly analyze and retrieve data from different perspectives

• Numeric data• Structured data:

o can be represented as numeric values (or sets thereof) accessed by a composite key

o each of the parts of the composite key belongs to a well-defined set of values

• Facts = numeric values• Dimensions = parts of the composite key• Source = usually a start or snowflake schema in a

relational DB (other sources possible)

OLAP Cubes - data sources

Star schema Snowflake schema

OLAP Facts and dimensions

• Every "cell" in an OLAP cube contains numeric data a.k.a "measures".

• Every "cell" may contain more than one measure, e.g. forecast and outcome.

• Every "cell" has a unique combination of dimension values.

OLAP Cubes - operations

• Slice = choose values corresponding to ONE value on one or more dimensions

• Dice = choose values corresponding to one slice or a number of consecutive slices on more than 2 dimensions of the cube

OLAP Cubes - operations (cont'd)

• Drill down/up = choose lower/higher level details. Used in context of hierarchical dimensions.

• Pivot = rotate the orientation of the data for reporting purposes

• Roll-up

OLAP Cubes - refresh methods

• Incremental:o possible when cubes grow

"outwards", i.e. no "scattered" changes in data

o only delta data need to be reado refresh may be fast if delta is

small• Full:

o possible for all cubes, even when changes are "scattered" all over thedata

o all data need to be re-read with every 

o refresh may take long time (hours)

The situation on hand

• Business operating on 24*6 basis (Sun-Fri)• Events from production systems are aggregated into

flows and production units• Production figures may be adjusted manually long

after production date• Daily production figures are basis for daily forecasts

with the simplified formula:forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm

• Adjustments in production figures will alter forecast figures

• Outcome and forecast should be stored in MS OLAP cubes as per software architecture demands

• The system should simplify comparisons between forecast and outcome figures

Software

• Source of data:o Relational databaseo Oracle 10g databaseo extensive use of PL/SQL in database

• Destination of data:o OLAP cubes - MS SQL Server Analysis Services

(version 2005 and 2008)• Other software:

o MS SQL Server database

QUESTION

Can we get almost real-time reports from MS OLAP cubes?

ANSWER

YES! The answer lies in "cube partitioning".

Cube partitioning - the basics

• Cube partitions may be updated independently• Cube partitions may not overlap (duplicate values

may occur)• Time is a good dimension to partition on

MS OLAP cube partitioning - details

• Every cube partition has its own query to define the data set fetched from the data source

• The SQL statements define the non-overlapping data sets

MS OLAP cube partitioning - details

How to partition? - theory

• Partitions with different lengths and different update frequencies:o current data = very small partition, very short

update times, updated ofteno "not very current" data = a bit larger partition,

longer update times, updated less ofteno historical data = large partition, long update

times, updated seldom• Operation 24x6 delivers the "seldom" window

How to partition? - theory cont'd

• One cube for both forecast and outcome

Solution - approach one

Decisions:• Cubes partitioned on date boundaries• MOLAP cubes (for better queryperformance)• Use SSIS to populate cubes

o dimensions populated by incremental processingo facts populated by full processingo jobs for historical data must be run after midnight

to compensate for date change

Actions:• Cubes built• SSIS deployed inside SQL Server (and not filesystem)• SSIS set up as scheduled database jobs

Did it work?

No!Malfunctions:• Simultaneous updates of cube partitions could lead

to deadlocks• Deadlocks left cube partitions in unprocessed state

Amendment:• Cube partitions must not be updated simultaneously

Solution - approach two

Decisions:• Cube processing must be ONE partition at a time• Scheduling done by SSIS "super package":

o SQL Server table contains approx. frequency and package names

o "super package" executes SSIS packages as indicated by the table

Actions:• Scheduling table created• "Super package" created to be self-modifying

Did it work?

Not really!Malfunctions:• Historical data had to be updated after midnight and

real-time updates for "Now" partition were postponed. This was done to avoid "gaps" in outcome data and "overlappings" in forecast data.

• Real-time updates ended soon after midnight and were resumed a few hours later. (That was NOT acceptable.)

Amendment:• Re-think!

Solution - approach three

Decisions:• Take advantage of 6*24 cycle (as opposed to 7*24)• Switch dates on Saturdays only

o the "Now" partition had to stretch from Saturday to Saturday

o all other partitions had to stretch from a Saturday to another Saturday

• Re-process all time-consuming partitions on Saturday after switch of date

Solution - approach three cont'd

Actions:• Create logic in Oracle database to do date

calculations "modulo week", i.e. based on Saturday. Logic implemented as function.

• Rewrite SQL statements for cube partitions so that they employ the Oracle function (as above) instead of current date +/- given number of days.

• Reschedule the time consuming updates so they run every 7th day.

Did it work?

Yes!Malfunctions:• None, really.

Lessons learned

• It is possible to build real-time OLAP cubes in MS technology

• It is possible to make the partitions self-maintaining in terms of partition boundaries

• The concept need careful engineering as there are pits in the way.

Omitted details

Some details have been omitted:• the quasi real-time updates are scheduled to occur

every 2nd or 3rd minute• scheduling is not exact, as the Super-job keeps track

of what is to be run and when and executes SSIS packages based on "scheduled-to-run" state, their priority and a few other criteria

• the source of data is not a proper star schema, it is rather an emulation of facts and dimensions by means of data tables and views in Oracle.

top related