copy of sec d (2)

93
SectionD

Upload: vaibhav-kathuria

Post on 28-Jul-2015

70 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Copy of sec d (2)

SectionD

Page 2: Copy of sec d (2)

Online transaction processingOnline transaction processing, or OLTP, is a class of information systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions . OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automated teller machine (ATM) for a bank is an example of a commercial transaction processing application.

Benefits

Online Transaction Processing has two key benefits:

Simplicity and efficiency. Reduced paper trails The faster, more accurate forecasts for revenues and expenses are both examples

of how OLTP makes things simpler for businesses.

Page 3: Copy of sec d (2)

Online transaction processing increasingly requires support for transactions that span a network and may include more than one company. For this reason, new online transaction processing software uses client or server processing and brokering software that allows transactions to run on different computer platforms in a network.

In large applications, efficient OLTP may depend on sophisticated transaction management software (such as CICS) and/or database optimization tactics to facilitate the processing of large numbers of concurrent updates to an OLTP-oriented database

For even more demanding decentralized database systems, OLTP brokering programs can distribute transaction processing among multiple computers on a network. OLTP is often integrated into service-oriented architecture (SOA) and Web services.

Online Transaction Processing (OLTP) involves gathering input information, processing the information and updating existing information to reflect the gathered and processed information. As of today, most organizations use a database management system to support OLTP. OLTP is carried in a client server system.

Online transaction processing Requirements

Page 4: Copy of sec d (2)

Online analytical processing

In computing, online analytical processing, or OLAP , is an approach to

answering multi-dimensional analytical (MDA) queries swiftly.

OLAP is part of the broader category of business intelligence, which also

encompasses relational database, report writing and data mining.

Typical applications of OLAP include business reporting for sales, marketing,

management reporting, business process management (BPM), budgeting and

forecasting , financial reporting and similar areas, with new applications coming

up, such as agriculture.

The term OLAP was created as a slight modification of the traditional database

term OLTP (Online Transaction Processing)

Page 5: Copy of sec d (2)

Online analytical processing

OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives.

OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.

Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions.

For example, all sales offices are rolled up to the sales department or sales division to

anticipate sales trends. By contrast, the drill-down is a technique that allows users to

navigate through the details. For instance, users can view the sales by individual products

that make up a region’s sales. Slicing and dicing is a feature whereby users can take out

(slicing) a specific set of data of the OLAP cube and view (dicing) the slices from different

viewpoints.

Page 6: Copy of sec d (2)

Online analytical processing• Databases configured for OLAP use a multidimensional data model, allowing for complex

analytical and ad hoc queries with a rapid execution time . They borrow aspects of navigational databases, hierarchical databases and relational databases.

Overview of OLAP systems

The core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It consists of numeric facts called measures which are categorized by dimensions. The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a Vector space. The usual interface to manipulate an OLAP cube is a matrix interface like Pivot tables in a spreadsheet program, which performs projection operations along the dimensions, such as aggregation or averaging.

• The cube metadata is typically created from a star schema or snowflake schema or fact constellation of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.

• Each measure can be thought of as having a set of labels, or meta-data associated with it. A dimension is what describes these labels; it provides information about the measure.

• A simple example would be a cube that contains a store's sales as a measure, and Date/Time as a dimension. Each Sale has a Date/Time label that describes more about that sale.

• Any number of dimensions can be added to the structure such as Store, Cashier, or Customer by adding a foreign key column to the fact table. This allows an analyst to view the measures along any combination of the dimensions.

Page 7: Copy of sec d (2)

• For example:

Page 8: Copy of sec d (2)

Transaction processing system• Transaction processing is a style of computing that divides work into individual, indivisible operations,

called transactions. A transaction processing system (TPS) or transaction server is a software system, or software/hardware combination, that supports transaction processing.

Processing types

Transaction processing is distinct from other computer processing models — batch processing, time-sharing, and real-time processing

Batch processing

Batch processing is execution of a series of programs (jobs) on a computer without manual intervention. Several transactions, called a batch are collected and processed at the same time. The results of each transaction are not immediately available when the transaction is being entered ; there is a time delay.

Real-time processing

Real time systems attempt to guarantee an appropriate response to a stimulus or request quickly enough to affect the conditions that caused the stimulus." Each transaction in real-time processing is unique; it is not part of a group of transactions.

Page 9: Copy of sec d (2)

Time-sharing

Time sharing is the sharing of a computer system among multiple users, usually giving each user the illusion that they have exclusive control of the system. The users may be working on the same project or different projects, but there are usually few restrictions on the type of work each user is doing.

Transaction processing

Transaction processing systems also attempt to provide predictable response times to requests, although this is not as critical as for real-time systems. Rather than allowing the user to run arbitrary programs as time-sharing, transaction processing allows only predefined, structured transactions. Each transaction is usually short duration and the processing activity for each transaction is programmed in advance.

Transaction processing system

Page 10: Copy of sec d (2)

Transaction processing system features

The following features are considered important in evaluating transaction processing systems

• Performance

Fast performance with a rapid response time is critical. Transaction processing systems are usually measured by the number of transactions they can process in a given period of time.

• Continuous availability

The system must be available during the time period when the users are entering transactions. Many organizations rely heavily on their TPS; a breakdown will disrupt operations or even stop the business.

• Data integrity

The system must be able to handle hardware or software problems without corrupting data. Multiple users must be protected from attempting to change the same piece of data at the same time, for example two operators cannot sell the same seat on an airplane.

• Ease of useOften users of transaction processing systems are casual users. The system should be simple for them to understand, protect them from data-entry errors as much as possible, and allow them to easily correct their errors.

• Modular growth

The system should be capable of growth at incremental costs, rather than requiring a complete replacement. It should be possible to add, replace, or update hardware and software components without shutting down the system.

Page 11: Copy of sec d (2)

Concurrency: Definition

Concurrency is the ability of a database to allow multiple users to affect multiple

transactions. This is one of the main properties that separates a database from other

forms of data storage like spreadsheets.

The ability to offer concurrency is unique to databases. Spreadsheets or other flat file

means of storage are often compared to databases, but they differ in this one important

regard. Spreadsheets cannot offer several users the ability to view and work on the

different data in the same file, because once the first user opens the file it is locked to

other users. Other users can read the file, but may not edit data.

The problems caused by concurrency are even more important than the ability to support

concurrent transactions. For example, when one user is changing data but has not yet

saved (committed) that data, then the database should not allow other users who query

the same data to view the changed, unsaved data. Instead the user should only view the

original data.

Page 12: Copy of sec d (2)

Concurrency Problems

Concurrency problems include:

• Lost or buried updates.

• Uncommitted dependency (dirty read).

• Inconsistent analysis (non-repeatable read).

• Phantom reads.

Page 13: Copy of sec d (2)

• Lost Updates

Lost updates occur when two or more transactions select the same row and then update the row based on the value originally selected. Each transaction is unaware of other transactions. The last update overwrites updates made by the other transactions, which results in lost data.

For example,-

Two editors make an electronic copy of the same document. Each editor changes the copy independently and then saves the changed copy, thereby overwriting the original document. The editor who saves the changed copy last overwrites changes made by the first editor. This problem could be avoided if the second editor could not make changes until the first editor had finished.

• Uncommitted Dependency (Dirty Read)

Uncommitted dependency occurs when a second transaction selects a row that is being updated by another transaction. The second transaction is reading data that has not been committed yet and may be changed by the transaction updating the row.

• For example,

An editor is making changes to an electronic document. During the changes, a second editor takes a copy of the document that includes all the changes made so far, and distributes the document to the intended audience. The first editor then decides the changes made so far are wrong and removes the edits and saves the document. The distributed document contains edits that no longer exist, and should be treated as if they never existed. This problem could be avoided if no one could read the changed document until the first editor determined that the changes were final.

Concurrency Problems

Page 14: Copy of sec d (2)

•Inconsistent Analysis (Non-repeatable Read)

Inconsistent analysis occurs when a second transaction accesses the same row several times and reads different data each time. Inconsistent analysis is similar to uncommitted dependency in that another transaction is changing the data that a second transaction is reading. However, in inconsistent analysis, the data read by the second transaction was committed by the transaction that made the change. Also, inconsistent analysis involves multiple reads (two or more) of the same row and each time the information is changed by another transaction; thus, the term non-repeatable read.

• For example, an editor reads the same document twice, but between each reading, the writer rewrites the document. When the editor reads the document for the second time, it has changed. The original read was not repeatable. This problem could be avoided if the editor could read the document only after the writer has finished writing it.

• Phantom Reads

Phantom reads occur when an insert or delete action is performed against a row that belongs to a range of rows being read by a transaction. The transaction's first read of the range of rows shows a row that no longer exists in the second or succeeding read, as a result of a deletion by a different transaction. Similarly, as the result of an insert by a different transaction, the transaction's second or succeeding read shows a row that did not exist in the original read.

• For example, an editor makes changes to a document submitted by a writer, but when the changes are incorporated into the master copy of the document by the production department, they find that new unedited material has been added to the document by the author. This problem could be avoided if no one could add new material to the document until the editor and production department finish working with the original document.

Concurrency Problems

Page 15: Copy of sec d (2)

Concurrency Control

• Lock-Based Protocols

• Timestamp-Based Protocols

• Validation-Based Protocols

• Multiple Granularity

• Multiversion Schemes

• Insert and Delete Operations

Page 16: Copy of sec d (2)

Lock-Based Protocols

• A lock is a mechanism to control concurrent access to a data item

• Data items can be locked in two modes :

1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction. 2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction.

• Lock requests are made to concurrency-control manager. Transaction can proceed only after request is granted.

Page 17: Copy of sec d (2)

Lock-Based Protocols (Cont.)

• Lock-compatibility matrix

A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions

Any number of transactions can hold shared locks on an item, but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item.

If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted.

Page 18: Copy of sec d (2)

Lock-Based Protocols (Cont.)

• Example of a transaction performing locking: T2: lock-S(A);

read (A); unlock(A); lock-S(B); read (B); unlock(B); display(A+B)• Locking as above is not sufficient to guarantee serializability — if

A and B get updated in-between the read of A and B, the displayed sum would be wrong.

• A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.

Page 19: Copy of sec d (2)

Pitfalls of Lock-Based Protocols

• Consider the partial schedule

Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.

Such a situation is called a deadlock. To handle a deadlock one of T3 or T4 must be rolled back and its locks released.

Page 20: Copy of sec d (2)

Pitfalls of Lock-Based Protocols (Cont.)

• The potential for deadlock exists in most locking protocols.

Deadlocks are a necessary evil.

• Starvation is also possible if concurrency control manager is

badly designed. For example:

• A transaction may be waiting for an X-lock on an item, while a

sequence of other transactions request and are granted an S-lock

on the same item.

• The same transaction is repeatedly rolled back due to deadlocks.

• Concurrency control manager can be designed to prevent

starvation

Page 21: Copy of sec d (2)

The Two-Phase Locking Protocol

• This is a protocol which ensures conflict-serializable schedules.

• Phase 1: Growing Phase

• transaction may obtain locks • transaction may not release locks

• Phase 2: Shrinking Phase• transaction may release locks• transaction may not obtain locks

The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock).

Page 22: Copy of sec d (2)

The Two-Phase Locking Protocol (Cont.)

• Two-phase locking does not ensure freedom from deadlocks

• Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.

• Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit

• There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.

• However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense:

Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.

Page 23: Copy of sec d (2)

Multiple Granularity

• Granularity is the extent to which a system is broken down into small

parts, either the system itself or its description or observation. It is

the extent to which a larger entity is subdivided. For example, a yard

broken into inches has finer granularity than a yard broken into feet.

• Allow data items to be of various sizes and define a hierarchy of data

granularities, where the small granularities are nested within larger

ones

• Can be represented graphically as a tree (but don't confuse with

tree-locking protocol)

• When a transaction locks a node in the tree explicitly, it implicitly

locks all the node's descendents in the same mode.

Page 24: Copy of sec d (2)

Example of Granularity Hierarchy

The levels, starting from the coarsest (top) level are

databaseareafilerecord

Page 25: Copy of sec d (2)

Intention Lock Modes• In addition to S and X lock modes, there are three additional lock

modes with multiple granularity:

• intention-shared (IS): indicates explicit locking at a lower level of the tree

but only with shared locks.

• intention-exclusive (IX): indicates explicit locking at a lower level with

exclusive or shared locks

• shared and intention-exclusive (SIX): the subtree rooted by that node is

locked explicitly in shared mode and explicit locking is being done at a

lower level with exclusive-mode locks.

• intention locks allow a higher level node to be locked in S or X mode

without having to check all descendent nodes.

Page 26: Copy of sec d (2)

Deadlock Handling

• A deadlock is a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does.

• In an operating system, a deadlock is a situation which occurs when a process or threadenters a waiting state because a resource requested is being held by another waiting process, which in turn is waiting for another resource. If a process is unable to change its state indefinitely because the resources requested by it are being used by another waiting process, then the system is said to be in a deadlock

System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set.

Page 27: Copy of sec d (2)

Deadlock Handling

• Deadlock prevention protocols ensure that the system will never enter into a deadlock state.

• Some prevention strategies :• Require that each transaction locks all its data items before it begins

execution (pre declaration).• Impose partial ordering of all data items and require that a transaction

can lock data items only in the order specified by the partial order (graph-based protocol).

• Timeout-Based Schemes :• a transaction waits for a lock only for a specified amount of time. After

that, the wait times out and the transaction is rolled back.• thus deadlocks are not possible• simple to implement; but starvation* is possible. Also difficult to

determine good value of the timeout interval.* Starvation is a severe deficiency in caloric energy, nutrient, and vitamin intake. It is the most extreme form of malnutrition

Page 28: Copy of sec d (2)

Deadlock Detection• After a deadlock is detected, it can be corrected by using one of

the following methods:• Process Termination: One or more process involved in the

deadlock may be aborted. We can choose to abort all processes involved in the deadlock. This ensures that deadlock is resolved with certainty and speed. But the expense is high as partial computations will be lost. Or, we can choose to abort one process at a time until the deadlock is resolved. This approach has high overheads because after each abort an algorithm must determine whether the system is still in deadlock. Several factors must be considered while choosing a candidate for termination, such as priority and age of the process.

• Resource Preemption: Resources allocated to various processes may be successively preempted and allocated to other processes until the deadlock is broken.

Page 29: Copy of sec d (2)

Deadlock Detection (Cont.)

Wait-for graph without a cycle Wait-for graph with a cycle

Page 30: Copy of sec d (2)

Deadlock Recovery

• When deadlock is detected :

• Some transaction will have to rolled back (made a victim) to

break deadlock. Select that transaction as victim that will incur

minimum cost.

• Rollback -- determine how far to roll back transaction

• Total rollback: Abort the transaction and then restart it.

• More effective to roll back transaction only as far as necessary to

break deadlock.

• Starvation happens if same transaction is always chosen as

victim. Include the number of rollbacks in the cost factor to avoid

starvation

Page 31: Copy of sec d (2)

Other Approaches to Concurrency Control

• Timestamp-Based Protocols

• Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti) <TS(Tj).

• The protocol manages concurrent execution such that the time-stamps determine the serializability order.

• In order to assure such behavior, the protocol maintains for each data Q two timestamp values:• W-timestamp(Q) is the largest time-stamp of any transaction that

executed write(Q) successfully.• R-timestamp(Q) is the largest time-stamp of any transaction that

executed read(Q) successfully.

Page 32: Copy of sec d (2)

Timestamp-Based Protocols

• The timestamp ordering protocol ensures that any conflicting read and write operations are executed in timestamp order.

• Suppose a transaction Ti issues a read(Q)

1. If TS(Ti) W-timestamp(Q), then Ti needs to read a value of Q that was already

overwritten. Hence, the read operation is rejected, and Ti is rolled back.

2. If TS(Ti) W-timestamp(Q), then the read operation is executed, and R-timestamp(Q) is set

to max(R-timestamp(Q), TS(Ti)).

• Suppose that transaction Ti issues write(Q).

1. If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed previously,

and the system assumed that that value would never be produced. Hence, the write operation is rejected, and Ti is rolled back.

2. If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete value of Q.

Hence, this write operation is rejected, and Ti is rolled back.

3. Otherwise, the write operation is executed, and W-timestamp(Q) is set to TS(Ti).

Page 33: Copy of sec d (2)

Correctness of Timestamp-Ordering Protocol• The timestamp-ordering protocol guarantees serializability since all

the arcs in the precedence graph are of the form:

Thus, there will be no cycles in the precedence graph• Timestamp protocol ensures freedom from deadlock as no

transaction ever waits. • But the schedule may not be cascade-free, and may not even be

recoverable.

Page 34: Copy of sec d (2)

DCL commands are used to enforce database security in a multiple user database environment. Two types of DCL commands are GRANT and REVOKE. Only Database Administrator's or owner's of the database object can provide/remove privileges on a database object

• SQL GRANT Command

SQL GRANT is a command used to provide access or privileges on the database objects to the users

The Syntax for the GRANT command is:

Page 35: Copy of sec d (2)

• SQL GRANT Command

• privilege_name is the access right or privilege granted to the user. Some

of the access rights are ALL, EXECUTE, and SELECT.

• object_name is the name of an database object like TABLE, VIEW, STORED

PROC and SEQUENCE.

• user_name is the name of the user to whom an access right is being

granted.

• user_name is the name of the user to whom an access right is being

granted.

• PUBLIC is used to grant access rights to all users.

• ROLES are a set of privileges grouped together.

• WITH GRANT OPTION - allows a user to grant access rights to other users.

Page 36: Copy of sec d (2)

SQL GRANT Command

• For Example:

GRANT SELECT ON employee TO user1;This command grants a SELECT

permission on employee table to user1.You should use the WITH GRANT

option carefully because for example if you GRANT SELECT privilege on

employee table to user1 using the WITH GRANT option, then user1 can GRANT

SELECT privilege on employee table to another user, such as user2 etc. Later, if

you REVOKE the SELECT privilege on employee from user1, still user2 will have

SELECT privilege on employee table.

Page 37: Copy of sec d (2)

SQL REVOKE Command :

• The REVOKE command removes user access rights or privileges to the database objects.• The Syntax for the REVOKE command is:

For Example:

REVOKE SELECT ON employee FROM user1;This command will REVOKE a SELECT privilege on employee table from user1.When you REVOKE SELECT privilege on a table from a user, the user will not be able to SELECT data from that table anymore. However, if the user has received SELECT privileges on that table from more than one users, he/she can SELECT from that table until everyone who granted the permission revokes it. You cannot REVOKE privileges if they were not initially granted by you.

Page 38: Copy of sec d (2)

Privileges and Roles:Privileges: Privileges defines the access rights provided to a user on a database object. There are two types of privileges.

1) System privileges - This allows the user to CREATE, ALTER, or DROP database objects.

2) Object privileges - This allows the user to EXECUTE, SELECT, INSERT, UPDATE, or DELETE data from database objects to which the privileges apply.

Few CREATE system privileges are listed below:

System Privileges Description

CREATE object allows users to create the specified object in their own schema.

CREATE ANY object

allows users to create the specified object in any schema.

Page 39: Copy of sec d (2)

The above rules also apply for ALTER and DROP system privileges.

Few of the object privileges are listed below:

Object Privileges Description

INSERT allows users to insert rows into a table.

SELECT allows users to select data from a database object.

UPDATE allows user to update data in a table.

EXECUTE allows user to execute a stored procedure or a function.

Page 40: Copy of sec d (2)

Roles: Roles are a collection of privileges or access rights. When there are many users in a database it becomes difficult to grant or revoke privileges to users. Therefore, if you define roles, you can grant or revoke privileges to users, thereby automatically granting or revoking privileges. You can either create Roles or use the system roles pre-defined by oracle.

Some of the privileges granted to the system roles are as given below:

System Role Privileges Granted to the Role

CONNECT CREATE TABLE, CREATE VIEW, CREATE SYNONYM, CREATE SEQUENCE, CREATE SESSION etc.

RESOURCE CREATE PROCEDURE, CREATE SEQUENCE, CREATE TABLE, CREATE TRIGGER etc. The primary usage of the RESOURCE role is to restrict access to database objects.

DBA ALL SYSTEM PRIVILEGES

Page 41: Copy of sec d (2)

• Creating Roles:

The Syntax to create a role is:

It's easier to GRANT or REVOKE privileges to the users through a role rather than assigning a privilege directly to every user. If a role is identified by a password, then, when you GRANT or REVOKE privileges to the role, you definitely have to identify it with the password.

Page 42: Copy of sec d (2)

Database Recovery

There are many situations in which a transaction may not reach a commit or

abort point. An operating system crash can terminate the DBMS processes

• The DBMS can crash

• The process issuing the transaction can abnormally terminate

• The system might lose power

• A disk may fail or other hardware may fail.

• Human error can result in deletion of critical data.

• In any of these situations, data in the database may become inconsistent or

lost.

Page 43: Copy of sec d (2)

Two general classes of failures:

• Process or Processor failures in which no data is lost but the database must be restarted and recovered.

• Storage failure wherein data on disk is lost or corrupted.

In such cases, data must be restored from backup and then the database must be restarted and recovered.

Database Recovery is the process of restoring the database and the data to a consistent state. This may include restoring lost data up to the point of the event (e.g. system crash).

There are many different approaches to recover a database. Here we discuss Manual Reprocessing, and a variety of Automated recovery techniques.

Page 44: Copy of sec d (2)

Manual Reprocessing

• In a Manual Reprocessing approach, the database is periodically backed

up (a database save) and all transactions applied since the last save are

recorded

• If the system crashes, the latest database save is restored and all of the

transactions are re-applied (by users) to bring the database back up to

the point just before the crash.

• Several shortcomings:

• Time required to re-apply transactions

• Transactions might have other (physical) consequences

• Re-applying concurrent transactions is not straight forward.

Page 45: Copy of sec d (2)

Automated Recovery

There are several types of Automated Recovery techniques including: Deferred update and immediate update, shadow paging, etc.

Recall, a transaction can be in one of the following states:

• Active - when the transaction just begins

• Partially Committed - after the last operation has completed (but before

the commit point is reached)

• Failed - Normal operation is prevented (e.g., in the event of a system crash)

• Aborted - Transaction is rolled back. That is, all of its effects are undone

• Committed - Transaction completes all operations and moved the

database to the next consistent state

Page 46: Copy of sec d (2)

Deferred Update Recovery

• Also called NO-UNDO/REDO During a transaction, only record the changes to data items in the log. When the transaction commits, actually update the data items on disk. Two main rules: A transaction cannot change any items in the database until it commits.

• A transaction may not commit until all of the write operations are successfully recorded in the log. This means that we must check to see that the log is actually written to disk.

Since T1 and T4 committed, their changes were written to diskHowever, T2 and T3 did not commit, hence their changes were not written to diskTo recover, we simply ignore those transactions that did not commit.

Advantages: Recovery is made easier:

Any transaction that reached the commit point (from the log) has its writes applied to the database (REDO).All other transactions are ignored. Cascading rollback does not occur because no other transaction sees the work of another until it is committed (no stale reads).

Page 47: Copy of sec d (2)

Immediate Update Recovery

• Immediate Update applies the write operations to the database as the transaction is executing (as opposed to at the end as with deferred update).

Writes are still saved in the log before being applied to the database - a write-ahead log (WAL)

• Maintain two logs:

• REDO log: A record of each new data item in the database. • UNDO log: A record of each updated data item (old values).

• Two rules: • Transaction T may not update the database until all UNDO entries have

been written to the UNDO log. • Transaction T is not allowed to commit until all REDO and UNDO log

entries are written (forced-written to disk).

Page 48: Copy of sec d (2)

• To Recover:

• Begin at the end of the log and read backwards to the last chechpoint. Create two lists:C - transactions that have committedNC - transactions that did not commit

• For each transaction T in C, make a third list of transactions, RB, that have read items that were updated by NC transactions during T.

• UNDO: Undo the effects of the transactions in NC and RB. • REDO: Apply the effects of all of the transactions that did commit (all T in list

C)

Advantages:

• Immediate update allows higher concurrency because transactions write continuously to the database rather than waiting until the commit point.

Disadvantages: • Step 2 above can lead to cascading rollbacks - time consuming and may be

problematic.

Page 49: Copy of sec d (2)

OLAP (online analytical processing)

• OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view.

For example, a user can request that data be analyzed to display a spreadsheet showing all of a company's beach ball products sold in Florida in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in Florida in the same time period.

To facilitate this kind of analysis, OLAP data is stored in a multidimensional database. Whereas a relational database can be thought of as two-dimensional, a multidimensional database considers each data attribute (such as product, geographic sales region, and time period) as a separate "dimension." OLAP software can locate the intersection of dimensions (all products sold in the Eastern region above a certain price during a certain time period) and display them. Attributes such as time periods can be broken down into sub-attributes.

Page 50: Copy of sec d (2)

• OLAP can be used for data mining or the discovery of previously undiscerned

relationships between data items. An OLAP database does not need to be as

large as a data warehouse, since not all transactional data is needed for trend

analysis. Using Open Database Connectivity (ODBC), data can be imported from

existing relational databases to create a multidimensional database for OLAP.

• Online Analytical Processing, a category of software tools that provides analysis

of data stored in a database. OLAP tools enable users to analyze different

dimensions of multidimensional data. For example, it provides time series and

trend analysis views. OLAP often is used in data mining.

• The chief component of OLAP is the OLAP server, which sits between a client and

a database management systems (DBMS). The OLAP server understands how

data is organized in the database and has special functions for analyzing the data.

There are OLAP servers available for nearly all the major database systems.

OLAP (online analytical processing)

Page 51: Copy of sec d (2)

OLTP vs. OLAP

• OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

• OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

Page 52: Copy of sec d (2)

The following table summarized main differences between OLPT and OLAP OLTP System

Online Transaction Processing (Operational System)

OLAP System Online Analytical Processing

(Data Warehouse)

Source of data Operational data; OLTPs are the original source of the data.

Consolidation data; OLAP data comes from the various OLTP Databases

Purpose of data To control and run fundamental business tasks To help with planning, problem solving, and decision support

What the data Reveals a snapshot of ongoing business processes Multi-dimensional views of various kinds of business activities

Inserts and Updates Short and fast inserts and updates initiated by end users Periodic long-running batch jobs refresh the data

Queries Relatively standardized and simple queries Returning relatively few records Often complex queries involving aggregations

Processing Speed Typically very fastDepends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be

improved by creating indexes

Space Requirements Can be relatively small if historical data is archived Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP

Database Design Highly normalized with many tables Typically de-normalized with fewer tables; use of star and/or snowflake schemas

Backup and RecoveryBackup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary

loss and legal liability

Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method

Page 53: Copy of sec d (2)

What Is a Data Warehouse?

• A data warehouse is a relational or multidimensional database that is designed for query and analysis. Data warehouses are not optimized for transaction processing, which is the domain of OLTP systems. Data warehouses usually consolidate historical and analytic data derived from multiple sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources.

• A data warehouse usually stores many months or years of data to support historical analysis. The data in a data warehouse is typically loaded through an extraction, transformation, and loading (ETL) process from one or more data sources such as OLTP applications, mainframe applications, or external data providers.

• Users of the data warehouse perform data analyses that are often time-related. Examples include consolidation of last year's sales figures, inventory analysis, and profit by product and by customer

Page 54: Copy of sec d (2)

The Key Characteristics of a Data Warehouse

• The key characteristics of a data warehouse are as follows:

• Some data is denormalized for simplification and to improve

performance

• Large amounts of historical data are used

• Queries often retrieve large amounts of data

• Both planned and ad hoc queries are common

• The data load is controlled

• In general, fast query performance with high data throughput is the key

to a successful data warehouse.

Page 55: Copy of sec d (2)

Common Oracle Data Warehousing Tasks

• As an Oracle data warehousing administrator or designer, you can expect to be involved in the following tasks:

• Configuring an Oracle database for use as a data warehouse• Designing data warehouses• Performing upgrades of the database and data warehousing software to new releases• Managing schema objects, such as tables, indexes, and materialized views• Managing users and security• Developing routines used for the extraction, transformation, and loading (ETL)

processes• Creating reports based on the data in the data warehouse• Backing up the data warehouse and performing recovery when necessary• Monitoring the data warehouse's performance and taking preventive or corrective

action as required• In a small-to-midsize data warehouse environment, you might be the sole person

performing these tasks. In large, enterprise environments, the job is often divided among several DBAs and designers, each with their own specialty, such as database security or database tuning.

Page 56: Copy of sec d (2)

Characteristics of Data Warehouses

• Data warehouses have several distinguishing characteristics.• These systems combine data from multiple sources. Operational systems such

as ERP systems provide production data, financial systems supply revenue and expense data, and human resource systems present employee data.

• The data copied into a data warehouse does not change (except to correct errors). The data warehouse is a historical record of the state of an organization. The frequent changes of the source OLTP systems are reflected in the data warehouse by adding new data, not by changing existing data.

• Data warehouses are subject oriented, that is, they focus on measuring entities, such as sales, inventory, and quality. OLTP systems, by contrast, are function oriented and focus on operations such as order fulfillment.

• In data warehouses, data from distinct function-oriented systems is integrated to provide a single view of an operational entity.

• Data warehouses are designed for business users, not database programmers, so they are easy to understand and query.

• Data warehouses are designed for speed. Users interact with data warehouses—running queries, analyzing data, formulating new queries based on information gleaned from earlier queries—so the systems must be responsive.

Page 57: Copy of sec d (2)

Metadata

• Metadata is "data about data". The term is ambiguous, as it is used for two

fundamentally different concepts (types). Structural metadata is about the

design and specification of data structures and is more properly called "data

about the containers of data"; descriptive metadata, on the other hand, is about

individual instances of application data, the data content.

Metadata (metacontent) are defined as the data providing information about one or more aspects of the data, such as:

• Means of creation of the data• Purpose of the data• Time and date of creation• Creator or author of the data• Location on a computer network where the data were created• Standards used

Page 58: Copy of sec d (2)

Metadata types

The metadata application is many fold covering a large variety of fields of application there are nothing

but specialized and well accepted models to specify types of metadata

Structural metadata are used to describe the structure of database objects such as tables, columns,

keys and indexes.

Guide metadata are used to help humans find specific items and are usually expressed as a set of

keywords in a natural language

Technical metadata correspond to internal metadata, business metadata - to external metadata.

Kimball adds a third category named

Process metadata. On the other hand, NISO distinguishes among three types of metadata:

descriptive, structural and administrative.[

Descriptive metadata are the information used to search and locate an object such as title, author,

subjects, keywords, publisher;

Structural metadata give a description of how the components of the object are organised

Administrative metadata refer to the technical information including file type. Two sub-types of

administrative metadata are rights management metadata and preservation metadata.

Page 59: Copy of sec d (2)

OLAP Metadata Model

• The basic data model in a relational database is a table composed of one

or more columns of data. All of the data is stored in columns. In contrast,

the basic data model for multidimensional analysis is a cube, which is

composed of Measures, Dimensions, and Attributes.

• The OLAP Catalog metadata informs applications about the data that is

available within the database. The application can then define

multidimensional objects to represent that data. When the application

runs, it instantiates these objects and populates them with data.

Page 60: Copy of sec d (2)

Mapping Data Objects to Metadata Objects

The objects comprising a data warehouse and Oracle OLAP metadata use different data structures. The data objects in your data warehouse are represented to the OLAP metadata catalog in the following relational objects, regardless of whether the data is actually stored in relational tables or workspace variables:

• Fact Tables or Views• Level-based dimension Tables or Views

Oracle OLAP metadata catalog maps the data warehouse schema to these multidimensional data objects:

• Measures• Dimensions• Dimension attributes• Levels• Level attributes• Hierarchies• Cubes• Measure folders

Page 61: Copy of sec d (2)

Measures

Measures are the same as facts. The term "fact" is typically used in

relational databases, and the term "measure" is typically used in

multidimensional applications.

Measures are thus located in fact tables. A fact table has columns that

store measures (or facts) and foreign key columns that create the

association with dimension tables.

Measures contain the data that you wish to analyze, such as Sales or

Cost. OLAP Catalog metadata requires that a column have a numerical or

a date data type to be identified as a measure. Most frequently, a

measure is numerical and additive.

Page 62: Copy of sec d (2)

Dimensions

• Dimensions identify and categorize your data. Dimension members are stored in a dimension table. Each column represents a particular level in a hierarchy. In a star schema, the columns are all in the same table; in a snowflake schema, the columns are in separate tables for each level.

• Because measures are typically multidimensional, a single value in a measure must be qualified by a member of each dimension to be meaningful.

For example theunit_cost measure has two dimensions:

products_dim and times_dim.

A value of unit_cost (21.60) is only meaningful when it is qualified by a specific product code (1575) and a time period (28-jan-1998).

Page 63: Copy of sec d (2)

Attributes

Attributes provide descriptive information about the data and are typically used for display.

Level Attributes

Level attributes provide supplementary information about the dimension members at a particular level of a dimension hierarchy. The dimension members themselves may be meaningless, such as a value of "1296" for a time period. These cryptic values for dimension members are used internally for selecting and sorting quickly, but are meaningless to users.

For example-

You might have columns for employee number (ENUM), last name (LAST_NAME), first name (FIRST_NAME), and telephone extension (TELNO). ENUM is the best choice for a level column, since it is a key column and its values uniquely identify the employees. ENUM also has a NUMBER data type, which makes it more efficient than a text column for the creation of indexes. LAST_NAME, FIRST_NAME, and TELNO are attributes. Even though they are dimensioned by ENUM, they do not make suitable measures because they are descriptive text rather than business measurements

Page 64: Copy of sec d (2)

Dimension Attributes

• Dimension attributes specify groupings of level attributes for a specific dimension. Whereas level attributes map to specific data values, dimension attributes are purely logical metadata objects.

• An example of a dimension attribute is end date, which is required for time dimensions. If a time dimension has month, quarter, and year levels, end date identifies the last date of each month, each quarter, and each year. Within a relational schema, the three level attributes that make up the end date dimension attribute would be stored in columns with names like month_end_date, quarter_end_date, and year_end_date.

Page 65: Copy of sec d (2)

Measure Folders

• Measures can be organized within measure folders, which facilitate the browsing of data by business area. Measure folders are also known as catalogs.

• Whereas dimensions and measures are associated with the schemas that contain their source data, measure folders are schema independent. Each OLAP client can view all measure folders defined within the Oracle instance.

Page 66: Copy of sec d (2)

ETL - Extract, Transform, Load

• ETL is short for extract, transform, load

Three database functions that are combined into one tool to pull data out of one database and place it into another database.

• Extract is the process of reading data from a database.• Transform is the process of converting the extracted data from

its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.

• Load is the process of writing the data into the target database.• ETL is used to migrate data from one database to another, to

form data marts and data warehouses and also to convert databases from one format or type to another.

Page 67: Copy of sec d (2)

ETL - Extract, Transform, Load

• In managing databases, extract, transform, load (ETL) refers to three separate functions combined into a single programming tool. First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data - using rules or lookup tables, or creating combinations with other data - to convert it to the desired state. Finally, the load function is used to write the resulting data (either all of the subset or just the changes) to a target database, which may or may not previously exist.

• ETL can be used to acquire a temporary subset of data for reports or other purposes, or a more permanent data set may be acquired for other purposes such as: the population of a data mart or data warehouse; conversion from one database type to another; and the migration of data from one database or platform to another

Page 68: Copy of sec d (2)

Dimensional modeling• Dimensional modeling (DM) is the name of a set of techniques and concepts used

in data warehouse design. It is considered to be different from entity-relationship modeling (ER). Dimensional Modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files.

• DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance

• ER is very useful for the transaction capture, it should be avoided for end-user delivery.

• Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts. For example, sales amount is a fact; timestamp, product, register#, store#, etc. are elements of dimensions.

• Dimensional models are built by business process area, e.g. store sales, inventory, claims, etc. Because the different business process areas share some but not all dimensions, efficiency in design, operation, and consistency, is achieved using conformed dimensions, i.e. using one copy of the shared dimension across subject areas

Page 69: Copy of sec d (2)

Dimensional modeling process• The dimensional model is built on a star-like schema, with dimensions surrounding

the fact table.

To build the schema, the following design model is used:

• Choose the business process• Declare the grain• Identify the dimensions• Identify the fact

Choose the business process –

The process of dimensional modeling builds on a 4-step design method that helps to ensure the usability of the dimensional model and the use of the data warehouse. The basics in the design build on the actual business process which the data warehouse should cover. Therefore the first step in the model is to describe the business process which the model builds on. This could for instance be a sales situation in a retail store. To describe the business process, one can choose to do this in plain text or use basic Business Process Modeling Notation (BPMN) or other design guides like the Unified Modeling Language (UML).

Page 70: Copy of sec d (2)

Dimensional modeling process

• Declare the grain

After describing the Business Process, the next step in the design is to declare the grain of the model. The grain of the model is the exact description of what the dimensional model should be focusing on. This could for instance be “An individual line item on a customer slip from a retail store”. To clarify what the grain means, you should pick the central process and describe it with one sentence. Furthermore the grain (sentence) is what you are going to build your dimensions and fact table from. You might find it necessary to go back to this step to alter the grain due to new information gained on what your model is supposed to be able to deliver.

• Identify the dimensions

The third step in the design process is to define the dimensions of the model. The dimensions must be defined within the grain from the second step of the 4-step process. Dimensions are the foundation of the fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory etc. These dimensions are where all the data is stored. For example, the date dimension could contain data such as year, month and weekday.

Page 71: Copy of sec d (2)

Dimensional modeling process

• Identify the facts

After defining the dimensions, the next step in the process is to make keys for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system, since this is where they get access to data stored in the data warehouse. Therefore most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc.

Page 72: Copy of sec d (2)

Dimensional modeling process

Benefits of the dimensional modeling are following:

• Understandability - Compared to the normalized model, the dimensional model is easier to understand and more intuitive. In dimensional models, information is grouped into coherent business categories or dimensions, making it easier to read and interpret. Simplicity also allows software to navigate databases efficiently. In normalized models, data is divided into many discrete entities and even a simple business process might result in dozens of tables joined together in a complex way.

• Query performance - Dimensional models are more denormalized and optimized for data querying, while normalized models seek to eliminate data redundancies and are optimized for transaction loading and updating. The predictable framework of a dimensional model allows the database to make strong assumptions about the data that aid in performance. Each dimension is an equivalent entry point into the fact table, and this symmetrical structure allows effective handling of complex queries. Query optimization for star join databases is simple, predictable, and controllable.

• Extensibility - Dimensional models are extensible and easily accommodate unexpected new data. Existing tables can be changed in place either by simply adding new data rows into the table or executing SQL alter table commands. No queries or other applications that sit on top of the Warehouse need to be reprogrammed to accommodate changes. Old queries and applications continue to run without yielding different results. But in normalized models each modification should be considered carefully, because of the complex dependencies between database tables.

Page 73: Copy of sec d (2)

Fact Tables, Dimension Tables

• In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. Where multiple fact tables are used, these are arranged as a fact constellation schema.

• A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non additive, and semi additive measures.

• Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed.

• Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores).

Page 74: Copy of sec d (2)

Types of fact tablesThere are basically three fundamental measurement events, which characterizes all fact tables.

• Transactional

A transactional table is the most basic and fundamental. The grain associated with a transactional fact table is usually specified as "one row per line in a transaction", e.g., every line on a receipt. Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it.

• Periodic snapshots

The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.

• Accumulating snapshots

This type of fact table is used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it's important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.

Page 75: Copy of sec d (2)

Dimension table

In data warehousing, a dimension table is one of the set of companion tables to a fact table.The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.Contrary to fact tables, dimension tables contain descriptive attributes (or fields) that are typically textual fields (or discrete numbers that behave like text). These attributes are designed to serve two critical purposes: query constraining and/or filtering, and query result set labeling.

Dimension attributes should be:

• Verbose (labels consisting of full words)• Descriptive• Complete (having no missing values)• Discretely valued (having only one value per dimension table row)• Quality assured (having no misspellings or impossible values)

Page 76: Copy of sec d (2)

• Dimension table rows are uniquely identified by a single key field. It is recommended that the key field be a simple integer because a key value is meaningless, used only for joining fields between the fact and dimension tables.

The use of surrogate dimension keys brings several advantages, including

• Performance. Join processing is made much more efficient by using a single field (the surrogate key)

• Buffering from operational key management practices. This prevents situations where removed data rows might reappear when their natural keys get reused or reassigned after a long period of dormancy

• Mapping to integrate disparate sources• Handling unknown or not-applicable connections• Tracking changes in dimension attribute values

Page 77: Copy of sec d (2)

The goal of a dimension table is to create standardized, conformed dimensions that can be shared across the enterprise's data warehouse environment, and enable joining to multiple fact tables representing various business processes.

Conformed dimensions are important to the enterprise nature of DW/BI systems because they promote:

• Consistency- Every fact table is filtered consistently, so that query answers are labeled consistently.

• Integration-Queries can drill into different process fact tables separately for each individual fact table, then join the results on common dimension attributes.

• Reduced development time to market. The common dimensions are available without recreating them

Page 78: Copy of sec d (2)

Introduction to Cubes

• Cubes are the main objects in online analytic processing (OLAP), a technology that provides fast access to data in a data warehouse. A cube is a set of data that is usually constructed from a subset of a data warehouse and is organized and summarized into a multidimensional structure defined by a set of dimensions and measures.

• A cube provides an easy-to-use mechanism for querying data with quick and uniform response times. End users use client applications to connect to an Analysis server and query the cubes on the server. In most client applications, end users issue a query on a cube by manipulating the user interface controls, which determine the contents of the query. This spares end users from writing language-based queries . Precalculated summary data called aggregations provides the mechanism for rapid and uniform response times to queries. Aggregations are created for a cube before end users access it. The results of a query are retrieved from the aggregations, the cube's source data in the data warehouse, a copy of this data on the Analysis server, the client cache, or a combination of these sources. An Analysis server can support many different cubes, such as a cube for sales, a cube for inventory, a cube for customers, and so on.

Page 79: Copy of sec d (2)

Introduction to Cubes

• Every cube has a schema, which is the set of joined tables in the data warehouse from which the cube draws its source data. The central table in the schema is the fact table, the source of the cube's measures. The other tables are dimension tables, the sources of the cube's dimensions. For more information about schemas, see Cube Structure.

• A cube is defined by the measures and dimensions that it contains. For example, a cube for sales analysis includes the measures Item_Sale_Price and Item_Cost and the dimensions Store_Location, Product_Line, and Fiscal_Year. This cube enables end users to separate Item_Sale_Price and Item_Cost into various categories by Store_Location, Product_Line, and Fiscal_Year.

• Each cube dimension can contain a hierarchy of levels to specify the categorical breakdown available to end users. For example, the Store_Location dimension includes the level hierarchy: Continent, Country, State_Province, City, Store_Number. Each level in a dimension is of finer granularity than its parent. For example, continents contain countries or regions, and states or provinces contain cities. Similarly, the hierarchy of the Fiscal_Year dimension includes the levels Year, Quarter, Month, and Day.

Page 80: Copy of sec d (2)

OLAP cube

• An OLAP cube is a multidimensional database that is optimized for data

warehouse and online analytical processing (OLAP) applications.

• An OLAP cube is a method of storing data in a multidimensional form,

generally for reporting purposes. In OLAP cubes, data (measures) are

categorized by dimensions. OLAP cubes are often pre-summarized across

dimensions to drastically improve query time over relational databases.

• Although it stores data like a traditional database does, an OLAP cube is

structured very differently. Databases, historically, are designed according

to the requirements of the IT systems that use them. OLAP cubes, however,

are used by business users for advanced analytics. Thus, OLAP cubes are

designed using business logic and understanding. They are optimized for

analytical purposes, so that they can report on millions of records at a time.

Business users can query OLAP cubes using plain English.

Page 81: Copy of sec d (2)

Dimension Hierarchies

• A hierarchy defines relationships among a set of attributes that are grouped by levels in the dimension of a cube model. These relationships between levels are usually defined with a functional dependency to improve optimization. Multiple hierarchies can be defined for a dimension of a cube model.

• Hierarchies document the relationships between levels in a dimension. When you define a hierarchy, DB2 Data Warehouse Edition creates a functional dependency by default between consecutive levels, such that the level key attributes from one level are functionally dependent on the level key attributes in the lower level. For example, in a Region hierarchy that is defined with the following levels: Region, State, and City, DB2 Data Warehouse Edition creates two functional dependencies. One functional dependency to show that City determines State, and a second functional dependency to show that state determines region

Page 82: Copy of sec d (2)

All level for the hierarchies in the following types of dimensions:

Time dimensions-

Aggregating your fact data across all of time is rarely useful, unless you know that the fact table contains data for a specific period of time, such as the year 2005.

Scenario dimensions. A scenario dimension describes likely ways of analyzing your data. You might have two members, such as Actual and Budget, in your scenario dimension. In this case, it does not make sense to combine the data values for budget and actual into a combined aggregation.

The hierarchy type describes the relationship among the levels within the hierarchy. The following four hierarchy types are supported:

Page 83: Copy of sec d (2)

Balanced

• A hierarchy with meaningful levels and branches that have a consistent depth. Each level's logical parent is in the level directly above it. A balanced hierarchy can represent time where the meaning and depth of each level, such as Year, Quarter, and Month, is consistent. They are consistent because each level represents the same type of information, and each level is logically equivalent. Figure 1 shows an example of a balanced time hierarchy.

• Figure 1. Balanced hierarchy. Example of a balanced hierarchy

Page 84: Copy of sec d (2)

Unbalanced• A hierarchy with levels that have a consistent parent-child relationship but have a logically

inconsistent levels. The hierarchy branches also can have inconsistent depths. An unbalanced hierarchy can represent an organization chart. For example, Figure 2 shows a chief executive officer (CEO) on the top level of the hierarchy and at least two of the people that might branch off below including the chief operating officer and the executive secretary. The chief operating officer has more people branching off also, but the executive secretary does not. The parent-child relationships on both branches of the hierarchy are consistent. However, the levels of both branches are not logical equivalents.

For example, an executive secretary is not the logical equivalent of a chief operating officer.Figure 2. Unbalanced hierarchy. Example of an unbalanced hierarchy

Page 85: Copy of sec d (2)

RaggedA hierarchy in which each level has a consistent meaning, but the branches have inconsistent depths because at least one member attribute in a branch level is unpopulated. A ragged hierarchy can represent a geographic hierarchy in which the meaning of each level such as city or country is used consistently, but the depth of the hierarchy varies. Figure 3 shows a geographic hierarchy that has Continent, Country, Province/State, and City levels defined. One branch has North America as the Continent, United States as the Country, California as the Province or State, and San Francisco as the City. However, the hierarchy becomes ragged when one member does not have an entry at all of the levels. For example, another branch has Europe as the Continent, Greece as the Country, and Athens as the City, but has no entry for the Province or State level because this level is not applicable to Greece for the business model in this example. In this example, the Greece and United States branches descend Figure 3. Ragged hierarchy. Example of a ragged hierarchy to different depths, creating a ragged hierarchy.

Page 86: Copy of sec d (2)

Network

A hierarchy in which the order of levels is not specified, but in which levels do have semantic meaning. For example,Figure 4 shows a network hierarchy that describes product attributes such as Color, Size, and PackageType. Because the levels do not have an inherent parent-child relationship, the order of the levels is not important. A widget company might have member entries like white for Color, small for Size, and shrink wrap for PackageType. A second member entry might be red for Color, large for Size, and box for PackageType.

Figure 4. Network hierarchy. Example of a network hierarchy

Page 87: Copy of sec d (2)

Advantages are there from developing a Data Warehouse? The three major advantages are:

1. Integrating data from multiple sources;2. Performing new types of analyses; and3. Reducing cost to access historical data.Other benefits may include:1. Standardizing data across the organization, a "single version of the truth";2. Improving turnaround time for analysis and reporting;3. Sharing data and allowing others to easily access data;4. Supporting ad hoc reporting and inquiry;5. Reducing the development burden on IS/IT; and6. Removing informational processing load from transaction-oriented databases;

Page 88: Copy of sec d (2)

Advantage of data marts.

It is easy to access frequently needed data from the database when required by the client.

We can give access to group of users to view the Datamart when it is required. Of course performance will be good.

It is easy to maintain and to create the datamart. It will be related to specific business.

And It is low cost to create a datamart rather than creating datarehouse with a huge space.

Page 89: Copy of sec d (2)

Difference between Data Warehouse and Data mart• A data mart is a simple form of a data warehouse that is focused on a single

subject (or functional area), such as Sales or Finance or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data.

How Is It Different from a Data Warehouse?

• A data warehouse, in contrast, deals with multiple subject areas and is typically implemented and controlled by a central organizational unit such as the Corporate Information Technology (IT) group. Often, it is called a central or enterprise data warehouse. Typically, a data warehouse assembles data from multiple source systems.

• Nothing in these basic definitions limits the size of a data mart or the complexity of the decision-support data that it contains. Nevertheless, data marts are typically smaller and less complex than data warehouses; hence, they are typically easier to build and maintain. The following table summarizes the basic differences between a data warehouse and a data mart:

Page 90: Copy of sec d (2)

Difference between Data Warehouse and Data mart

Parameters Data Warehouse Data Mart

Scope Corporate Line-of-Business (LoB)

Subjects Multiple Single Subject

Data Sources Many Few

Size (typical) 100 GB-TB+ < 100GB

Implementation Time Months to years Months

Page 91: Copy of sec d (2)

DATA WAREHOUSE REPORTING TOOLS

REPORTING TOOLS

• connect to one or multiple databases in a single report

• defining queries on the data warehouse database

• loading millions of records for aggregation and analysis (big data)

• creating simple and complex filters and sorts

• create a pixel perfect report including a variety of formatting options

• conditional formatting

• automatic summarization of the measures: average, median, count, sum and so

forth.

• display pivot tables, simple lists or summarized tables

• creating sections guided by a dimension

• presenting data in many type of graphs

Page 92: Copy of sec d (2)

DATA WAREHOUSE REPORTING TOOLS

ANALYSIS TOOLS

• Clicking on data items to drill-down into the details

• one click graphs

• slicing and dicing – switching easily dimensions

• nesting dimensions

• sorting and filtering

• conditional formatting

• extreme performance getting answers instantly

• automatic summarization of the measures

Page 93: Copy of sec d (2)

Query and Reporting Tools

The data warehouse is accessed using an end-user query and reporting tool from Business Objects. Business Objects provides several tools to securely access the data warehouse or

personal data files with a point-and-click interface including the following:

• BusinessObjects -(Reporter and Explorer) ? a Microsoft Windows based query and reporting tool.

• InfoView - a web based tool, that allows reports to be refreshed on demand (but can not create new reports).

• InfoBurst - a web based server tool, that allows reports to be refreshed, scheduled and distributed. It can be used to distribute reports and data to users or servers in various formats (e.g. Text, Excel, PDF, HTML, etc)

• Data Warehouse List Upload - a web based tool, that allows lists of data to be uploaded into the data warehouse for use as input to queries.