Object_Oriented Databases, by Dr. Khalil
1
Data Warehousing Concepts
Dr. Awad KhalilDr. Awad Khalil
Computer Science DepartmentComputer Science Department
AUCAUC
Object_Oriented Databases, by Dr. Khalil
2
Content
Why Data Warehousing?Why Data Warehousing? What Data Warehousing?What Data Warehousing? Benefits of Data WarehousingBenefits of Data Warehousing Problems of Data WarehousingProblems of Data Warehousing
Object_Oriented Databases, by Dr. Khalil
3
Why Data Warehousing? Corporate decision-makers require access to all the organization’s Corporate decision-makers require access to all the organization’s
data, wherever it is located.data, wherever it is located. To provide comprehensive analysis of the organization, its To provide comprehensive analysis of the organization, its
business, its requirements, and any trends, requires access to not business, its requirements, and any trends, requires access to not only the current values in the database, but also to historical data.only the current values in the database, but also to historical data.
To facilitate this type of analysis, the data warehouse has been To facilitate this type of analysis, the data warehouse has been created to hold data drawn from several data sources, maintained created to hold data drawn from several data sources, maintained by different operating units, together with historical and summary by different operating units, together with historical and summary transformations.transformations.
However, decision-makers also require powerful analysis tools. However, decision-makers also require powerful analysis tools. Two main types of analysis tools have emerged over the last few Two main types of analysis tools have emerged over the last few years:years: Online Analytical Processing (OLAP)Online Analytical Processing (OLAP) Data MiningData Mining
Object_Oriented Databases, by Dr. Khalil
4
What Data Warehousing? Data Warehousing: : A subject-oriented, integrated, time-A subject-oriented, integrated, time-
variant, and non-volatile collection of data in support of variant, and non-volatile collection of data in support of management’s decision-making processmanagement’s decision-making process..
Subject-oriented: : as the warehouse is organized around the as the warehouse is organized around the major subjects of the enterprise (such as customers, products, major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas (such as and sales) rather than the major application areas (such as customer invoicing, stock control, and product sales).customer invoicing, stock control, and product sales).
Integrated: : because of the coming together of source data because of the coming together of source data from different enterprise-wide applications systems.from different enterprise-wide applications systems.
Time-variant:: because data in the warehouse is only accurate because data in the warehouse is only accurate and valid at some point in time or over some time interval.and valid at some point in time or over some time interval.
Non-volatile: : as the data is not updated in real time but as the data is not updated in real time but refreshed from operational systems on a regular basis.refreshed from operational systems on a regular basis.
Object_Oriented Databases, by Dr. Khalil
5
Benefits of Data Warehousing
Potential high returns on investmentPotential high returns on investmentCompetitive advantageCompetitive advantageIncrease productivity of corporate Increase productivity of corporate
decision-makersdecision-makers
Object_Oriented Databases, by Dr. Khalil
6
OLTP versus Data WarehousingOLTP Systems
Holds current data Stores detailed data Data is dynamic Repetitive processing High level of transaction
throughput Predictable pattern of usage Transaction-driven Application-oriented Supports day-to-day
decisions Serve large number of users
Data Warehousing Holds historical data
Stores detailed, lightly, and highly summarized data
Data is largely static
Ad hoc, unstructured and heuristic processing
Medium to low level of transaction throughput
Unpredictable pattern of usage
Analysis-driven
Subject-oriented Supports strategic decisions Serves relatively low number
of managerial users
Object_Oriented Databases, by Dr. Khalil
7
Problems of Data Warehousing Underestimation of resources for data loadingUnderestimation of resources for data loading Hidden problems with source systemsHidden problems with source systems Required data not capturedRequired data not captured Increased end-user demandsIncreased end-user demands Data homogenizationData homogenization High demand for resourcesHigh demand for resources Data ownershipData ownership High maintenanceHigh maintenance Long-duration projectsLong-duration projects Complexity of integrationComplexity of integration
Object_Oriented Databases, by Dr. Khalil
8
Data Warehouse Architecture
Operational Data Operational Data
Store (ODS) Load Manager Warehouse
Manager Query Manager Detailed Data Lightly and Highly
Summarized Data Archive/Backup
Data Metadata End-User Access
Tools
Object_Oriented Databases, by Dr. Khalil
9
Data Warehouse Architecture (Cont’d)
Operational Data Mainframe operational data held in first generation hierarchical Mainframe operational data held in first generation hierarchical
and network databases.and network databases. Departmental data held in proprietary file systems such as Departmental data held in proprietary file systems such as
VSAM, RMS, and relational DBMSs such as Oracle and DB2.VSAM, RMS, and relational DBMSs such as Oracle and DB2. Private data held on workstations and private servers.Private data held on workstations and private servers. External systems such as the Internet, commercially available External systems such as the Internet, commercially available
databases, or databases associated with an organization’s databases, or databases associated with an organization’s suppliers or customers.suppliers or customers.
Operational Data Store (ODS) ODS is a repository of current and integrated operational data ODS is a repository of current and integrated operational data
used for analysis.used for analysis. It is often structured and supplied with data in the same way as It is often structured and supplied with data in the same way as
the data warehouse , but may in fact act simply as a staging area the data warehouse , but may in fact act simply as a staging area for data to be moved into the data warehouse.for data to be moved into the data warehouse.
Object_Oriented Databases, by Dr. Khalil
10
Data Warehouse Architecture (Cont’d)
Load Manager The load manager (also called the The load manager (also called the frontend frontend component) component)
performs all the operations associated with the performs all the operations associated with the extraction and loading of data into the warehouse.extraction and loading of data into the warehouse.
The operations performed by the load manager may The operations performed by the load manager may include simple transformations of the data to prepare include simple transformations of the data to prepare the data for entry into the warehouse.the data for entry into the warehouse.
The size and complexity of this component will vary The size and complexity of this component will vary between data warehouses and may be constructed using between data warehouses and may be constructed using a combination of vendor data loading tools and custom-a combination of vendor data loading tools and custom-built programs.built programs.
Object_Oriented Databases, by Dr. Khalil
11
Data Warehouse Architecture (Cont’d)Warehouse Manager The warehouse manager performs all the operations associated with the management The warehouse manager performs all the operations associated with the management
of the data in the warehouse.of the data in the warehouse. This component is constructed using vendor data management tools and custom-built This component is constructed using vendor data management tools and custom-built
programs.programs. The operations performed by the warehouse manager include:The operations performed by the warehouse manager include:
Analysis of data to ensure consistency.Analysis of data to ensure consistency. Transformation and merging of source data from temporary storage into data Transformation and merging of source data from temporary storage into data
warehouse tables.warehouse tables. Creation of indexes and views on base tables.Creation of indexes and views on base tables. Generation of denormalizations (if necessary).Generation of denormalizations (if necessary). Generation of aggregations (if necessary).Generation of aggregations (if necessary). Backing-up and archiving data.Backing-up and archiving data.
Query Manager The query manager (also called the The query manager (also called the backendbackend component) performs all the operations component) performs all the operations
associated with the management of user queries.associated with the management of user queries. This component is typically constructed using vendor end-user data access tools, data This component is typically constructed using vendor end-user data access tools, data
warehouse monitoring tools, database facilities, and custom-built programs.warehouse monitoring tools, database facilities, and custom-built programs.
Object_Oriented Databases, by Dr. Khalil
12
Data Warehouse Architecture (Cont’d)
Detailed Data This area of the warehouse stores all the detailed data in the This area of the warehouse stores all the detailed data in the
database schema.database schema. In most cases, the detailed data is not stored online but is made In most cases, the detailed data is not stored online but is made
available by aggregating the data to the next level of detail.available by aggregating the data to the next level of detail. However, on a regular basis, detailed data is added to the However, on a regular basis, detailed data is added to the
warehouse to supplement the aggregated data.warehouse to supplement the aggregated data.
Lightly and Highly Summarized Data This area stores all the predefined lightly and highly summarized This area stores all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.(aggregated) data generated by the warehouse manager. It is transient area as it will be subject to change on an ongoing It is transient area as it will be subject to change on an ongoing
basis in order to respond to changing query profiles.basis in order to respond to changing query profiles. The purpose of summary information is to speed up the The purpose of summary information is to speed up the
performance of the queries.performance of the queries.
Object_Oriented Databases, by Dr. Khalil
13
Data Warehouse Architecture (Cont’d)
Archive/Backup Data This area of the warehouse stores the detailed and This area of the warehouse stores the detailed and
summarized data for the purpose of archiving and summarized data for the purpose of archiving and backup.backup.
Metadata This area stores all the metadata (data about data) This area stores all the metadata (data about data)
definitions used by all the processes in the warehouse. definitions used by all the processes in the warehouse. Metadata is used for a variety of purposes including:Metadata is used for a variety of purposes including: The extraction and loading processes.The extraction and loading processes. The warehouse management process.The warehouse management process. As part of the query management process.As part of the query management process.
Object_Oriented Databases, by Dr. Khalil
14
Data Warehouse Architecture (Cont’d)
End-User Access Tools The principal purpose of data warehousing is to provide The principal purpose of data warehousing is to provide
information to business users for strategic decision-making.information to business users for strategic decision-making. These users interact with the warehouse using end-user access These users interact with the warehouse using end-user access
tools.tools. The data warehouse must efficiently support ad hoc and routine The data warehouse must efficiently support ad hoc and routine
analysis.analysis. Although the definition of end-user access tools can overlap, Although the definition of end-user access tools can overlap,
they can be categorized into five main groups:they can be categorized into five main groups: Reporting and query tools;Reporting and query tools; Application development tools;Application development tools; Executive Information Systems (EIS) tools;Executive Information Systems (EIS) tools; Online Analytical Processing (OLAP) tools;Online Analytical Processing (OLAP) tools; Data mining tools.Data mining tools.
Object_Oriented Databases, by Dr. Khalil
15
Data Warehouse Data Flows
Data warehousing focuses on the Data warehousing focuses on the management of five primary data management of five primary data flows, namely the inflow, upflow, flows, namely the inflow, upflow, downflow, outflow, and downflow, outflow, and metaflow:metaflow:
InflowInflow: Extracting, cleansing, : Extracting, cleansing, and loading of the source and loading of the source data.data.
UpflowUpflow: Adding value to the : Adding value to the data in the warehouse data in the warehouse through summarizing, through summarizing, packaging, and distribution packaging, and distribution of the data.of the data.
DownflowDownflow: Archiving and : Archiving and backing-up the data in the backing-up the data in the warehouse.warehouse.
OutflowOutflow: Making the data : Making the data available to end-usersavailable to end-users
MetaflowMetaflow: Managing the : Managing the metadata.metadata.
Object_Oriented Databases, by Dr. Khalil
16
Data Warehouse Tools and Technologies
Extraction, Cleansing, and Transformation Tools Selecting the correct extraction, cleansing, and Selecting the correct extraction, cleansing, and
transformation tools are critical steps in the transformation tools are critical steps in the construction of a data warehouse.construction of a data warehouse.
The tasks of capturing data from a source system, The tasks of capturing data from a source system, cleansing and transforming it, and then loading the cleansing and transforming it, and then loading the results into target system can be carried out either by results into target system can be carried out either by separate products, or by single integrated solution.separate products, or by single integrated solution.
Integrated solutions fall into one of the following Integrated solutions fall into one of the following categories:categories: Code generators;Code generators; Database replication tools;Database replication tools; Dynamic transformation engines.Dynamic transformation engines.
Object_Oriented Databases, by Dr. Khalil
17
Data Warehouse Tools and Technologies (Cont’d)
Data Warehouse DBMS Requirements: The specialized requirements for a relational DBMS suitable for data warehousing : The specialized requirements for a relational DBMS suitable for data warehousing
are as follows:are as follows: Load performance and load processingLoad performance and load processing Data quality managementData quality management Query performanceQuery performance Terabyte scalability and mass user scalabilityTerabyte scalability and mass user scalability Networked data warehouseNetworked data warehouse Warehouse administrationWarehouse administration Integrated dimensional analysisIntegrated dimensional analysis Advanced query functionalityAdvanced query functionality
Parallel DBMSs: Data warehousing requires the processing of enormous amounts of data and : Data warehousing requires the processing of enormous amounts of data and parallel database technology offers a solution to providing the necessary growth in performance.parallel database technology offers a solution to providing the necessary growth in performance.
The success of parallel DBMSs depends on the efficient operation of many resources including The success of parallel DBMSs depends on the efficient operation of many resources including processors, memory, disks and network connections.processors, memory, disks and network connections.
The aim behind using parallel DBMS is to solve decision-support problems using multiple nodes The aim behind using parallel DBMS is to solve decision-support problems using multiple nodes working on the same problem.working on the same problem.
The major characteristics of parallel DBMSs are scalability, operability, and availability.The major characteristics of parallel DBMSs are scalability, operability, and availability. The parallel DBMS performs many database operations simultaneously, splitting individual tasks The parallel DBMS performs many database operations simultaneously, splitting individual tasks
into smaller parts so that tasks can be spread across multiple processors.into smaller parts so that tasks can be spread across multiple processors. Parallel DBMSs must be capable of running parallel queries.Parallel DBMSs must be capable of running parallel queries. Parallel DBMSs must be capable of parallel data loading, table scaling, and data archiving and Parallel DBMSs must be capable of parallel data loading, table scaling, and data archiving and
backup.backup.
Object_Oriented Databases, by Dr. Khalil
18
Data Warehouse Tools and Technologies (Cont’d)
Data Warehouse Metadata The major purpose of metadata is to show the pathway back to where the data begun, so that The major purpose of metadata is to show the pathway back to where the data begun, so that
the warehouse administrators know the history of any item in the warehouse.the warehouse administrators know the history of any item in the warehouse. Metadata has several functions within the warehouse that relates to the processes associated Metadata has several functions within the warehouse that relates to the processes associated
with data transformation and loading, data warehouse management, and query generation.with data transformation and loading, data warehouse management, and query generation. The metadata associated with data transformation and loading must describe the source data The metadata associated with data transformation and loading must describe the source data
and any changes that were made to the data.and any changes that were made to the data. The metadata associated with data management describes the data as it is stored in the The metadata associated with data management describes the data as it is stored in the
warehouse.warehouse. The metadata is also required by the query manager to generate appropriate queries. In turn, The metadata is also required by the query manager to generate appropriate queries. In turn,
the query manager generates additional metadata about the queries that are run, which can the query manager generates additional metadata about the queries that are run, which can be used to generate a history on all the queries and a query profile for each user, group of be used to generate a history on all the queries and a query profile for each user, group of users, or the data warehouse.users, or the data warehouse.
There is also metadata associated with the users of queries that includes, for example, There is also metadata associated with the users of queries that includes, for example, information describing what the term ‘price’ or ‘customer’ means in a particular database information describing what the term ‘price’ or ‘customer’ means in a particular database and whether the meaning has changed over time.and whether the meaning has changed over time.
Synchronizing Metadata The major integration issue is how to synchronize the various types of metadata used The major integration issue is how to synchronize the various types of metadata used
throughout the data warehouse. throughout the data warehouse. The various tools of a data warehouse generate and use their own metadata, and to achieve The various tools of a data warehouse generate and use their own metadata, and to achieve
integration, we require that these tools are capable of sharing their metadata.integration, we require that these tools are capable of sharing their metadata. The challenge is to synchronize metadata between different products from different vendors The challenge is to synchronize metadata between different products from different vendors
using different metadata stores.using different metadata stores.
Object_Oriented Databases, by Dr. Khalil
19
Data Warehouse Tools and Technologies (Cont’d)
Administration and Management Tools A data warehouse requires tools to support the administration and A data warehouse requires tools to support the administration and
management of such a complex environment.management of such a complex environment. The data warehouse administration and management tools must be capable The data warehouse administration and management tools must be capable
of supporting the following tasks:of supporting the following tasks: Monitoring data loading from multiple sources;Monitoring data loading from multiple sources; Data quality and integrity checks;Data quality and integrity checks; Managing and updating metadata;Managing and updating metadata; Monitoring database performance to ensure efficient query response Monitoring database performance to ensure efficient query response
times and resources utilization;times and resources utilization; Auditing data warehouse usage to provide user chargeback information;Auditing data warehouse usage to provide user chargeback information; Replicating, subsetting, and distributing data;Replicating, subsetting, and distributing data; Maintaining efficient data storage management;Maintaining efficient data storage management; Purging data;Purging data; Archiving and backing-up data;Archiving and backing-up data; Implementing recovery following failures;Implementing recovery following failures; Security management.Security management.
Object_Oriented Databases, by Dr. Khalil
20
Data Marts
Data Mart is a subset of a data Data Mart is a subset of a data warehouse that supports the warehouse that supports the requirements of a particular requirements of a particular department or business department or business function.function.
The characteristics that The characteristics that differentiate data marts and data differentiate data marts and data warehouses include:warehouses include: A data mart focuses on only A data mart focuses on only
the requirements of users the requirements of users associated with one associated with one department or business department or business function;function;
Data marts do not normally Data marts do not normally contain detailed operational contain detailed operational data, unlike data data, unlike data warehouse;warehouse;
As data marts contain less As data marts contain less data compared with data data compared with data warehouse, data marts are warehouse, data marts are more easily understood and more easily understood and navigated.navigated.
Object_Oriented Databases, by Dr. Khalil
21
Reasons for Creating a Data Mart
To give users access to the data they need to analyze most often;To give users access to the data they need to analyze most often; To provide data in a form that matches the collective view of the data by a To provide data in a form that matches the collective view of the data by a
group of users in a department or business function;group of users in a department or business function; To improve end-user response time due to the reduction in the volume of To improve end-user response time due to the reduction in the volume of
data to be accessed;data to be accessed; To provide appropriately structured data as dictated by the requirements of To provide appropriately structured data as dictated by the requirements of
end-user access tools such as OLAP and data mining tools, which may end-user access tools such as OLAP and data mining tools, which may require their own internal database structures.require their own internal database structures.
Data marts normally use less data so tasks such as data cleansing, loading, Data marts normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data setting up a data mart is simpler than establishing a corporate data warehouse.warehouse.
The cost of implementing data marts is normally less than that required to The cost of implementing data marts is normally less than that required to establish a data warehouse.establish a data warehouse.
The potential users of a data mart are more clearly defined and can be more The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a easily targeted to obtain support for a data mart project rather than a corporate data warehouse project.corporate data warehouse project.
Object_Oriented Databases, by Dr. Khalil
22
Data Marts Issues
Data mart functionality: The capabilities of data marts have increased with the growth in : The capabilities of data marts have increased with the growth in their popularity. Rather than being simply small, easy-to-access databases, some data marts their popularity. Rather than being simply small, easy-to-access databases, some data marts must now be scalable to hundreds of gigabytes, and provide sophisticated analysis using OLAP must now be scalable to hundreds of gigabytes, and provide sophisticated analysis using OLAP and/or data mining tools.and/or data mining tools.
Data mart size: Users expect faster response times from data marts than from data : Users expect faster response times from data marts than from data warehouse, however, performance deteriorates as data marts grow in size. warehouse, however, performance deteriorates as data marts grow in size.
Data mart load performance: A data mart has to balance two critical components: end-: A data mart has to balance two critical components: end-user response time and data loading performance. user response time and data loading performance.
User’s access to data in multiple data marts: One approach is to replicate data : One approach is to replicate data between different data marts or, alternatively, build virtual data marts. Virtual data marts are between different data marts or, alternatively, build virtual data marts. Virtual data marts are views of several physical data marts or the corporate data warehouse tailored to meet the views of several physical data marts or the corporate data warehouse tailored to meet the requirements of specific group of users.requirements of specific group of users.
Data mart Internet/Intranet access: Internet/Intranet technology offers users low-cost : Internet/Intranet technology offers users low-cost access to data marts and the data warehouse using Web browsers.access to data marts and the data warehouse using Web browsers.
Data mart administration: As the number of data marts in an organization increases, so : As the number of data marts in an organization increases, so does the need to centrally manage and coordinate data mart activities.does the need to centrally manage and coordinate data mart activities.
Data mart installation: Data marts are becoming increasingly complex to build. Vendors : Data marts are becoming increasingly complex to build. Vendors are offering products referred to as “Data marts in a box” that provide a low-cost source of data are offering products referred to as “Data marts in a box” that provide a low-cost source of data mart tools.mart tools.
Object_Oriented Databases, by Dr. Khalil
23
Thank you