rusty barnett
DESCRIPTION
TRANSCRIPT
Data Warehousing De-Mystified
Presented to the AnnualGeorgia Oracle Users Group
Conference on March 20, 2006 By Rusty Barnett
Presented by Rusty Barnett 2
Data Warehousing De-Mystified
What do people mean when they call a grouping of data a “Data Warehouse”?
1) Large Data Volume, Loaded using an ETL Process (Extract, Transform, Load)
2) Data is Time-sensitive and Subject Oriented3) Enterprise Data Warehouse, Data Marts, ODS
(Operational Data Stores), DSS, EIS4) Dimensional Model, Third-Normal Form
Presented by Rusty Barnett 3
Data Warehousing De-Mystified
What do people mean when they refer to “Data Warehousing”?
1) Dimensional Data Modeling
2) ETL processing and administrating
3) Business Intelligence Reporting
4) Database managing and administrating
… As it pertains to a Data Warehouse
Presented by Rusty Barnett 4
Data Warehousing De-Mystified
What is a “Data Warehouse”?1) "A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management's decision making process“. Bill Inmon, 1990.
2) "A data warehouse is the queryable presentation resource for an enterprise’s data". Ralph Kimball, page 19, "The Data Warehouse Lifecycle Toolkit“.
Presented by Rusty Barnett 5
Data Warehousing De-Mystified
Definitions of Data Warehouse on the Web:• A data warehouse is a database geared towards the business
intelligence requirements of an organization. www.oranz.co.uk/glossary_text.htm
• An information infrastructure that enables businesses to access and analyze detailed data and trends.www.adobe.com/products/vdp/glossary.html
• A collection of integrated, subject-oriented databases designed to support the DSS function. The data warehouse contains atomic data and lightly summarized data.it.csumb.edu/departments/data/glossary.html
• A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect.www.pdacortex.com/glossary.htm
Presented by Rusty Barnett 6
Data Warehousing De-Mystified
Definitions of Data Warehouse on the Web Continued:• The Data Warehouse is a central repository of data that provides
the MIT community with integrated, up-to-date data from various administrative systems. web.mit.edu/sapr3/docs/webdocs/glossary/glCD.html
• An information repository from which queries & analysis are made.www.pcai.com/web/glossary/pcai_d_f_glossary.html
• This vast database stores information like a data repository, but goes a step further, allowing users to access data to perform research-oriented analyses.www.payorid.com/glossary.asp
• A collection of databases combined with a flexible data extraction system.www.genpromag.com/Glossary~LETTER~D.html
Presented by Rusty Barnett 7
Data Warehousing De-Mystified
Definitions of Data Warehouse on the Web Continued :• Datawarehouse (database) A generic term for a system for storing,
retrieving and managing large amounts of any type of data. www2.themanualpage.org/glossary/glo_d.php3
• The department or entity charged with collecting organization-wide data, verifying its accuracy, and analyzing, managing, and distributing it throughout the organization. In organizations without a data warehouse, each department may collect, analyze, manage, and distribute the data it needs for its operations.www2.uta.edu/ssw/trainasfa/glossary.htm
• A repository made up of databases of data extracted from a variety of sources, with a view to analysis to reveal additional information.www.gbc.t-online.hu/english/bszotare2.htm
Presented by Rusty Barnett 8
Data Warehousing De-Mystified
Definitions of Data Warehouse on the Web Continued :• Central repository of data extracted from various sources. The
current CMSU DW is a ‘read-only’ system with extracted data from SIS+, AFINS, HRS and Enrollment Management.www.cmsu.edu/x18299.xml
• A collection of data and information from various source systems.www.gov.bc.ca/prem/popt/service_plans/srv_pln/pssg/appen_a.htm
• A subject-oriented non-volatile collection of data used to support strategic decision making. The warehouse is the central point of data integration for business intelligence. It is the source of data for data marts within an enterprise and delivers a common view of enterprise data.publib.boulder.ibm.com/tividd/td/TEDW/GC32-0744-01/en_US/HTML/insmst342.htm
Presented by Rusty Barnett 9
Data Warehousing De-Mystified
Definitions of Data Warehouse on the Web Continued :• A logical collection of information, gathered from many different
operational databases, that supports business analysis activities and decision-making taskswww.321site.com/greg/courses/mis1/glossary.htm
• A data warehouse is, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favor efficient data analysis and reporting (especially OLAP). Data warehousing is not meant for current, "live" data. en.wikipedia.org/wiki/Data_warehouse
Presented by Rusty Barnett 10
Data Warehousing De-Mystified
Definitions of Data Warehouse on the Web Continued :• A very large repository of data comprising nearly all of a company’s
information.www.mosaictec.com/storage/storage_terms.htm
• A large database capable of storing all the information possessed by a large organization.www.jqjacobs.net/edu/cis105/concepts/CIS105_concepts_13.html
• A database used for storing historical data, which is used for data analysis.docs.rinet.ru/O8/glossary.htm
• One or more data stores originating from prime authoritative data sources by an auditable replication process.www.data-core.com/glossary-of-terms.htm
Presented by Rusty Barnett 11
Data Warehousing De-Mystified
Goal of Data Warehousing:
A System that displays Large Volumes of Data as Business INFORMATION!
Presented by Rusty Barnett 12
Data Warehousing De-Mystified
Terms:Data Warehouse ---The term Data Warehouse was coined
by Bill Inmon in 1990 Enterprise Data Warehouse --- Accumulation of data from
all significant areas of an EnterpriseData Mart --- A subset of a data warehouse, for use by a
functional area (department) on a subjectDimensional Modeling --- A design technique that seeks to
present the data in a framework that’s intuitive, allows for high-performance access, and resistance to change
ERD --- Entity Relational Diagram
Presented by Rusty Barnett 13
Data Warehousing De-Mystified
Terms:Schema --- A collection of database objects, including
tables, views, indexes, and synonymsStar Schema --- A method of designing tables with a Fact
Table in the middle and Dimension Tables around it; a standard technique for designing and building multi-dimensional databases
Snowflake Schema --- Star Schema with child DimensionsStar Query --- A join between a Fact table and a number of
Dimension tables, where the Dimension tables are not joined to each other, only to the Fact table
Presented by Rusty Barnett 14
Data Warehousing De-Mystified
Terms:Fact Table --- Primary table in Dimensional Model that is
meant to contain measurements of the business, whose most useful columns are numeric and additive
Dimension Table --- One of the set of companion tables to a Fact Table, whose columns are usually textual and are used for constraining, grouping, ordering within queries
Grain --- The level of the data, like Weekly or YearlyMeasures --- The numeric columns in a Fact TableAudit Columns --- Columns specifying change (who,when)
Presented by Rusty Barnett 15
Data Warehousing De-Mystified
Terms:Conformance --- A Dimension is said to be conformed if it
can be used by more than one Fact Table and/or Data Mart and/or subject area, i.e., “shareable”
Associative (Intersection) Table --- A table that is used to associate a value from one table to a value in another table in order to reduce many-to-many relationships between tables
Staging Tables --- Tables used to “stage” data along the way toward loading data, especially useful in ETL
ETL --- Extract, Transform, and Load; also a Verb, meaning to load data into one system from data in another
Presented by Rusty Barnett 16
Data Warehousing De-Mystified
Data Warehousing Roles:Business Users --- People with knowledge of the business
and needs for reportingReport Writers --- People who write reports (Developers);
tools for writing reportsData Modelers --- People who create data models from
Business RequirementsETL Developers --- People who write / develop ETL
processesDBA --- People who manage and support databases
Presented by Rusty Barnett 17
Data Warehousing De-Mystified
Tips for Data Modelers:1) Understanding the capabilities of the ETL tool,
the RDBMS, and the Reporting tool, is just as important as understanding the needs of the business and the users requirements
2) All Data Warehouse primary keys (PK) should be single-column surrogate keys (sequence)
3) Keep Referential Integrity in the DW4) Use Associative tables to handle M:M5) Order columns in tables by their “fill” factor
Presented by Rusty Barnett 18
Data Warehousing De-Mystified
Tips for ETL Developers:1) Embrace the use of multiple Staging Tables2) Primarily INSERT data in each ETL process step,
especially into the Staging Tables3) Invalidate Indexes (other than Primary Keys, Unique
Keys, and Foreign Keys) before loading/updating massive amounts of data. After the data is loaded/updated, then rebuild these Indexes
4) Be careful in usage of Oracle ROWNUM5) Test new features, like 10g’s MERGE enhancements,
Pipeline Functions, Grouped Table Outer Joins, Conditional Update/Insert/Delete statements, etc.
Presented by Rusty Barnett 19
Data Warehousing De-Mystified
Tips for Report Developers:1) Avoid using Hints in your SQL2) Consider using Materialized Views for reporting on a
common set of pre-processed data 3) Be very careful (in fact, be hesitant) in using normal
Views in Data Warehouses 4) Consider using the SQL clauses “intersect”, “minus”,
and “union [all]” in set processing5) “JOINs are expensive” --- NOT, I/O is expensive 6) Test new features, like 10g’s Connect_by, MView and
MERGE enhancements, Pipeline Functions, Grouped Table Outer Joins, Time Series & Interrow Calc’s, etc.
Presented by Rusty Barnett 20
Data Warehousing De-Mystified
Tips for DBA’s:1) Use as much of each Data Block as possible
(PCTFREE, PCTUSED) 2) Set optimizer_index_cost_adj to lower than 100 3) Use ARCHIVELOG mode & RMAN for backups 4) Collect Statistics in best way for your Oracle Release 5) Test Benefits of Parallel Query and Query Re-write 6) Test new features of RDBMS --- 10g features like BFT,
new Flashback, Security, Partitioning, and RMAN features, Improved VLDB support, etc.
Presented by Rusty Barnett 21
Data Warehousing De-Mystified
Current Debates in Data Warehousing:1) Fresh Data versus Live Data2) Dimensional Model versus Third-Normal Form3) Is a Snowflake schema bad for performance?4) Time spent to “physicalize” the Data Model?5) Who is responsible for Data Model?6) Manage DB via Data Model? (OWB)7) When to partition and when not to?8) Acceptance of Level 3 Changing Dimensions
Presented by Rusty Barnett 22
Data Warehousing De-Mystified
Current Standards DW Tools:Database --- Oracle 9i / 10g
ERD Data Modeling --- ERwin (Oracle Designer)
ETL --- Informatica (OWB)
Report Writing --- Oracle Discoverer, MicroStrategy, Business Objects, Cognos
Presented by Rusty Barnett 23
Data Warehousing De-Mystified
Questions
And
Answer
Session
Presented by Rusty Barnett 24
Data Warehousing De-Mystified
The End