aggregating knowledge in a data warehouse and multidimensional analysis rafal lukawiecki strategic...
TRANSCRIPT
Aggregating Knowledge in a Data Warehouse and Multidimensional AnalysisRafal LukawieckiStrategic Consultant, Project Botticelli [email protected]
2
Objectives
• Explain the basics of:1. Data Warehousing2. ETL3. OLAP/Multidimensional Data
• Relate the theory to SQL Server 2008 SSAS and SSIS
This seminar is based on a number of sources including a few dozen of Microsoft-owned presentations, used with permission. Thank you to Marin Bezic, Kathy Sabourin, Aydin Gencler, Bryan Bredehoeft, and Chris Dial for all the support. Thank you to Maciej Pilecki for assistance with demos.
The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.
Portions © 2009 Project Botticelli Ltd & entire material © 2009 Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.
5
Let’s Store the Intelligence: DW• SQL Server Analysis Services server is a logical
endpoint for data being aggregated with SSIS• But do not store actual data in it
• Data physically rests in another relational database called a Data Warehouse
• Modelling of data stored in DW and analysed using SSAS is at the heart of good Data Warehouse design
7
Star Schema Benefits
• Transforms normalized data into a simpler model• Delivers high-performance queries• Delivers higher performing queries using Star
Join Query Optimization• Uses mature modeling techniques that are
widely supported by many BI tools• Requires low maintenance as the data
warehouse design evolves
8
Snowflake Dimension Tables• Define hierarchies using multiple dimension
tables• Support fact tables with varying granularity• Simplify consolidation of data from multiple
sources
Potential for slower query performance in relational reporting
No difference in performance in Analysis Services database
9
OLAP Hierarchies
• Benefits• View of data at different levels of summarization• Path to drill down or drill up
• Implementation• Denormalized DW star
schema dimension• Normalized DW snowflake
dimension• Self-referencing
relationship
11
Fact Table Fundamentals
• Collection of measurements associated with a specific business process
• Specific column types• Foreign keys to dimensions• Measures – numeric and aggregatable• Metadata and lineage
• Consistent granularity – the most atomic level by which the facts can be defined
12
Fact Table Examples
Day Grain
Quarter Grain
Reseller sales data by:•Product•Order Date•Reseller•Employee•Sales Territory
Sales quota data by:•Employee•Time
13
Date Dimension Table
• Most common dimension used in analysis (aka Time dimension)
• Used consistently with all facts for efficient and flexible analysis
• Useful common attributes – Year, Quarter, Month, Day• Time series analysis support• Navigation and summarization enabled with
hierarchies, such as calendar or fiscal• Single table design (typically not snowflake
design)Tip: Format the key of the dimension as yyyymmdd (e.g. 20060925) to make it readily understandable
14
Slowly Changing Dimensions
• Support primary role of data warehouse to describe the past accurately
• Maintain historical context as new or changed data is loaded into dimension tables
• Implement changes by Slowly Changing Dimension (SCD) type• Type 1: Overwrite the existing dimension record• Type 2: Insert a new ‘versioned’ dimension record• Type 3: Track limited history with attributes
16
SCD Type 2
• Existing record is ‘expired’ and new record inserted
• History is preserved• Most common form of SCD
17
SCD Type 3
• Existing record is updated• Limited history is preserved• Implementation is rare
SalesTerritoryKey update to 10
18
Let’s Get the Data
• We would like to populate facts and dimensions in our Data Warehouse from OLTP data...
20
Let’s do ETL with SSIS
• SQL Server Integration Services (SSIS) service
• SSIS object model• Two distinct runtime engines:
• Control flow• Data flow
• 32-bit and 64-bit editions
21
The Package
• The basic unit of work, deployment, and execution
• An organized collection of:• Connection managers• Control flow components• Data flow components• Variables• Event handlers• Configurations
• Can be designed graphically or built programmatically
• Saved in XML format to the file system or SQL Server
22
Control Flow
• Control flow is a process-oriented workflow engine
• A package contains a single control flow• Control flow elements
• Containers• Tasks• Precedence constraints• Variables
23
Data Flow
• The Data Flow Task• Performs traditional ETL and more• Fast and scalable
• Data Flow Components• Extract data from Sources• Load data into Destinations• Modify data with Transformations
• Service Paths• Connect data flow components• Create the pipeline
28
Row Transformations
• Update column values or create new columns• Transform each row in the pipeline input
29
Rowset Transformations• Create new rowsets that can include
• Aggregated values• Sorted values• Sample rowsets• Pivoted or unpivoted rowsets
• This is a heavy-weight performer of SSIS• Are also called asynchronous components
30
Split and Join Transformations
• Distribute rows to different outputs• Create copies of the transformation inputs• Join multiple inputs into one output• Perform lookup operations
33
SQL Server 2008 Analysis Services
• OLAP component• Aggregates and organizes data from
business data sources• Performs calculations difficult to perform
using relational queries• Supports advanced business intelligence,
such as Key Performance Indicators• Data mining component
• Discovers patterns in both relational and OLAP data
• Enhances the OLAP component with discovered results
34
Cube = Unified Dimensional Model• Multidimensional data• Combination of measures and dimensions as
one conceptual model• Measures are sourced from fact tables• Dimensions are sourced from dimension tables
35
Dimensions
• Members from tables/views in a data source view (based on a Data Warehouse)
• Contain attributes matching dimension columns• Organize attributes as hierarchies
• One All level and one leaf level• User hierarchies are multi-level combinations of
attributes• Can be placed in display folders
• Used for slicing and dicing by attribute
36
Hierarchy
• Defined in Analysis Services• Ordered collection of attributes into levels• Navigation path through dimensional space• Very important to get right!
Customers by Geography
Country
State
City
Customer
Customers by Demographics
Marital
Gender
Customer
37
Measure Group
• Group of measures with same dimensionality• Analogous to a fact table• Cube can contain more than one measure group
• E.g. Sales, Inventory, Finance• Defined by dimension relationships
38
Sales Inventory Finance
Customers X
Products X X
Time X X X
Promotions X
Warehouse X
Department
X
Account X
Scenario X
Measure Group
Measure GroupD
imen
sio
n
39
Dimension Relationships
• Define interaction between dimensions and measure groups
• Relationship types• Regular (a typical dimension)• Fact (Degenerate)• Reference• Many-to-many• Data mining
41
Calculations
• Expressions evaluated at query time for values that cannot be stored in fact table
• Types of calculations• Calculated members• Named sets• Scoped assignments
• Calculations are defined using MDX
MDX = MultiDimensional EXpressions
43
Summary
• As a platform for enterprise Business Intelligence you should consider three things:• A Data Warehouse• Process of Data Integration (incl. ETL)• Multidimensional Analysis (OLAP)
= SQL Server 2008 Engine, SSIS, and SSAS
• Now you can support decision making and performance management through:• Reports, dashboards, Excel integration, data
mining, and better business software
44
© 2009 Microsoft Corporation & Project Botticelli Ltd. All rights reserved.
The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.
Portions © 2009 Project Botticelli Ltd & entire material © 2009 Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.