best practices for query performance in a data warehouse
DESCRIPTION
Best Practices for Query Performance In a Data Warehouse. Calisto Zuzarte IBM [email protected] Session Code: D09 May 13, 2010 8:30AM–9:30AM Platform: Linux, Unix and Windows. Data Warehouse Life Cycle. Database design / Application design - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/1.jpg)
Best Practices for Query Performance In a Data Warehouse
Calisto [email protected]
Session Code: D09
May 13, 2010 8:30AM–9:30AMPlatform: Linux, Unix and Windows
![Page 2: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/2.jpg)
Data Warehouse Life Cycle• Database design / Application design
• The Warehouse Application architects and Database Administrators work together to design the queries and schema before they put the application in production
• Database performance layer implementation• In order to meet SLAs, DBAs usual go through some iterations augmenting the
database with performance layer objects and set up the initial configuration to get good performance
• Database tuning operations• During production, with changing requirements and change in data, there is on-
going tuning required to keep operations smooth.
![Page 3: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/3.jpg)
Motivation
• Data warehouse environments characteristics:• Large volumes of data
• Millions/Billions of rows involved in some tables• Large amounts of data rolled-in and rolled-out
• Complex queries • Large Joins• Large Sorts, • Large amounts of Aggregations• Many tables involved
• Ad Hoc Queries
• It is important to pay attention to query performance
![Page 4: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/4.jpg)
Objectives
• Provide recommendations so that you can improve data warehouse query performance• Database Design considerations• Application Design considerations• Performance Layer Considerations• Ongoing Tuning Considerations
![Page 5: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/5.jpg)
Agenda
• Best Practices – Database Design • Best Practices – Application Design• Best Practices – Performance Layer• Best Practices – Configuration and Operations
![Page 6: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/6.jpg)
Best Practices – Database Design
• Best Practices - Parallelism• Inter-partition Shared nothing parallelism • Intra-Query Parallelism (SMP)
• Best Practices - Partitioning• Database Partitioning• Table Partitioning • Multi-Dimension Clustering• UNION ALL Views
• Best Practices – Schema • Best Practices - Compression
![Page 7: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/7.jpg)
Best Practices - Parallelism
• Database partition feature (DPF) is recommended • To achieve parallelism in a data warehouse• For scalability and query performance
• SMP (Intra-Query Parallelism) not recommended• In concurrent multi-user environments with heavy CPU usage
• SMP recommended • When CPUs are highly under utilized• When DPF is not an option
![Page 8: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/8.jpg)
Partitioning (Complimentary Strategies in DB2)• Database Partitioning (DPF) … DISTRIBUTE BY HASH
• Key Benefit : Better scalability and performance through parallelism
• Multidimensional Clustering (MDC) … ORGANIZE BY DIMENSION• Key Benefit : Better query performance through data clustering
• Table (Range) Partitioning … PARTITION BY RANGE• Table Partitioning• Key Benefit : Better data management (roll-in and roll-out of data)
• UNION ALL Views • Key Benefit : Independent branch optimization
![Page 9: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/9.jpg)
Divide And Conquer ! Distribute, Partition, Organize !
Distribute By
Partition By
Organize By
![Page 10: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/10.jpg)
Best Practices – Database Partitioning
• Collocate the fact and largest dimension • Choose to avoid significant skew on some partitions• Avoid DATE dimension where active transactions for
current date all fall on one database partition (TIMESTAMP is good)
• Possibilities for workload isolation for data marts• Different partition groups but common dimension tables• Needs replicated tables (discussed later)
![Page 11: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/11.jpg)
Best Practices – Table Partitioning
• Recommend partitioning the fact tables• Typically based on DATE dimension• Works better with application key predicates applied directly
• Table or Range Partitioning• Recommend table or range partitioning (V9.7 :partitioned indexes) • Choose partitioning based on roll-in / roll-out granularity
• UNION ALL Views• Each branch optimized independently • Use with well designed applications (Dangers of materialization)• Large number of branches require time and memory to optimize• Needs predicates with constants for branch elimination
![Page 12: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/12.jpg)
Best Practices – Multidimensional Clustering (MDC)
• Recommend defining MDC on the fact table• Guaranteed clustering (Avoids the need to REORG for clustering)• I/O optimization• Compact indexes (compact, coexists with regular indexes)
• Choose dimensions based on query predicates • Recommend the use of 1 to 4 dimensions• Need to ensure dimensions are chosen such that they do not
waste storage
• Could choose a finer granularity of Table partitioning range• For example: Table partition range by month, MDC by date
![Page 13: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/13.jpg)
Star Schema
Product_idStore_idChannel_idDate_id
AmountQuantity…
SALESStore_id
Region_id…
STORE
Date_id
Month_idQuarter_idYear_id
TIME
Channel_id
…
CHANNEL
Product_id
Class_idGroup_idFamily_idLine_idDivision_id…
PRODUCT
![Page 14: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/14.jpg)
Dimension Hierarchy
Channel Store
Month
Product
Class
Group
Family
Line
Division
Quarter
Year
Retailer
Sales Fact
Product Dimension
Time Dimension
Store Dimension
Channel Dimension
Level 5
Level 1
Level 2
Level 3
Level 4
Level 0Date
![Page 15: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/15.jpg)
Best Practices - Schema• Surrogate Keys
• As far as possible use application keys themselves • allows predicates to be applied/transferred directly on the fact table
• DATE is a good candidate (easier to roll-in/roll-out and for MDC )
• Star Schema / Snowflakes• Separate tables for each dimension hierarchy (snowflake) may result in a large
number of joins• Flattened dimensions may contain a lot of redundancy (space)
• Define Columns NOT NULL when appropriate• Many optimizations that are done based on NOT NULL
• Define Uniqueness when appropriate• Primary Keys / Unique Constraints / Unique Indexes
![Page 16: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/16.jpg)
Compression
• Table, Index and Temp Table compression • Huge benefits with storage savings
• With table and TEMP compression 30-70%• With Index compression 30-40%
• Performance gains because • Less I/O and better use of bufferpools
• TEMP table compression helps operators like Hash Join, Merge Join, Sorts and Table Queues if they spill
![Page 17: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/17.jpg)
Best Practices - Compression
• Consider compression particularly with the fact table• Strongly recommend compression on the fact table when
not CPU bound
![Page 18: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/18.jpg)
Agenda
• Best Practices – Database Design• Best Practices – Application Design• Best Practices – Performance Layer• Best Practices – Configuration and Operations
![Page 19: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/19.jpg)
Best Practices – Application Considerations
• Use constants instead of expressions in the query• Example
• WHERE DateCol <= CURRENT DATE – 5
• Use VALUES(CURRENT DATE – 5) to get the resulting constant first and use it in the query
• Avoid expressions on indexed columns• Example
• WHERE DATECOL – 2 DAYS > ‘2009-10-22’
• WHERE DATECOL > ‘2009-10-22’ + 2 DAYS
![Page 20: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/20.jpg)
Best Practices – Application Considerations
• Avoid mixing data types in join predicates• Example
• WHERE IntegerCol = DecimalCol
• Use Global Temporary Tables to split a query if it contains more than 10-15 tables• Reduces optimization time
![Page 21: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/21.jpg)
Agenda
• Best Practices – Database Design• Best Practices – Application Design• Best Practices – Performance Layer• Best Practices – Configuration and Operations
![Page 22: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/22.jpg)
Best Practices – Performance Layer
• Indexes• Statistics
• Distribution Statistics• Column Group Statistics• Statistical Views
• Constraints• Referential Integrity
• Materialized Query Tables• Replicated Tables
![Page 23: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/23.jpg)
Indexes
• Indexes are a vertical subset of the data in the table• Indexes provide ORDER• Indexes may allow for clustered access to the table
![Page 24: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/24.jpg)
• To get Index Only Access instead of more expensive ISCAN-FETCH or TSCAN (Table Scan)
• To avoid SORTs particularly those that spill
• To promote index-ORing and index-ANDing
• To promote Star Joins
• When you have range join predicates• Better possibilities with Nested Loop Join
• Indexes for clustering (MDC)
Index Considerations
![Page 25: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/25.jpg)
![Page 26: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/26.jpg)
• Estimating the size of intermediate results is critical to getting good query execution plans
• Without sufficient information, the optimizer can only guess based on some assumptions
• Data skew and statistical correlation between multiple column values introduce uncertainty
• Pay attention to DATE columns
Cardinality Estimation
![Page 27: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/27.jpg)
Best Practices - Statistics
• Collect distribution Statistics when there is skew and predicates use constants
• Consider a high number of quantile statistics on columns with DATE range predicates and character string columns
![Page 28: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/28.jpg)
Column Group Statistics
• Example: COUNTRY = ‘Germany’ And CITY = ‘Frankfurt’
• No CGS: Selectivity = ½ * 1/3 = 1/6 …Estimate 1 row
• With CGS: Selectivity = 1/3 …Estimate 2 rows
Country City Hotel Name
Germany
Bremen Hilton
Germany
Bremen Best Western
Germany
Frankfurt
InterCity
Germany
Frankfurt
Shangri-La
Canada Toronto Four Seasons
Canada Toronto Intercontinental
1
2
3
![Page 29: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/29.jpg)
CUSTID CNAME
1 ‘ABC’
2 ‘DEF’
3 ‘GHI’
4 ‘IBM’
5 ‘JKL’
6 ‘MNO’
7 ‘PQR’
… …
… …
100 ‘XYZ’
CUSTID # of Rows
4 2000000
10 700000
8 500000
63 300000
72 100000
9 50000
12 20000
… …
… …
5 5
CUST Table Frequency Statistics – SALES Table
SELECT … FROM SALES, CUSTWHERE CUST.CNAME = ‘IBM’ AND CUST.CUSTID = SALES.CUSTID
Cardinality Estimate with Uniformity = 100,000Actual Cardinality : 2,000,000 !!!!!!!!!!!!!!!!!!!!
100 rows, 100 custids10000000 rowsProblem Scenario - Skew
![Page 30: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/30.jpg)
Best Practices - Statistics• Collect Column Group Statistics with multiple predicates on the
same table• WHERE Country = ‘CANADA’ and City = ‘TORONTO’RUNSTATS … ON ALL COLUMNS AND ON COLUMNS ((country, city) … ) …
• Consider Statistical Views when • There is skew on the join column• There is a significant difference in the range of values in the fact and the
dimensionCREATE VIEW SV1 AS (SELECT C.* FROM CUST C, FACT F WHERE C.CUST_ID = F.CUST_ID)ALTER VIEW cust_fact ENABLE QUERY OPTIMIZATIONRUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION
![Page 31: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/31.jpg)
Referential Integrity (RI)
• Facilitates aggregation push down• Example in the appendix section
• Eliminates redundant joins in views• RI helps determine that queries that do not require data from a
primary key table need not do that join even if it is in the view
• Helps with Materialized Query Table matching• Allows Queries to match MQTs with more dimension table joins
![Page 32: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/32.jpg)
Consider Materialized Query Tables
Fact
GB
JOIN
Dim2
Dim1
JOIN
Fact
GB
JOIN
Dim2
Dim1
JOIN
Fact
GB
JOIN
Dim2
Dim1
JOIN
Fact
GB
JOIN
Dim2
Dim1
JOIN
Joe’s Query
Sue’s Query
Bob’s QueryFact
GB
JOIN
Dim2
Dim1
JOIN
MQT
Q9 Q9
Bob’s QSue’s
Joe’s Q
![Page 33: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/33.jpg)
Best Practices - Defining Materialized Query Tables
• What MQTs should I define ?• Estimate the size of the candidate MQTs by executing COUNT
queries against base tables. • Try to achieve at least a 10X reduction in size between fact and
the MQT• Build MQTs with a reasonable number of GROUP BY columns (3
to 6 dimension keys) at a time based on query patterns
• As far as possible build the MQT from the fact table alone• Use Table Partitioning for the fact table and the MQTs
![Page 34: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/34.jpg)
Best Practices - MQT Matching
• Define Referential Integrity to help with matching MQTs that contain more tables than the queries
• Define Functional Dependencies for thinner MQTs• Use COUNT_BIG instead of COUNT for DPF MQTs• Define indexes on MQTs• Keep statistics up-to-date• Define base table columns NOT NULL as far as possible
• For example we can match SUM(A + B) with SUM(A) + SUM(B)
![Page 35: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/35.jpg)
Best Practices – MQT Maintenance
• REFRESH IMMEDIATE• Create an index on the GROUP BY columns• Create the index on the set of columns that form a unique key • Always keep the base table and MQT statistics up-to-date
• REFRESH DEFERRED• If log space is an issue, consider NOT LOGGED INITIALLY or LOAD
from cursor • An MQT can be temporarily toggled into a regular table by using
• ALTER TABLE … DROP MATERIALIZED QUERY • ALTER TABLE … ADD MATERIALIZED QUERY
• Use ATTACH / DETACH if fact table and MQT are range partitioned tables
![Page 36: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/36.jpg)
Replicated Tables
• Replicate dimension tables (unless collocated with fact )
• Benefit : Avoids data movement
• Important : Define suitable indexes
• If too large, replicate a subset of frequently used columns
BTQ
JOIN
BTQ
JOIN
SALES SALESCUST SALES
BTQ
JOIN
CUSTCOPY
CUSTCOPY
CUSTCOPY
![Page 37: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/37.jpg)
Agenda
• Best Practices – Database Design• Best Practices – Application Design• Best Practices – Performance Layer• Best Practices – Configuration and Operations
![Page 38: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/38.jpg)
Best Practices – Configuration
• Optimization Level 5• Registry Variables
• DB2_ANTIJOIN=EXTEND• If slow queries have NOT EXISTS, NOT IN predicates
• DB2_REDUCED_OPTIMIZATION=YES• If compile time is an issue
• Configuration thumb rules • BUFFPOOL ~= SHEAPTHRES• SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)
![Page 39: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/39.jpg)
Best Practices - Statistics
• The DB2 Query Optimizer relies on reasonably accurate statistics to get a good query plans
• User runs RUNSTATS when data changes (part of ETL)• Statistics Fabrication (unreliable)
• DB2 keeps UPDATE / DELETE / INSERT counters• Fabrication limited to a few statistics – Not enough
• Consider configuring Automatic Statistics• Automatically collects statistics on tables in need• Runs in the background as a low priority job
• Consider configuring Real Time Statistics• Collects statistics on-the-fly
![Page 40: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/40.jpg)
Summary – Best Practices
• Database Design :• Parallelism, Partitioning, Schema, Compression
• Application Design• SQL Tips
• Performance Layer• Indexes, Statistics, Referential Integrity, Materialized Query
Tables, Replicated Tables
• Configuration and Operations• Configuration, Collecting Statistics
![Page 41: Best Practices for Query Performance In a Data Warehouse](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812ef1550346895d94914e/html5/thumbnails/41.jpg)
Calisto [email protected]