best practices for query performance in a data warehouse

Best Practices for Query Performance In a Data Warehouse

Calisto [email protected]

Session Code: D09

May 13, 2010 8:30AM–9:30AMPlatform: Linux, Unix and Windows

Data Warehouse Life Cycle• Database design / Application design

• The Warehouse Application architects and Database Administrators work together to design the queries and schema before they put the application in production

• Database performance layer implementation• In order to meet SLAs, DBAs usual go through some iterations augmenting the

database with performance layer objects and set up the initial configuration to get good performance

• Database tuning operations• During production, with changing requirements and change in data, there is on-

going tuning required to keep operations smooth.

Motivation

• Data warehouse environments characteristics:• Large volumes of data

• Millions/Billions of rows involved in some tables• Large amounts of data rolled-in and rolled-out

• Complex queries • Large Joins• Large Sorts, • Large amounts of Aggregations• Many tables involved

• Ad Hoc Queries

• It is important to pay attention to query performance

Objectives

• Provide recommendations so that you can improve data warehouse query performance• Database Design considerations• Application Design considerations• Performance Layer Considerations• Ongoing Tuning Considerations

Agenda

• Best Practices – Database Design • Best Practices – Application Design• Best Practices – Performance Layer• Best Practices – Configuration and Operations

Best Practices – Database Design

• Best Practices - Parallelism• Inter-partition Shared nothing parallelism • Intra-Query Parallelism (SMP)

• Best Practices - Partitioning• Database Partitioning• Table Partitioning • Multi-Dimension Clustering• UNION ALL Views

• Best Practices – Schema • Best Practices - Compression

Best Practices - Parallelism

• Database partition feature (DPF) is recommended • To achieve parallelism in a data warehouse• For scalability and query performance

• SMP (Intra-Query Parallelism) not recommended• In concurrent multi-user environments with heavy CPU usage

• SMP recommended • When CPUs are highly under utilized• When DPF is not an option

Partitioning (Complimentary Strategies in DB2)• Database Partitioning (DPF) … DISTRIBUTE BY HASH

• Key Benefit : Better scalability and performance through parallelism

• Multidimensional Clustering (MDC) … ORGANIZE BY DIMENSION• Key Benefit : Better query performance through data clustering

• Table (Range) Partitioning … PARTITION BY RANGE• Table Partitioning• Key Benefit : Better data management (roll-in and roll-out of data)

• UNION ALL Views • Key Benefit : Independent branch optimization

Divide And Conquer ! Distribute, Partition, Organize !

Distribute By

Partition By

Organize By

Best Practices – Database Partitioning

• Collocate the fact and largest dimension • Choose to avoid significant skew on some partitions• Avoid DATE dimension where active transactions for

current date all fall on one database partition (TIMESTAMP is good)

• Possibilities for workload isolation for data marts• Different partition groups but common dimension tables• Needs replicated tables (discussed later)

Best Practices – Table Partitioning

• Recommend partitioning the fact tables• Typically based on DATE dimension• Works better with application key predicates applied directly

• Table or Range Partitioning• Recommend table or range partitioning (V9.7 :partitioned indexes) • Choose partitioning based on roll-in / roll-out granularity

• UNION ALL Views• Each branch optimized independently • Use with well designed applications (Dangers of materialization)• Large number of branches require time and memory to optimize• Needs predicates with constants for branch elimination

Best Practices – Multidimensional Clustering (MDC)

• Recommend defining MDC on the fact table• Guaranteed clustering (Avoids the need to REORG for clustering)• I/O optimization• Compact indexes (compact, coexists with regular indexes)

• Choose dimensions based on query predicates • Recommend the use of 1 to 4 dimensions• Need to ensure dimensions are chosen such that they do not

waste storage

• Could choose a finer granularity of Table partitioning range• For example: Table partition range by month, MDC by date

Star Schema

Product_idStore_idChannel_idDate_id

AmountQuantity…

SALESStore_id

Region_id…

STORE

Date_id

Month_idQuarter_idYear_id

TIME

Channel_id

…

CHANNEL

Product_id

Class_idGroup_idFamily_idLine_idDivision_id…

PRODUCT

Dimension Hierarchy

Channel Store

Month

Product

Class

Group

Family

Line

Division

Quarter

Year

Retailer

Sales Fact

Product Dimension

Time Dimension

Store Dimension

Channel Dimension

Level 5

Level 1

Level 2

Level 3

Level 4

Level 0Date

Best Practices - Schema• Surrogate Keys

• As far as possible use application keys themselves • allows predicates to be applied/transferred directly on the fact table

• DATE is a good candidate (easier to roll-in/roll-out and for MDC )

• Star Schema / Snowflakes• Separate tables for each dimension hierarchy (snowflake) may result in a large

number of joins• Flattened dimensions may contain a lot of redundancy (space)

• Define Columns NOT NULL when appropriate• Many optimizations that are done based on NOT NULL

• Define Uniqueness when appropriate• Primary Keys / Unique Constraints / Unique Indexes

Compression

• Table, Index and Temp Table compression • Huge benefits with storage savings

• With table and TEMP compression 30-70%• With Index compression 30-40%

• Performance gains because • Less I/O and better use of bufferpools

• TEMP table compression helps operators like Hash Join, Merge Join, Sorts and Table Queues if they spill

Best Practices - Compression

• Consider compression particularly with the fact table• Strongly recommend compression on the fact table when

not CPU bound

Agenda

• Best Practices – Database Design• Best Practices – Application Design• Best Practices – Performance Layer• Best Practices – Configuration and Operations

Best Practices – Application Considerations

• Use constants instead of expressions in the query• Example

• WHERE DateCol <= CURRENT DATE – 5

• Use VALUES(CURRENT DATE – 5) to get the resulting constant first and use it in the query

• Avoid expressions on indexed columns• Example

• WHERE DATECOL – 2 DAYS > ‘2009-10-22’

• WHERE DATECOL > ‘2009-10-22’ + 2 DAYS

Best Practices – Application Considerations

• Avoid mixing data types in join predicates• Example

• WHERE IntegerCol = DecimalCol

• Use Global Temporary Tables to split a query if it contains more than 10-15 tables• Reduces optimization time

Agenda


Best Practices – Performance Layer

• Indexes• Statistics

• Distribution Statistics• Column Group Statistics• Statistical Views

• Constraints• Referential Integrity

• Materialized Query Tables• Replicated Tables

Indexes

• Indexes are a vertical subset of the data in the table• Indexes provide ORDER• Indexes may allow for clustered access to the table

• To get Index Only Access instead of more expensive ISCAN-FETCH or TSCAN (Table Scan)

• To avoid SORTs particularly those that spill

• To promote index-ORing and index-ANDing

• To promote Star Joins

• When you have range join predicates• Better possibilities with Nested Loop Join

• Indexes for clustering (MDC)

Index Considerations

• Estimating the size of intermediate results is critical to getting good query execution plans

• Without sufficient information, the optimizer can only guess based on some assumptions

• Data skew and statistical correlation between multiple column values introduce uncertainty

• Pay attention to DATE columns

Cardinality Estimation

Best Practices - Statistics

• Collect distribution Statistics when there is skew and predicates use constants

• Consider a high number of quantile statistics on columns with DATE range predicates and character string columns

Column Group Statistics

• Example: COUNTRY = ‘Germany’ And CITY = ‘Frankfurt’

• No CGS: Selectivity = ½ * 1/3 = 1/6 …Estimate 1 row

• With CGS: Selectivity = 1/3 …Estimate 2 rows

Country City Hotel Name

Germany

Bremen Hilton

Germany

Bremen Best Western

Germany

Frankfurt

InterCity

Germany

Frankfurt

Shangri-La

Canada Toronto Four Seasons

Canada Toronto Intercontinental

1

2

3

CUSTID CNAME

1 ‘ABC’

2 ‘DEF’

3 ‘GHI’

4 ‘IBM’

5 ‘JKL’

6 ‘MNO’

7 ‘PQR’

… …

… …

100 ‘XYZ’

CUSTID # of Rows

4 2000000

10 700000

8 500000

63 300000

72 100000

9 50000

12 20000

… …

… …

5 5

CUST Table Frequency Statistics – SALES Table

SELECT … FROM SALES, CUSTWHERE CUST.CNAME = ‘IBM’ AND CUST.CUSTID = SALES.CUSTID

Cardinality Estimate with Uniformity = 100,000Actual Cardinality : 2,000,000 !!!!!!!!!!!!!!!!!!!!

100 rows, 100 custids10000000 rowsProblem Scenario - Skew

Best Practices - Statistics• Collect Column Group Statistics with multiple predicates on the

same table• WHERE Country = ‘CANADA’ and City = ‘TORONTO’RUNSTATS … ON ALL COLUMNS AND ON COLUMNS ((country, city) … ) …

• Consider Statistical Views when • There is skew on the join column• There is a significant difference in the range of values in the fact and the

dimensionCREATE VIEW SV1 AS (SELECT C.* FROM CUST C, FACT F WHERE C.CUST_ID = F.CUST_ID)ALTER VIEW cust_fact ENABLE QUERY OPTIMIZATIONRUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION

Referential Integrity (RI)

• Facilitates aggregation push down• Example in the appendix section

• Eliminates redundant joins in views• RI helps determine that queries that do not require data from a

primary key table need not do that join even if it is in the view

• Helps with Materialized Query Table matching• Allows Queries to match MQTs with more dimension table joins

Consider Materialized Query Tables

Fact

GB

JOIN

Dim2

Dim1

JOIN

Fact

GB

JOIN

Dim2

Dim1

JOIN

Fact

GB

JOIN

Dim2

Dim1

JOIN

Fact

GB

JOIN

Dim2

Dim1

JOIN

Joe’s Query

Sue’s Query

Bob’s QueryFact

GB

JOIN

Dim2

Dim1

JOIN

MQT

Q9 Q9

Bob’s QSue’s

Joe’s Q

Best Practices - Defining Materialized Query Tables

• What MQTs should I define ?• Estimate the size of the candidate MQTs by executing COUNT

queries against base tables. • Try to achieve at least a 10X reduction in size between fact and

the MQT• Build MQTs with a reasonable number of GROUP BY columns (3

to 6 dimension keys) at a time based on query patterns

• As far as possible build the MQT from the fact table alone• Use Table Partitioning for the fact table and the MQTs

Best Practices - MQT Matching

• Define Referential Integrity to help with matching MQTs that contain more tables than the queries

• Define Functional Dependencies for thinner MQTs• Use COUNT_BIG instead of COUNT for DPF MQTs• Define indexes on MQTs• Keep statistics up-to-date• Define base table columns NOT NULL as far as possible

• For example we can match SUM(A + B) with SUM(A) + SUM(B)

Best Practices – MQT Maintenance

• REFRESH IMMEDIATE• Create an index on the GROUP BY columns• Create the index on the set of columns that form a unique key • Always keep the base table and MQT statistics up-to-date

• REFRESH DEFERRED• If log space is an issue, consider NOT LOGGED INITIALLY or LOAD

from cursor • An MQT can be temporarily toggled into a regular table by using

• ALTER TABLE … DROP MATERIALIZED QUERY • ALTER TABLE … ADD MATERIALIZED QUERY

• Use ATTACH / DETACH if fact table and MQT are range partitioned tables

Replicated Tables

• Replicate dimension tables (unless collocated with fact )

• Benefit : Avoids data movement

• Important : Define suitable indexes

• If too large, replicate a subset of frequently used columns

BTQ

JOIN

BTQ

JOIN

SALES SALESCUST SALES

BTQ

JOIN

CUSTCOPY

CUSTCOPY

CUSTCOPY

Agenda


Best Practices – Configuration

• Optimization Level 5• Registry Variables

• DB2_ANTIJOIN=EXTEND• If slow queries have NOT EXISTS, NOT IN predicates

• DB2_REDUCED_OPTIMIZATION=YES• If compile time is an issue

• Configuration thumb rules • BUFFPOOL ~= SHEAPTHRES• SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)

Best Practices - Statistics

• The DB2 Query Optimizer relies on reasonably accurate statistics to get a good query plans

• User runs RUNSTATS when data changes (part of ETL)• Statistics Fabrication (unreliable)

• DB2 keeps UPDATE / DELETE / INSERT counters• Fabrication limited to a few statistics – Not enough

• Consider configuring Automatic Statistics• Automatically collects statistics on tables in need• Runs in the background as a low priority job

• Consider configuring Real Time Statistics• Collects statistics on-the-fly

Summary – Best Practices

• Database Design :• Parallelism, Partitioning, Schema, Compression

• Application Design• SQL Tips

• Performance Layer• Indexes, Statistics, Referential Integrity, Materialized Query

Tables, Replicated Tables

• Configuration and Operations• Configuration, Collecting Statistics

Calisto [email protected]

best practices for query performance in a data warehouse

Documents

better query performance

data warehousecalisto

database partition timestamp

data warehousefor scalability

tableslarge amounts

better data management

database administrators

performance layer objects