december 01-03, 2009 minneapolis, chicago, milwaukee 0 best practices to improve query performance...

December 01-03, 2009 •Minneapolis, Chicago, Milwaukee1

Best Practices to Improve Query Performance in a Data Warehouse - 2

Calisto Zuzarte, STSM, IBM, [email protected]


Data Warehouse Life Cycle

Database design / Application design– The Warehouse Application architects and Database

Administrators work together to design the queries and schema before they put the application in production

Database performance layer implementation– In order to meet SLAs, DBAs usual go through some iterations

augmenting the database with performance layer objects and set up the initial configuration to get good performance

Database tuning operations– During production, with changing requirements and change in

data, there is on-going tuning required to keep operations smooth.


Motivation

Data warehouse environments characteristics:– Large volumes of data

• Millions/Billions of rows involved in some tables

• Large Joins

• Large Sorts,

• Large Aggregations

• Many tables involved

• Large amount of data rolled-in and rolled-out

– Complex queries • Report Queries

• Ad Hoc Queries

It is important to pay attention to query performance


Objective

Provide recommendations from a DB2 optimizer perspective to improve query performance through the Data Warehouse life cycle


Agenda

SESSION 1

Best Practices – Database Design

Best Practices – Application Design

Best Practices – Configuration and Operations

SESSION 2

Best Practices – Performance Layer



Indexes

Statistics

Constraints

Materialized Query Tables

Replicated Tables


We want Index Only Access instead of more expensive ISCAN-FETCH or TSCAN (Table Scan)

We want to avoid SORTs particularly those that spill

We want to promote index-ORing and index-ANDing

We want to promote Star Joins

When you have range join predicates– Better performance with Nested Loop Join

When we want to avoid random I/O with better clustering

Indexes are useful when …


Nested Loop Join Merge Join Hash Join

HSJOIN

outer

NLJOINMGJN

inner outer inner probe build

4030

1919

29

319

319

31219

79

0239

191940

33379

12191919

SORT/ Index SORT / Index

319

319

31219

79

51020151015

555

51020

55

155

1015

51020151015

555

4030

1919

29

3231233

3231233

3323123

Better With Index

Indexes Can Help NLJN and MGJN


HSJOIN /------------------+-----------------\ TBSCAN DTQ | | TABLE:SUPPLIER NLJOIN /------+------\ TBSCAN FETCH | /----+----\ SORT IXSCAN TABLE: PART | | FETCH P_PK /----+---\ RIDSCN TABLE: PARTSUPP | SORT | IXAND /-----------+-----------\ NLJOIN NLJOIN /------+-----\ /----+---\ TBSCAN IXSCAN BTQ IXSCAN | | | | TABLE: PART INDEX: PS_PK TBSCAN INDEX: PS_SK | TABLE: SUPPLIER Star Join requires fact table Indexes

Indexes Promote Star Joins


SORT Spills

The details for a SORT will indicate if the SORT spilled

The I/Os indicate that there was spilling associated with the SORT.

Minimize spills by considering indexes and (Also discussed later by balancing SORTHEAP, SHEAPTHRES and BUFFERPOOL) SORT

( 16) 6.14826e+06 1.30119e+06 | 3.65665e+07 TBSCAN ( 17) 2.00653e+06 1.14286e+06 | 3.74999e+07 TABLE: TPCD.ORDERS


Define Indexes-To get index-only access-To avoid SORT (spills)-To promote Index-OR / Index AND-To promote star joins-To better handle range join predicate-For point Queries (Nested Loop Join)-For better clustering (MDC Indexes)



Indexes

Statistics

Constraints


Replicated Tables


Estimating the size of intermediate results is critical to getting good query execution plans

Without sufficient information, the optimizer can only guess based on some assumptions

Data skew and statistical correlation between multiple column values introduce uncertainty

Cardinality Estimation


Cardinality Estimation With and Without Distribution Statistics

Distribution statistics are exploited ONLY when predicates use constantsThey make a significant difference if you have skew

SELECT * FROM T WHERE C1 = 300

Equality Predicate (no distribution Statistics)• CARD = 1000, COLCARD = 100• FF = 1/COLCARD = 1/100 = 0.01 (1% of the rows)• Cardinality Estimate = 0.01 * 1000 = 10 rows

Equality Predicate with distribution • Value 300 has VALCOUNT 200• Cardinality = 200

0

50

100

150

200

250

45 300 50 200 250 10 20 50 30 280 … …

Frequency


Collect Distribution Statistics

when you have skew and predicates with constants


Quantile or Histogram Statistics – Date Columns

Table size ~400,000 ROWS

4 buckets each ~100,000 ROWS (number of quantiles = 4)

Assume 4 months per bucket

20080601 20081001 20090201 20090601 20101001

WHERE DateCol BETWEEN 20081115 AND 20081101

~100,000

Cardinality Estimate = ((20081115 – 20081101) / (20090201 – 20081001)) * 100000 = (14 / 9200 ) * 100000 = ~152 ROWS


Quantile or Histogram Statistics – Date Columns

Table size ~400,000 ROWS

16 buckets each ~25,000 ROWS (number of quantiles = 16)

20080601 20081001 20090201 20090601 20101001

20081101 20081201 20090101

~25,000

WHERE DateCol BETWEEN 20081115 AND 20081101

Cardinality Estimate = ((20081115 – 20081101) / (20081201 - 20081101)) * 25000 = (14 / 100 ) * 25000 = ~3500 ROWS

Could be more accurate if 20081130


Collect a large number of Quantile Statistics

on DATE columns


Column Group Statistics

Example: WHERE COUNTRY = ‘Germany’ And CITY = ‘Frankfurt’

No CGS: Selectivity = ½ * 1/3 = 1/6 …Estimate 1 row

With CGS: Selectivity = 1/3 …Estimate 2 rows

Country City Hotel Name

Germany Bremen Hilton

Germany Bremen Best Western

Germany Frankfurt InterCity

Germany Frankfurt Shangri-La

Canada Toronto Four Seasons

Canada Toronto Intercontinental

1

2

3


Column Group Statistics

The DB2 optimizer can exploit column group statistics for better estimates

SELECT … FROM dw.hotel H

WHERE H.country = ? and H.city = ?

RUNSTATS ON TABLE dw.hotel … ON COLUMNS (… (country, city) … ) …

If you have an index on (country, city) we have FIRST2KEYCARD statistic so no need of CGS


Collect Column Group Statistics

when you have multiple equality predicates on the

same table on columns that might be correlated


Join Cardinality Estimation

Join cardinality estimation is typically based on inclusion and uniformity

PRODUCT.PROD_ID = SALES.PROD_ID– PRODUCT Table

• 20000 rows• 10000 distinct PROD_IDs

– SALES• 100000 rows• 5000 distinct PROD_IDs

– Join Cardinality• 20000*100000 / MAX(10000, 5000) = 200000


CUSTID CNAME

1 ‘ABC’

2 ‘DEF’

3 ‘GHI’

4 ‘IBM’

5 ‘JKL’

6 ‘MNO’

7 ‘PQR’

… …

… …

100 ‘XYZ’

CUSTID # of Rows

4 2000000

10 700000

8 500000

63 300000

72 100000

9 50000

12 20000

… …

… …

5 5

CUST Table Frequency Statistics – SALES Table

SELECT … FROM SALES, CUSTWHERE CUST.CNAME = ‘IBM’ AND CUST.CUSTID = SALES.CUSTID

Cardinality Estimate with Uniformity : Cardinality Estimate of CUST after local predicate = (1/100 * 100) = 1Join Card Estimate= (1 * 10000000) / MAX(1, 100) = 100000Actual Cardinality : 2000000 !!!!!!!!!!!!!!!!!!!!

100 rows100 distinct values

10000000 rows100 distinct values

Problem Scenario - Skew


What Are Statistical Views

Views with statistics that can be used by the DB2 optimizer

If appropriate for a query, these statistics can be exploited to get better cardinality estimates

The view need not be directly referenced in the query - matched like MQTs

Single table or simple join views works best


Creating Statistical Views

Create a view like a regular view :CREATE VIEW cust_fact AS(SELECT C.* FROM CUSTOMER C, FACT F

WHERE C.CUST_ID = F.CUST_ID)

Alter the view to make it a statistical viewALTER VIEW cust_fact ENABLE QUERY OPTIMIZATION

Collecting StatisticsRUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION


Define Statistical Views

when you have skew on join columns



Indexes

Statistics

Constraints


Replicated Tables


Referential Integrity (RI)

Facilitates aggregation push down– Example follows

Eliminates redundant joins in views– RI helps determine that queries that do not require data from a

primary key table need not do that join even if it is in the view

Helps with Materialized Query Table matching– Allows Queries to match MQTs with more dimension table joins


Exploiting Referential Integrity Constraints

Find the top 20 stores in terms of total sales, and include the store name and city information

SELECT st.store_id, st.name, st.city, sum(F.sales) as sm FROM salesF as F, store as ST WHERE F.store_id = ST.store_id GROUP BY st.store_id, st.name, st.city ORDER BY sm desc FETCH FIRST 20 ROWS ONLY;



f.store_id=st.store_id /* FK=PK */

sum(sales) as sm group by store_id, name, city

store_id, name, city, sm

order by sm descfetch first 20 rows only

store_id, name, city

2,000 rows store_id, sales

1,000,000 rows

JOIN

Sales (F) Store (ST)

GROUPBY

SORT

SELECT

2,000 rows

20 rows

1,000,000 rows

20 rows




sum(sales) as sm group by store_id




2,000 rows

GROUPBY


JOIN

SORT

SELECT

2,000 rows

store_id, sales

1,000,000 rows

20 rows

20 rows

2,000 rows




sum(sales) as sm group by store_id




2,000 rows

GROUPBY


SORT

JOIN

SELECT

2,000 rows

store_id, sales

1,000,000 rows

20 rows

20 rows

20 rows


Define Referential Integrity

Whenever possible(Informational if you know for sure)



Indexes

Statistics

Constraints


Replicated Tables


Consider Materialized Query Tables

Fact

GB

JOIN

Dim2

Dim1

JOIN

Fact

GB

JOIN

Dim2

Dim1

JOIN

Fact

GB

JOIN

Dim2

Dim1

JOIN

Fact

GB

JOIN

Dim2

Dim1

JOIN

Joe’s Query

Sue’s Query

Bob’s QueryFact

GB

JOIN

Dim2

Dim1

JOIN

MQT

Q9 Q9

Bob’s QSue’s

Joe’s Q


Best Practices - Defining Materialized Query Tables

Compute and use the dimension lattice to assist in MQT selection. Determine cardinality of aggregates that appear to be good candidates by executing COUNT queries against base tables.

Try to achieve at least a 10X reduction in size between fact and the MQT

Build MQTs with a reasonable number of GROUP BY columns (3 to 4 dimension keys) at a time based on query patterns

As far as possible build the MQT from the fact table alone. This is easier to manage although the dimension tables may need to be joined for most queries.


Best Practices - MQT Matching

Define Referential Integrity to help with matching MQTs that contain more tables than the queries

Define Functional Dependencies for thinner MQTs

Use COUNT_BIG instead of COUNT for DPF MQTs

Define indexes on MQTs

Keep statistics up-to-date

Include as many Foreign Key Columns as the size allows in the MQT

– Allows joins of dimension tables when needed

Define base table columns NOT NULL as far as possible– For example we can match SUM(A + B) with SUM(A) + SUM(B)


Best Practices – MQT Maintenance

REFRESH IMMEDIATE– Create an index on the GROUP BY columns

– If not aggregating, create the index on the set of columns that form a unique key

– If possible, define the base table columns used in the GROUP BY clause of REFRESH IMMEDIATE MQTs as NOT NULL

– Define RI relationships wherever applicable

– Always keep the base table and MQT statistics up-to-date

REFRESH DEFERRED– If log space is an issue, consider ALTER TABLE … NOT LOGGED

INITIALLY

– An MQT can be temporarily toggled into a regular table by using • ALTER TABLE … DROP MATERIALIZED QUERY

• ALTER TABLE … ADD MATERIALIZED QUERY

– For Full REFRESH, consider LOAD FROM CURSOR.


Best Practices MQTs

Define Referential Integrity Make FK columns

NOT NULL

Define FunctionalDependencies

Use COUNT_BIG(*) in DPFConsider GROUPING SETS,

ROLLUP or CUBE

10X Reduction ?

Indexes and RUNSTATS

Include FK columns in the MQT



Indexes

Statistics

Constraints


Replicated Tables


Replicated Tables

Replicate dimension tables (unless collocated with fact table)

Benefit : Avoids data movement, Avoids too many agents

Important : Define suitable indexes

If too large, replicate a subset of frequently used columns and/or rows

BTQ

JOIN

BTQ

JOIN

SALES SALESCUST SALES

BTQ

JOIN

CUSTCOPY

CUSTCOPY

CUSTCOPY


Largest dimension table- Collocate with the fact table

All other dimension tables-Single partition and replicated

- At least the HOT columns


Summary

Session 2 : Best practices to improve data warehouse query performance related to the performance layer

– Indexes

– Statistics (Column Group Statistics and Statistical Views)

– Referential Integrity

– Materialized Query Tables

– Replicated Tables

Recap : Session 1 covered Database Design, Application Design, Configuration and Operations

– These include considerations related to Parallelism, Partitioning, Schema, Application queries, Configuration

45

© Copyright IBM Corporation [current year]. All rights reserved.U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

Please update paragraph below for the particular product or family brand trademarks you mention such as WebSphere, DB2, Maximo, Clearcase, Lotus, etc

IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

If you have mentioned trademarks that are not from IBM, please update and add the following lines:

[Insert any special 3rd party trademark names/attributions here] Other company, product, or service names may be trademarks or service marks of others.

Disclaimer

december 01-03, 2009 minneapolis, chicago, milwaukee 0 best practices to improve query performance...

Documents

ibm brand

ibm logo

current list of ibm

ibm trademarked terms

trademark information

common law trademarks

family brand trademarks

particular product